An encoding detector library ported from Mozilla

Overview
Comments
  • Transferring to uchardet organization?

    Transferring to uchardet organization?

    Hi @BYVoid,

    Would you mind if we create a "uchardet" organization in github, and transfer the uchardet project there? It will make it more "official" and stand up in the middle of the various forks.

    As an alternative, we could be hosted by a friend organization, like on GNOME repository (i.e. the main repository will be out of github, though GNOME also has mirror of all its repositories here: https://github.com/GNOME). I have not asked the GNOME foundation yet, but I think they would accept. This would not make it a GNOME project, simply a friend project and still keeps independence. This second alternative would likely be my favorite. :-) But I'm fine if you are absolutely attached to keep the main repository on github.

    Thanks!

    Jehan

    opened by Jehan 17
  • WINDOWS-1253 file detected as ISO-8859-7

    WINDOWS-1253 file detected as ISO-8859-7

    These two are the most common Greek encodings and they are mostly identical. One major difference between them is the mapping of GREEK CAPITAL LETTER ALPHA WITH TONOS (Ά), which is very common in Greek texts/subtitles.

    code | ISO 8859-7                                     | windows-1253
    ------------------------------------------------------------------------------------------------------
    0xA1 | [U+2018] LEFT SINGLE QUOTATION MARK            | [U+0385] GREEK DIALYTIKA TONOS
    0xA2 | [U+2019] RIGHT SINGLE QUOTATION MARK           | [U+0386] GREEK CAPITAL LETTER ALPHA WITH TONOS
    0xA4 | _unassigned_                                   | [U+00A4] [CURRENCY SIGN]
    0xA5 | _unassigned_                                   | [U+00A5] [YEN SIGN]
    0xAE | _unassigned_                                   | [U+00AE] [REGISTERED SIGN]
    0xB5 | [U+0385] GREEK DIALYTIKA TONOS                 | [U+00B5] [MICRO SIGN]
    0xB6 | [U+0386] GREEK CAPITAL LETTER ALPHA WITH TONOS | [U+00B6] [PILCROW SIGN]
    

    Source: ISO 8859-7 vs. windows-1253

    I don't know how the detection works but more 0xA2 than 0xB6 would be a strong indication of WINDOWS-1253 (and vice versa).

    PS. I use uchardet through mpv for the subtitle language detection and I would say about 10-20% of the subtitles have this problem

    Attached sample file

    opened by larvanitis 10
  • Invalid WINDOWS-1255 file detected as WINDOWS-1255

    Invalid WINDOWS-1255 file detected as WINDOWS-1255

    Uchardet detects this file as WINDOWS-1255 whereas it contains the octet 0xFB, which is invalid in this charset.

    How to reproduce:

    $ echo -ne "\xf0\xe0\xfb\xf4" | uchardet
    > WINDOWS-1255
    $ echo -ne "\xf0\xe0\xfb\xf4" | iconv -f WINDOWS-1255
    > נiconv: invalid escape sequence at position 2
    
    opened by lovasoa 10
  • Future roadmap?

    Future roadmap?

    Hi Carbo,

    It's been a long time, I'd ask for future roadmap here. Because mpv and other music players would like to support special charset via this lib, but it's not developed as far as I see from the commit.

    Opinions?

    opened by cicku 9
  • Can this code be used to make a Windows DLL? How?

    Can this code be used to make a Windows DLL? How?

    Hi, sorry for communicate with you for this channel... THIS IS NOT A ISSUE AT ALL ...but I dont know how ask you this:

    I am a windows programmer (delphi and pascal) with zero experience in C (or C++).

    My question is: Can this code be used to make a Windows DLL? If yes please could you guide me how? I have some hope because looking in CMakeLists.txt file there is a option for win32 (# although commented #)

    Basic questions:

    • Compiler name and version you know this can be compiled (in windows of course)
    • Basic procedure (or at least some hints)
    • With the right compiler and procedure can be done or is need some modification of the code?

    Thanks in advance!

    PD: I know what exist a version of Mozilla code for Windows made in Delphi here but yours seems more complete and detect much more encodes.

    opened by antekgla 6
  • GB18030 file detected as WINDOWS-1252

    GB18030 file detected as WINDOWS-1252

    I encountered an issue when I try to use uchardet to detect a file almost all English(a very little Chinese). The file is GB18030 encoded but detected as WINDOWS-1252.

    opened by yanxijian 6
  • New release?

    New release?

    There is no tags in git, and it seems that there is no releases. The version is currently the 0.0.1, so at a first glance it looks like the project is very young and has just started, but the first commit was created in 2011. It's maybe time to create a real release.

    opened by ghost 5
  • document the difference between this and libchardet

    document the difference between this and libchardet

    On reviewing bomi it looks like that uses libchardet which like this library also is based on Mozilla's code. I see that the public APIs between these two projects are different as well; however having two copies of the same code is not great for the FOSS community in general.

    Could anyone give a more detailed account of the differences, and maybe merge the two libraries? For example, which version of Mozilla's code this library contains, the history of both codebases, how easy it would be to merge the two, etc.

    opened by infinity0 3
  • (void) and () empty arguments are different in C.

    (void) and () empty arguments are different in C.

    This fixes the following warning when including uchardet.h in C source, built with -Wstrict-prototypes: uchardet.h:52:1: warning: function declaration isn't a prototype

    We want to use uchardet in gtksourceview (used for instance for gedit): https://bugzilla.gnome.org/show_bug.cgi?id=669448 I had this warning when compiling with a define to uchardet.h.

    opened by Jehan 3
  • ISO-8859-2 should be detected

    ISO-8859-2 should be detected

    In your README, ISO-8859-2 is not supported. Yet I can find a model for it in src/LangHungarianModel.cpp. I tried it with a ISO-8859-2 file I built myself: https://cloud.libreart.info/public.php?service=files&t=40140bd3fd105b2c03d7716dfe4b498a And it fails detecting it as "windows-1252".

    On the other hand python-chardet was able to properly detect the ISO-8859-2 encoding:

    $ chardetect iso-8859-2.smi iso-8859-2.smi: ISO-8859-2 with confidence 0.850807928898

    Considering they are both supposed to be based on the same algorithm from Mozilla and that you have mention of this encoding in your code, I'm thinking it would be cool if it were supported.

    opened by Jehan 3
  • uchardet wrongly determines the text as UTF-8

    uchardet wrongly determines the text as UTF-8

    The file has these bytes:

    00000000  78 78 78 e2 80 99 78 78  78 0a 63 68 61 72 20 27  |xxx...xxx.char '|
    00000010  e2 27 20 28 69 6e 0a 4d  69 6c 6f c5 a1 5f 46 6f  |.' (in.Milo.._Fo|
    00000020  72 6d 61 6e 0a                                    |rman.|
    

    Please note that it has 3 non-ascii areas:

    1. e2 80 99: is U+2019 RIGHT SINGLE QUOTATION MARK
    2. e2: could be UTF-8 3-char sequence, but bytes 27 20 don't make for any UTF-8 symbol
    3. c5 a1: U+0161 LATIN SMALL LETTER S WITH CARON
    

    However, uchardet determines that it is UFT-8:

    $ uchardet < xxx 
    UTF-8
    

    FreeBSD file(1) determines this file as:

    $ file xxx 
    xxx: C source, Non-ISO extended-ASCII text
    

    I am not sure how it should determine this text, but this isn't UTF-8 for sure.

    opened by yurivict 3
  • Make a portable executable

    Make a portable executable

    Hi,

    I like your project and I would like to use your tool on a workstation without manually installing lib.

    Can you explain me how to create a portable executable of your project ?

    Thanks in advance!

    opened by finitha 0
  • UTF-8 with right single quote (U+2019) mistaken as Windows-1250

    UTF-8 with right single quote (U+2019) mistaken as Windows-1250

    I'm seeing an issue where English UTF-8 encoded text with a Unicode Right Single Quotation Mark (U+2019), is being identified by uchardet as WINDOWS-1250.

    $ hexdump -C test/en/utf-8.txt 
    00000000  45 6e 67 6c 69 73 68 20  74 65 78 74 20 77 69 74  |English text wit|
    00000010  68 20 61 20 72 69 67 68  74 20 73 69 6e 67 6c 65  |h a right single|
    00000020  20 71 75 6f 74 65 20 28  55 2b 32 30 31 39 29 20  | quote (U+2019) |
    00000030  69 6e 73 74 65 61 64 20  6f 66 20 61 6e 20 61 70  |instead of an ap|
    00000040  6f 73 74 72 6f 70 68 65  20 73 68 6f 75 6c 64 6e  |ostrophe shouldn|
    00000050  e2 80 99 74 0d 0a 62 65  20 6d 69 73 74 61 6b 65  |...t..be mistake|
    00000060  6e 20 66 6f 72 20 73 6f  6d 65 74 68 69 6e 67 20  |n for something |
    00000070  65 6c 73 65 2e                                    |else.|
    00000075
    
    $ uchardet test/en/utf-8.txt
    WINDOWS-1250
    

    (I've also seen this misidentified elsewhere as WINDOWS-1258, but those files have confidential content and I couldn't share them. I was able to reproduce misidentification with this smaller sample, though with a slightly different outcome.)

    opened by AlanKrueger 2
Releases(v0.0.5)
  • v0.0.5(Dec 5, 2015)

    • Revert UTF-16 and UTF-32 label change: it was an error to specify endianness for texts with BOM. The Unicode standard explicitly warns against it, and it actually even (partially) breaks conversions.
    • Added supports:
      • French: Windows-1252.
      • German: ISO-8859-1, Windows-1252
      • Esperanto: ISO-8859-3
      • Turkish: ISO-8859-3 and ISO-8859-9
      • Thai: ISO-8859-11 (and TIS-620 model rebuilt).
    • Single Byte charset detection algorithm improved: detection of control characters lowers confidence.
    Source code(tar.gz)
    Source code(zip)
  • v0.0.4(Dec 3, 2015)

    • Add support of ISO-8859-1 and ISO-8859-15 for French.
    • Re-enable Hungarian language models (ISO-8859-2 and Windows-1250) which used to conflict with other charsets (should be better now).
    • Differentiate ASCII detection and detection failure.
    • Improve single-byte charset detection confidence algorithm (fixes for instance Windows-1251 Russian text detection).
    • "UTF-16" is now outputted with endianness information (UTF-16LE/BE).
    • Add UTF-32 BOM detection.
    • Discard single byte charsets upon illegal codepoint detection.
    • Internal redesign of single-byte charmaps with more semantics, and variable sample size length (different languages have different sizes of grapheme lists).
    • A lot more test files (33 successful unit tests should be successful with make test).
    • Adding python scripts to generate language models from Wikipedia data in a single command.
    Source code(tar.gz)
    Source code(zip)
  • v0.0.3(Nov 19, 2015)

    A quick release after 0.0.2 mostly to fix a bad crash on the command line tool when charset detection failed (or detected ASCII).

    Additionaly:

    • The build now includes more test files for various language/encoding and a make test target for unit testing (20 encoding detection tests should be successful upon running it).
    • The build has a new BUILD_STATIC option, by default set to ON, allowing to disable static library building if not needed.
    • All encoding names are iconv-compatible, enabling developers to directly feed the result of uchardet_get_charset() into libiconv.
    • Compilation warnings fixed.
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Nov 16, 2015)

    The primary goal of this release is to set a fixed point in time for distributions, since most are using various commits as their source, but still calling it 0.0.1 (there was actually a version 0.0.1 tarball available in GoogleCode, dating from 2011).

    Version 0.0.2 mostly fixes various bugs and allow querying charsets for multiple files in the same command with uchardet command line tool.

    Source code(tar.gz)
    Source code(zip)
Owner
Carbo Kuo
Carbo Kuo
Unofficial third-party implementation of FFD (fast feature detector) published in IEEE TIP 2020.

fast_feature_detector Unofficial third-party implementation of FFD (fast feature detector) published in IEEE TIP 2020. Caution I have not got any perm

kamino410 12 Feb 17, 2022
deep learning vision detector/estimator

libopenvision deep learning visualization C library Prerequest ncnn Install openmp vulkan(optional) Build git submodule update --init --recursuve cd b

Prof Syd Xu 3 Sep 17, 2022
A c++ implementation of yolov5 head deepsort detector

A C++ implementation of Yolov5 and Deepsort in Jetson Xavier nx and Jetson nano This repository uses yolov5 and deepsort to follow humna heads which c

null 5 Aug 25, 2022
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.

torch-ngp A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. Note: This

hawkey 748 Sep 21, 2022
The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs based on CUDA.

dgSPARSE Library Introdution The dgSPARSE Library (Deep Graph Sparse Library) is a high performance library for sparse kernel acceleration on GPUs bas

dgSPARSE 54 Sep 18, 2022
C-based/Cached/Core Computer Vision Library, A Modern Computer Vision Library

Build Status Travis CI VM: Linux x64: Raspberry Pi 3: Jetson TX2: Backstory I set to build ccv with a minimalism inspiration. That was back in 2010, o

Liu Liu 6.9k Sep 21, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

NetEase Youdao 176 Jul 21, 2022
The Robotics Library (RL) is a self-contained C++ library for rigid body kinematics and dynamics, motion planning, and control.

Robotics Library The Robotics Library (RL) is a self-contained C++ library for rigid body kinematics and dynamics, motion planning, and control. It co

Robotics Library 605 Sep 15, 2022
A GPU (CUDA) based Artificial Neural Network library

Updates - 05/10/2017: Added a new example The program "image_generator" is located in the "/src/examples" subdirectory and was submitted by Ben Bogart

Daniel Frenzel 91 Sep 7, 2022
Header-only library for using Keras models in C++.

frugally-deep Use Keras models in C++ with ease Table of contents Introduction Usage Performance Requirements and Installation FAQ Introduction Would

Tobias Hermann 891 Sep 15, 2022
simple neural network library in ANSI C

Genann Genann is a minimal, well-tested library for training and using feedforward artificial neural networks (ANN) in C. Its primary focus is on bein

Lewis Van Winkle 1.3k Sep 15, 2022
oneAPI Deep Neural Network Library (oneDNN)

oneAPI Deep Neural Network Library (oneDNN) This software was previously known as Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-

oneAPI-SRC 2.9k Sep 21, 2022
A lightweight C library for artificial neural networks

Getting Started # acquire source code and compile git clone https://github.com/attractivechaos/kann cd kann; make # learn unsigned addition (30000 sam

Attractive Chaos 609 Sep 7, 2022
LibDEEP BSD-3-ClauseLibDEEP - Deep learning library. BSD-3-Clause

LibDEEP LibDEEP is a deep learning library developed in C language for the development of artificial intelligence-based techniques. Please visit our W

Joao Paulo Papa 18 Mar 15, 2022
A c++ trainable semantic segmentation library based on libtorch (pytorch c++). Backbone: ResNet, ResNext. Architecture: FPN, U-Net, PAN, LinkNet, PSPNet, DeepLab-V3, DeepLab-V3+ by now.

中文 C++ library with Neural Networks for Image Segmentation based on LibTorch. The main features of this library are: High level API (just a line to cr

null 291 Sep 7, 2022
Using PLT trampolines to provide a BLAS and LAPACK demuxing library.

libblastrampoline All problems in computer science can be solved by another level of indirection Using PLT trampolines to provide a BLAS and LAPACK de

Elliot Saba 49 Sep 11, 2022
A C library for product recommendations/suggestions using collaborative filtering (CF)

Recommender A C library for product recommendations/suggestions using collaborative filtering (CF). Recommender analyzes the feedback of some users (i

Ghassen Hamrouni 251 Aug 30, 2022
An open source machine learning library for performing regression tasks using RVM technique.

Introduction neonrvm is an open source machine learning library for performing regression tasks using RVM technique. It is written in C programming la

Siavash Eliasi 33 May 31, 2022
a generic C++ library for image analysis

VIGRA Computer Vision Library Copyright 1998-2013 by Ullrich Koethe This file is part of the VIGRA computer vision library. You may use,

Ullrich Koethe 373 Sep 12, 2022