Tesseract Open Source OCR Engine (main repository)

Overview

Tesseract OCR

Build Status Build status Build status
Coverity Scan Build Status Code Quality: Cpp Total Alerts OSS-Fuzz
GitHub license Downloads

Table of Contents

About

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The main branch also has experimental support for ALTO (XML) output.

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

This project does not include a GUI application. If you need one, please see the 3rdParty documentation.

Tesseract can be trained to recognize other languages. See Tesseract Training for more information.

Brief history

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. From 2006 until November 2018 it was developed by Google.

The latest stable version is 5.0.0, released on November 30, 2021. Latest source code is available from main branch on GitHub. Open issues can be found in issue tracker, and planning documentation.

See Release Notes and Change Log for more details of the releases.

Installing Tesseract

You can either Install Tesseract via pre-built binary package or build it from source.

A C++ compiler with good C++17 support is required for building Tesseract from source.

Running Tesseract

Basic command line usage:

tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...]

For more information about the various command line options use tesseract --help or man tesseract.

Examples can be found in the documentation.

For developers

Developers can use libtesseract C or C++ API to build their own application. If you need bindings to libtesseract for other programming languages, please see the wrapper section in the AddOns documentation.

Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.

Support

Before you submit an issue, please review the guidelines for this repository.

For support, first read the documentation, particularly the FAQ to see if your problem is addressed there. If not, search the Tesseract user forum, the Tesseract developer forum and past issues, and if you still can't find what you need, ask for support in the mailing-lists.

Mailing-lists:

Please report an issue only for a bug, not for asking questions.

License

The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Tesseract uses Leptonica library which essentially uses a BSD 2-clause license.

Dependencies

Tesseract uses Leptonica library for opening input images (e.g. not documents like pdf). It is suggested to use leptonica with built-in support for zlib, png and tiff (for multipage tiff).

Latest Version of README

For the latest online version of the README.md see:

https://github.com/tesseract-ocr/tesseract/blob/main/README.md

Issues
  • RFC: Tesseract 4.0.0 – open tasks

    RFC: Tesseract 4.0.0 – open tasks

    I'd like to collect open tasks which should be addressed before tagging the official release 4.0.0.

    These tasks are on my own list and to be discussed whether we consider them important for the new release or not:

    • Remove deprecated code. This does not include OpenCL or the old Tesseract engine.
    • Add --version parameter for all command line commands.
    • Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version. This will make the command slower, because each file must be opened and parsed.
    • Add option to optionally select implementation for dot product (CPU, SSE, AVX, ...).
    • Relative includes for traineddata: tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
    • Maybe more fixes for compiler warnings and issues reported by Coverity Scan.
    • (list still incomplete)
    RFC 
    opened by stweil 194
  • Build Tesseract from source with Visual Studio

    Build Tesseract from source with Visual Studio


    Environment

    • Tesseract Version: 5.0.0 alfa
    • Commit Number: a1a177f
    • Platform:Windows 10 64 bit

    Current Behavior:

    I can not build from source i had download SW client and save it at "D:\Essam\Software\SW" the add to Path and i can run SW in command line and see WS information as follow D:\Tutorial\Git\tesseract\build>sw --version sw.client.sw version 1.0.0 git revision 083bb99144549c1f361298e8284daa6b54422965 assembled on 30.01.2020 18:36:29 Egypt Standard Time

    then i run the following commands to compile from source as describe in the following link https://github.com/tesseract-ocr/tesseract/wiki/Compiling the command are

    git clone https://github.com/tesseract-ocr/tesseract tesseract cd tesseract mkdir build && cd build cmake .. -G "Visual Studio 15 2017 Win64" -DCMAKE_INSTALL_PREFIX=inst

    i receive the following error

    "-- Selecting Windows SDK version 10.0.17763.0 to target Windows 10.0.18363. Configuring tesseract version 5.0.0-alpha-621-ga1a17... -- target changed from "auto" to "kaby-lake" CMake Error at CMakeLists.txt:197 (find_package): By not providing "FindSW.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "SW", but CMake did not find one.

    Could not find a package configuration file provided by "SW" with any of the following names:

    SWConfig.cmake
    sw-config.cmake
    

    Add the installation prefix of "SW" to CMAKE_PREFIX_PATH or set "SW_DIR" to a directory containing one of the above files. If "SW" provides a separate development package or SDK, be sure it has been installed.

    -- Configuring incomplete, errors occurred! See also "D:/Tutorial/Git/tesseract/build/CMakeFiles/CMakeOutput.log"."

    the log file attached

    CMakeOutput.log

    Expected Behavior:

    build tesseract solution

    Suggested Fix:

    build process 
    opened by essamzaky 114
  • Tag a new version for LSTM  4.0

    Tag a new version for LSTM 4.0

    Many fixes have been made to master branch for 4.0 since the 4.00.00alpha release in November 2016. A number of assertions have been fixed.

    @zdenop Please add a new tag eg. 4.0.0alpha-1 / 2 (numbering as you consider appropriate). Thanks!

    RFC 
    opened by Shreeshrii 108
  • RFC: Remove the legacy OCR Engine

    RFC: Remove the legacy OCR Engine

    Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.

    From #518:

    @stweil commented:

    I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

    @theraysmith commented:

    Please provide examples of where you get better results with the old engine. Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

    legacy RFC 
    opened by amitdo 106
  • good accuracy but too slow, how to improve Tesseract speed

    good accuracy but too slow, how to improve Tesseract speed

    I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

    It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

    I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

    Any way to make it faster. Any ideas on how to make Tesseract read faster? thanks 00060

    performance OpenMP SIMD 
    opened by ychtioui 90
  • Tesseract 4.0.0 crashed on Intel I5-8400 CPU with Debian 9.6.0 amd64 (SSE/AVX/AVX2)

    Tesseract 4.0.0 crashed on Intel I5-8400 CPU with Debian 9.6.0 amd64 (SSE/AVX/AVX2)

    Environment

    • Tesseract Version: 4.0.0 Release
    • Commit Number: 51316994ccae0b48692d547030f26c0969308214
    • Platform: Debian 9.6.0 amd64

    Current Behavior: Tesseract 4.0.0 crashed on Itel I5-8400 CPU with Debian 9.6.0 amd64 (SSE/AVX/AVX2).

    I compiled the tesseract 4.0 on Itel I5-8400 CPU with Debian 9.6.0 amd64. tesseract --version output this: tesseract 4.0.0 leptonica-1.74.2 libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 Found AVX2 Found AVX Found SSE

    When I call tesseract several times, crash happens and PC is reboot.

    I have a Intel G4650 CPU and this CPU not suport AVX2 / AVX and everything works fine! Never crash happens! How to make tesseract work fine on Intel I5-8400 with AVX/AVX2/SSE.

    Expected Behavior:

    Suggested Fix:

    bug SIMD unexpected termination 
    opened by edilinux 83
  • RFC: Add initial support for traineddata files in compressed archive formats (don't merge)

    RFC: Add initial support for traineddata files in compressed archive formats (don't merge)

    This requires libminizip-dev, so expect failures from CI.

    Up to now, little endian tesseract works with the new zip format.

    More work is needed for training tools and big endian support and also to maintain compatibility with the current proprietary format.

    Signed-off-by: Stefan Weil [email protected]

    feature request build process RFC 
    opened by stweil 81
  • trying to add tessedit_char_whitelist etc. again:

    trying to add tessedit_char_whitelist etc. again:

    • ignore matrix outputs in ComputeTopN if they belong to a disabled unichar_id
    • pass UNICHARSET refs to check that
    • in SetBlackAndWhitelist, also update the unicharset of the lstm_recognizer_ instance, if any
    RFC enhancement allowlist / denylist 
    opened by bertsky 79
  • RFC: Reorganize source tree

    RFC: Reorganize source tree

    I'd like to propose changes to tesseract source tree structure. Today the common way is to have src folder with all program stuff and include folder with public headers. Now we have a lot of dirs in the root - that's very annoying. On the first stage I propose:

    1. move all sources into src
    2. move training tools from training to tools/training

    Later we can try to move public headers to include directory.

    The new look will be like: pic

    If there are no objections, I'll commit changes.

    RFC 
    opened by egorpugin 69
  • 4.0 bugs on MAC OS X and a step by step for reference

    4.0 bugs on MAC OS X and a step by step for reference

    This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work. I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.

    Special thanks for Shree that helped me at the google groups

    Project and more details: https://github.com/tesseract-ocr/tesseract

    where to get help?

    google group: https://groups.google.com/forum/#!forum/tesseract-ocr git: https://github.com/tesseract-ocr/tesseract/issues

    Platform: MAC OS X 10.13.3 Tesseract: 4.0.0-beta.1-69-g10f4 leptonica-1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

    Found AVX2 Found AVX Found SSE

    Compiling Tesseract - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos

    Warning: Don't install tesseract using brew, since you can't generate the ScrollView.jar from it! (At least I wasn't able to generate it)

    Steps

    1 - Install these libs

    brew install automake autoconf autoconf-archive libtool
    brew install pkgconfig
    brew install icu4c
    brew install leptonica
    brew install gcc
    

    2 - Run the code

    ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c
    

    Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1

    3 - Clone tesseract repo

    git clone https://github.com/tesseract-ocr/tesseract/
    

    4 - Enter in the folder

    cd tesseract
    

    5 - Run the script

    ./autogen.sh
    

    6 - Run the code, and copy the CPPFLAGS and LDFLAGS

    brew info icu4c
    

    7 - Update the CPPFLAGS and LDFLAGS and execute the code

    ./configure \
      CPPFLAGS=-I/usr/local/opt/icu4c/include \
      LDFLAGS=-L/usr/local/opt/icu4c/lib
    

    8 - Run the code

    make -j
    

    9 - Run the code

    sudo make install
    

    10 - Run the code

    sudo update_dyld_shared_cache
    

    Obs.: this is the sudo ldconfig version for MAC OS X

    11 - Run the code

    make training
    

    Creating ScrollView.jar - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging

    Important: Use the JDK 8 to build, or else it is going to return an error

    Steps

    1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar

    http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar

    2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java

    3 - Enter the tesseract/java folder

    cd java
    

    4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code

    SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar
    

    Training Font - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain

    Steps

    1 - Clone the langdata dir from git

    git clone https://github.com/tesseract-ocr/langdata
    

    2 - Enter the tesseract folder

    cd ..
    

    3 - Execute this code and select one font from the list (I recommend "Verdana")

    text2image --list_available_fonts --fonts_dir=/Library/Fonts
    

    Font dir for MAC can be : ~/Library/Fonts /Library/Fonts/ /Network/Library/Fonts/ /System/Library/Fonts/ /System Folder/Fonts/

    More details here: https://support.apple.com/en-us/HT201722

    4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from

    - export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
    + export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
    

    Obs.: this is a fix for the error:

    mktemp: illegal option -- -
    usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
           mktemp [-d] [-q] [-u] -t prefix
    /Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied
    

    5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)

    git clone https://github.com/tesseract-ocr/tessdata_best
    

    or

    git clone https://github.com/tesseract-ocr/tessdata_fast
    

    6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/

    7 - Create the training data

    PANGOCAIRO_BACKEND=fc \
    ~/projects/tesseract/training/tesstrain.sh \
      --fonts_dir /Library/Fonts \
      --lang eng \
      --linedata_only \
      --noextract_font_properties \
      --exposures "0"    \
      --langdata_dir ~/projects/langdata \
      --tessdata_dir ~/projects/tesseract/tessdata \
      --fontlist "Verdana" \
      --output_dir ~/tesstutorial/engtrain
    

    Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

    8 - Create other training data using other font to compare

    PANGOCAIRO_BACKEND=fc \
    ~/projects/tesseract/training/tesstrain.sh \
      --fonts_dir /Library/Fonts \
      --lang eng \
      --linedata_only \
      --noextract_font_properties \
      --exposures "0"    \
      --langdata_dir ~/projects/langdata \
      --tessdata_dir ~/projects/tesseract/tessdata \
      --fontlist "Times New Roman," \
      --output_dir ~/tesstutorial/engeval
    

    Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

    9 - Create the needed folder

    mkdir -p ~/tesstutorial/engoutput
    

    10 - Start the training

    SCROLLVIEW_PATH=~/projects/tesseract/java \
    ~/projects/tesseract/training/lstmtraining \
    --debug_interval 100 \
    --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
    --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
    --model_output ~/tesstutorial/engoutput/base \
    --learning_rate 20e-4 \
    --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
    --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
    --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log
    

    Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1

    11 - Monitor the log on another console

    tail -f ~/tesstutorial/engoutput/basetrain.log
    

    12 - Test Accuracy with other font

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/engoutput/base_checkpoint \
      --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
      --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
    

    13 - Test Accuracy with best traindata

    ~/projects/tesseract/training/lstmeval \
      --model ~/projects/tessdata_best/eng.traineddata \
      --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
    

    14 - Test Accuracy with actual traindata (in this case the same as step 13)

    ~/projects/tesseract/training/lstmeval \
      --model ~/projects/tesseract/tessdata/eng.traineddata \
      --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
    

    Fine tuning - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

    Steps

    1 - Create the necessary folder

    mkdir -p ~/tesstutorial/verdana_from_small
    

    2 - Start to fine tuning

    ~/projects/tesseract/training/lstmtraining \
      --model_output ~/tesstutorial/verdana_from_small/verdana \
      --continue_from ~/tesstutorial/engoutput/base_checkpoint \
      --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
      --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
      --max_iterations 1200
    

    3 - Validate the progress

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
      --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
      --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
    

    4 - Create the necessary folder

    mkdir -p ~/tesstutorial/verdana_from_full
    

    5 - Combine the trained data

    ~/projects/tesseract/training/combine_tessdata \
      -e ~/projects/tesseract/tessdata/eng.traineddata \
      ~/tesstutorial/verdana_from_full/eng.lstm
    

    6 - Train merged data

    ~/projects/tesseract/training/lstmtraining \
      --model_output ~/tesstutorial/verdana_from_full/verdana \
      --continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
      --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
      --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
      --max_iterations 400
    

    7 - Validate the results on the main training file

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
      --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
      --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
    

    8 - Validate the results on our training file

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
      --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
      --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
    

    Fine tuning add ± character - tesseract 4.0

    Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

    Steps

    1 - Modify langdata/eng/eng.training_text and include these lines:

    alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
    TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
    Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
    VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
    PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
    Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
    Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
    Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
    Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
    United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
    Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
    Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
    netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
    

    2 - Generate the training file

    PANGOCAIRO_BACKEND=fc \
    ~/projects/tesseract/training/tesstrain.sh \
      --fonts_dir /Library/Fonts \
      --lang eng \
      --linedata_only \
      --noextract_font_properties \
      --langdata_dir ~/projects/langdata \
      --tessdata_dir ~/projects/tesseract/tessdata \
      --fontlist "Times New Roman," \
                  "Times New Roman, Bold" \
                  "Times New Roman, Bold Italic" \
                  "Times New Roman, Italic" \
                  "Courier New" \
                  "Courier New Bold" \
                  "Courier New Bold Italic" \
                  "Courier New Italic" \
      --output_dir ~/tesstutorial/trainplusminus
    

    3 - Generate the eval data

    PANGOCAIRO_BACKEND=fc \
    ~/projects/tesseract/training/tesstrain.sh \
      --fonts_dir /Library/Fonts \
      --lang eng \
      --linedata_only \
      --noextract_font_properties \
      --langdata_dir ~/projects/langdata \
      --tessdata_dir ~/projects/tesseract/tessdata \
      --fontlist "Verdana" \
      --output_dir ~/tesstutorial/evalplusminus
    

    4 - Combine trained data files

    ~/projects/tesseract/training/combine_tessdata \
      -e ~/projects/tesseract/tessdata/eng.traineddata \
      ~/tesstutorial/trainplusminus/eng.lstm
    

    5 - Fine tuning

    ~/projects/tesseract/training/lstmtraining \
      --model_output ~/tesstutorial/trainplusminus/plusminus \
      --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
      --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
      --old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
      --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
      --max_iterations 3600
    

    6 - Test the result on other fonts

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
      --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
      --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt
    

    6 - Test the result test on main font

    ~/projects/tesseract/training/lstmeval \
      --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
      --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
      --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
    
    build process 
    opened by FernandoGOT 57
  • Some programs can't find OCR text in Tesseract's PDFs (3.04)

    Some programs can't find OCR text in Tesseract's PDFs (3.04)

    While Acrobat XI can find text in a PDF, it appears that poppler's pdftotext program, OS X's Preview app, and the library PyPDF2's extractText() function all fail to locate text. It seems that Tesseract is encoding text in a way that makes it inaccessible to many PDF viewers.

    pdftotext produces empty output. Preview app allows highlighting of text in the appropriate locations, but it cannot be copied to the clipboard or searched. PyPDF2 extractText also produces an empty string as text.

    bug PDF 
    opened by jbarlow83 56
  • Tesseract can not recognize small superscript numbers behind words

    Tesseract can not recognize small superscript numbers behind words

    Environment

    • Tesseract Version: 4.0.0-2
    • Platform: MX Linux-19 64bit (Debian 10 based)

    Sometimes I have texts where the word is followed by a superscript number, because there are clues to that word at the bottom of the page.

    Current Behavior:

    It can not recognize the small number "1".

    Expected Behavior:

    It should recognize the small Number "1"

    Gif-Animation: OCr training

    Question: So I guess this is a mix of letters and numbers in one word and at the same time different font sizes in the same word. Is there a way, to train Tesseract, so that it can correct recognize this small numbers?

    Thank you.

    opened by Golddouble 1
  • tesseract::TessBaseAPI::ProcessPages cannot be stopped on demand

    tesseract::TessBaseAPI::ProcessPages cannot be stopped on demand

    Environment

    • Tesseract Version: Tesseract 4.1.1
    • Platform: Win10 64bit, VS2017, MFC C++ application

    Current Behavior:

    tesseract::TessBaseAPI::ProcessPages cannot be stopped on demand. I didn't discover any way to stop tesseract::TessBaseAPI::ProcessPages when I want, some timeout could be setup, but this not help

    Expected Behavior:

    tesseract::TessBaseAPI::ProcessPages should have a solution to be stopped on demand

    Suggested Fix:

    tesseract::TessBaseAPI::ProcessPages should taking account ETEXT_DESC somehow. More details could be found here: https://stackoverflow.com/questions/72719440/stop-tesseracttessbaseapiprocesspages-on-demand

    feature request 
    opened by flaviu22 5
  • Encoding of string failed! Chinese

    Encoding of string failed! Chinese


    Environment

    Tesseract Version: v5.1.0.20220510

    Current Behavior:

    Extracting tessdata components from chi_sim.traineddata Wrote chi_sim.lstm Version:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] 0:config:size=1966, offset=192 17:lstm:size=12152851, offset=2158 18:lstm-punc-dawg:size=282, offset=12155009 19:lstm-word-dawg:size=590634, offset=12155291 20:lstm-number-dawg:size=82, offset=12745925 21:lstm-unicharset:size=258834, offset=12746007 22:lstm-recoder:size=72494, offset=13004841 23:version:size=84, offset=13077335 Loaded file D:\Download\tess\tess_trainrance\chi_sim.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from D:\Download\tess\tess_trainrance\chi_sim.lstm Encoding of string failed! Failure bytes: e8 af b6 e8 ae a4 e8 ae a4 e8 ae a4 7e 7e 7e e2 80 a6 e2 80 a6 e3 80 8d Can't encode transcription: '銆屽憸鍛溾€︹€?楦e憸鍛滆璁よ璁~~鈥︹€︺€? in language '' Encoding of string failed! Failure bytes: e8 af b6 e8 ae a4 e8 ae a4 e8 ae a4 7e 7e 7e e2 80 a6 e2 80 a6 e3 80 8d Can't encode transcription: '銆屽憸鍛溾€︹€?楦e憸鍛滆璁よ璁~~鈥︹€︺€? in language '' Encoding of string failed! Failure bytes: e8 af b6 e8 ae a4 e8 ae a4 e8 ae a4 7e 7e 7e e2 80 a6 e2 80 a6 e3 80 8d Can't encode transcription: '銆屽憸鍛溾€︹€?楦e憸鍛滆璁よ璁~~鈥︹€︺€? in language ''

    Expected Behavior:

    There is a Chinese character '诶' in the picture I trained, and the error Encoding of string failed! will be reported after training. After my investigation and testing, I found that the reason is that the chi_sim.traineddata of tessdata_best does not have '诶'. what should I do? How to add '诶' in chi_sim.traineddata?

    Suggested Fix:

    training encoding failed 
    opened by Gnakkk 0
  • Tesseract bug:execute lstm.train but fail, print Deserialize header failed

    Tesseract bug:execute lstm.train but fail, print Deserialize header failed

    In example.zip, 7 tif file in it. I merge all tif to zq.newbox.exp0.tif and generate zq.newbox.exp0.box. Then I execute this command in windows CMD, I want to generate lstm file for zq.newbox.exp0.tif and zq.newbox.exp0.box . tesseract C:\Users\zhang\Desktop\test\zq.newbox.exp0.tif zq.newbox.exp0 --psm 6 -l eng lstm.train

    but print:

    Page 1

    Page 2 Deserialize header failed: zq.newbox.exp0.lstmf Failed to read training data from zq.newbox.exp0.lstmf! Error during processing.

    Then I delete the 22KY.tif , merge other 6 tif files to zq.newbox.exp0.tif, then retry lstm.train success.

    I don't know why 22KY.tif is error ?

    But, after I success use 6 tif files generate zq.newbox.exp0.lstmf, If the zq.newbox.exp0.lstmf in work dir, execute lstm.train for 7 tif files (include 22KY.tif ) can success.

    So, I wonder it might be bug for tesseract

    Environment

    Tesseract 5.1.0.20220510 windows 10 (64bit)

    Here is my box file and tif file

    example.zip

    end.traineddata donwload from https://github.com/tesseract-ocr/tessdata_best and I execute combine_tessdata -e "C:\Program Files\Tesseract-OCR\tessdata\eng.traineddata" "eng.lstm"

    opened by zhangqi-ulua 0
  • OCR tool timing

    OCR tool timing

    Hi

    By Using Tesseract OCR to recognize text from image is works normal but for some time execution of following line reaches 450-500 ms with 3-5 char detection .

    char* out = tess[nArrayIndex].GetUTF8Text();

    Please assist me to resolve this issue.

    performance 
    opened by balaji-6975 0
Releases(5.1.0)
  • 5.1.0(Mar 1, 2022)

    This is a new minor version of Tesseract 5.

    • Handle image and line regions in output formats ALTO, hOCR and text.
    • New parameter curl_timeout for curl_easy_setop.
    • Build fixes and improvements.
    • Catch nullptr in PageIterator::Orientation to improve robustness.
    • Remove unused code.

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.1(Jan 7, 2022)

    This is a bug fix release of Tesseract 5.0.

    • Add SPDX-License-Identifier to public include files.
    • Support redirections when running OCR on a URL.
    • Lots of fixes and improvements for cmake builds. Distributions should use the autoconf build.
    • Fix broken msys2 build with gcc 11.
    • Fix parameter certainty_scale (was duplicated).
    • Fix some compiler warnings and clean code.
    • Correctly detect amd64 and i386 on FreeBSD.
    • Add libarchive and libcurl in continuous integration actions.
    • Update submodule googletest to release v1.11.0.

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0(Nov 30, 2021)

  • 5.0.0-rc3(Nov 22, 2021)

  • 4.1.3(Nov 15, 2021)

  • 5.0.0-rc2(Nov 14, 2021)

  • 4.1.2(Nov 14, 2021)

    This is a new stable release of Tesseract 4.1.

    Note: The autoconf build is broken (see issue #3642), so please use 4.1.3.

    • Allow line images with larger width for training
    • Bug fixes
    • Build updates and fixes

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-rc1(Oct 29, 2021)

    This is the first release candidate of Tesseract 5.0.0.

    • Enable fast float32 LSTM by default
    • Switch to NFC normalisation everywhere
    • Remove banner message
    • Disable music staff detection and removal
    • Add new command line option --loglevel
    • Bug fixes

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-beta-20210916(Sep 16, 2021)

    This is a new pre-release of Tesseract 5.0.0.

    • Bug fixes
    • Extend URI support for Tesseract with libcurl
    • Rename processed TIFF output file and add page number if needed

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-beta-20210815(Aug 15, 2021)

    This is a new pre-release of Tesseract 5.0.0.

    • Bug fixes
    • Modernize more code
    • More options for binarization
    • Improved support for ARM NEON
    • No longer depends on Abseil for unit tests
    • Support float for model training and text recognition (faster, requires less RAM)

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-alpha-20210401(Apr 1, 2021)

    This is a new pre-release of Tesseract 5.0.0.

    • Replaced all remaining STRING by std::string
    • Replaced lots of GenericVector by std::vector
    • Replaced all malloc / free by C++ code
    • Modernized and formatted code

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-alpha-20201231(Dec 31, 2020)

    This is a new pre-release of Tesseract 5.0.0.

    It has massive changes in the public API which is a great step towards a final 5.0.0. All unit tests pass, but because of those changes more practical experience is needed.

    • the public API no longer uses proprietary data types GenericVector, STRING
    • pdf.ttf is no longer needed because it is now integrated into the code

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 5.0.0-alpha-20201224(Dec 24, 2020)

    This is a new pre-release of Tesseract 5.0.0.

    It is considered to be production ready for end users, but nevertheless not stable because more incompatible API changes are planned.

    • improved performance (also on ARM / ARM64)
    • improved unit tests
    • many fixes
    • faster flat build with automake
    • support for latest macOS (including new M1 processor)

    See also list of all changes.

    Source code(tar.gz)
    Source code(zip)
  • 4.1.1(Dec 26, 2019)

  • 4.1.0(Jul 7, 2019)

    • Added new renderers Alto, LSTMBox, WordStrBox.
    • Added character boxes in hOCR output.
    • Added python training scripts (experimental) as alternative shell scripts.
    • Better support AVX / AVX2 / SSE.
    • Disable OpenMP support by default (see e.g. #1171, #1081).
    • Fix for bounding box problem.
    • Implemented support for whitelist/blacklist in LSTM engine.
    • Improved cmake configuration.
    • Code modernization and improvements.
    • A lot of bug fixes...

    Detailed changelog is on wiki.

    Windows installer can be downloaded from https://github.com/UB-Mannheim/tesseract/wiki.

    Source code(tar.gz)
    Source code(zip)
  • 4.0.0(Oct 29, 2018)

  • 3.05.02(Jun 19, 2018)

  • 3.05.01(Jun 1, 2017)

  • 3.05.00(Feb 16, 2017)

    • Made some fine tuning to the hOCR output.
      • Added TSV as another optional output format.
      • Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method.
      • text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer.
      • Training tools - Replaced asserts with tprintf() and exit(1).
      • Fixed Cygwin compatibility.
      • Improved multipage tiff processing.
      • Improved the embedded pdf font (pdf.ttf).
      • Enable selection of OCR engine mode from command line.
      • Changed tesseract command line parameter '-psm' to '--psm'.
      • Added new C API for orientation and script detection, removed the old one.
      • Increased minimum autoconf version to 2.59.
      • Removed dead code.
      • Fixed many compiler warning.
      • Fixed memory and resource leaks.
      • Fixed some issues with the 'Cube' OCR engine.
      • Fixed some openCL issues.
      • Added option to build Tesseract with CMake build system.
      • Implemented CPPAN support for easy Windows building.
    Source code(tar.gz)
    Source code(zip)
  • 3.04.01(Feb 16, 2016)

  • 3.04.00(Jul 24, 2015)

Open 3D Engine (O3DE) is an Apache 2.0-licensed multi-platform AAA Open 3D Engine

Open 3D Engine (O3DE) is an Apache 2.0-licensed multi-platform 3D engine that enables developers and content creators to build AAA games, cinema-quality 3D worlds, and high-fidelity simulations without any fees or commercial obligations.

O3DE 4.9k Jun 29, 2022
A completely free, open-source, 2D game engine built on proven torque technology.

Torque2D 4.0 Early Access 1 MIT Licensed Open Source version of Torque2D from GarageGames. Maintained by the Torque Game Engines team and contribution

Torque Game Engines 500 Jun 27, 2022
Open-source, cross-platform, C++ game engine for creating 2D/3D games.

GamePlay v3.0.0 GamePlay is an open-source, cross-platform, C++ game framework/engine for creating 2D/3D mobile and desktop games. Website Wiki API De

gameplay3d 3.7k Jun 19, 2022
appleseed is an open source, physically-based global illumination rendering engine primarily designed for animation and visual effects.

appleseed is an open source, physically-based global illumination rendering engine primarily designed for animation and visual effects.

appleseedhq 1.9k Jun 21, 2022
The official Open-Asset-Importer-Library Repository. Loads 40+ 3D-file-formats into one unified and clean data structure.

Open Asset Import Library (assimp) A library to import and export various 3d-model-formats including scene-post-processing to generate missing render

Open Asset Import Library 7.9k Jun 24, 2022
Horde3D is a small 3D rendering and animation engine. It is written in an effort to create an engine being as lightweight and conceptually clean as possible.

Horde3D Horde3D is a 3D rendering engine written in C++ with an effort being as lightweight and conceptually clean as possible. Horde3D requires a ful

Volker Vogelhuber 1.2k Jun 22, 2022
Brand new engine with new and QoL features. Grafex is Psych engine with some additions and Better graphics

Friday Night Funkin' - Graphex Engine Credits: Grafex Mod aka Psych Graphic Rework: Xale - Lead Coding, Artist PurpleSnake - Second Coder Psych Engine

Xale 2 Jun 27, 2022
ORE (OpenGL Rendering Engine) is a rendering engine developed for my college minor project assessment.

ORE (OPENGL RENDERING ENGINE) What is ORE? ORE(OpenGL Rendering Engine) is a rendering engine with great and easy to use UI that allows the user to lo

HARSHIT BARGUJAR 4 Dec 20, 2021
An integration of Live++ for Open 3D Engine

LivePlusPlus_O3DE_Gem An integration of https://liveplusplus.tech/ for Open 3D Engine (Windows only). See Open 3D Engine at https://github.com/o3de/o3

Olex Lozitskiy 9 Apr 28, 2022
An Open-Source subdivision surface library.

OpenSubdiv OpenSubdiv is a set of open source libraries that implement high performance subdivision surface (subdiv) evaluation on massively parallel

Pixar Animation Studios 2.6k Jun 25, 2022
An open-source implementation of Autodesk's FBX

SmallFBX An open-source implementation of Autodesk's FBX that is capable of import & export mesh, blend shape, skin, and animations. Mainly intended t

Seiya Ishibashi 40 Jun 16, 2022
The open-source tool for creating of 3D models

The open-source tool for creating of 3D models

3D geoinformation research group at TU Delft 396 Jun 25, 2022
StereoKit is an easy-to-use open source mixed reality library for building HoloLens and VR applications with C# and OpenXR!

StereoKit is an easy-to-use open source mixed reality library for building HoloLens and VR applications with C# and OpenXR! Inspired by libraries like XNA and Processing, StereoKit is meant to be fun to use and easy to develop with, yet still quite capable of creating professional and business ready software.

Nick Klingensmith 410 Jun 24, 2022
OpenCorr is an open source C++ library for development of 2D, 3D/stereo, and volumetric digital image correlation

OpenCorr OpenCorr is an open source C++ library for development of 2D, 3D/stereo, and volumetric digital image correlation. It aims to provide a devel

Zhenyu Jiang 43 Jun 22, 2022
Open source Altium Database Library with over 147,000 high quality components and full 3d models.

Open source Altium Database Library with over 147,000 high quality components and full 3d models.

Mark 1.2k Jun 20, 2022
Vizzu is a free, open-source Javascript/C++ library for animated data visualizations and data stories.

Vizzu is a free, open-source Javascript/C++ library utilizing a generic dataviz engine that generates many types of charts and seamlessly animates between them

Vizzu 1.5k Jun 24, 2022
Cocos2d-x is a suite of open-source, cross-platform, game-development tools used by millions of developers all over the world.

Cocos2d-x is a suite of open-source, cross-platform, game-development tools used by millions of developers all over the world.

cocos2d 16.2k Jun 20, 2022
Open-Source Vulkan C++ API

Vulkan-Hpp: C++ Bindings for Vulkan The goal of the Vulkan-Hpp is to provide header only C++ bindings for the Vulkan C API to improve the developers V

The Khronos Group 2.3k Jun 23, 2022
ZBar Bar Code Reader is an open source software suite for reading bar codes from various sources

ZBar Bar Code Reader is an open source software suite for reading bar codes from various sources

null 2.3k Jun 21, 2022