How does Tesseract-OCR work ? A Developer Guide

You've certainly heard of Tesseract-OCR if you've been in the OCR field. In this tutorial, we're going to provide a developer guide on how does tesseract work internally. Specifically, what happens when you run the following example.

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    char *output_base= "OUTPUT_BASE";

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    api->Init(NULL, "eng");
    api->SetPageSegMode(tesseract::PSM_AUTO);

    Pix *image = pixRead("image.png");
    api->SetImage(image);
    
    TessAltoRenderer alto_renderer(output_base);
    alto_renderer.BeginDocument(nullptr);
    
    api->ProcessPage(image, 0, nullptr, nullptr, 0, &alto_renderer);
    
    alto_renderer.EndDocument();
   
    api->End();
    delete api;
    pixDestroy(&image);
    
    return 0;
}

or somewhat equivalently when you run the command below.

tesseract image.png OUTPUT_BASE -l eng --psm 3 alto # Enters tesseract.cpp::main

Let's start!

In the code we first instantiated a TessBaseAPI object. This is a stateful api defined in baseapi.cpp. It is worth stating some of its important members:

tesseract_: Tesseract*: A class defining some important methods related to layout analysis e.g., SegmentPage and recog_all_words. Of its notable attributes are the binarized image pix_binary_: Image, textord_: Textord for layout analysis and lstm_recognizer_: LSTMRecognizer*. It also contains as member almost 200 parameters mostly controlling algorithmic subtleties and also things like debugging.
thresholder_: ImageThresholder*: Image thresholding module.
block_list_: BLOCK_LIST*: A list of blocks with rows and recognized words containing the result of tesseract run.
page_res_: PAGE_RES*: block_list_ finally gets converted to this data type.

A note on Tesseract parameters

Parameters being used by tesseract are defined throughout its code and control its variant aspects and parts. Some of them are defined as global variables in the tesseract namespace and the others as member variables of classes like Tesseract, Textord and LanguageModel.

Each parameters has a default value that could be overriden by either of following methods:

.config file within the .traineddata files used by the engine.
Config files located in TESSDATA_DIR/configs directory.

Depending on whether you're employing the command or the code above, you can also use -c param=value or api->SetVariable(param, value) respectively as well.

We start by api->ProcessPage(). This method first calls TessBaseAPI::Recognize() method, then simply calls alto_renderer->AddImage() to handle the output of OCR i.e., converting the PAGE_RES data structure to a xml string of ALTO standard.

TessBaseAPI::Recognize() is an important function in tesseract and works as follows.

Calls TessBaseAPI::FindLines()
Initializes page_res_ from block_list_.
Calls tesseract_->recog_all_words(page_res_, ...)
Calls TessBaseAPI::DetectParagraphs(true)

TessBaseAPI::FindLines() is one of the important functions in tesseract and is responsible for binarizing the input image by calling TessBaseAPI::Threshold() method, then analysing its layout by calling tesseract_->SegmentPage(..., block_list_, ...). tesseract_->recog_all_words(page_res_, ...) does the text recognition using LSTM and finally DetectParagraphs detects paragraphs within each block of page_res_.

Let's move forward with Tesseract::SegmentPage but before that, we'd better get familiar with common data types in tesseract. Here is the UML diagram of most common data types in tesseract.

TO_BLOCK and TO_ROW are used in the intermediate stages of the tesseract_->SegmentPage and are finally converted to BLOCK and ROW respectively. BLOBNBOX type is also used only during SegmentPage runtime.

SegmentPage is located in pagesegmain.cpp which is also home for another important function namely, AutoPageSeg(pageseg_mode, blocks, to_blocks, ...) among others which is responsible for, as its docstring says, dividing the page image into blocks of uniform text linespacing and images. SegmentPage initially calls the very AutoPageSeg with blocks containing a single BLOCK representing the whole image and with to_blocks as a to-be-filled empty list of TO_BLOCK, among other parameters. Then it calls textord_.TextordPage(pageseg_mode, ..., pix_binary_, ..., blocks, &to_blocks).

According to the textord.h comments, Textord class definition gathers text line and word finding functionality and TextordPage() is the starting point to that end. We revisit this method shortly.

Understanding Tesseract Layout Analysis

Here we should give some explanation on the algorithm implemented in AutoPageSeg and laid out in the paper Ray Smith, Hybrid Page Layout Analysis via Tab-Stop Detection, 2009. The idea of the paper is to exploit the fact that the leftmost coordinates of the lines in a block of text, so called tab-stops, have the same x. Here comes the description of the algorithm. For each step, we mention the corresponding part of the code and also debug-related parameters in the case you want to examine it. Note that in order to use many of the debug parameters, you have to build the java program next to the tesseract namely, ScrollView.jar and set the environment variable SCROLLVIEW_PATH to its path.

Preprocessing This step takes place in Tesseract::SetupPageSegAndDetectOrientation. First, the image mask and separators are identified in the image using ordinary image processing methods and the separators are subtracted from the input. Image mask containing photos is used later in the code. LineFinder::FindAndRemoveLines and ImageFind::FindImages bear the responsibility for this matter.

Debug parameters

Parameter name Description

tessedit_dump_pageseg_images Results of intermediate steps are collected as images into the image array tesseract::pixa_debug_ and finally are written into a PDF file. This is done in Tesseract::Clear()

textord_tabfind_show_vlines Intermediate results of LineFinder::FindAndRemoveLines are written into a PDF named vhlinefinding.pdf.

Parameter name	Description
tessedit_dump_pageseg_images	Results of intermediate steps are collected as images into the image array `tesseract::pixa_debug_` and finally are written into a PDF file. This is done in `Tesseract::Clear()`
textord_tabfind_show_vlines	Intermediate results of `LineFinder::FindAndRemoveLines` are written into a PDF named `vhlinefinding.pdf`.

Afterwards, a Connected Component analysis is done over the image by calling textord_.find_components(pix_binary_, blocks, to_blocks). By connected component we mean a contiguous area of black pixels like a dot or letter. It's called blob as well. Each blob is represented by a list of outlines and each outline (of class C_OUTLINE) is represented by a list of points. It could also have a list of outlines as its children. For example the blob of the english letter 'o' has an outline for its outer circle, having one child for the inner circle. Blobs are extracted as C_BLOB objects in extract_edges(pix, block) function and set to their corresponding block. A parallel list of BLOBNBOX objects are also constructed and set to the corresponding to_block: TO_BLOCK. This very last operation is done in assign_blobs_to_blocks2.

From now on, we deal with blobs. filter_blobs partitions the blobs in each to_block based on their size and into separate attributes of the to_block. By doing so , it estimates and sets the line_size, line_spacing and max_blob_size as TO_BLOCK attributes. filter_blobs relies on a handful of parameters for doing its computation, notably textord_max_noise_size and textord_excess_blobsize. This function is located in tordmain.cpp.

Debug parameters

Parameter name Description

textord_show_blobs Shows the blobs of the image in different colors depending on their partition.

textord_show_boxes Shows the bounding box of the image blobs.

Parameter name	Description
textord_show_blobs	Shows the blobs of the image in different colors depending on their partition.
textord_show_boxes	Shows the bounding box of the image blobs.

Now a ColumnFinder object is constructed, its SetupAndFilterNoise method is called and the object is returned for further use. SetupAndFilterNoise does blob partitioning once again with different logic and parameters than the filter_blobs function.

Debug parameters

Parameter name Description

textord_tabfind_show_blocks Shows the blobs of the image in different colors depending on their new partition.

Parameter name	Description
textord_tabfind_show_blocks	Shows the blobs of the image in different colors depending on their new partition.

Next, SetupAndFilterNoise calls SetNeighboursOnMediumBlobs(input_block) method of column finder's stroke_width_: StrokeWidth attribute. This method finds for each medium blob in the block its neighbors in all four directions (they're put in BLOBNBOX::neighbours_ attribute) and marks the good ones i.e., the neighbors with similar stroke width. Afterwards, a CCNonTextDetect object is created and is asked to ComputeNonTextMask(..., photo_mask_pix, input_block). It constructs a noise mask out of block's noise and small blobs, the medium blobs without good neighbors and the already adopted photo mask. Then it adds to noise mask, the medium and large blobs that overlap with a lot of small and noise blobs. Also the large blobs that overlap with a lot of medium blobs. Final result is set to column finder's nontext_map_ attribute.

Debug parameters

Parameter name Description

textord_debug_tabfind Writes the initial noise mask in an image named junknoisemask.png and the final noise+photo mask in an image named junkccphotomask.png. It also plots noise (in red) and non-noise blobs in a window named Photo Mask Blobs.

Parameter name	Description
textord_debug_tabfind	Writes the initial noise mask in an image named `junknoisemask.png` and the final noise+photo mask in an image named `junkccphotomask.png`. It also plots noise (in red) and non-noise blobs in a window named `Photo Mask Blobs`.

Finding Tab-stops, Column Paritions and Blocks Suppose we slice a column horizontally. Each slice is called a column partition. We find column partitions before forming the columns. This is done in finder->FindBlocks(...) as follows. First it finds leader dots and constructs a ColumnPartition for every chain of them in stroke_width_->FindLeaderPartitions(input_block, &part_grid_). It also sets for the right and left neighbor blobs if they have leader part on their side. Then in stroke_width_->RemoveLineResidue(&big_parts_), it finds very tall line-like blobs, constructs ColumnPartition out of them and adds them to big_parts_ attribute of the column finder.

Debug parameters

Parameter name Description

textord_tabfind_show_strokewidths Displays blobs with colors acording to their neighbors and other criterias in a window named LeaderNeighbours. It also draws those tall line-like partitions.

Parameter name	Description
textord_tabfind_show_strokewidths	Displays blobs with colors acording to their neighbors and other criterias in a window named `LeaderNeighbours`. It also draws those tall line-like partitions.

Then it finds tab-stop blobs in FindInitialTabVectors(...).

Debug parameters

Parameter name Description

textord_tabfind_show_initialtabs Displays tab-stops.

Parameter name	Description
textord_tabfind_show_initialtabs	Displays tab-stops.

Next step is to construct column partitions out of medium and large blobs by looking for horizontal/vertical chains of blobs ruling out small and noise blobs. This is done in stroke_width_->GradeBlobsIntoPartitions(...).

Debug parameters

Parameter name Description

textord_tabfind_show_strokewidths Plots blobs and the boxes being processed during GradeBlobsIntoPartitions.

Parameter name	Description
textord_tabfind_show_strokewidths	Plots blobs and the boxes being processed during `GradeBlobsIntoPartitions`.

Finding image column partitions takes place afterwards, by employing image mask and initially running a CC analysis on it. This is done by ImageFind::FindImagePartitions(...).

Debug parameters

Parameter name Description

textord_tabfind_show_images Shows image coulmn partitions along with the other ones in a window named With Images.

Parameter name	Description
textord_tabfind_show_images	Shows image coulmn partitions along with the other ones in a window named `With Images`.

Now final tab-stops are computed along with computing column widths, having the column partitions. This is done by FindTabVectors(...). Then tab-stops are set to corresponding column paritions in part_grid_.SetTabStops(this).

Debug parameters

Parameter name Description

textord_tabfind_show_finaltabs Displays final tab-stops.

Parameter name	Description
textord_tabfind_show_finaltabs	Displays final tab-stops.

Subsequently, in column finder's MakeColumns(), column partitions are grouped into a list of candidate ColumnPartitionSets representing the page columns. The column partition set which best explains the page layout is then selected.

Debug parameters

Parameter name Description

textord_tabfind_show_columns Shows the final columns of the page.

Parameter name	Description
textord_tabfind_show_columns	Shows the final columns of the page.

After some operations, horizontal and vertical separators are added as column partitions in GridInsertH/VLinePartitions(...), and also the type of columns partitions is reset using the found columns, in SetPartitionTypes().

Debug parameters

Parameter name Description

textord_tabfind_show_initial_partitions Displays the column partitions in this state of the algorithm.

Parameter name	Description
textord_tabfind_show_initial_partitions	Displays the column partitions in this state of the algorithm.

Now some refinements on column partitions is done. Things like optionally finding tables, deleting unknown partitions (by DeleteUnknownParts()), finding partition partners, finding figure captions and so on.

Debug parameters

Parameter name Description

textord_tabfind_show_partitions Shows the final column partitions.

Parameter name	Description
textord_tabfind_show_partitions	Shows the final column partitions.

In the next step column partitions are transformed into blocks in TransformToBlocks(blocks, to_blocks).

Debug parameters

Parameter name Description

textord_tabfind_show_blocks Displays the blocks.

Parameter name	Description
textord_tabfind_show_blocks	Displays the blocks.

A note on how Tesseract finds things in the image

Many of the search stuff in tesseract is done through a generic class called BBGrid<class BBC, class BBC_CLIST, class BBC_C_IT>. This class is like a matrix which hosts some instances of BBC objects. For example BlobGrid holds the blobs of the image therein, or ColPartitionGrid which hosts ColPartitions located in the image. One ColPartition or Blob could span many grid slots. Then GridSearch<class BBC, class BBC_CLIST, class BBC_C_IT> provides facilities to search or move through the grid in different patterns efficiently.

Till now, we've extracted blocks of the page into to_blocks and we should extract lines and words from the text blocks. This responsibility goes to textord_.TextordPage(...). This function first calls filter_blobs(..., to_blocks, ...) to do a blob partitioning again. Then make_rows(page_tr_, to_blocks) is called which is an important function in tesseract. In make_rows, first make_initial_textrows and cleanup_rows_making are called for each block which results in a list of TO_ROWs for each block. The algorithm of meticulously allocating enough TO_ROWS for medium blobs in a block takes place in assign_blobs_to_rows(TO_BLOCK *block,...) function and is done by assigning a blob to a proper existing row or creating a new row for the blob. The resulted rows from make_initial_text_rows is refined in cleanup_rows_making by expanding rows to their neighbors or removing the empty ones or merging the highly overlapping ones. Finally other blobs of the block are assigned to the proper row without creating a new one.

Debug parameters

Parameter name Description

textord_show_initial_rows Draws initial rows.

textord_show_expanded_rows Draws expanded rows.

Parameter name	Description
textord_show_initial_rows	Draws initial rows.
textord_show_expanded_rows	Draws expanded rows.

After creating rows, straight and spline baseline is fitted on each one. This is done by ComputeStraightBaselines and ComputeBaselineSplinesAndXheights methods of BaselineDetect class. Computed baselines are also used later in proportional word finding.

Debug parameters

Parameter name Description

textord_show_final_blobs Draws final blobs. It's here because an optional and false-by-default function named vigorous_noise_removal in ComputeBaselineSplinesAndXheights removes some blobs.

textord_show_final_rows Draws final rows.

Parameter name	Description
textord_show_final_blobs	Draws final blobs. It's here because an optional and false-by-default function named `vigorous_noise_removal` in `ComputeBaselineSplinesAndXheights` removes some blobs.
textord_show_final_rows	Draws final rows.

After creating rows, we have to create words. This is done by make_words, another important function in tesseract. This function tests for fixed-pitch words and if it was not possible, falls back to the proportional words. In the next step, in cleanup_blocks(true, blocks), noises are removed from words and rows that might result in eliminating an entire row. After that, diacritics are assigned to the nearest word in the function TransferDiacriticsToBlockGroups(diacritic_blobs, blocks). Now we're at the end of both Tesseract::SegmentPage(..., blocks,...) and TessBaseAPI::FindLines() with blocks containing the result. TessBaseAPI::Recognize after calling FindLines, constructs a PAGE_RES object out of the blocks and calls tesseract_->recog_all_words(page_res_,...) for text recognition which uses LSTM neural network for that purpose, and finally calls DetectParagraphs(true) for paragraph detection.

Debug parameters

Parameter name Description

interactive_display_mode Shows the result blocks interactively and with details.

Parameter name	Description
interactive_display_mode	Shows the result blocks interactively and with details.

After all, the PAGE_RES object is given to the renderer to produce the output with desired format and save it to a file.

The End.

sadra-barikbin/README.md

Select an option

No results found

Select an option

No results found

A note on Tesseract parameters

Understanding Tesseract Layout Analysis

A note on how Tesseract finds things in the image