You've certainly heard of Tesseract-OCR if you've been in the OCR field. In this tutorial, we're going to provide a developer guide on how does tesseract work internally. Specifically, what happens when you run the following example.
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
int main()
{
char *output_base= "OUTPUT_BASE";
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
api->Init(NULL, "eng");
api->SetPageSegMode(tesseract::PSM_AUTO);
Pix *image = pixRead("image.png");
api->SetImage(image);
TessAltoRenderer alto_renderer(output_base);
alto_renderer.BeginDocument(nullptr);
api->ProcessPage(image, 0, nullptr, nullptr, 0, &alto_renderer);
alto_renderer.EndDocument();
api->End();
delete api;
pixDestroy(&image);
return 0;
}or somewhat equivalently when you run the command below.
tesseract image.png OUTPUT_BASE -l eng --psm 3 alto # Enters tesseract.cpp::mainLet's start!
In the code we first instantiated a TessBaseAPI object. This is a stateful api defined in baseapi.cpp. It is worth stating some of its important members:
tesseract_: Tesseract*: A class defining some important methods related to layout analysis e.g.,SegmentPageandrecog_all_words. Of its notable attributes are the binarized imagepix_binary_: Image,textord_: Textordfor layout analysis andlstm_recognizer_: LSTMRecognizer*. It also contains as member almost 200 parameters mostly controlling algorithmic subtleties and also things like debugging.thresholder_: ImageThresholder*: Image thresholding module.block_list_: BLOCK_LIST*: A list of blocks with rows and recognized words containing the result of tesseract run.page_res_: PAGE_RES*:block_list_finally gets converted to this data type.
Parameters being used by tesseract are defined throughout its code and control its variant aspects and parts. Some of them are defined as global variables in the tesseract namespace and the others as member variables of classes like Tesseract, Textord and LanguageModel.
Each parameters has a default value that could be overriden by either of following methods:
.configfile within the.traineddatafiles used by the engine.- Config files located in
TESSDATA_DIR/configsdirectory.
Depending on whether you're employing the command or the code above, you can also use -c param=value or api->SetVariable(param, value) respectively as well.
We start by api->ProcessPage(). This method first calls TessBaseAPI::Recognize() method, then simply calls alto_renderer->AddImage() to handle the output of OCR i.e., converting the PAGE_RES data structure to a xml string of ALTO standard.
TessBaseAPI::Recognize() is an important function in tesseract and works as follows.
- Calls
TessBaseAPI::FindLines() - Initializes
page_res_fromblock_list_. - Calls
tesseract_->recog_all_words(page_res_, ...) - Calls
TessBaseAPI::DetectParagraphs(true)
TessBaseAPI::FindLines() is one of the important functions in tesseract and is responsible for binarizing the input image by calling TessBaseAPI::Threshold() method, then analysing its layout by calling tesseract_->SegmentPage(..., block_list_, ...). tesseract_->recog_all_words(page_res_, ...) does the text recognition using LSTM and finally DetectParagraphs detects paragraphs within each block of page_res_.
Let's move forward with Tesseract::SegmentPage but before that, we'd better get familiar with common data types in tesseract. Here is the UML diagram of most common data types in tesseract.
TO_BLOCK and TO_ROW are used in the intermediate stages of the tesseract_->SegmentPage and are finally converted to BLOCK and ROW respectively. BLOBNBOX type is also used only during SegmentPage runtime.
SegmentPage is located in pagesegmain.cpp which is also home for another important function namely, AutoPageSeg(pageseg_mode, blocks, to_blocks, ...) among others which is responsible for, as its docstring says, dividing the page image into blocks of uniform text linespacing and images. SegmentPage initially calls the very AutoPageSeg with blocks containing a single BLOCK representing the whole image and with to_blocks as a to-be-filled empty list of TO_BLOCK, among other parameters. Then it calls textord_.TextordPage(pageseg_mode, ..., pix_binary_, ..., blocks, &to_blocks).
According to the textord.h comments, Textord class definition gathers text line and word finding functionality and TextordPage() is the starting point to that end. We revisit this method shortly.
Here we should give some explanation on the algorithm implemented in AutoPageSeg and laid out in the paper Ray Smith, Hybrid Page Layout Analysis via Tab-Stop Detection, 2009. The idea of the paper is to exploit the fact that the leftmost coordinates of the lines in a block of text, so called tab-stops, have the same x. Here comes the description of the algorithm. For each step, we mention the corresponding part of the code and also debug-related parameters in the case you want to examine it. Note that in order to use many of the debug parameters, you have to build the java program next to the tesseract namely, ScrollView.jar and set the environment variable SCROLLVIEW_PATH to its path.
-
Preprocessing This step takes place in
Tesseract::SetupPageSegAndDetectOrientation. First, the image mask and separators are identified in the image using ordinary image processing methods and the separators are subtracted from the input. Image mask containing photos is used later in the code.LineFinder::FindAndRemoveLinesandImageFind::FindImagesbear the responsibility for this matter.Debug parameters
Parameter name Description tessedit_dump_pageseg_images Results of intermediate steps are collected as images into the image array tesseract::pixa_debug_and finally are written into a PDF file. This is done inTesseract::Clear()textord_tabfind_show_vlines Intermediate results of LineFinder::FindAndRemoveLinesare written into a PDF namedvhlinefinding.pdf.Afterwards, a Connected Component analysis is done over the image by calling
textord_.find_components(pix_binary_, blocks, to_blocks). By connected component we mean a contiguous area of black pixels like a dot or letter. It's called blob as well. Each blob is represented by a list of outlines and each outline (of classC_OUTLINE) is represented by a list of points. It could also have a list of outlines as its children. For example the blob of the english letter 'o' has an outline for its outer circle, having one child for the inner circle. Blobs are extracted asC_BLOBobjects inextract_edges(pix, block)function and set to their corresponding block. A parallel list ofBLOBNBOXobjects are also constructed and set to the correspondingto_block: TO_BLOCK. This very last operation is done inassign_blobs_to_blocks2.From now on, we deal with blobs.
filter_blobspartitions the blobs in eachto_blockbased on their size and into separate attributes of theto_block. By doing so , it estimates and sets theline_size,line_spacingandmax_blob_sizeasTO_BLOCKattributes.filter_blobsrelies on a handful of parameters for doing its computation, notablytextord_max_noise_sizeandtextord_excess_blobsize. This function is located intordmain.cpp.Debug parameters
Parameter name Description textord_show_blobs Shows the blobs of the image in different colors depending on their partition. textord_show_boxes Shows the bounding box of the image blobs. Now a
ColumnFinderobject is constructed, itsSetupAndFilterNoisemethod is called and the object is returned for further use.SetupAndFilterNoisedoes blob partitioning once again with different logic and parameters than thefilter_blobsfunction.Debug parameters
Parameter name Description textord_tabfind_show_blocks Shows the blobs of the image in different colors depending on their new partition. Next,
SetupAndFilterNoisecallsSetNeighboursOnMediumBlobs(input_block)method of column finder'sstroke_width_: StrokeWidthattribute. This method finds for each medium blob in the block its neighbors in all four directions (they're put inBLOBNBOX::neighbours_attribute) and marks the good ones i.e., the neighbors with similar stroke width. Afterwards, aCCNonTextDetectobject is created and is asked toComputeNonTextMask(..., photo_mask_pix, input_block). It constructs a noise mask out of block's noise and small blobs, the medium blobs without good neighbors and the already adopted photo mask. Then it adds to noise mask, the medium and large blobs that overlap with a lot of small and noise blobs. Also the large blobs that overlap with a lot of medium blobs. Final result is set to column finder'snontext_map_attribute.Debug parameters
Parameter name Description textord_debug_tabfind Writes the initial noise mask in an image named junknoisemask.pngand the final noise+photo mask in an image namedjunkccphotomask.png. It also plots noise (in red) and non-noise blobs in a window namedPhoto Mask Blobs. -
Finding Tab-stops, Column Paritions and Blocks Suppose we slice a column horizontally. Each slice is called a column partition. We find column partitions before forming the columns. This is done in
finder->FindBlocks(...)as follows. First it finds leader dots and constructs aColumnPartitionfor every chain of them instroke_width_->FindLeaderPartitions(input_block, &part_grid_). It also sets for the right and left neighbor blobs if they have leader part on their side. Then instroke_width_->RemoveLineResidue(&big_parts_), it finds very tall line-like blobs, constructsColumnPartitionout of them and adds them tobig_parts_attribute of the column finder.Debug parameters
Parameter name Description textord_tabfind_show_strokewidths Displays blobs with colors acording to their neighbors and other criterias in a window named LeaderNeighbours. It also draws those tall line-like partitions.Then it finds tab-stop blobs in
FindInitialTabVectors(...).Debug parameters
Parameter name Description textord_tabfind_show_initialtabs Displays tab-stops. Next step is to construct column partitions out of medium and large blobs by looking for horizontal/vertical chains of blobs ruling out small and noise blobs. This is done in
stroke_width_->GradeBlobsIntoPartitions(...).Debug parameters
Parameter name Description textord_tabfind_show_strokewidths Plots blobs and the boxes being processed during GradeBlobsIntoPartitions.Finding image column partitions takes place afterwards, by employing image mask and initially running a CC analysis on it. This is done by
ImageFind::FindImagePartitions(...).Debug parameters
Parameter name Description textord_tabfind_show_images Shows image coulmn partitions along with the other ones in a window named With Images.Now final tab-stops are computed along with computing column widths, having the column partitions. This is done by
FindTabVectors(...). Then tab-stops are set to corresponding column paritions inpart_grid_.SetTabStops(this).Debug parameters
Parameter name Description textord_tabfind_show_finaltabs Displays final tab-stops. Subsequently, in column finder's
MakeColumns(), column partitions are grouped into a list of candidateColumnPartitionSets representing the page columns. The column partition set which best explains the page layout is then selected.Debug parameters
Parameter name Description textord_tabfind_show_columns Shows the final columns of the page. After some operations, horizontal and vertical separators are added as column partitions in
GridInsertH/VLinePartitions(...), and also the type of columns partitions is reset using the found columns, inSetPartitionTypes().Debug parameters
Parameter name Description textord_tabfind_show_initial_partitions Displays the column partitions in this state of the algorithm. Now some refinements on column partitions is done. Things like optionally finding tables, deleting unknown partitions (by
DeleteUnknownParts()), finding partition partners, finding figure captions and so on.Debug parameters
Parameter name Description textord_tabfind_show_partitions Shows the final column partitions. In the next step column partitions are transformed into blocks in
TransformToBlocks(blocks, to_blocks).Debug parameters
Parameter name Description textord_tabfind_show_blocks Displays the blocks.
Many of the search stuff in tesseract is done through a generic class called BBGrid<class BBC, class BBC_CLIST, class BBC_C_IT>. This class is like a matrix which hosts some instances of BBC objects. For example BlobGrid holds the blobs of the image therein, or ColPartitionGrid which hosts ColPartitions located in the image. One ColPartition or Blob could span many grid slots. Then GridSearch<class BBC, class BBC_CLIST, class BBC_C_IT> provides facilities to search or move through the grid in different patterns efficiently.
Till now, we've extracted blocks of the page into to_blocks and we should extract lines and words from the text blocks. This responsibility goes to textord_.TextordPage(...). This function first calls filter_blobs(..., to_blocks, ...) to do a blob partitioning again. Then make_rows(page_tr_, to_blocks) is called which is an important function in tesseract. In make_rows, first make_initial_textrows and cleanup_rows_making are called for each block which results in a list of TO_ROWs for each block. The algorithm of meticulously allocating enough TO_ROWS for medium blobs in a block takes place in assign_blobs_to_rows(TO_BLOCK *block,...) function and is done by assigning a blob to a proper existing row or creating a new row for the blob. The resulted rows from make_initial_text_rows is refined in cleanup_rows_making by expanding rows to their neighbors or removing the empty ones or merging the highly overlapping ones. Finally other blobs of the block are assigned to the proper row without creating a new one.
Debug parameters
Parameter name Description textord_show_initial_rows Draws initial rows. textord_show_expanded_rows Draws expanded rows.
After creating rows, straight and spline baseline is fitted on each one. This is done by ComputeStraightBaselines and ComputeBaselineSplinesAndXheights methods of BaselineDetect class. Computed baselines are also used later in proportional word finding.
Debug parameters
Parameter name Description textord_show_final_blobs Draws final blobs. It's here because an optional and false-by-default function named vigorous_noise_removalinComputeBaselineSplinesAndXheightsremoves some blobs.textord_show_final_rows Draws final rows.
After creating rows, we have to create words. This is done by make_words, another important function in tesseract. This function tests for fixed-pitch words and if it was not possible, falls back to the proportional words. In the next step, in cleanup_blocks(true, blocks), noises are removed from words and rows that might result in eliminating an entire row. After that, diacritics are assigned to the nearest word in the function TransferDiacriticsToBlockGroups(diacritic_blobs, blocks). Now we're at the end of both Tesseract::SegmentPage(..., blocks,...) and TessBaseAPI::FindLines() with blocks containing the result. TessBaseAPI::Recognize after calling FindLines, constructs a PAGE_RES object out of the blocks and calls tesseract_->recog_all_words(page_res_,...) for text recognition which uses LSTM neural network for that purpose, and finally calls DetectParagraphs(true) for paragraph detection.
Debug parameters
Parameter name Description interactive_display_mode Shows the result blocks interactively and with details.
After all, the PAGE_RES object is given to the renderer to produce the output with desired format and save it to a file.
The End.
