CIB ocr technical manual (EN)

Site: CIB eLearning
Course: CIB ocr
Book: CIB ocr technical manual (EN)
Printed by: Guest user
Date: Friday, 22 November 2024, 4:39 PM

1. Scope of Delivery

CIB ocr is delivered as a binary module, in the form of DLLs for Windows or shared libraries (Unix).

 

Komponente

Softwareumfang

CIB ocr

  • CIBOcr32.dll respectively CIBOcr64.dll 
  • CIB ocr DLL , interface for application

Tesseract-language package

  • Folder „tessdata“ containing dictionaries for the used languages, f.e. deu.traineddata for German.

Hunspell dictionary

  • Hunspell dictionaries and stopword lists within the ‘hunspell’ directory

CIB runshell

  • cibrsh.exe
  • cibrsh64.exe

 

Note:

The language package mentioned above is used for Tesseract text-recognition. By Default CIB ocr searches the subfolder “tessdata” within the current folder. If a different folder is used for language-files this has to be declared in the property „DataFolder”.

 

CIB runshell

CIB runshell (cibrsh.exe) provides the possibility to call CIB ocr DLL directly. With this call text and barcode can be extracted from a specified inputfile.

Example:

cibrsh.exe –oc input.tif output.txt

 

CIB job

(From CIB ocr version 2.3.0 and CIB job version 1.8.0)

CIB ocr DLL can be started via CIB job xml.

CIB job xml can be used by CIB runshell (cibrsh.exe) or CIB documentserver.

Example CIB runshell:

cibrsh.exe –d job.xml

 

For detailed example for a CIB job XML look at chapter Calling CIB ocr via CIB job/CIB documentServer.


2. Introduction

CIB ocr is a CIB solution for optical character recognition and mainly employed as a component of CIB products and modules like CIB doXima, CIB pdf toolbox or CIB doXisafe. It is used to support automatic text- and barcode-recognition in all these modules. For text-recognition the engine Tesseract is used.

About Tesseract:

Tesseract is an optical character recognition engine and is considered to be one of the most accurate open source OCR engines currently available.  It is free software, released under the Apache License  and development has been sponsored by Google since 2006. 

Tesseract is available for Linux, Windows and Mac OS X. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library.  Tesseract is able to process a lot of different languages, German included. 
  

Input-formats supported by CIB ocr : 

  • bmp image;
  • tiff image (including multipage tiff);
  • jpeg image;
  • png image;
  • xml (serialized cv::Mat);
  • yml (serialized cv::Mat)


Supported barcode-types: Still correct?? 

  • Code128A, Code128B, Code128C, Code128Auto
  • Code39
  • Code39Extended
  • QR-Code
  • Datamatrix

 

For output the following formats can be used: 

  • plain text
  • hocr

3. Usecase: Calling CIB OCR via CIB runshell

In the following example, a QR-code from a .jpg image file is processed by CIB ocr and output in a .txt textfile.

First a valid license is set and tracing is activated. To recognize the QR-code, the properties “BarcodeType” and “Recognize” have to be set accordingly. Then the  output filename and the output directory are set. CIB  ocr is called via the command –oc , followed by the input filename of the .jpg  image that is to be processed.  

All properties used in this call are described in more detail in chapters 4 to 6. 

CIB runshell can be called directly via command line in the appropriate directory, or via a batch script.  

Cibrsh.exe LicenseCompany=“Example Company“ LicenseKey=“xxxx-xxxx-xxxxxxxx“ 
TraceFilename=”trace.log” BarcodeType=QR Recognize=BarcodeRecognizer 
BarcodeOutputFilename=”./output/CIB_web.jpg.txt” –oc “./input/CIB_web.jpg”    

Inputfile: 

This is the QR Code as JPG file:

Result: 

The output textfile contains the following text:  

BARCODE:QR;http://www.cib.de 

4. Usecase: Calling CIB ocr Via CIB pdf toolbox

n CIB pdf toolbox it is possible to directly pass through properties t o CIB ocr. Further details can be found in the CIB pdf toolbox technical manual.  

This example uses CIB ocr to extract text from a PDF file that is not readable, i.e.  the text is represented by a picture. To pass through CIB ocr properties in this CIB pdf toolbox call, “CibOcr” is added as a prefix to the CIB ocr properties described in this document.   

Again a valid license has to be set. Then the property “OutputFormat is set to specify that a searchable PDF should be generated. It is also specified that in this case the image in the input PDF should be removed and the extracted text from that image shown instead in the output PDF  (properties “FormatSearchablePdfRemoveImage” and “FormatSearchablePdfShowText”).  

The language of the text in the PDF is set accordingly (“CibOcrOCRLanguage”) and the path to the necessary dictionary files is set (“CibOcrDataFolder”).  Also, a preprocessor is selected to improve the ocr result.  

Then cib pdf toolbox is called via –fj  (join) and the input and output files are specified. 

CibRsh.exe LicenseCompany=”Example Company” LicenseKey=”xxxx-xxxx-xxxxxxxx” 
OutputFormat=FormatSearchablePdf FormatSearchablePdfShowText=1 
FormatSearchablePdfRemoveImages=1 CibOcrDataFolder=”./tessdata” CibOcrOCRLanguage=”eng” 
CibOcrPreprocess=SauvolaThresholding –fj “./input/TextAsPicture.pdf” 
“./output/TextAsPicture.out.pdf”
    

5. Usecase: Calling CIB ocr Via CIB job/CIB documentServer

(From CIB ocr version 2.3.0 and CIB job version 1.8.0)

For using CIB ocr via CIB job or CIB documentServer the corresponding XML schema has to defined. This XML can be used for CIB documentServer request. The following XML-example uses CIB ocr to extract text from an image-file using Tesseract library. 

In load-step inputfile is loaded into memory.  
In ocr-step first a valid license has to be set, “Tesseract” is the OCRlibrary to be used and OutputFormat is “FormatHocr”. Extracted text is in German ("OCRLanguage"=deu). 

In save-step in-memory-output is written into a file. 

Example: 

<?xml version="1.0" encoding="ISO-8859-1" ? 
<root> 
<Comod> 
<defaults> 
<properties command="job"> 
<property name="OutputMode">XML</property> 
<property name="UseInMemoryProcessing">1</property> 
</properties> 
</defaults> 
<jobs> 
<job name="tesseract" expected-result-code="404"> 
<steps> 
<step name="LoadStep" command="load"> 
<properties> 
<property name="InputFilename">./input/input.png</property> 
</properties> 
</step> 
<step name="OcrStep" expected-result-code="1000" command="ocr"> 
<properties> 
<property name="LicenseCompany">CustomerLicensee</property> 
<property name="LicenseKey">4444-cccc-88888888</property> 
<property name="OCRLibraryName">Tesseract</property> 
<property name="DataFolder">.</property> 
<property name="OutputFormat">FormatHocr</property> 
<property name="TraceFilename">OCR_trace.log</property> 
<property name="OCRLanguage">deu</property> 
<property name="TracePreprocessOutput">0</property> 
</properties> 
</step> 
<step name="SaveStep" expected-result-code="0" command="save"> 
<properties> 
<property name="OutputFilename">./ocr_out.html</property> 
</properties> 
</step> 
</steps> 
</job> 
</jobs> 
</Comod> 
</root> 


6. General properties

In this chapter general properties of CIB ocr are described.

− NEW: add new property "PageSelection" RELEASE 2.3.0 

− property “TraceRecognitionOutput” has been added; REL 2.1.0 


6.1. Config

Property-Name

Datentyp

Art

Config

String

Set

 

This property specifies the name of the config-file (text-file) which contains parameters for CIB ocr. The parameters specified in the config-file overwrite those set in job directly. It is possible to use several config-files (containing different configurations) which are performed one after the other in the order they are set in property.(not supported yet). 

Syntax

Config= <onefilename>

 Example

Config=C:\Test\config.txt

6.2. Recognize

Property-Name

Data-Type

Type

Recognize

String

Set

 

This property defines for which purpose CIB ocr is run. Currently there are two options: text recognition and barcode recognition. By default, this is set to text recognition.

Syntax

Recognize=<Value>
<Value>:BarcodeRecognizer | OcrRecognizerWithDeeper | OcrRecognizer | WordRecognizer

default= OcrRecognizer

Example

Recognize=BarcodeRecognizer

6.3. DisableRecognition

Property-Name

Datentyp

Art

DisableRecognition

String

Set

 

When this property is set, text/barcode recognition is switched off and TracePreprocessOutput should be switched on by the user. So instead of defined preprocessors all of them are used one by one for an initial image.

For each preprocessor an imagefile is written.

 

Syntax

DisableRecognition=<value>
<value>: 0 | 1

default=0

 

Example

DisableRecognition=1
TracePreprocessOutput=1

That means:
No recognition is done,
TracePreprocessOutput=1 is set,
for each preprocessor an imagefile is created, e.g. “image1_preprocessor_MedianBlurAT.png”.


6.4. DpiX

Property-Name

Datentyp

Art

DpiX

String

Set

 

This property allows to set integer values which are needed for xfdf and specify the DPI value of the input image.
DpiX can be used when the property RegionTemplate is set.

 Syntax

DpiX=<value>
<value>: 1 to N 

default=300

 Example

DpiX=200

6.5. DpiY

Property-Name

Datentyp

Art

DpiY

String

Set

 

This property allows to set integer values which are needed for xfdf and specify the DPI value of the input image.
DpiY can be used when the property RegionTemplate is set.

 Syntax

DpiY=<value>
<value>: 1 to N 

default=300

 Example

DpiY=200

6.6. InputFilename

Property-Name

Datentyp

Art

InputFilename

String

Set/Get

 

This property specifies the name of the input file.
This information is optional, memory input can be used alternatively (via property InputMemoryAddress).

The following input-formats are supported:

  • bmp image;
  • tiff image (includes multipage tiff);
  • jpeg image;
  • png image;
  • xml (serialized cv::Mat);
  • yml (serialized cv::Mat)

 Syntax

InputFilename=<name>
<name>: name.ext 

 Example

InputFilename=Rechnung.tiff

6.7. InputMemoryAddress

Property-Name

Datentyp

Art

InputMemoryAddress

String

Set

 

Property specifies memory-address of input-information.
This information is optional, inputfile can be used alternatively (via property InputFilename).

 Syntax

<InputMemoryAddress> ::= <Integer> | ";" <Integer>

 Example

InputMemoryAddress=1034924448#4096#1034924448#1292

6.8. LicenseCompany

Property-Name

Datentyp

Art

LicenseCompany

String

Set

 

This property sets the licensee from the license information. It is used in connection with LicenseKey. This property has to be set in a CIB ocr call to activate its functions.

Syntax

licenseCompany=CustomerLicensee

6.9. LicenseKey

Property-Name

Datentyp

Art

LicenseKey

String

Set

 

This property sets the license Key from the license information. It is used in connection with LicenseCompany. This property has to be set in a CIB ocr call to activate its functions.

Syntax

LicenseKey=xxxx-xxxx-xxxxxxxx

Default:
If no license information is set, test-license is used.

Example

LicenseCompany=CIB software GmbH 
LicenseKey=4444-cccc-88888888

6.10. OCRLanguage

Property-Name

Datentyp

Art

OCRLanguage

String

Set

 

Property specifies the language which is used in the input-document. (ISO 639-2 Code , 3digits).

Syntax

OCRLanguage=<language>
<language>: deu | eng | rus 

default=deu

Example

OCRLanguage=eng

6.11. Preprocess

Allgemein
Preprocessor details
     ▸ Thresholding Methods
     ▸ Median filters
     ▸ BilateralFilter
     ▸ Thinning Algorithms
     ▸ DeSkew
     ▸ Invert / AutoInvert
     ▸ AutoRotate
Composite Algorithms

Allgemein

Property-Name 

Datentyp 

Art 

Preprocess 

String 

Set 

 

This property allows to specify methods for  preprocessing the inputfile in order to  get a better ocr-result. 

 

Syntax  

Preprocess: <preprocessnames>
<preprocessnames>= <preprocessor> | <preprocessor> “+” <preprocessnames>
<preprocessor>: NativeAdaptiveThresholding | PureMedianBlur | PureAdaptiveThresholding |
PureAdaptiveGaussianThresholding | MedianBlurAGT | MedianBlurAT | MedianBlurGAT |
SauvolaThresholding | NiblackThresholding | WolfJolionThresholding | NickThresholding |
FengThresholding | OtsuThresholding | BilateralFilter | ThinningZhangSuen |
ThinningGuoHall | DeSkew | Invert | AutoInvert | Composite

No default

 

Example 

Preprocess = NativeAdaptiveThresholding 


Preprocessor details
Thresholding Methods is the simplest method of image segmentation. From a grayscale image, thresholding can be used to create binary images. 
The simplest thresholding methods replace each pixel in an image with a black pixel if the image intensity is less than some fixed constant T, or a white pixel if the image intensity is greater than that constant. 

Adaptive Thresholding

Using a global value as threshold value  may not be good in all conditions where  an  image has different lighting conditions in different areas. In that case, we go for adaptive thresholding. Adaptive thresholding means that  the algorithm calculates the threshold for small regions of the image.  Thus we get different thresholds for different regions of the same image and this gives us better results for images with varying illumination. 

It has three ‘special’ input parameters and only one output argument.  

Adaptive Method - It decides how the  thresholding value is calculated. 

  • cv2.ADAPTIVE_THRESH_MEAN_C : threshold value is the mean of the neighborhood area.
  • cv2.ADAPTIVE_THRESH_GAUSSIAN_C : threshold value is the weighted sum of neighborhood values where weights are a Gaussian window.

Block Size - It decides the size of the  neighborhood area. 

C - It is just a constant which is subtracted from the mean or weighted mean calculated. 

 

NativeAdaptiveThresholding 

This is complex filter which consists of the following steps using OpenCV library:  

  • cv::medianBlur()
  • cv::adaptiveThreshold() using CV_ADAPTIVE_THRESH_MEAN_C threshold type
  • cv::bilateralFilter()
  • The result is a grayscale image;

PureAdaptiveThresholding  

While the conventional thresholding operator uses a global threshold for all pixels, adaptive thresholding changes the threshold dynamically over the image. This more sophisticated version of thresholding can accommodate changing lighting conditions in the image , e.g. those occurring as a result of a strong illumination gradient or shadows.   

PureAdaptiveThresholding consist only of one step:  

  • cv::adaptiveThreshold() using CV_ADAPTIVE_THRESH_MEAN_C threshold type
  • The result is a binary image;

Alternative: PureAdaptiveGaussianThresholding

Thresholding based on standard deviation

The methods described in the following sections - FengThresholding, SauvolaThresholding, NiblackThresholding, WolfJolionThresholding, NickThresholding - differ only by the final formula for the thresholding value for particular pixel, but use the same matrixes with standard deviation.


SauvolaThresholding

The basic idea behind Sauvola is that if there is a lot of local contrast, the threshold should be chosen close to the mean value, whereas if there is very little contrast, the threshold should be chosen below the mean, by an amount proportional to the normalized local standard deviation.

 

NiblackThresholding

Niblack’s method can be considered as the first local threshold method. It has the advantage of detecting the text but it introduces a lot of background noise. Sauvola and Pietikinen modified the Niblack threshold to decrease the background noise but the text detection rate is also decreased while bleed-through still remains in most cases.

 

WolfJolionThresholding

In particular, for most colored images the Wolfjolion preprocessor allows to achieve the best quality of recognition as well as for images with background noise and anti-aliased font.

 

NickThresholding

Nick's binarization derives its thresholding formula from the basic Niblack algorithm, the parent of many local image thresholding methods. The major advantage of Nick's method over Niblack is that it considerably improves binarization for "white" and light page images by shifting down the binarization threshold.

 

FengThresholding

The Feng thresholding method is interesting because it can qualitatively outperform the Sauvola thresholding method. However, the Feng method contains many parameters which have to be set. Hence this method was never widely accepted.

 

OtsuThresholding

Considering a bimodal image (a bimodal image is an image whose histogram has two peaks) we can approximately take a value in the middle of those peaks as threshold value. That is what Otsu binarization does. So it automatically calculates a threshold value from an image’s histogram for a bimodal image. (For images which are not bimodal, binarization won’t be accurate.)


Median filters

A median filter is an example of a non-linear filter and, if properly designed, is very good at preserving image detail. Running  a median filter: 

  1. considers each pixel in the image
  2. sorts the neighboring pixels into order based upon their intensities,
  3. replaces the original value of the pixel by the median value from the list.

A median filter is a rank-selection (RS) filter, for example one that selects the closest of the neighboring values when a pixel's value is external in its neighborhood, and leaves it unchanged otherwise . It is sometimes preferred, especially in photographic applications. 

Median and other RCRS filters are good at removing salt and pepper noise from an image, and also cause relatively little blurring of edges, and hence are often used in computer vision applications.  

Disadvantage: the rest becomes blurred,  this impairs the  borders of characters and consequently recognition accuracy. 

At the same time (and rather unexpectedly), the best choice for “recipes” and images with “curved” or “complex in general” text is  the MedianBlurGAT preprocessor. 

Used filters: 

  • PureMedianBlur

Contain thresholding in addition

  • MedianBlurAGT
  • MedianBlurAT
  • MedianBlurGAT


BilateralFilter

A bilateral filter is a non-linear, edge-preserving and noise-reducing smoothing filter for images. The intensity value at each pixel in an image is replaced by a weighted average of intensity values from nearby pixels. This weight can be based on a Gaussian distribution. Crucially, the weights depend not only on  the Euclidean distance of pixels, but also on the radiometric differences (e.g. range differences, such as color intensity, depth distance, etc.). This preserves sharp edges by systematically looping through each pixel and adjusting weights to the adjacent pixels accordingly.  

It is normally used for non-text images or after thresholding. 


Thinning Algorithms

This is an algorithm used for binary images to  reduce a black and white area to a n  e.g. one bit skeleton.   

A fast parallel thinning  algorithm consists of  tw iteration loops: 
One aimed at  deleting the  south-east boundary  points and  the north-west  corner points  while the  other one  is aimed  at deleting the north-west  boundary points  and the south-east  corner points. End points  and  pixel  connectivity are  preserved. Each  pattern  is  thinned down  to  a "skeleton"  of  unitary thickness.  Experimental  results show  that  this method  is  very effective . 

Used algorithms: 

  • ThinningZhangSuen
  • ThinningGuoHall

DeSkew

Deskewing an image can help a lot, if you want to do barcode detection, or just improve the readability of scanned images. I photos of goods with a barcode for example,   the skew angle is often  too high, so the barcode cannot be detected.  After deskewing, the barcode can be read. 

If an image is a logo, a good choice is DeSkew+AutoInvert and  any of the preprocessors Feng,  Nick,  Sauvola or WolfJolion.  
For invoices  suggestion is DeSkew and  Sauvola or WolfJolion.  


Invert / AutoInvert

Both filters are suitable for images containing more black than white color. 

Application of “Invert” changes black to white and vice versa.   

Filter “Autoinvert” checks first,  if we really have more black than white  on page. 

We get good results, if “Invert (AutoInvert)” is used together with “BilateralFilter” and “DeSkew” . 


AutoRotate

This preprocessor algorithm allows to detect image rotation by 90/180/270 degrees, using artificial intelligent algorithm It detects rotation of image and rotate it  before text recognition process. The following preprocessor settings allow to detect  image rotation and rotate it, and then de-skew the resulting image, before text recognition: 

Example  

Preprocess = AutoRotate+Deskew 

 
For using this algorithm, an additional property should be set: AutoRotateModel. This property should point to tensorflow-based model file , trained to detect image rotation. 


Composite Algorithms

(From CIB ocr version 2.3.0) 

CIB OCR can use complex algorithms for image preprocessing. For using of complex image processing algorithms preprocessor "Composite" should be used. This possibility is based on usage of CIB image toolbox functionality. Each preprocessing algorithm should be described in XML format (details are available in CIB image toolbox documentation). 

Example CIB runshell: 

cibrsh.exe –oc Preprocess=Composite AlgorithmsSetName=AlgorithmsSet_sample.xml
AlgorithmName=SepaTextExtraction AlgorithmProfile=processing_profile.xml
IPLTraceFilename=OCR_IPL.log


Example CIB Job/CIB DocumentServer 

<?xml version="1.0" encoding="ISO-8859-1" ?
<root> 
<Comod> 
<defaults> 
<properties command="job"> 
<property name="OutputMode">XML</property> <property name="UseInMemoryProcessing">1</property>  
</properties> 
</defaults> 
<jobs> 
<job name="tesseract" expected-result-code="404"> 
<steps> 
<step name="LoadStep" command="load"> <properties> 
<property name="InputFilename">./input/input.png</property> </properties> 
</step> 
<step name="OcrStep" expected-result-code="1000" command=" ocr"> 
<properties> 
<property name="LicenseCompany">CustomerLicensee</property> 
<property name="LicenseKey">4444-cccc-88888888</property> <property name="OCRLibraryName">Tesseract</property> <property name="DataFolder">.</property> 
<property name="OutputFormat">FormatHocr</property> <property name="TraceFilename">OCR_trace.log</property> <property name="OCRLanguage">deu</property> <property name="TracePreprocessOutput">1</property>  <property name="Preprocess">Composite</property> <property name="AlgorithmsSetName">AlgorithmsSet_sample.xml</property> <property name="AlgorithmName">SepaTextExtraction</property> <property name="AlgorithmProfile">processing_profile.xml</property> <property name="IPLTraceFilename">OCR_IPL.log</property> </properties> 
</step> 
<step name="SaveStep" expected-result-code="0" command="save"> <properties> 
<property name="OutputFilename">./ocr_out.html</property> </properties> 
</step> 
</steps> 
</job> 
</jobs> 
</Comod> 
</root> 

6.12. TracePreprocessOutput

Property-Name

Datentyp

Art

TracePreprocessOutput

String

Set

 

The result of the preprocessed image will be written in a file.

Syntax

TracePreprocessOutput=<value>
<value>: 0 | 1

default=0

Example

TracePreprocessOutput=1

File created e.g. “image1_preprocessor_MedianBlurAT.png”.


7. Properties Text-Recognition

DataFolder
OCRConfigs
OCRRegion
PaddingHorizontal
PaddingVertical
OutputFilename
OutputFormat
OutputText
OutputTextLength
OutputType
RegionTemplate

DataFolder

Property-Name 

Data-Type 

Type 

DataFolder 

String 

Set 

 

This property specifies a path to the Tesseract language package „tessdata“.  

If the property is empty, it is assumed the folder „tessdata“ is located in the currently used folder. 

 

Syntax 

DataFolder=<path> 

default=No input 
(current folder is used) 

 

Example 

DataFolder=C:\Test\Invoice 

OCRConfigs

(From CIB ocr version 2.3.1) 

Property-Name 

Data-Type 

Type 

OCRConfigs 

String 

Set 

 

Names of Tesseract config files. 

All Tesseract config files should be located in $(TESSDATA_PREFIX)\tessdata\configs\ 

 

Syntax 

OCRConfigs=config_name1[;config_name2[;config_name3...]] 

 

Examples 

OCRConfigs=hocr 

or 

OCRConfigs=hocr;debug 

OCRRegion

Property-Name 

Data-Type 

Type 

OCRRegion 

String 

Set 

 

This property specifies rectangle on a page. 
This rectangle is used to define a scan-area to extract the text. That means all characters are ignored which are outside of this scan-area.  
A rectangle is defined by two basic points (left,top and right,bottom).  
Point of origin is the top-left corner of the page, the unit is mm.  

 

Syntax 

OCRRegion=<onerectangle> 
<onerectangle>: <left> ";" <top> ";" <right> ";" <bottom> 

default=No input 

The whole page is scanned if no input is set or if the rectangles given by the coordinates are empty. 

 

Example 

OCRRegion=5;5;15;20 

PaddingHorizontal

Property-Name 

Data-Type 

Type 

PaddingHorizontal 

String 

Set 

 

This property adds a horizontal padding to the rectangle determined by “OCRRegion” on a page. 
The main Use Case is Textrecognition with deeper on a line. This Property allows to further extend the OCRRegion in horizontal direction. This way context information in the image will not get lost.  

The unit is %. That means in case of a OCRRegion width of 100 and PaddingHorizontal of 10: The image is extended by 10 pixels left and 10 pixels right. The center of the OCRRegion and the padded rectangle remains. 

Syntax 

PaddingHorizontal =<integer_value> 

default=0 

Example 

PaddingHorizontal = 10 

PaddingVertical

Property-Name 

Data-Type 

Type 

PaddingVertical 

String 

Set 

 

This property adds a vertical padding to the rectangle determined by “OCRRegion” on a page. 
The main Use Case is Textrecognition with deeper on a line. This Property allows to further extend the OCRRegion in vertical direction. This way context information in the image will not get lost.  

The unit is %. That means in case of a OCRRegion height of 100 and PaddingVertical of 10: The image is extended by 10 pixels on top and 10 pixels at the bottom. The center of the OCRRegion and the padded rectangle remains. 

Syntax 

PaddingVertical=<integer_value> 

default=0 

 

Example 

PaddingVertical=10 

OutputFilename

Property-Name 

Data-Type 

Type 

OutputFilename 

String 

Set 

 

This property specifies the name of the outputfile.  
The property OutputFilename is optional, if it is empty  OutputTextLength and OutputText are used. 
The format/extension is described in the next property OutputFormat. 

 

Syntax 

OutputFilename=<name> 
<name>: name.ext  

default=No input, use of OutputTextLength and OutputText. 

 

Example 

OutputFilename=Rechnung.txt 

OutputFormat

Property-Name 

Data-Type 

Type 

OutputFormat 

String 

Set 

 

This property defines the format of the created outputfile. 

 

Syntax 

OutputFormat=<format> 
<format>: FormatText | FormatHocr FormatHocrText 

default=FormatHocr 

 

Example 

OutputFormat=FormatHocr 

OutputText

Property-Name 

Data-Type 

Type 

OutputText 

String 

Get 

 

This property contains the result of text-recognition. 
If used, it is also required to define the size of the output buffer with the property OutputTextLength. 

 

Syntax of output: 

[textstring] 

 

Example 

Das ist der gelesene Text. 

OutputTextLength

Property-Name 

Data-Type 

Type 

OutputTextLength 

String 

Get 

 

This property contains the length of the output result and specifies the required size of the output buffer. 

 

Syntax of output: 

[integer] (string representation)  

 

Example 

1000 

OutputType

Property-Name 

Data-Type 

Type 

OutputType 

String 

Set 

 

This property defines whether output should be in memory or to a file. 
This property is automatically set depending on whether OutputFilename is set or not. IOutputFilename is set, then OutputType=File is automatically setotherwise OutputType=Memory is set.  

Syntax 

OutputType=<type> 
<type>: File | Memory  

default=File 

 

Example 

OutputType=File 

RegionTemplate

Property-Name 

Data-Type 

Type 

RegionTemplate 

String 

Set 

 

The property RegionTemplate contains the name of the xfdf-file, where the OCRRegions are defined. 

 

Syntax 

RegionTemplate=<filename.xfdf> 

default=No input 

 

Example 

RegionTemplate=region.xfdf 

8. Properties Text-Recognition with deepER

Instead of using Tesseract for OCR it is possible to also choose text-recognition with deepER. The OCR will be calculated on a server. A RESTFUL Service is running on the server, while the client utilizes libcurl in order to send the request.


DataFolder
InputFilename
OutputFilename
Recognize
DeeperURL
DeeperAuthentication
DeeperImageFormat
OcrGrayScaleConversion

InputFilename

This Property is mandatory.  

It is not yet possible to use In-Memory-Processing for the input. 

Property-Name 

Datentyp 

Art 

InputFilename 

String 

Set/Get 

 

This property specifies the name of the input file.  
The following input-formats are supported: 

  • bmp image;
  • tiff image (includes multipage tiff);<
  • jpeg image;
  • png image;

OutputFilename

Property-Name 

Data-Type 

Type 

OutputFilename 

String 

Set 

 

This property specifies the name of the out putfile.  
The Outputformat is fixed to hOCR. 

 

Syntax 

OutputFilename=<name> 
<name>: name.ext  

 

Example 

OutputFilename=Rechnung.html

Recognize

Property-Name 

Data-Type 

Type 

Recognize 

String 

Set 

 

Syntax  

Recognize=<Value>
<Value>: OcrRecognizerWithDeeper 

default=OcrRecognizer 

 

Example 

Recognize= OcrRecognizerWithDeeper 

DeeperURL

Property-Name 

Data-Type 

Type 

DeeperUrl 

String 

Set 

 

Syntax 

DEEPERURL=<Value>
<Value>: http://localhost:5000 

default= http://localhost:5000 

 

Example 

DeeperUrl = http://graphix:5000 

DeeperAuthentication

Property-Name 

Data-Type 

Type 

DeeperAuthentication 

String 

Set 

 

Syntax 

DEEPERURL=<Value>
<Value>: User:password 

default= “” 

 

Example 

DeeperUrl = Franz:TopSecret 

DeeperImageFormat

Property-Name 

Data-Type 

Type 

DeeperImageFormat 

String 

Set 

 

Default value (if not set) is PNG. Possible values: JPG (JPEG) / PNG / Smallest.  

CIB ocr converts input image into the requested format, before sending it to the deeper server. 

If DeeperImageFormat is set as Smallest then CIB  ocr converts the input image into both: PNG and JPG and the smallest representation will be sent to the deeper server for recognition . 


OcrGrayScaleConversion

Property-Name 

Data-Type 

Type 

OcrGrayScaleConversion 

String 

Set 

 

Syntax 

OcrGrayScaleConversion=<Value>
<Value>: 0|1 

Default =1 

 

Example 

OcrGrayScaleConversion = 0 

9. Properties Barcode-Recognition

BarcodeOutputFilename
BarcodeOutputType
BarcodeRegion
BarcodeRemoveChecksum
BarcodeShowPageNumber
BarcodeStopAfter
BarcodeType
BarcodeTimeout
BarcodeValue
BarcodeValueLength
DatamatrixAngleDeviation
DatamatrixShrinkingFactor
DatamatrixScanGap
DatamatrixThreshold
ZBarConfig

BarcodeOutputFilename

Property-Name 

Data-Type 

Type 

BarcodeOutputFilename 

String 

Set 

 

This property sets the name of the outputfile to which values of all barcodes found in the inputfile are written. 
The property BarcodeOutputFilename is optional, if it is empty - BarcodeValueLength  and BarcodeValue are used . 
For format/extension look at OutputFormat  

The format of this file is given by “BarcodeOutputType ”. 

Syntax 

BarcodeOutputFilename=<filename> 
<filename>:name.ext 

default=No input, BarcodeValueLength and BarcodeValue are used . 

 Example 

BarcodeOutputFilename=barcodes.txt 

BarcodeOutputType

Property-Name 

Data-Type 

Type 

BarcodeOutputType 

String 

Set 

 

This property defines whether the output should  be in memory or to file. 
This property is automatically set depending on whether BarcodeOutputFilename is set  or not. IBarcodeOutputFilename is set then BarcodeO utputType=File is set otherwise  BarcodeOutputType=Memory .   

 

Syntax 

BarcodeOutputType=<type> 
<type>: File | Memory  

default=File 

 

Example 

BarcodeOutputType=File 

BarcodeRegion

With this property it is set that a search for barcodes is only applied on defined rectangle. 

If a barcode is always situated on the same area of each page , performance is improved by limiting  the  search for barcodes to this area.  
Th is area (= rectangle) is defined by two basic points (left ,top and right,bottom).  
Point of origin is the top-left corner of the page,  the unit is  pixel 

 

Syntax 

BarcodeRegion=<rectangle> 
<rectangle>: <left> ";" <top> ";" <right> ";" <bottom> 

default=No input 

The whole page is scanned if no input is done or rectangle given by coordinates is empty. 

 

Example 

BarcodeRegion=10.5;50;100.0;200.5 

BarcodeRemoveChecksum

Property-Name 

Data-Type 

Type 

BarcodeRemoveChecksum 

String 

Set 

 

When this property is set, checksum is not considered  when reading a barcode. 

The property can be used for barcodes Code39 Code39Extended and Code128   

 

Syntax 

BarcodeRemoveChecksum=<value> 
<value>: 0 | 1 

default=0 
(Checksum is not cut off). 

 

Example 

BarcodeRemoveChecksum=1 

BarcodeShowPageNumber

Property-Name 

Data-Type 

Type 

BarcodeShowPageNumber 

String 

Set 

 

When this property is set, the page-number of the inputfile where this barcode was found is added to each output barcode from  property “BarcodeValue”. 

 

Syntax  

BarcodeShowPageNumber=<value> 
<value>: 0 | 1  

default=0 
(No output of page-number). 

 

Example 

BarcodeShowPageNumber=1 
Afterward property BarcodeValue“ contains e.g. 
BARCODE:1;DATAMATRIX;00011122233344455566677788;BARCODE:2;DATAMATRIX;899374032904908

BarcodeStopAfter

Property-Name 

Data-Type 

Type 

BarcodeStopAfter 

String 

Set 

 

For Datamatrix only. 

Defines that search is stopped after retrieving N th barcode. 

 

Syntax 

BarcodeStopAfter=<value> 
<value>: 1 to N 

default=”” 
No value set means that search continues until the end of inputfile. 

 

Example 

BarcodeStopAfter=5 

Search stops after retrieving 5th barcode.


BarcodeType

Property-Name 

Data-Type 

Type 

BarcodeType 

String 

Set 

 

This property specifies the type of barcode which is to be searched. 
 

Syntax  

BarcodeType = < onebarcodetype>  

<onebarcodetype>= "DataMatrix" | "Code128" | "Code39" | "Code39Extended" | "QR" 

default=No input, CIB ocr searches for all possible barcode-types. 

Code128 includes the subtypes: 

  • Code128A
  • Code128B
  • Code128C
  • Code128Auto

Example 

BarcodeType= DataMatrix 

BarcodeTimeout

Property-Name 

Data-Type 

Type 

BarcodeTimeout 

String 

Set 

 

For Datamatrix only. 

This property  specifies the time (in milliseconds) when CIB  ocr stops looking for more barcode-candidates in the inputfile. 

 

Syntax 

BarcodeTimeout=<value>
<value>:1 to N  

default=The whole inputfile is processed  to find all barcode-candidatesneglecting the time it takes.  

 

Example 

BarcodeTimeout=50 

BarcodeValue

Property-Name 

Data-Type 

Type 

BarcodeValue 

String 

Get 

 

This property contains a list of all barcodes found in the  input-image-file. 
BarcodeValueLength is  necessary to define the  size of the output buffer.  

 

Syntax of output: 

[BARCODE :[pagenumber;]BarcodeType;TextValue;] 

 

Example 

BARCODE:DATAMATRIX;00011122233344455566677788 

BarcodeValueLength

Property-Name 

Data-Type 

Type 

BarcodeValueLength 

String 

Get 

 

This property contains the length of the output result and thus gives the required size of output buffer.  

 

Syntax of output: 

[integer] (string representation)  

 

Example 

1000 

DatamatrixAngleDeviation

Property-Name 

Data-Type 

Type 

DatamatrixAngleDeviation 

String 

Set 

 

For Datamatrix only. 

This property gives the allowed non-squareness of corners  of rectangles in degrees (0-90). 

The size of the allowed deviation depends on the application: 

  • Faxing and flatbed scanning: A low squareness deviation (5-10 degrees is enough since all right angles in the subject image will appear as right angles in the image.
  • Scanning from a cell phone or webcam: Higher deviations (20-40 degrees) should be set as distortion due to extreme scanning angles may occur. The dmtxread utility allows large deviation values by default.

 

Syntax  

DatamatrixAngleDeviation=<value>
<value>:0 to 90 

default=10 

 

Example 

DatamatrixAngleDeviation=20 

DatamatrixShrinkingFactor

Property-Name 

Data-Type 

Type 

DatamatrixShrinkingFactor 

String 

Set 

 

For Datamatrix only. 

This property sets a factor for shrinking a high resolution image internally. 
This sometimes provides dramatic performance benefit  as the amount of pixels of a page is minimized. It especially helps when  an image has  high resolution but blurry focus.  

 

Syntax  

DatamatrixShrinkingFactor=<value>
<value>: 1 to N  

Default: 1  
(no change of original resolution) 

 

Example 

DatamatrixShrinkingFactor=2 

Means resolution is divided in half. 


DatamatrixScanGap

Property-Name 

Data-Type 

Type 

DatamatrixScanGap 

String 

Set 

 

For Datamatrix only. 

This Property allows specifying the size of the gaps in the grid pattern (using pixels). 

Increasing the gaps (e.g. to 100) can improve performance, but  if the grid is too coarse it may cause that  the barcode is no t found at all. 

 

Syntax 

DatamatrixScanGap=<value>
<value>: 1 to N 

default=1 

 

Example 

DatamatrixScanGap=50

DatamatrixThreshold

Property-Name 

Data-Type 

Type 

DatamatrixThreshold 

String 

Set 

 

For Datamatrix only. 

Lowering the threshold can increase the number of features to be scanned, but thereby slows performance. But this  may be necessary if the image is blurry or has low contrast.  
Sometimes lowering the threshold will actually improve performance if thereby a good barcode candidate  is found more quickly than otherwise. 

 

Syntax 

DatamatrixThreshold=<value>
<value>: 1 to 100  

default=5 

 

Example 

DatamatrixThreshold=10 

Weak edges below threshold 10 are ignored. 


ZBarConfig

(From CIB ocr version 2.4.0) 

 

Property-Name 

Data-Type 

Type 

ZBarConfig 

String 

Set 

 

Property for tuning of barcode recognition (ZBar functionality). 


Syntax 

ZBarConfig
        =config_
        line1[;config
        _line2[;config_line3...]] 

 

Example 

ZBarConfig=code39.enable 

10. Properties Word-Recognition

Recognize
DictionaryPath
InputFormat
Wordrecognizeroptions
Wordrecognizerresult

Recognize

Property-Name 

Data-Type 

Type 

Recognize 

String 

Set 

 

In order to use the WordRecognizer this property has to be set to “WordRecognizer”.  

Syntax 

Recognize=<Value> 
<Value>:BarcodeRecognizer | OcrRecognizer | WordRecognizer 

default=OcrRecognizer   

Example 

Recognize=WordRecognizer 

 

DictionaryPath

Property-Name 

Data-Type 

Type 

DictionaryPath 

String 

Set 

 

This property can also be defined within the WordRecognizerOptions. It is recommended to define it within  WordRecognizerOptions, as WordRecognizerOptions  overrules  this property.  

However this property has to be defined at least within this property or WordRecognizerOptions.  

Syntax 

DictionaryPath=<Value> 

No default value! It has to be set. 

Example 

DictionaryPath=".\\hunspell" 

InputFormat

Property-Name 

Data-Type 

Type 

InputFormat 

String 

Set 

 

In order to use the WordRecognizer this property has to be set to “WordRecognizer”.  

Syntax 

InputFormat=<Value> 
<Value>: HOCR | UTF8 | UTF16 Unicode 

HOCR: input is a HOCR file which must be UTF8 encoded 
UTF8: input is plain text in UTF8 encoding (with or without UTF8 BOM ) 
UTF16: input is plain text in UTF-16 encoding. The BOM (FE FF or FF FE) must be present 

Example 

InputFormat=UTF8 

Wordrecognizeroptions

This property is defined as json-String and contains all the information that is needed in order to analyse the document by WordRecognizer. 

 

Property-Name 

Data-Type 

Type 

WordRecognizerOptions 

String 

Set 

 

This property might look like this. A more detailed explanation for each component can be found below the example:  

Example: 

{ 
"DictionaryPath": "D:\\PROJEKTE-SVN\\products\\CIB ocr\\trunk\\src-test\\testdata\\hunspell",  
"Dictionaries": {"DE": {"Dictionaries": "de_DE_frami-UTF8", "StopwordFiles": "stopword_german.txt""DigramScores""de_digramscores.txt"}}, 
"InputFormat": "UTF8",  
"RecognizedWordsFilename": "recognized.log",  
"RejectedWordsFilename": "rejected.log",  
"StatisticsFilename": "statistics.log",  
"StatisticsOutputFormat": "FormatCsv"} 

Explanation of each component: 

Component 

Value 

Note 

InputText 

<string> 

(required if property InputFilename / InputMemoryAddress is empty) 

text to parse, must be in UTF-8 format 

DictionaryPath 

<string> 

(required) path to the hunspell folder, may be absolute or relative to the working directory 

Dictionaries 

<dictionary-object> 

(required) dictionaries to use, one or more dictionaries for each language (see below) 

InputFormat 

<string> 

Specifies the input format (e.g. “UTF8”) 

RecognizedWordsFilename 

<string> 

(optional) filename for recognized words 

writes the numer of occurrences for each recognized word, per language 

a word is considered recognized if it is no stopword and contained in at least one dictionary for that language 

for the language "<GLOBAL>", all stop word lists are ignored, and the word is recognized if it is contained in at least one dictionary (excluding stop word dictionaries)  

the words are written as  
<language> <TAB> <count> <TAB> <word>  

RejectedWordsFilename 

<string> 

(optional) filename for rejected words 

writes the numer of occurrences for each rejected word, per language 

a word is considered rejected if it is no stopword and is not contained in any dictionaries for that language 

for the language "<GLOBAL>", all stop word lists are ignored, and the word is rejected if it is not contained in any dictionary (excluding stop word dictionaries)  

the words are written as  
<language> <TAB> <count> <TAB> <word>  

StatisticsFilename 

<string> 

(optional) filename for the summary of the word recognizer run 

StatisticsOutputFormat 

<string> 

defines the output format for the summary 

"FormatText": output is written as tabbed text 

"FormatCsv": output is written in csv format  
(with ";" as delimiter) 

"FormatJSON": the property "WordRecognizerResult" is written to the specified file (as JSON string) 

StatisticsPerPage 

<boolean> 

adds a pagewise statistics to the WordRecognizer result 
(if InputFormat is not HOCR, all input text is considered as page 1)  

TextAcceptThreshold 

<number> 

sets "TextAccepted" flag in the result, if the "longer glyph ratio" is at least this value. Only meaningful if LargeWordLimit > 0  

SmallWordLimit 

<number> 

(optional) if > 0, words with at most that many characters are counted in the "SmallWord" group 

LargeWordLimit 

<number> 

(optional) if > 0, words with at least that many characters are counted in the "LargeWord" group 

 

Component “Dictionaries”: 

specification of <dictionary-object> (same as a few comments above): 

{<language-name>: <language>, ...}  

<language-name> = JSON-String: "..." (specifies a language name)  
<language> = JSON-String: "..." (specifies  a single dictionary for that language, no stopwords) 
<language> = JSON-Array: ["...","..."] (specifies one or more dictionaries for that language, no stopwords) 
<language> = JSON-Object: {"Dictionaries": <dictionaries>, "StopwordDictionaries:" <stopword-dicts>, "Stopwords": <stopwords>]} 
<dictionaries> = JSON-String: "..." (specifies a single dictionary for that language) 
<dictionaries> = JSON-Array: ["...","..."] (specifies one or more dictionaries for that language) 
<stopword-dicts> = JSON-String: "..." (specifies a single stopword dictionary for that language) 
<stopword-dicts> = JSON-Array: ["...","..."] (specifies one or more stopword dictionaries for that language) 
<stopwords> = JSON-Array: ["...", "..."] (specifies a list of stopwords (UTF-8 encoded)) 

Example 1: 

"Dictionaries": {"DE": ["de_DE-frami-UTF8", "de_user"], "EN": "en_US-UTF8"} 

Example 2: 

"Dictionaries": { 
"DE": {"Dictionaries": "de_DE-frami-UTF8", "StopwordDictionaries": "de_stopwords""DigramScores""de_digramscores.txt"}  
"EN": {"Dictionaries": "en_US-UTF8", "Stopwords": ["a", "an", "in"]"DigramScores""de_digramscores.txt} }

Wordrecognizerresult

Property-Name 

Data-Type 

Type 

WordRecognizerResult 

String 

Set 

 

This property contains all the output information. It makes sense to set the property to mode=“out”. This will result in the output-output.xml and the tracefile to contain all the results of the WordRecognizer (Additionally to the statistic files).  

In version 2.7, WordRecognizerResult is a JSON object as follows:  

{<language-string>: <statistics-object>, ...} 

language-string: one of the language strings given in the WordRecognizerOptions (for instance ‘EN’ for english) 

 
<statistics-object> =  
{ 
"SmallWordCount": <number> number of recognized words (excluding stop words), which are small words (according to SmallWordLimit) 
"LargeWordCount": <number> number of recognized words (excluding stop words), which are large words (according to LargeWordLimit) 
"MainWordCount": <number> number of words which are recognized, and are not stop words 
"StopWordCount": <number> number of words which are in the stop word list/dictionary 
"RejectedWordCount": <number> number of words which are neither stop words nor in one of the language dictionaries 
(for example, most english words are rejected in german dictionaries) 
"TotalWordCount": <number> number of words which are recognized (including stop words). Should be the MainWordCount+StopWordCount  

"SmallWordCoverage": <number> number of characters over all small words 
"LargeWordCoverage": <number> number of characters over all large words 
"MainWordCoverage": <number> number of characters over all recognized words (excluding stop words) 
"StopWordCoverage": <number> number of characters over all stop words 
"RejectedWordCoverage": <number> number of characters over all rejected words 
"TotalWordCoverage": <number> number of characters over all recognized words 
"MainWordCountPerLength": [<length1>, <count1>, <length2>, <count2>, ...] number of occurrences per word length (counting only recognized words which are not stop words) 
"TotalWordCountPerLength": [<length1>, <count1>, <length2>, <count2>, ...] number of occurrences per word length (counting only recognized words, including stop words, but not rejected words)  

"GlyphRatioLongWords": <number> (old) number in percent of long words found in relation to all words. (Glyphs like "%","&" etc are filtered beforehand. 

"LongerGlyphRate": <number> (new) number in percent of not-short words found in relation to all words. (Glyphs like "%","&" are included / and therefore lots of those symbols will reduce this value).  

"DigramScoreArithmetic": <number> number between [0;9] that indicates the text quality based on digramm score tables. There are scoretables for each language. The language chosen in the "All" language is chosen by the language that has the highest TotalWordCount.  

"FulltextQuality": <number> number in percent that indicates text quality. The formula takes the following values into consideration:  

GlyphRatioLongWords, LongerGlyphRate, TotalWordCount, DigramScoreArithmetic 
} 

Note 1: In addition to the languages specified in WordRecognizerOptions, there is an additional language "<GLOBAL>". This (virtual) language consists of all dictionaries over all languages, excluding all stop word dictionaries and stop word lists. This means, if a token (word to check) is contained in at least one of these dictionaries, it is considered as "recognized". Otherwise, it is considered as "rejected" 
Note 2: The character count counts only the characters of the words passed to the spellchecker. The parser may have eliminated blanks, numbers, punctuation marks, quotes, hyphens and such.  

 

Since version 2.8, WordRecognizerResult is a JSON object as follows: 

{"DocumentStatistics"<language-statistics> , "PageStatistics": <page-statistics> } 
("PageStatistics" is only present if the "StatisticsPerPage" option is set to true) 
 

<page-statistics> is a JSON object with page numbers as key and <language-statistics>  objects as value. 
Example: {"1": <language-statistics> , "3": <language-statistics> } 
(if the input is HOCR, and a page has no "ppageno" attribute, the page number is "0")  

 

<language-statistics> is a JSON object as follows: 
{"AllLanguages": <word-statistics> , 
"Languages": <language-specific-statistics> , 
"TextAccepted": true | false} 
(TextAccepted is false if the GlyphRatioLongWords of "AllLanguages" is lower than the TextAcceptedThreshold specified in   the WordRecognizerOptions. The TextAccepted flag of the document-global statistics is also set to false if at least one page has a "longer glyph ratio" ratio below the threshold, even if there are enough other pages to get the global ratio above the limit)  

 

<language-specific>  is a JSON object as follows: 
{<language-key>: <word-statistics> , ...} 
where <language-key> is one of the language keys defined in the WordRecognizerOptions 
(note: the statistics for all languages combined is now the value of "AllLanguages". The special language key "<GLOBAL>" is no longer used)  

 

<word-statistics>  is the same object as <statistics-object>  described in 7.5, but with an additional key "GlyphRatioLongWords". This value of this key is defined as largeWordCoverage * 100 / (totalWordCoverage + rejectedWordCoverage), i.e. the ratio of glyphs in long words compared to the total number of checked glyphs (excluding blanks, delimiters, numbers). The value is expressed as an integer (percentage) ranging from 0 to 100.  

Since version 2.14, there are two additional keys: 
"RawGlyphCount": <number> number of glyphs before parsing (excluding whitespaces but including digits, punctuation marks etc.) 
"LongerGlyphRate": calculated as 
(TotalWordCoverage - SmallWordCoverage) / RawGlyphCount.  
This is the ratio of glyphs in recognized words which are not "small", compared to the number of all glyphs including digits etc. (see above). 

An example of WordRecognizerResult, with multiple pages and languages, might look like this: 

WordRecognizerResult = { 
"DocumentStatistics": { 
"AllLanguages": {"GlyphRatioLongWords": 80, ...}, 
"Languages": { 
"DE": {"GlyphRatioLongWords": 55, ...}, 
"EN": {"GlyphRatioLongWords": 33, ...}  
}, 
"TextAccepted": false  
}, 
"PageStatistics": { 
"1": { 
"AllLanguages": {"GlyphRatioLongWords": 100, ...}, 
"Languages": { 
"DE": {"GlyphRatioLongWords": 100, ...}, 
"EN": {"GlyphRatioLongWords": 16, ...}  
}, 
"TextAccepted": true  
}, 
"2": { 
"AllLanguages": {"GlyphRatioLongWords": 55, ...}, 
"Languages": { 
"DE": {"GlyphRatioLongWords": 0, ...}, 
"EN": {"GlyphRatioLongWords": 55, ...}  
}, 
"TextAccepted": false  
 
 
} 

 

A complete Job XML-Example for Word Recognition might look like this:  

 

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<root> 
<Comod> 
<defaults/> 
<jobs> 
<job name="TextRecognize"> 
<properties> 
<property name="LicenseCompany">Example Company</property>  
<property name="LicenseKey">xxxx-xxxx-xxxxxxxx</property> 
<property name="OutputMode">Xml</property> 
</properties> 
<steps> 
<step name="ocr-step" command="ocr"> 
<properties> 
<property name="LicenseCompany">CIB Demo</property>  
<property name="LicenseKey">xxxx-xxxx-xxxxxxxx</property>  
<property name="InputFilename">..\templates-txt\wikipedia-Deutschland_DE.txt</property>  
<property name="TraceFilename">ocr.txt</property>  
<property name="PageSelection">All</property>  
<property name="Recognize">WordRecognizer</property>  
<property name="WordRecognizerOptions">{ 
"DictionaryPath": "..\\hunspell",  
"Dictionaries":{"deu":{"Dictionaries":"de_DE_frami- UTF8","StopwordFiles":"de_stopwords.txt"}},"InputFormat":"UTF8","RecognizedWordsFilename": "recognizedWords.txt","StatisticsFilename":"statistics.txt","StatisticsOutputFormat":"FormatJSON"}  
</property> 
</properties> 
</step> 
</steps> 
</job> 
</jobs> 
</Comod> 
</root> 

11. Technical interface: Native functions

This chapter provides a brief overview of native functions.

CIB ocr job handle
CibOcrJobCreate
CibOcrJobSetProperty
CibOcrJobSetPropertyW
CibOcrJobGetProperty
CibOcrJobGetPropertyW
CibOcrJobGetProgress
CibOcrJobStart
CibOcrJobFree
CibOcrJobCancel
CibOcrGetVersion
CibOcrGetVersionText
CibOcrGetVersionTextW
CibOcrJobGetErrorText
CibOcrJobGetErrorTextW
CibOcrJobGetError

CIB ocr job handle

Every CIB ocr task is assigned  to a „job handle“ of the type Handle* . This object represents the task. The steps 

  • Setting and reading properties(CibOcrJobSetProperty/ CibOcrJobGetProperty)
  • Executing the task(CibOcrJobStart)
  • Getting error information(CibOcrJobGetError/ CibOcrJobGetErrorText)

always refer to such a job handle. 

A CIB ocr task is initiated by creating a job handle ( CibOcrJobCreate). After setting the necessary properties and running the task the job handle is released again ( CibOcrJobFree ). 


CibOcrJobCreate

bool exportfunc CibOcrJobCreate(Handle *job); 

This method creates a job handle. The Job-handle is given to all  subsequent functions to ensure thread-security. It should be released again with CibOcrJobFree  after the task is completed. 

If no error occurs, the function result is TRUE, otherwise  FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Creates a new job handle and stores it at *Job   


CibOcrJobSetProperty

bool  exportfunc  CibOcrJobSetProperty  (Handle job, const char *name, const char *value); 

This function allows setting additional properties for a merge run. The names and values are expected to be UTF-8 encoded zero terminated strings. 

If no error occurs, the function result is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the job that this property refers to 

Char* 

Name 

Name of the property that is to be set 

Char* 

Value 

Value of the property that is to be set 


CibOcrJobSetPropertyW

Windows 

bool  exportfunc  CibOcrJobSetProperty W (Handle job, const  wchar *name, const  wchar  *value );  

This function allows setting additional properties for a merge run. The names and values are expected to be zero terminated wide strings.  

If no error occurs, the function result is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the job that this property refers to 

wchar* 

Name 

Name of the property that is to be set 

wchar* 

Value 

Value of the property that is to be set 


CibOcrJobGetProperty

bool  exportfunc  CibOcrJobGetProperty  (Handle *job, const char *name, const char *buffer, int size); 

This function returns the property values that are currently set into the specified buffer. The returned names and values are zero terminated strings in UTF-8 encoding. 

If no error occurs, the function result is TRUE, otherwise FALSE. 

Type 

Value 

Description 

Handle* 

Job 

Handle of the job that this property refers to 

char* 

Name 

Name of the property whose value is to be returned 

char* 

Buffer 

The property’s value that is currently set 

int 

Size 

Maximum buffer length 


CibOcrJobGetPropertyW

Windows 

bool  exportfunc  CibOcrJobGetProperty W (Handle *job, const  wchar *name, const  wchar  *buffer, int size );  

This function returns the property values that are currently set into the specified buffer. The returned names and values are zero terminated wide strings. 

If no error occurs, the function result is TRUE, otherwise FALSE. 

 

Type 

Value 

Description 

Handle* 

Job 

Handle of the job that this property refers to 

wchar* 

Name 

Name of the property whose value is to be returned 

wchar* 

Buffer 

The property’s value that is currently set 

int 

Size 

Maximum buffer length 


CibOcrJobGetProgress

(From CIB ocr version 2.3.2) 

bool exportfunc CibOcrJobGetProgress(Handle* job, char *buffer, size_t size); 

Gets percent of recognition progress. 

This function fills buffer by the string:  
<page_number> <page_count> <page_progress> 

 

  • <page_number>       number of page processed at the moment<
  • <page_count>           total page count
  • <page_progress>     progress for current page

 

Special values <page_progress>: 

  • 1   Recognition proces has not started
  • 2   Recognition finished successfully
  • 3   Recognition cancelled
  • 4   Recognition finished with error

 

If no error occurs, the function result is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the job that this property refers to  

char* 

Buffer 

The property’s value that is currently set 

int 

Size 

Maximum buffer length 

 


CibOcrJobStart

bool exportfunc CibOcrJobStart(Handle *job); 

Starts a CIB ocr-Job.  

If no error occurs, the function result is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the job that is to be started. 


CibOcrJobFree

bool exportfunc CibOcrJobFree(Handle *job); 

This function frees the created CibOcrJobHandle and other resources allocated by CIB ocr. 

If no error occurs, the function result  is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the job that is to be terminated 


CibOcrJobCancel

(From CIB ocr version 2.3.2 

bool exportfunc CibOcrJobCancel(Handle* job); 

This function stops recognition process. 

If no error occurs, the function result is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the job that is to be cancelled 


CibOcrGetVersion

bool CibOcrGetVersion(unsigned long * iVersion); 

This function provides access to the current CIB ocr version number as an integer.  

If no error occurs, the function result is TRUE, otherwise FALSE.  

Type 

Variable 

Description 

Unsigned long* 

iVersion 

Pointer to the stored product version 


CibOcrGetVersionText

bool exportfunc CibOcrGetVersionText(char *text, long *maxlength); 

This function provides access to the current CIB ocr version number as a string.  

If no error occurs function result is TRUE, otherwise FALSE.  

 

Type 

Variable 

Description 

Char* 

Text 

Pointer to character buffer where the version text is stored 

Long* 

maxlength 

Maximum length of version text 


CibOcrGetVersionTextW

Windows 

bool exportfunc CibOcrGetVersionTextW(wchar *text, long *maxlength); 

This function provides access to the current CIB ocr version number as a string.  

If no error occurs function result is TRUE, otherwise FALSE.  

 

Type 

Variable 

Description 

wchar* 

Text 

Pointer to character buffer where the version text is stored 

Long* 

maxlength 

Maximum length of version text 


CibOcrJobGetErrorText

bool exportfunc CibOcrJobGetErrorText(Handle *job, char *text, long *maxlength); 

This function returns the error text that is output after executing a function. 

If no error occurs the function result is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the current job 

Char* 

Text 

Pointer to the string buffer where the error message text is stored 

Long* 

maxlength 

Maximum length of the error message (size of message buffer) 


CibOcrJobGetErrorTextW

Windows 

bool exportfunc CibOcrJobGetErrorTextW(Handle *job, wchar *text, long *maxlength); 

This function returns the error text that is output after executing a function. 

If no error occurs the function result is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the current job 

wchar* 

Text 

Pointer to the string buffer where the error message text is stored 

Long* 

maxlength 

Maximum length of the error message (size of message buffer) 


CibOcrJobGetError

bool exportfunc CibOcrJobGetError(Handle *job,  int *ErrorCode); 

This function gives access to the current error state of CIB ocr after executing various functions.  

If no error occurs, the function result is TRUE, otherwise FALSE. 

Type 

Variable 

Description 

Handle* 

Job 

Handle of the current job 

Int* 

Errorcode 

Outputs the current error code 

 

For all possible error-codes please see the appendix . 


12. JNI Interface

CIB ocr also provides a JNI Interface.

In order to utilize the JNI Interface the three java classes are necessary:
CibOcr.java, CibOcrConstants.java, CibOcrJNI.java

CibOcrJNI.java contains the following methods:

public final static native int CibOcrJobCreate(long[] jarg1);
public final static native int CibOcrJobStart(long jarg1);
public final static native int CibOcrJobCancel(long jarg1);
public final static native int CibOcrJobReset(long jarg1);
public final static native int CibOcrJobFree(long[] jarg1);
public final static native int CibOcrJobGetProperty(long jarg1, String jarg2, byte[] jarg3);
public final static native int CibOcrJobSetProperty(long jarg1, String jarg2, String jarg3);
public final static native int CibOcrJobGetProgress(long jarg1, byte[] jarg2);
public final static native int CibOcrGetVersion(long[] jarg1);
public final static native int CibOcrGetVersionText(byte[] jarg1);
public final static native int CibOcrJobGetError(long jarg1, int[] jarg2);
public final static native int CibOcrJobGetErrorText(long jarg1, byte[] jarg2);

 

The Methods mentioned above are almost identical to the ones mentioned in c++ (section 9.1) However only the Properties that accept wide characters (for example CibOcrGetVersionTextW) are called within the JNI Interface.


13. Error Codes

error code

description

0

no error

9

input file not found

11

the function/method has not been implemented

47

buffer too small

99

the specified property name is not supported

122

invalid or missing license

198

unexpected exception

951

Neither image file nor memory address are specified

952

Can not load image file

953

Can not load image from memory

954

Invalid property value

955

Incorrect barcode type. Should be Datamatrix.

956

Can not open file for writing

957

Can not create MODI control

958

MODI recognition failed

959

File not found

960

Can not load tessdll.dll

961

Tesseract recognition failed

962

Image type recognition failed

963

Image type is not supported

964

Can not load cuneiform.dll

965

Cuneiform recognition failed

966

Can not load FineReader FREngine.dll

967

FineReader recognition failed

968

Can not load Omnipage KernelAPI.dll

969

Omnipage recognition failed

970

conversion to output codepage failed

971

Invalid argument

972

Output result error

973

Recognition was cancelled

974

Invalid output format specified

975

Invalid output type specified

976

Preprocessor can't be appiled

977

Invalid recognizer name

978

"DataFolder" or "TESSDATA_PREFIX" should be defined

979

Error during initialization OCR framework

980

Error during text recognition

981

Invalid or unsupported barcode type specified

982

Error during initialization barcode recognition framework

983

Error during barcodes recognition

984

The specified configuration file does not exist

985

The specified configuration file has invalid format

986

Invalid xfdf input specified

987

word recognizer error

988

Path to dictionaries is missing

989

Path to dictionaries is invalid

990

Dictionary not found

991

Unknown input format

992

The HOCR file could not be parsed

993

The HOCR file could not be processed


14. Trace

In case of unclear error-situations it is possible to create a trace-file:

TraceFilename

Property-Name

Data-Type

Type

TraceFilename

String

Set

 

Syntax

TraceFilename= <filename><
<filename>= tracename.log

 

Example

TraceFilename= ocrtrace.log

 

Environment Variable

The environment-variable CIB_OCRTRACE is set to a filename and the erroneous process is started.

Example:

set CIB_OCRTRACE=ocrtrace.log