CIB ocr technical manual (EN)

1. Scope of Delivery

CIB ocr is delivered as a binary module, in the form of DLLs for Windows or shared libraries (Unix).

Komponente	Softwareumfang
CIB ocr	CIBOcr32.dll respectively CIBOcr64.dll CIB ocr DLL , interface for application
Tesseract-language package	Folder „tessdata“ containing dictionaries for the used languages, f.e. deu.traineddata for German.
Hunspell dictionary	Hunspell dictionaries and stopword lists within the ‘hunspell’ directory
CIB runshell	cibrsh.exe cibrsh64.exe

Note:

The language package mentioned above is used for Tesseract text-recognition. By Default CIB ocr searches the subfolder “tessdata” within the current folder. If a different folder is used for language-files this has to be declared in the property „DataFolder”.

CIB runshell

CIB runshell (cibrsh.exe) provides the possibility to call CIB ocr DLL directly. With this call text and barcode can be extracted from a specified inputfile.

Example:

cibrsh.exe –oc input.tif output.txt

CIB job

(From CIB ocr version 2.3.0 and CIB job version 1.8.0)

CIB ocr DLL can be started via CIB job xml.

CIB job xml can be used by CIB runshell (cibrsh.exe) or CIB documentserver.

Example CIB runshell:

cibrsh.exe –d job.xml

For detailed example for a CIB job XML look at chapter Calling CIB ocr via CIB job/CIB documentServer.

2. Introduction

CIB ocr is a CIB solution for optical character recognition and mainly employed as a component of CIB products and modules like CIB doXima, CIB pdf toolbox or CIB doXisafe. It is used to support automatic text- and barcode-recognition in all these modules. For text-recognition the engine Tesseract is used.

About Tesseract:

Tesseract is an optical character recognition engine and is considered to be one of the most accurate open source OCR engines currently available. It is free software, released under the Apache License and development has been sponsored by Google since 2006.

Tesseract is available for Linux, Windows and Mac OS X. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract is able to process a lot of different languages, German included.

Input-formats supported by CIB ocr :

bmp image;
tiff image (including multipage tiff);
jpeg image;
png image;
xml (serialized cv::Mat);
yml (serialized cv::Mat)

Supported barcode-types: Still correct??

Code128A, Code128B, Code128C, Code128Auto
Code39
Code39Extended
QR-Code
Datamatrix

For output the following formats can be used:

plain text
hocr

3. Usecase: Calling CIB OCR via CIB runshell

In the following example, a QR-code from a .jpg image file is processed by CIB ocr and output in a .txt textfile.

First a valid license is set and tracing is activated. To recognize the QR-code, the properties “BarcodeType” and “Recognize” have to be set accordingly. Then the output filename and the output directory are set. CIB ocr is called via the command “–oc ”, followed by the input filename of the .jpg image that is to be processed.

All properties used in this call are described in more detail in chapters 4 to 6.

CIB runshell can be called directly via command line in the appropriate directory, or via a batch script.

Cibrsh.exe LicenseCompany=“Example Company“ LicenseKey=“xxxx-xxxx-xxxxxxxx“ 
TraceFilename=”trace.log” BarcodeType=QR Recognize=BarcodeRecognizer 
BarcodeOutputFilename=”./output/CIB_web.jpg.txt” –oc “./input/CIB_web.jpg”

Inputfile:

This is the QR Code as JPG file:

Result:

The output textfile contains the following text:

BARCODE:QR;http://www.cib.de

4. Usecase: Calling CIB ocr Via CIB pdf toolbox

n CIB pdf toolbox it is possible to directly pass through properties t o CIB ocr. Further details can be found in the CIB pdf toolbox technical manual.

This example uses CIB ocr to extract text from a PDF file that is not readable, i.e. the text is represented by a picture. To pass through CIB ocr properties in this CIB pdf toolbox call, “CibOcr” is added as a prefix to the CIB ocr properties described in this document.

Again a valid license has to be set. Then the property “OutputFormat” is set to specify that a searchable PDF should be generated. It is also specified that in this case the image in the input PDF should be removed and the extracted text from that image shown instead in the output PDF (properties “FormatSearchablePdfRemoveImage” and “FormatSearchablePdfShowText”).

The language of the text in the PDF is set accordingly (“CibOcrOCRLanguage”) and the path to the necessary dictionary files is set (“CibOcrDataFolder”). Also, a preprocessor is selected to improve the ocr result.

Then cib pdf toolbox is called via “–fj ” (join) and the input and output files are specified.

CibRsh.exe LicenseCompany=”Example Company” LicenseKey=”xxxx-xxxx-xxxxxxxx” 
OutputFormat=FormatSearchablePdf FormatSearchablePdfShowText=1 
FormatSearchablePdfRemoveImages=1 CibOcrDataFolder=”./tessdata” CibOcrOCRLanguage=”eng” 
CibOcrPreprocess=SauvolaThresholding –fj “./input/TextAsPicture.pdf” 
“./output/TextAsPicture.out.pdf”

5. Usecase: Calling CIB ocr Via CIB job/CIB documentServer

(From CIB ocr version 2.3.0 and CIB job version 1.8.0)

For using CIB ocr via CIB job or CIB documentServer the corresponding XML schema has to defined. This XML can be used for CIB documentServer request. The following XML-example uses CIB ocr to extract text from an image-file using Tesseract library.

In load-step inputfile is loaded into memory.
In ocr-step first a valid license has to be set, “Tesseract” is the OCRlibrary to be used and OutputFormat is “FormatHocr”. Extracted text is in German ("OCRLanguage"=deu).

In save-step in-memory-output is written into a file.

Example:

<?xml version="1.0" encoding="ISO-8859-1" ?>  
<root> 
   <Comod> 
      <defaults> 
         <properties command="job"> 
            <property name="OutputMode">XML</property> 
            <property name="UseInMemoryProcessing">1</property> 
         </properties> 
      </defaults> 
      <jobs> 
         <job name="tesseract" expected-result-code="404"> 
         <steps> 
            <step name="LoadStep" command="load"> 
            <properties> 
               <property name="InputFilename">./input/input.png</property> 
            </properties> 
            </step> 
            <step name="OcrStep" expected-result-code="1000" command="ocr"> 
            <properties> 
               <property name="LicenseCompany">CustomerLicensee</property> 
               <property name="LicenseKey">4444-cccc-88888888</property> 
               <property name="OCRLibraryName">Tesseract</property> 
               <property name="DataFolder">.</property> 
               <property name="OutputFormat">FormatHocr</property> 
               <property name="TraceFilename">OCR_trace.log</property> 
               <property name="OCRLanguage">deu</property> 
               <property name="TracePreprocessOutput">0</property> 
            </properties> 
            </step> 
            <step name="SaveStep" expected-result-code="0" command="save"> 
            <properties> 
               <property name="OutputFilename">./ocr_out.html</property> 
            </properties> 
            </step> 
         </steps> 
         </job> 
      </jobs> 
   </Comod> 
</root>

6. General properties

In this chapter general properties of CIB ocr are described.

− NEW: add new property "PageSelection" RELEASE 2.3.0

− property “TraceRecognitionOutput” has been added; REL 2.1.0

6.1. Config

Property-Name	Datentyp	Art
Config	String	Set

This property specifies the name of the config-file (text-file) which contains parameters for CIB ocr. The parameters specified in the config-file overwrite those set in job directly. It is possible to use several config-files (containing different configurations) which are performed one after the other in the order they are set in property.(not supported yet).

Syntax

Config= <onefilename>

Example

Config=C:\Test\config.txt

6.2. Recognize

Property-Name	Data-Type	Type
Recognize	String	Set

This property defines for which purpose CIB ocr is run. Currently there are two options: text recognition and barcode recognition. By default, this is set to text recognition.

Syntax

Recognize=<Value>
<Value>:BarcodeRecognizer | OcrRecognizerWithDeeper | OcrRecognizer | WordRecognizer

default= OcrRecognizer

Example

Recognize=BarcodeRecognizer

6.3. DisableRecognition

Property-Name	Datentyp	Art
DisableRecognition	String	Set

When this property is set, text/barcode recognition is switched off and TracePreprocessOutput should be switched on by the user. So instead of defined preprocessors all of them are used one by one for an initial image.

For each preprocessor an imagefile is written.

Syntax

DisableRecognition=<value>
<value>: 0 | 1

default=0

Example

DisableRecognition=1
TracePreprocessOutput=1

That means:
No recognition is done,
TracePreprocessOutput=1 is set,
for each preprocessor an imagefile is created, e.g. “image1_preprocessor_MedianBlurAT.png”.

6.4. DpiX

Property-Name	Datentyp	Art
DpiX	String	Set

This property allows to set integer values which are needed for xfdf and specify the DPI value of the input image.
DpiX can be used when the property RegionTemplate is set.

Syntax

DpiX=<value>
<value>: 1 to N

default=300

Example

DpiX=200

6.5. DpiY

Property-Name	Datentyp	Art
DpiY	String	Set

This property allows to set integer values which are needed for xfdf and specify the DPI value of the input image.
DpiY can be used when the property RegionTemplate is set.

Syntax

DpiY=<value>
<value>: 1 to N

default=300

Example

DpiY=200

6.6. InputFilename

Property-Name	Datentyp	Art
InputFilename	String	Set/Get

This property specifies the name of the input file.
This information is optional, memory input can be used alternatively (via property InputMemoryAddress).

The following input-formats are supported:

bmp image;
tiff image (includes multipage tiff);
jpeg image;
png image;
xml (serialized cv::Mat);
yml (serialized cv::Mat)

Syntax

InputFilename=<name>
<name>: name.ext

Example

InputFilename=Rechnung.tiff

6.7. InputMemoryAddress

Property-Name	Datentyp	Art
InputMemoryAddress	String	Set

Property specifies memory-address of input-information.
This information is optional, inputfile can be used alternatively (via property InputFilename).

Syntax

<InputMemoryAddress> ::= <Integer> | ";" <Integer>

Example

InputMemoryAddress=1034924448#4096#1034924448#1292

6.8. LicenseCompany

Property-Name	Datentyp	Art
LicenseCompany	String	Set

This property sets the licensee from the license information. It is used in connection with LicenseKey. This property has to be set in a CIB ocr call to activate its functions.

Syntax

licenseCompany=CustomerLicensee

6.9. LicenseKey

Property-Name	Datentyp	Art
LicenseKey	String	Set

This property sets the license Key from the license information. It is used in connection with LicenseCompany. This property has to be set in a CIB ocr call to activate its functions.

Syntax

LicenseKey=xxxx-xxxx-xxxxxxxx

Default:
If no license information is set, test-license is used.

Example

LicenseCompany=CIB software GmbH 
LicenseKey=4444-cccc-88888888

6.10. OCRLanguage

Property-Name	Datentyp	Art
OCRLanguage	String	Set

Property specifies the language which is used in the input-document. (ISO 639-2 Code , 3digits).

Syntax

OCRLanguage=<language>
<language>: deu | eng | rus

default=deu

Example

OCRLanguage=eng

6.11. Preprocess

Allgemein
Preprocessor details
     ▸ Thresholding Methods
     ▸ Median filters
     ▸ BilateralFilter
     ▸ Thinning Algorithms
     ▸ DeSkew
     ▸ Invert / AutoInvert
     ▸ AutoRotate
Composite Algorithms

Allgemein

Property-Name	Datentyp	Art
Preprocess	String	Set

This property allows to specify methods for preprocessing the inputfile in order to get a better ocr-result.

Syntax

Preprocess: <preprocessnames> 
<preprocessnames>= <preprocessor> | <preprocessor> “+” <preprocessnames> 
<preprocessor>: NativeAdaptiveThresholding | PureMedianBlur | PureAdaptiveThresholding |
PureAdaptiveGaussianThresholding | MedianBlurAGT | MedianBlurAT | MedianBlurGAT | 
SauvolaThresholding | NiblackThresholding | WolfJolionThresholding | NickThresholding | 
FengThresholding | OtsuThresholding | BilateralFilter | ThinningZhangSuen | 
ThinningGuoHall | DeSkew | Invert | AutoInvert | Composite

No default

Example

Preprocess = NativeAdaptiveThresholding

Preprocessor details

Thresholding Methods is the simplest method of image segmentation. From a grayscale image, thresholding can be used to create binary images.
The simplest thresholding methods replace each pixel in an image with a black pixel if the image intensity is less than some fixed constant T, or a white pixel if the image intensity is greater than that constant.

Adaptive Thresholding

Using a global value as threshold value may not be good in all conditions where an image has different lighting conditions in different areas. In that case, we go for adaptive thresholding. Adaptive thresholding means that the algorithm calculates the threshold for small regions of the image. Thus we get different thresholds for different regions of the same image and this gives us better results for images with varying illumination.

It has three ‘special’ input parameters and only one output argument.

Adaptive Method - It decides how the thresholding value is calculated.

cv2.ADAPTIVE_THRESH_MEAN_C : threshold value is the mean of the neighborhood area.
cv2.ADAPTIVE_THRESH_GAUSSIAN_C : threshold value is the weighted sum of neighborhood values where weights are a Gaussian window.

Block Size - It decides the size of the neighborhood area.

C - It is just a constant which is subtracted from the mean or weighted mean calculated.

NativeAdaptiveThresholding

This is a complex filter which consists of the following steps using OpenCV library:

cv::medianBlur()
cv::adaptiveThreshold() using CV_ADAPTIVE_THRESH_MEAN_C threshold type
cv::bilateralFilter()
The result is a grayscale image;

PureAdaptiveThresholding

While the conventional thresholding operator uses a global threshold for all pixels, adaptive thresholding changes the threshold dynamically over the image. This more sophisticated version of thresholding can accommodate changing lighting conditions in the image , e.g. those occurring as a result of a strong illumination gradient or shadows.

PureAdaptiveThresholding consist only of one step:

cv::adaptiveThreshold() using CV_ADAPTIVE_THRESH_MEAN_C threshold type
The result is a binary image;

Alternative: PureAdaptiveGaussianThresholding

Thresholding based on standard deviation

The methods described in the following sections - FengThresholding, SauvolaThresholding, NiblackThresholding, WolfJolionThresholding, NickThresholding - differ only by the final formula for the thresholding value for particular pixel, but use the same matrixes with standard deviation.

SauvolaThresholding

The basic idea behind Sauvola is that if there is a lot of local contrast, the threshold should be chosen close to the mean value, whereas if there is very little contrast, the threshold should be chosen below the mean, by an amount proportional to the normalized local standard deviation.

NiblackThresholding

Niblack’s method can be considered as the first local threshold method. It has the advantage of detecting the text but it introduces a lot of background noise. Sauvola and Pietikinen modified the Niblack threshold to decrease the background noise but the text detection rate is also decreased while bleed-through still remains in most cases.

WolfJolionThresholding

In particular, for most colored images the Wolfjolion preprocessor allows to achieve the best quality of recognition as well as for images with background noise and anti-aliased font.

NickThresholding

Nick's binarization derives its thresholding formula from the basic Niblack algorithm, the parent of many local image thresholding methods. The major advantage of Nick's method over Niblack is that it considerably improves binarization for "white" and light page images by shifting down the binarization threshold.

FengThresholding

The Feng thresholding method is interesting because it can qualitatively outperform the Sauvola thresholding method. However, the Feng method contains many parameters which have to be set. Hence this method was never widely accepted.

OtsuThresholding

Considering a bimodal image (a bimodal image is an image whose histogram has two peaks) we can approximately take a value in the middle of those peaks as threshold value. That is what Otsu binarization does. So it automatically calculates a threshold value from an image’s histogram for a bimodal image. (For images which are not bimodal, binarization won’t be accurate.)

Median filters

A median filter is an example of a non-linear filter and, if properly designed, is very good at preserving image detail. Running a median filter:

considers each pixel in the image
sorts the neighboring pixels into order based upon their intensities,
replaces the original value of the pixel by the median value from the list.

A median filter is a rank-selection (RS) filter, for example one that selects the closest of the neighboring values when a pixel's value is external in its neighborhood, and leaves it unchanged otherwise . It is sometimes preferred, especially in photographic applications.

Median and other RCRS filters are good at removing salt and pepper noise from an image, and also cause relatively little blurring of edges, and hence are often used in computer vision applications.

Disadvantage: the rest becomes blurred, this impairs the borders of characters and consequently recognition accuracy.

At the same time (and rather unexpectedly), the best choice for “recipes” and images with “curved” or “complex in general” text is the MedianBlurGAT preprocessor.

Used filters:

PureMedianBlur

Contain thresholding in addition

MedianBlurAGT
MedianBlurAT
MedianBlurGAT

BilateralFilter

A bilateral filter is a non-linear, edge-preserving and noise-reducing smoothing filter for images. The intensity value at each pixel in an image is replaced by a weighted average of intensity values from nearby pixels. This weight can be based on a Gaussian distribution. Crucially, the weights depend not only on the Euclidean distance of pixels, but also on the radiometric differences (e.g. range differences, such as color intensity, depth distance, etc.). This preserves sharp edges by systematically looping through each pixel and adjusting weights to the adjacent pixels accordingly.

It is normally used for non-text images or after thresholding.

Thinning Algorithms

This is an algorithm used for binary images to reduce a black and white area to a n e.g. one bit skeleton.

A fast parallel thinning algorithm consists of tw o iteration loops:
One aimed at deleting the south-east boundary points and the north-west corner points while the other one is aimed at deleting the north-west boundary points and the south-east corner points. End points and pixel connectivity are preserved. Each pattern is thinned down to a "skeleton" of unitary thickness. Experimental results show that this method is very effective .

Used algorithms:

ThinningZhangSuen
ThinningGuoHall

DeSkew

Deskewing an image can help a lot, if you want to do barcode detection, or just improve the readability of scanned images. I n photos of goods with a barcode for example, the skew angle is often too high, so the barcode cannot be detected. After deskewing, the barcode can be read.

If an image is a logo, a good choice is DeSkew+AutoInvert and any of the preprocessors Feng, Nick, Sauvola or WolfJolion.
For invoices a suggestion is DeSkew and Sauvola or WolfJolion.

Invert / AutoInvert

Both filters are suitable for images containing more black than white color.

Application of “Invert” changes black to white and vice versa.

Filter “Autoinvert” checks first, if we really have more black than white on page.

We get good results, if “Invert (AutoInvert)” is used together with “BilateralFilter” and “DeSkew” .

AutoRotate

This preprocessor algorithm allows to detect image rotation by 90/180/270 degrees, using artificial intelligent algorithm . It detects rotation of image and rotate it before text recognition process. The following preprocessor settings allow to detect image rotation and rotate it, and then de-skew the resulting image, before text recognition:

Example

Preprocess = AutoRotate+Deskew

For using this algorithm, an additional property should be set: AutoRotateModel. This property should point to tensorflow-based model file , trained to detect image rotation.

Composite Algorithms

(From CIB ocr version 2.3.0)

CIB OCR can use complex algorithms for image preprocessing. For using of complex image processing algorithms preprocessor "Composite" should be used. This possibility is based on usage of CIB image toolbox functionality. Each preprocessing algorithm should be described in XML format (details are available in CIB image toolbox documentation).

Example CIB runshell:

cibrsh.exe –oc Preprocess=Composite AlgorithmsSetName=AlgorithmsSet_sample.xml
AlgorithmName=SepaTextExtraction AlgorithmProfile=processing_profile.xml
IPLTraceFilename=OCR_IPL.log

Example CIB Job/CIB DocumentServer

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<root> 
   <Comod> 
      <defaults> 
         <properties command="job"> 
            <property name="OutputMode">XML</property>
            <property name="UseInMemoryProcessing">1</property>                     
        </properties> 
      </defaults> 
      <jobs> 
         <job name="tesseract" expected-result-code="404"> 
         <steps> 
            <step name="LoadStep" command="load">
            <properties> 
               <property name="InputFilename">./input/input.png</property>
            </properties> 
            </step> 
            <step name="OcrStep" expected-result-code="1000" command=" ocr"> 
            <properties> 
               <property name="LicenseCompany">CustomerLicensee</property> 
               <property name="LicenseKey">4444-cccc-88888888</property>
               <property name="OCRLibraryName">Tesseract</property>
               <property name="DataFolder">.</property> 
               <property name="OutputFormat">FormatHocr</property>
               <property name="TraceFilename">OCR_trace.log</property>
               <property name="OCRLanguage">deu</property>
               <property name="TracePreprocessOutput">1</property> 
               <property name="Preprocess">Composite</property>
               <property name="AlgorithmsSetName">AlgorithmsSet_sample.xml</property>
               <property name="AlgorithmName">SepaTextExtraction</property>
               <property name="AlgorithmProfile">processing_profile.xml</property>
               <property name="IPLTraceFilename">OCR_IPL.log</property>
            </properties> 
            </step> 
            <step name="SaveStep" expected-result-code="0" command="save">
            <properties> 
               <property name="OutputFilename">./ocr_out.html</property>
            </properties> 
            </step> 
         </steps> 
         </job> 
      </jobs> 
   </Comod> 
</root>

6.12. TracePreprocessOutput

Property-Name	Datentyp	Art
TracePreprocessOutput	String	Set

The result of the preprocessed image will be written in a file.

Syntax

TracePreprocessOutput=<value>
<value>: 0 | 1

default=0

Example

TracePreprocessOutput=1

File created e.g. “image1_preprocessor_MedianBlurAT.png”.

7. Properties Text-Recognition

DataFolder
OCRConfigs
OCRRegion
PaddingHorizontal
PaddingVertical
OutputFilename
OutputFormat
OutputText
OutputTextLength
OutputType
RegionTemplate

DataFolder

Property-Name	Data-Type	Type
DataFolder	String	Set

This property specifies a path to the Tesseract language package „tessdata“.

If the property is empty, it is assumed the folder „tessdata“ is located in the currently used folder.

Syntax

DataFolder=<path>

default=No input
(current folder is used)

Example

DataFolder=C:\Test\Invoice

OCRConfigs

(From CIB ocr version 2.3.1)

Property-Name	Data-Type	Type
OCRConfigs	String	Set

Names of Tesseract config files.

All Tesseract config files should be located in $(TESSDATA_PREFIX)\tessdata\configs\

Syntax

OCRConfigs=config_name1[;config_name2[;config_name3...]]

Examples

OCRConfigs=hocr

or

OCRConfigs=hocr;debug

OCRRegion

Property-Name	Data-Type	Type
OCRRegion	String	Set

This property specifies a rectangle on a page.
This rectangle is used to define a scan-area to extract the text. That means all characters are ignored which are outside of this scan-area.
A rectangle is defined by two basic points (left,top and right,bottom).
Point of origin is the top-left corner of the page, the unit is mm.

Syntax

OCRRegion=<onerectangle> 
<onerectangle>: <left> ";" <top> ";" <right> ";" <bottom>

default=No input

The whole page is scanned if no input is set or if the rectangles given by the coordinates are empty.

Example

OCRRegion=5;5;15;20

PaddingHorizontal

Property-Name	Data-Type	Type
PaddingHorizontal	String	Set

This property adds a horizontal padding to the rectangle determined by “OCRRegion” on a page.
The main Use Case is Textrecognition with deeper on a line. This Property allows to further extend the OCRRegion in horizontal direction. This way context information in the image will not get lost.

The unit is %. That means in case of a OCRRegion width of 100 and PaddingHorizontal of 10: The image is extended by 10 pixels left and 10 pixels right. The center of the OCRRegion and the padded rectangle remains.

Syntax

PaddingHorizontal =<integer_value>

default=0

Example

PaddingHorizontal = 10

PaddingVertical

Property-Name	Data-Type	Type
PaddingVertical	String	Set

This property adds a vertical padding to the rectangle determined by “OCRRegion” on a page.
The main Use Case is Textrecognition with deeper on a line. This Property allows to further extend the OCRRegion in vertical direction. This way context information in the image will not get lost.

The unit is %. That means in case of a OCRRegion height of 100 and PaddingVertical of 10: The image is extended by 10 pixels on top and 10 pixels at the bottom. The center of the OCRRegion and the padded rectangle remains.

Syntax

PaddingVertical=<integer_value>

default=0

Example

PaddingVertical=10

OutputFilename

Property-Name	Data-Type	Type
OutputFilename	String	Set

This property specifies the name of the outputfile.
The property OutputFilename is optional, if it is empty – OutputTextLength and OutputText are used.
The format/extension is described in the next property OutputFormat.

Syntax

OutputFilename=<name> 
<name>: name.ext

default=No input, use of OutputTextLength and OutputText.

Example

OutputFilename=Rechnung.txt

OutputFormat

Property-Name	Data-Type	Type
OutputFormat	String	Set

This property defines the format of the created outputfile.

Syntax

OutputFormat=<format> 
<format>: FormatText | FormatHocr | FormatHocrText

default=FormatHocr

Example

OutputFormat=FormatHocr

OutputText

Property-Name	Data-Type	Type
OutputText	String	Get

This property contains the result of text-recognition.
If used, it is also required to define the size of the output buffer with the property OutputTextLength.

Syntax of output:

[textstring]

Example

Das ist der gelesene Text.

OutputTextLength

Property-Name	Data-Type	Type
OutputTextLength	String	Get

This property contains the length of the output result and specifies the required size of the output buffer.

Syntax of output:

[integer] (string representation)

Example

OutputType

Property-Name	Data-Type	Type
OutputType	String	Set

This property defines whether output should be in memory or to a file.
This property is automatically set depending on whether OutputFilename is set or not. If OutputFilename is set, then OutputType=File is automatically set, otherwise OutputType=Memory is set.

Syntax

OutputType=<type> 
<type>: File | Memory

default=File

Example

OutputType=File

RegionTemplate

Property-Name	Data-Type	Type
RegionTemplate	String	Set

The property RegionTemplate contains the name of the xfdf-file, where the OCRRegions are defined.

Syntax

RegionTemplate=<filename.xfdf>

default=No input

Example

RegionTemplate=region.xfdf

8. Properties Text-Recognition with deepER

Instead of using Tesseract for OCR it is possible to also choose text-recognition with deepER. The OCR will be calculated on a server. A RESTFUL Service is running on the server, while the client utilizes libcurl in order to send the request.

DataFolder
InputFilename
OutputFilename
Recognize
DeeperURL
DeeperAuthentication
DeeperImageFormat
OcrGrayScaleConversion

InputFilename

This Property is mandatory.

It is not yet possible to use In-Memory-Processing for the input.

Property-Name	Datentyp	Art
InputFilename	String	Set/Get

This property specifies the name of the input file.
The following input-formats are supported:

bmp image;
tiff image (includes multipage tiff);<
jpeg image;
png image;

OutputFilename

Property-Name	Data-Type	Type
OutputFilename	String	Set

This property specifies the name of the out putfile.
The Outputformat is fixed to hOCR.

Syntax

OutputFilename=<name> 
<name>: name.ext

Example

OutputFilename=Rechnung.html

Recognize

Property-Name	Data-Type	Type
Recognize	String	Set

Syntax

Recognize=<Value>
<Value>: OcrRecognizerWithDeeper

default=OcrRecognizer

Example

Recognize= OcrRecognizerWithDeeper

DeeperURL

Property-Name	Data-Type	Type
DeeperUrl	String	Set

Syntax

DEEPERURL=<Value>
<Value>: http://localhost:5000

default= http://localhost:5000

Example

DeeperUrl = http://graphix:5000

DeeperAuthentication

Property-Name	Data-Type	Type
DeeperAuthentication	String	Set

Syntax

DEEPERURL=<Value>
<Value>: User:password

default= “”

Example

DeeperUrl = Franz:TopSecret

DeeperImageFormat

Property-Name	Data-Type	Type
DeeperImageFormat	String	Set

Default value (if not set) is PNG. Possible values: JPG (JPEG) / PNG / Smallest.

CIB ocr converts input image into the requested format, before sending it to the deeper server.

If DeeperImageFormat is set as Smallest then CIB ocr converts the input image into both: PNG and JPG and the smallest representation will be sent to the deeper server for recognition .

OcrGrayScaleConversion

Property-Name	Data-Type	Type
OcrGrayScaleConversion	String	Set

Syntax

OcrGrayScaleConversion=<Value>
<Value>: 0|1

Default =1

Example

OcrGrayScaleConversion = 0

9. Properties Barcode-Recognition

BarcodeOutputFilename
BarcodeOutputType
BarcodeRegion
BarcodeRemoveChecksum
BarcodeShowPageNumber
BarcodeStopAfter
BarcodeType
BarcodeTimeout
BarcodeValue
BarcodeValueLength
DatamatrixAngleDeviation
DatamatrixShrinkingFactor
DatamatrixScanGap
DatamatrixThreshold
ZBarConfig

BarcodeOutputFilename

Property-Name	Data-Type	Type
BarcodeOutputFilename	String	Set

This property sets the name of the outputfile to which values of all barcodes found in the inputfile are written.
The property BarcodeOutputFilename is optional, if it is empty - BarcodeValueLength and BarcodeValue are used .
For format/extension look at OutputFormat

The format of this file is given by “BarcodeOutputType ”.

Syntax

BarcodeOutputFilename=<filename> 
<filename>:name.ext

default=No input, BarcodeValueLength and BarcodeValue are used .

Example

BarcodeOutputFilename=barcodes.txt

BarcodeOutputType

Property-Name	Data-Type	Type
BarcodeOutputType	String	Set

This property defines whether the output should be in memory or to file.
This property is automatically set depending on whether BarcodeOutputFilename is set or not. If BarcodeOutputFilename is set , then BarcodeO utputType=File is set, otherwise BarcodeOutputType=Memory .

Syntax

BarcodeOutputType=<type> 
<type>: File | Memory

default=File

Example

BarcodeOutputType=File

BarcodeRegion

With this property it is set that a search for barcodes is only applied on a defined rectangle.

If a barcode is always situated on the same area of each page , performance is improved by limiting the search for barcodes to this area.
Th is area (= rectangle) is defined by two basic points (left ,top and right,bottom).
Point of origin is the top-left corner of the page, the unit is pixel.

Syntax

BarcodeRegion=<rectangle> 
<rectangle>: <left> ";" <top> ";" <right> ";" <bottom>

default=No input

The whole page is scanned if no input is done or rectangle given by coordinates is empty.

Example

BarcodeRegion=10.5;50;100.0;200.5

BarcodeRemoveChecksum

Property-Name	Data-Type	Type
BarcodeRemoveChecksum	String	Set

When this property is set, checksum is not considered when reading a barcode.

The property can be used for barcodes Code39 , Code39Extended and Code128

Syntax

BarcodeRemoveChecksum=<value> 
<value>: 0 | 1

default=0
(Checksum is not cut off).

Example

BarcodeRemoveChecksum=1

BarcodeShowPageNumber

Property-Name	Data-Type	Type
BarcodeShowPageNumber	String	Set

When this property is set, the page-number of the inputfile where this barcode was found is added to each output barcode from property “BarcodeValue”.

Syntax

BarcodeShowPageNumber=<value> 
<value>: 0 | 1

default=0
(No output of page-number).

Example

BarcodeShowPageNumber=1 
Afterward property „BarcodeValue“ contains e.g. 
BARCODE:1;DATAMATRIX;00011122233344455566677788;BARCODE:2;DATAMATRIX;899374032904908

BarcodeStopAfter

Property-Name	Data-Type	Type
BarcodeStopAfter	String	Set

For Datamatrix only.

Defines that search is stopped after retrieving N th barcode.

Syntax

BarcodeStopAfter=<value> 
<value>: 1 to N

default=””
No value set means that search continues until the end of inputfile.

Example

BarcodeStopAfter=5

Search stops after retrieving 5th barcode.

BarcodeType

Property-Name	Data-Type	Type
BarcodeType	String	Set

This property specifies the type of barcode which is to be searched.

Syntax

BarcodeType = < onebarcodetype>

<onebarcodetype>= "DataMatrix" | "Code128" | "Code39" | "Code39Extended" | "QR"

default=No input, CIB ocr searches for all possible barcode-types.

Code128 includes the subtypes:

Code128A
Code128B
Code128C
Code128Auto

Example

BarcodeType= DataMatrix

BarcodeTimeout

Property-Name	Data-Type	Type
BarcodeTimeout	String	Set

For Datamatrix only.

This property specifies the time (in milliseconds) when CIB ocr stops looking for more barcode-candidates in the inputfile.

Syntax

BarcodeTimeout=<value>
<value>:1 to N

default=The whole inputfile is processed to find all barcode-candidates, neglecting the time it takes.

Example

BarcodeTimeout=50

BarcodeValue

Property-Name	Data-Type	Type
BarcodeValue	String	Get

This property contains a list of all barcodes found in the input-image-file.
BarcodeValueLength is necessary to define the size of the output buffer.

Syntax of output:

[BARCODE :[pagenumber;]BarcodeType;TextValue;]

Example

BARCODE:DATAMATRIX;00011122233344455566677788

BarcodeValueLength

Property-Name	Data-Type	Type
BarcodeValueLength	String	Get

This property contains the length of the output result and thus gives the required size of output buffer.

Syntax of output:

[integer] (string representation)

Example

DatamatrixAngleDeviation

Property-Name	Data-Type	Type
DatamatrixAngleDeviation	String	Set

For Datamatrix only.

This property gives the allowed non-squareness of corners of rectangles in degrees (0-90).

The size of the allowed deviation depends on the application:

Faxing and flatbed scanning: A low squareness deviation (5-10 degrees is enough since all right angles in the subject image will appear as right angles in the image.
Scanning from a cell phone or webcam: Higher deviations (20-40 degrees) should be set as distortion due to extreme scanning angles may occur. The dmtxread utility allows large deviation values by default.

Syntax

DatamatrixAngleDeviation=<value>
<value>:0 to 90

default=10

Example

DatamatrixAngleDeviation=20

DatamatrixShrinkingFactor

Property-Name	Data-Type	Type
DatamatrixShrinkingFactor	String	Set

For Datamatrix only.

This property sets a factor for shrinking a high resolution image internally.
This sometimes provides a dramatic performance benefit as the amount of pixels of a page is minimized. It especially helps when an image has high resolution but blurry focus.

Syntax

DatamatrixShrinkingFactor=<value>
<value>: 1 to N

Default: 1
(no change of original resolution)

Example

DatamatrixShrinkingFactor=2

Means resolution is divided in half.

DatamatrixScanGap

Property-Name	Data-Type	Type
DatamatrixScanGap	String	Set

For Datamatrix only.

This Property allows specifying the size of the gaps in the grid pattern (using pixels).

Increasing the gaps (e.g. to 100) can improve performance, but if the grid is too coarse it may cause that the barcode is no t found at all.

Syntax

DatamatrixScanGap=<value>
<value>: 1 to N

default=1

Example

DatamatrixScanGap=50

DatamatrixThreshold

Property-Name	Data-Type	Type
DatamatrixThreshold	String	Set

For Datamatrix only.

Lowering the threshold can increase the number of features to be scanned, but thereby slows performance. But this may be necessary if the image is blurry or has low contrast.
Sometimes lowering the threshold will actually improve performance if thereby a good barcode candidate is found more quickly than otherwise.

Syntax

DatamatrixThreshold=<value>
<value>: 1 to 100

default=5

Example

DatamatrixThreshold=10

Weak edges below threshold 10 are ignored.

ZBarConfig

(From CIB ocr version 2.4.0)

Property-Name	Data-Type	Type
ZBarConfig	String	Set

Property for tuning of barcode recognition (ZBar functionality).

Syntax

ZBarConfig
        =config_
        line1[;config
        _line2[;config_line3...]]

Example

ZBarConfig=code39.enable

10. Properties Word-Recognition

Recognize
DictionaryPath
InputFormat
Wordrecognizeroptions
Wordrecognizerresult

Recognize

Property-Name	Data-Type	Type
Recognize	String	Set

In order to use the WordRecognizer this property has to be set to “WordRecognizer”.

Syntax

Recognize=<Value> 
<Value>:BarcodeRecognizer | OcrRecognizer | WordRecognizer

default=OcrRecognizer

Example

Recognize=WordRecognizer

DictionaryPath

Property-Name	Data-Type	Type
DictionaryPath	String	Set

This property can also be defined within the WordRecognizerOptions. It is recommended to define it within WordRecognizerOptions, as WordRecognizerOptions overrules this property.

However this property has to be defined at least within this property or WordRecognizerOptions.

Syntax

DictionaryPath=<Value>

No default value! It has to be set.

Example

DictionaryPath=".\\hunspell"

InputFormat

Property-Name	Data-Type	Type
InputFormat	String	Set

In order to use the WordRecognizer this property has to be set to “WordRecognizer”.

Syntax

InputFormat=<Value> 
<Value>: HOCR | UTF8 | UTF16 | Unicode

HOCR: input is a HOCR file which must be UTF8 encoded
UTF8: input is plain text in UTF8 encoding (with or without UTF8 BOM )
UTF16: input is plain text in UTF-16 encoding. The BOM (FE FF or FF FE) must be present

Example

InputFormat=UTF8

Wordrecognizeroptions

This property is defined as json-String and contains all the information that is needed in order to analyse the document by WordRecognizer.

Property-Name	Data-Type	Type
WordRecognizerOptions	String	Set

This property might look like this. A more detailed explanation for each component can be found below the example:

Example:

{ 
"DictionaryPath": "D:\\PROJEKTE-SVN\\products\\CIB ocr\\trunk\\src-test\\testdata\\hunspell",  
"Dictionaries": {"DE": {"Dictionaries": "de_DE_frami-UTF8", "StopwordFiles": "stopword_german.txt", "DigramScores": "de_digramscores.txt"}}, 
"InputFormat": "UTF8",  
"RecognizedWordsFilename": "recognized.log",  
"RejectedWordsFilename": "rejected.log",  
"StatisticsFilename": "statistics.log",  
"StatisticsOutputFormat": "FormatCsv"}

Explanation of each component:

Component	Value	Note
InputText	<string>	(required if property InputFilename / InputMemoryAddress is empty) text to parse, must be in UTF-8 format
DictionaryPath	<string>	(required) path to the hunspell folder, may be absolute or relative to the working directory
Dictionaries	<dictionary-object>	(required) dictionaries to use, one or more dictionaries for each language (see below)
InputFormat	<string>	Specifies the input format (e.g. “UTF8”)
RecognizedWordsFilename	<string>	(optional) filename for recognized words writes the numer of occurrences for each recognized word, per language a word is considered recognized if it is no stopword and contained in at least one dictionary for that language for the language "<GLOBAL>", all stop word lists are ignored, and the word is recognized if it is contained in at least one dictionary (excluding stop word dictionaries) the words are written as <language> <TAB> <count> <TAB> <word>
RejectedWordsFilename	<string>	(optional) filename for rejected words writes the numer of occurrences for each rejected word, per language a word is considered rejected if it is no stopword and is not contained in any dictionaries for that language for the language "<GLOBAL>", all stop word lists are ignored, and the word is rejected if it is not contained in any dictionary (excluding stop word dictionaries) the words are written as <language> <TAB> <count> <TAB> <word>
StatisticsFilename	<string>	(optional) filename for the summary of the word recognizer run
StatisticsOutputFormat	<string>	defines the output format for the summary "FormatText": output is written as tabbed text "FormatCsv": output is written in csv format (with ";" as delimiter) "FormatJSON": the property "WordRecognizerResult" is written to the specified file (as JSON string)
StatisticsPerPage	<boolean>	adds a pagewise statistics to the WordRecognizer result (if InputFormat is not HOCR, all input text is considered as page 1)
TextAcceptThreshold	<number>	sets "TextAccepted" flag in the result, if the "longer glyph ratio" is at least this value. Only meaningful if LargeWordLimit > 0
SmallWordLimit	<number>	(optional) if > 0, words with at most that many characters are counted in the "SmallWord" group
LargeWordLimit	<number>	(optional) if > 0, words with at least that many characters are counted in the "LargeWord" group

Component “Dictionaries”:

specification of <dictionary-object> (same as a few comments above):

{<language-name>: <language>, ...}

<language-name> = JSON-String: "..." (specifies a language name) 
<language> = JSON-String: "..." (specifies a single dictionary for that language, no stopwords)
<language> = JSON-Array: ["...","..."] (specifies one or more dictionaries for that language, no stopwords)
<language> = JSON-Object: {"Dictionaries": <dictionaries>, "StopwordDictionaries:" <stopword-dicts>, "Stopwords": <stopwords>]}
<dictionaries> = JSON-String: "..." (specifies a single dictionary for that language)
<dictionaries> = JSON-Array: ["...","..."] (specifies one or more dictionaries for that language)
<stopword-dicts> = JSON-String: "..." (specifies a single stopword dictionary for that language)
<stopword-dicts> = JSON-Array: ["...","..."] (specifies one or more stopword dictionaries for that language)
<stopwords> = JSON-Array: ["...", "..."] (specifies a list of stopwords (UTF-8 encoded))

Example 1:

"Dictionaries": {"DE": ["de_DE-frami-UTF8", "de_user"], "EN": "en_US-UTF8"}

Example 2:

"Dictionaries": { 
"DE": {"Dictionaries": "de_DE-frami-UTF8", "StopwordDictionaries": "de_stopwords", "DigramScores": "de_digramscores.txt"}  
"EN": {"Dictionaries": "en_US-UTF8", "Stopwords": ["a", "an", "in"], "DigramScores": "de_digramscores.txt}
    }

Wordrecognizerresult

Property-Name	Data-Type	Type
WordRecognizerResult	String	Set

This property contains all the output information. It makes sense to set the property to mode=“out”. This will result in the output-output.xml and the tracefile to contain all the results of the WordRecognizer (Additionally to the statistic files).

In version 2.7, WordRecognizerResult is a JSON object as follows:

{<language-string>: <statistics-object>, ...}

language-string: one of the language strings given in the WordRecognizerOptions (for instance ‘EN’ for english)

<statistics-object> = 
{
"SmallWordCount": <number> number of recognized words (excluding stop words), which are small words (according to SmallWordLimit)
"LargeWordCount": <number> number of recognized words (excluding stop words), which are large words (according to LargeWordLimit)
"MainWordCount": <number> number of words which are recognized, and are not stop words
"StopWordCount": <number> number of words which are in the stop word list/dictionary
"RejectedWordCount": <number> number of words which are neither stop words nor in one of the language dictionaries
(for example, most english words are rejected in german dictionaries)
"TotalWordCount": <number> number of words which are recognized (including stop words). Should be the MainWordCount+StopWordCount

"SmallWordCoverage": <number> number of characters over all small words
"LargeWordCoverage": <number> number of characters over all large words
"MainWordCoverage": <number> number of characters over all recognized words (excluding stop words)
"StopWordCoverage": <number> number of characters over all stop words
"RejectedWordCoverage": <number> number of characters over all rejected words
"TotalWordCoverage": <number> number of characters over all recognized words
"MainWordCountPerLength": [<length1>, <count1>, <length2>, <count2>, ...] number of occurrences per word length (counting only recognized words which are not stop words)
"TotalWordCountPerLength": [<length1>, <count1>, <length2>, <count2>, ...] number of occurrences per word length (counting only recognized words, including stop words, but not rejected words)

"GlyphRatioLongWords": <number> (old) number in percent of long words found in relation to all words. (Glyphs like "%","&" etc are filtered beforehand.

"LongerGlyphRate": <number> (new) number in percent of not-short words found in relation to all words. (Glyphs like "%","&" are included / and therefore lots of those symbols will reduce this value).

"DigramScoreArithmetic": <number> number between [0;9] that indicates the text quality based on digramm score tables. There are scoretables for each language. The language chosen in the "All" language is chosen by the language that has the highest TotalWordCount.

"FulltextQuality": <number> number in percent that indicates text quality. The formula takes the following values into consideration:

GlyphRatioLongWords, LongerGlyphRate, TotalWordCount, DigramScoreArithmetic
}

Note 1: In addition to the languages specified in WordRecognizerOptions, there is an additional language "<GLOBAL>". This (virtual) language consists of all dictionaries over all languages, excluding all stop word dictionaries and stop word lists. This means, if a token (word to check) is contained in at least one of these dictionaries, it is considered as "recognized". Otherwise, it is considered as "rejected"
Note 2: The character count counts only the characters of the words passed to the spellchecker. The parser may have eliminated blanks, numbers, punctuation marks, quotes, hyphens and such.

Since version 2.8, WordRecognizerResult is a JSON object as follows:

{"DocumentStatistics": <language-statistics> , "PageStatistics": <page-statistics> }
("PageStatistics" is only present if the "StatisticsPerPage" option is set to true)

<page-statistics> is a JSON object with page numbers as key and <language-statistics> objects as value.
Example: {"1": <language-statistics> , "3": <language-statistics> }
(if the input is HOCR, and a page has no "ppageno" attribute, the page number is "0")

<language-statistics> is a JSON object as follows:
{"AllLanguages": <word-statistics> ,
"Languages": <language-specific-statistics> ,
"TextAccepted": true | false}
(TextAccepted is false if the GlyphRatioLongWords of "AllLanguages" is lower than the TextAcceptedThreshold specified in the WordRecognizerOptions. The TextAccepted flag of the document-global statistics is also set to false if at least one page has a "longer glyph ratio" ratio below the threshold, even if there are enough other pages to get the global ratio above the limit)

<language-specific> is a JSON object as follows:
{<language-key>: <word-statistics> , ...}
where <language-key> is one of the language keys defined in the WordRecognizerOptions
(note: the statistics for all languages combined is now the value of "AllLanguages". The special language key "<GLOBAL>" is no longer used)

<word-statistics> is the same object as <statistics-object> described in 7.5, but with an additional key "GlyphRatioLongWords". This value of this key is defined as largeWordCoverage * 100 / (totalWordCoverage + rejectedWordCoverage), i.e. the ratio of glyphs in long words compared to the total number of checked glyphs (excluding blanks, delimiters, numbers). The value is expressed as an integer (percentage) ranging from 0 to 100.

Since version 2.14, there are two additional keys:
"RawGlyphCount": <number> number of glyphs before parsing (excluding whitespaces but including digits, punctuation marks etc.)
"LongerGlyphRate": calculated as
(TotalWordCoverage - SmallWordCoverage) / RawGlyphCount.
This is the ratio of glyphs in recognized words which are not "small", compared to the number of all glyphs including digits etc. (see above).

An example of WordRecognizerResult, with multiple pages and languages, might look like this:

WordRecognizerResult = { 
   "DocumentStatistics": { 
      "AllLanguages": {"GlyphRatioLongWords": 80, ...}, 
      "Languages": { 
         "DE": {"GlyphRatioLongWords": 55, ...}, 
         "EN": {"GlyphRatioLongWords": 33, ...}  
      }, 
      "TextAccepted": false  
}, 
"PageStatistics": { 
   "1": { 
      "AllLanguages": {"GlyphRatioLongWords": 100, ...}, 
      "Languages": { 
         "DE": {"GlyphRatioLongWords": 100, ...}, 
         "EN": {"GlyphRatioLongWords": 16, ...}  
      }, 
      "TextAccepted": true  
   }, 
   "2": { 
      "AllLanguages": {"GlyphRatioLongWords": 55, ...}, 
      "Languages": { 
         "DE": {"GlyphRatioLongWords": 0, ...}, 
         "EN": {"GlyphRatioLongWords": 55, ...}  
      }, 
      "TextAccepted": false  
    }  
  }  
}

A complete Job XML-Example for Word Recognition might look like this:

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<root> 
   <Comod> 
      <defaults/> 
      <jobs> 
      <job name="TextRecognize"> 
         <properties> 
            <property name="LicenseCompany">Example Company</property>  
            <property name="LicenseKey">xxxx-xxxx-xxxxxxxx</property> 
            <property name="OutputMode">Xml</property> 
         </properties> 
         <steps> 
         <step name="ocr-step" command="ocr"> 
            <properties> 
            <property name="LicenseCompany">CIB Demo</property>  
            <property name="LicenseKey">xxxx-xxxx-xxxxxxxx</property>	 
            <property name="InputFilename">..\templates-txt\wikipedia-Deutschland_DE.txt</property>	 
            <property name="TraceFilename">ocr.txt</property>	 
            <property name="PageSelection">All</property>	 
            <property name="Recognize">WordRecognizer</property>			 
            <property name="WordRecognizerOptions">{ 
               "DictionaryPath": "..\\hunspell",  
               "Dictionaries":{"deu":{"Dictionaries":"de_DE_frami-
UTF8","StopwordFiles":"de_stopwords.txt"}},"InputFormat":"UTF8","RecognizedWordsFilename":
"recognizedWords.txt","StatisticsFilename":"statistics.txt","StatisticsOutputFormat":"FormatJSON"}					 
            </property> 
            </properties> 
         </step> 
         </steps> 
      </job> 
      </jobs> 
   </Comod> 
</root>

11. Technical interface: Native functions

This chapter provides a brief overview of native functions.

CIB ocr job handle
CibOcrJobCreate
CibOcrJobSetProperty
CibOcrJobSetPropertyW
CibOcrJobGetProperty
CibOcrJobGetPropertyW
CibOcrJobGetProgress
CibOcrJobStart
CibOcrJobFree
CibOcrJobCancel
CibOcrGetVersion
CibOcrGetVersionText
CibOcrGetVersionTextW
CibOcrJobGetErrorText
CibOcrJobGetErrorTextW
CibOcrJobGetError

CIB ocr job handle

Every CIB ocr task is assigned to a „job handle“ of the type Handle* . This object represents the task. The steps

Setting and reading properties(CibOcrJobSetProperty/ CibOcrJobGetProperty)
Executing the task(CibOcrJobStart)
Getting error information(CibOcrJobGetError/ CibOcrJobGetErrorText)

always refer to such a job handle.

A CIB ocr task is initiated by creating a job handle ( CibOcrJobCreate). After setting the necessary properties and running the task the job handle is released again ( CibOcrJobFree ).

CibOcrJobCreate

bool exportfunc CibOcrJobCreate(Handle *job);

This method creates a job handle. The Job-handle is given to all subsequent functions to ensure thread-security. It should be released again with CibOcrJobFree after the task is completed.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Creates a new job handle and stores it at *Job

CibOcrJobSetProperty

bool exportfunc CibOcrJobSetProperty (Handle job, const char *name, const char *value);

This function allows setting additional properties for a merge run. The names and values are expected to be UTF-8 encoded zero terminated strings.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the job that this property refers to
Char*	Name	Name of the property that is to be set
Char*	Value	Value of the property that is to be set

CibOcrJobSetPropertyW

Windows

bool exportfunc CibOcrJobSetProperty W (Handle job, const wchar *name, const wchar *value );

This function allows setting additional properties for a merge run. The names and values are expected to be zero terminated wide strings.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the job that this property refers to
wchar*	Name	Name of the property that is to be set
wchar*	Value	Value of the property that is to be set

CibOcrJobGetProperty

bool exportfunc CibOcrJobGetProperty (Handle *job, const char *name, const char *buffer, int size);

This function returns the property values that are currently set into the specified buffer. The returned names and values are zero terminated strings in UTF-8 encoding.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Value	Description
Handle*	Job	Handle of the job that this property refers to
char*	Name	Name of the property whose value is to be returned
char*	Buffer	The property’s value that is currently set
int	Size	Maximum buffer length

CibOcrJobGetPropertyW

Windows

bool exportfunc CibOcrJobGetProperty W (Handle *job, const wchar *name, const wchar *buffer, int size );

This function returns the property values that are currently set into the specified buffer. The returned names and values are zero terminated wide strings.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Value	Description
Handle*	Job	Handle of the job that this property refers to
wchar*	Name	Name of the property whose value is to be returned
wchar*	Buffer	The property’s value that is currently set
int	Size	Maximum buffer length

CibOcrJobGetProgress

(From CIB ocr version 2.3.2)

bool exportfunc CibOcrJobGetProgress(Handle* job, char *buffer, size_t size);

Gets percent of recognition progress.

This function fills buffer by the string:
<page_number> <page_count> <page_progress>

<page_number> number of page processed at the moment<
<page_count> total page count
<page_progress> progress for current page

Special values <page_progress>:

1 Recognition proces has not started
2 Recognition finished successfully
3 Recognition cancelled
4 Recognition finished with error

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the job that this property refers to
char*	Buffer	The property’s value that is currently set
int	Size	Maximum buffer length

CibOcrJobStart

bool exportfunc CibOcrJobStart(Handle *job);

Starts a CIB ocr-Job.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the job that is to be started.

CibOcrJobFree

bool exportfunc CibOcrJobFree(Handle *job);

This function frees the created CibOcrJobHandle and other resources allocated by CIB ocr.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the job that is to be terminated

CibOcrJobCancel

(From CIB ocr version 2.3.2

bool exportfunc CibOcrJobCancel(Handle* job);

This function stops recognition process.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the job that is to be cancelled

CibOcrGetVersion

bool CibOcrGetVersion(unsigned long * iVersion);

This function provides access to the current CIB ocr version number as an integer.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Unsigned long*	iVersion	Pointer to the stored product version

CibOcrGetVersionText

bool exportfunc CibOcrGetVersionText(char *text, long *maxlength);

This function provides access to the current CIB ocr version number as a string.

If no error occurs function result is TRUE, otherwise FALSE.

Type	Variable	Description
Char*	Text	Pointer to character buffer where the version text is stored
Long*	maxlength	Maximum length of version text

CibOcrGetVersionTextW

Windows

bool exportfunc CibOcrGetVersionTextW(wchar *text, long *maxlength);

This function provides access to the current CIB ocr version number as a string.

If no error occurs function result is TRUE, otherwise FALSE.

Type	Variable	Description
wchar*	Text	Pointer to character buffer where the version text is stored
Long*	maxlength	Maximum length of version text

CibOcrJobGetErrorText

bool exportfunc CibOcrJobGetErrorText(Handle *job, char *text, long *maxlength);

This function returns the error text that is output after executing a function.

If no error occurs the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the current job
Char*	Text	Pointer to the string buffer where the error message text is stored
Long*	maxlength	Maximum length of the error message (size of message buffer)

CibOcrJobGetErrorTextW

Windows

bool exportfunc CibOcrJobGetErrorTextW(Handle *job, wchar *text, long *maxlength);

This function returns the error text that is output after executing a function.

If no error occurs the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the current job
wchar*	Text	Pointer to the string buffer where the error message text is stored
Long*	maxlength	Maximum length of the error message (size of message buffer)

CibOcrJobGetError

bool exportfunc CibOcrJobGetError(Handle *job, int *ErrorCode);

This function gives access to the current error state of CIB ocr after executing various functions.

If no error occurs, the function result is TRUE, otherwise FALSE.

Type	Variable	Description
Handle*	Job	Handle of the current job
Int*	Errorcode	Outputs the current error code

For all possible error-codes please see the appendix .

12. JNI Interface

CIB ocr also provides a JNI Interface.

In order to utilize the JNI Interface the three java classes are necessary:
CibOcr.java, CibOcrConstants.java, CibOcrJNI.java

CibOcrJNI.java contains the following methods:

public final static native int CibOcrJobCreate(long[] jarg1);
public final static native int CibOcrJobStart(long jarg1);
public final static native int CibOcrJobCancel(long jarg1);
public final static native int CibOcrJobReset(long jarg1);
public final static native int CibOcrJobFree(long[] jarg1);
public final static native int CibOcrJobGetProperty(long jarg1, String jarg2, byte[] jarg3);
public final static native int CibOcrJobSetProperty(long jarg1, String jarg2, String jarg3);
public final static native int CibOcrJobGetProgress(long jarg1, byte[] jarg2);
public final static native int CibOcrGetVersion(long[] jarg1);
public final static native int CibOcrGetVersionText(byte[] jarg1);
public final static native int CibOcrJobGetError(long jarg1, int[] jarg2);
public final static native int CibOcrJobGetErrorText(long jarg1, byte[] jarg2);

The Methods mentioned above are almost identical to the ones mentioned in c++ (section 9.1) However only the Properties that accept wide characters (for example CibOcrGetVersionTextW) are called within the JNI Interface.

13. Error Codes

error code	description
0	no error
9	input file not found
11	the function/method has not been implemented
47	buffer too small
99	the specified property name is not supported
122	invalid or missing license
198	unexpected exception
951	Neither image file nor memory address are specified
952	Can not load image file
953	Can not load image from memory
954	Invalid property value
955	Incorrect barcode type. Should be Datamatrix.
956	Can not open file for writing
957	Can not create MODI control
958	MODI recognition failed
959	File not found
960	Can not load tessdll.dll
961	Tesseract recognition failed
962	Image type recognition failed
963	Image type is not supported
964	Can not load cuneiform.dll
965	Cuneiform recognition failed
966	Can not load FineReader FREngine.dll
967	FineReader recognition failed
968	Can not load Omnipage KernelAPI.dll
969	Omnipage recognition failed
970	conversion to output codepage failed
971	Invalid argument
972	Output result error
973	Recognition was cancelled
974	Invalid output format specified
975	Invalid output type specified
976	Preprocessor can't be appiled
977	Invalid recognizer name
978	"DataFolder" or "TESSDATA_PREFIX" should be defined
979	Error during initialization OCR framework
980	Error during text recognition
981	Invalid or unsupported barcode type specified
982	Error during initialization barcode recognition framework
983	Error during barcodes recognition
984	The specified configuration file does not exist
985	The specified configuration file has invalid format
986	Invalid xfdf input specified
987	word recognizer error
988	Path to dictionaries is missing
989	Path to dictionaries is invalid
990	Dictionary not found
991	Unknown input format
992	The HOCR file could not be parsed
993	The HOCR file could not be processed

14. Trace

In case of unclear error-situations it is possible to create a trace-file:

TraceFilename

Property-Name	Data-Type	Type
TraceFilename	String	Set

Syntax

TraceFilename= <filename><
<filename>= tracename.log

Example

TraceFilename= ocrtrace.log

Environment Variable

The environment-variable CIB_OCRTRACE is set to a filename and the erroneous process is started.

Example:

set CIB_OCRTRACE=ocrtrace.log

Website:	CIB eLearning
Kurs:	CIB ocr
Buch:	CIB ocr technical manual (EN)

Gedruckt von:	Gast
Datum:	Saturday, 18. October 2025, 18:25

CIB ocr technical manual (EN)

Inhaltsverzeichnis

1. Scope of Delivery

2. Introduction

3. Usecase: Calling CIB OCR via CIB runshell

4. Usecase: Calling CIB ocr Via CIB pdf toolbox

5. Usecase: Calling CIB ocr Via CIB job/CIB documentServer

6. General properties

6.1. Config

6.2. Recognize

6.3. DisableRecognition

6.4. DpiX

6.5. DpiY

6.6. InputFilename

6.7. InputMemoryAddress

6.8. LicenseCompany

6.9. LicenseKey

6.10. OCRLanguage

6.11. Preprocess

6.12. TracePreprocessOutput

7. Properties Text-Recognition

8. Properties Text-Recognition with deepER

9. Properties Barcode-Recognition

10. Properties Word-Recognition

11. Technical interface: Native functions

12. JNI Interface

13. Error Codes

14. Trace