CIB ocr technical manual (EN): Introduction

CIB ocr technical manual (EN)

2. Introduction

CIB ocr is a CIB solution for optical character recognition and mainly employed as a component of CIB products and modules like CIB doXima, CIB pdf toolbox or CIB doXisafe. It is used to support automatic text- and barcode-recognition in all these modules. For text-recognition the engine Tesseract is used.

About Tesseract:

Tesseract is an optical character recognition engine and is considered to be one of the most accurate open source OCR engines currently available. It is free software, released under the Apache License and development has been sponsored by Google since 2006.

Tesseract is available for Linux, Windows and Mac OS X. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract is able to process a lot of different languages, German included.

Input-formats supported by CIB ocr :

bmp image;
tiff image (including multipage tiff);
jpeg image;
png image;
xml (serialized cv::Mat);
yml (serialized cv::Mat)

Supported barcode-types: Still correct??

Code128A, Code128B, Code128C, Code128Auto
Code39
Code39Extended
QR-Code
Datamatrix

For output the following formats can be used:

plain text
hocr