CIB ocr technical manual (EN)
2. Introduction
CIB ocr is a CIB solution for optical character
recognition and mainly employed as a component of CIB products and modules like
CIB doXima, CIB pdf toolbox or CIB doXisafe. It is used to support automatic
text- and barcode-recognition in all these modules. For text-recognition the
engine Tesseract is used.
About Tesseract:
Tesseract is an optical character recognition engine and is considered to be one of the most accurate open source OCR engines currently available. It is free software, released under the Apache License and development has been sponsored by Google since 2006.
Tesseract is available for Linux, Windows and Mac OS X. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library.
Tesseract is able to process a lot of different languages, German included.
Input-formats supported by CIB ocr :
- bmp image;
- tiff image (including multipage tiff);
- jpeg image;
- png image;
- xml (serialized cv::Mat);
- yml (serialized cv::Mat)
Supported barcode-types: Still correct??
- Code128A, Code128B, Code128C, Code128Auto
- Code39
- Code39Extended
- QR-Code
- Datamatrix
For output the following formats can be used:
- plain text
- hocr