CIB ocr technical manual (EN)

6. General properties

6.11. Preprocess

Allgemein
Preprocessor details
     ▸ Thresholding Methods
     ▸ Median filters
     ▸ BilateralFilter
     ▸ Thinning Algorithms
     ▸ DeSkew
     ▸ Invert / AutoInvert
     ▸ AutoRotate
Composite Algorithms

Allgemein

Property-Name	Datentyp	Art
Preprocess	String	Set

This property allows to specify methods for preprocessing the inputfile in order to get a better ocr-result.

Syntax

Preprocess: <preprocessnames> 
<preprocessnames>= <preprocessor> | <preprocessor> “+” <preprocessnames> 
<preprocessor>: NativeAdaptiveThresholding | PureMedianBlur | PureAdaptiveThresholding |
PureAdaptiveGaussianThresholding | MedianBlurAGT | MedianBlurAT | MedianBlurGAT | 
SauvolaThresholding | NiblackThresholding | WolfJolionThresholding | NickThresholding | 
FengThresholding | OtsuThresholding | BilateralFilter | ThinningZhangSuen | 
ThinningGuoHall | DeSkew | Invert | AutoInvert | Composite

No default

Example

Preprocess = NativeAdaptiveThresholding

Preprocessor details

Thresholding Methods is the simplest method of image segmentation. From a grayscale image, thresholding can be used to create binary images.
The simplest thresholding methods replace each pixel in an image with a black pixel if the image intensity is less than some fixed constant T, or a white pixel if the image intensity is greater than that constant.

Adaptive Thresholding

Using a global value as threshold value may not be good in all conditions where an image has different lighting conditions in different areas. In that case, we go for adaptive thresholding. Adaptive thresholding means that the algorithm calculates the threshold for small regions of the image. Thus we get different thresholds for different regions of the same image and this gives us better results for images with varying illumination.

It has three ‘special’ input parameters and only one output argument.

Adaptive Method - It decides how the thresholding value is calculated.

cv2.ADAPTIVE_THRESH_MEAN_C : threshold value is the mean of the neighborhood area.
cv2.ADAPTIVE_THRESH_GAUSSIAN_C : threshold value is the weighted sum of neighborhood values where weights are a Gaussian window.

Block Size - It decides the size of the neighborhood area.

C - It is just a constant which is subtracted from the mean or weighted mean calculated.

NativeAdaptiveThresholding

This is a complex filter which consists of the following steps using OpenCV library:

cv::medianBlur()
cv::adaptiveThreshold() using CV_ADAPTIVE_THRESH_MEAN_C threshold type
cv::bilateralFilter()
The result is a grayscale image;

PureAdaptiveThresholding

While the conventional thresholding operator uses a global threshold for all pixels, adaptive thresholding changes the threshold dynamically over the image. This more sophisticated version of thresholding can accommodate changing lighting conditions in the image , e.g. those occurring as a result of a strong illumination gradient or shadows.

PureAdaptiveThresholding consist only of one step:

cv::adaptiveThreshold() using CV_ADAPTIVE_THRESH_MEAN_C threshold type
The result is a binary image;

Alternative: PureAdaptiveGaussianThresholding

Thresholding based on standard deviation

The methods described in the following sections - FengThresholding, SauvolaThresholding, NiblackThresholding, WolfJolionThresholding, NickThresholding - differ only by the final formula for the thresholding value for particular pixel, but use the same matrixes with standard deviation.

SauvolaThresholding

The basic idea behind Sauvola is that if there is a lot of local contrast, the threshold should be chosen close to the mean value, whereas if there is very little contrast, the threshold should be chosen below the mean, by an amount proportional to the normalized local standard deviation.

NiblackThresholding

Niblack’s method can be considered as the first local threshold method. It has the advantage of detecting the text but it introduces a lot of background noise. Sauvola and Pietikinen modified the Niblack threshold to decrease the background noise but the text detection rate is also decreased while bleed-through still remains in most cases.

WolfJolionThresholding

In particular, for most colored images the Wolfjolion preprocessor allows to achieve the best quality of recognition as well as for images with background noise and anti-aliased font.

NickThresholding

Nick's binarization derives its thresholding formula from the basic Niblack algorithm, the parent of many local image thresholding methods. The major advantage of Nick's method over Niblack is that it considerably improves binarization for "white" and light page images by shifting down the binarization threshold.

FengThresholding

The Feng thresholding method is interesting because it can qualitatively outperform the Sauvola thresholding method. However, the Feng method contains many parameters which have to be set. Hence this method was never widely accepted.

OtsuThresholding

Considering a bimodal image (a bimodal image is an image whose histogram has two peaks) we can approximately take a value in the middle of those peaks as threshold value. That is what Otsu binarization does. So it automatically calculates a threshold value from an image’s histogram for a bimodal image. (For images which are not bimodal, binarization won’t be accurate.)

Median filters

A median filter is an example of a non-linear filter and, if properly designed, is very good at preserving image detail. Running a median filter:

considers each pixel in the image
sorts the neighboring pixels into order based upon their intensities,
replaces the original value of the pixel by the median value from the list.

A median filter is a rank-selection (RS) filter, for example one that selects the closest of the neighboring values when a pixel's value is external in its neighborhood, and leaves it unchanged otherwise . It is sometimes preferred, especially in photographic applications.

Median and other RCRS filters are good at removing salt and pepper noise from an image, and also cause relatively little blurring of edges, and hence are often used in computer vision applications.

Disadvantage: the rest becomes blurred, this impairs the borders of characters and consequently recognition accuracy.

At the same time (and rather unexpectedly), the best choice for “recipes” and images with “curved” or “complex in general” text is the MedianBlurGAT preprocessor.

Used filters:

PureMedianBlur

Contain thresholding in addition

MedianBlurAGT
MedianBlurAT
MedianBlurGAT

BilateralFilter

A bilateral filter is a non-linear, edge-preserving and noise-reducing smoothing filter for images. The intensity value at each pixel in an image is replaced by a weighted average of intensity values from nearby pixels. This weight can be based on a Gaussian distribution. Crucially, the weights depend not only on the Euclidean distance of pixels, but also on the radiometric differences (e.g. range differences, such as color intensity, depth distance, etc.). This preserves sharp edges by systematically looping through each pixel and adjusting weights to the adjacent pixels accordingly.

It is normally used for non-text images or after thresholding.

Thinning Algorithms

This is an algorithm used for binary images to reduce a black and white area to a n e.g. one bit skeleton.

A fast parallel thinning algorithm consists of tw o iteration loops:
One aimed at deleting the south-east boundary points and the north-west corner points while the other one is aimed at deleting the north-west boundary points and the south-east corner points. End points and pixel connectivity are preserved. Each pattern is thinned down to a "skeleton" of unitary thickness. Experimental results show that this method is very effective .

Used algorithms:

ThinningZhangSuen
ThinningGuoHall

DeSkew

Deskewing an image can help a lot, if you want to do barcode detection, or just improve the readability of scanned images. I n photos of goods with a barcode for example, the skew angle is often too high, so the barcode cannot be detected. After deskewing, the barcode can be read.

If an image is a logo, a good choice is DeSkew+AutoInvert and any of the preprocessors Feng, Nick, Sauvola or WolfJolion.
For invoices a suggestion is DeSkew and Sauvola or WolfJolion.

Invert / AutoInvert

Both filters are suitable for images containing more black than white color.

Application of “Invert” changes black to white and vice versa.

Filter “Autoinvert” checks first, if we really have more black than white on page.

We get good results, if “Invert (AutoInvert)” is used together with “BilateralFilter” and “DeSkew” .

AutoRotate

This preprocessor algorithm allows to detect image rotation by 90/180/270 degrees, using artificial intelligent algorithm . It detects rotation of image and rotate it before text recognition process. The following preprocessor settings allow to detect image rotation and rotate it, and then de-skew the resulting image, before text recognition:

Example

Preprocess = AutoRotate+Deskew

For using this algorithm, an additional property should be set: AutoRotateModel. This property should point to tensorflow-based model file , trained to detect image rotation.

Composite Algorithms

(From CIB ocr version 2.3.0)

CIB OCR can use complex algorithms for image preprocessing. For using of complex image processing algorithms preprocessor "Composite" should be used. This possibility is based on usage of CIB image toolbox functionality. Each preprocessing algorithm should be described in XML format (details are available in CIB image toolbox documentation).

Example CIB runshell:

cibrsh.exe –oc Preprocess=Composite AlgorithmsSetName=AlgorithmsSet_sample.xml
AlgorithmName=SepaTextExtraction AlgorithmProfile=processing_profile.xml
IPLTraceFilename=OCR_IPL.log

Example CIB Job/CIB DocumentServer

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<root> 
   <Comod> 
      <defaults> 
         <properties command="job"> 
            <property name="OutputMode">XML</property>
            <property name="UseInMemoryProcessing">1</property>                     
        </properties> 
      </defaults> 
      <jobs> 
         <job name="tesseract" expected-result-code="404"> 
         <steps> 
            <step name="LoadStep" command="load">
            <properties> 
               <property name="InputFilename">./input/input.png</property>
            </properties> 
            </step> 
            <step name="OcrStep" expected-result-code="1000" command=" ocr"> 
            <properties> 
               <property name="LicenseCompany">CustomerLicensee</property> 
               <property name="LicenseKey">4444-cccc-88888888</property>
               <property name="OCRLibraryName">Tesseract</property>
               <property name="DataFolder">.</property> 
               <property name="OutputFormat">FormatHocr</property>
               <property name="TraceFilename">OCR_trace.log</property>
               <property name="OCRLanguage">deu</property>
               <property name="TracePreprocessOutput">1</property> 
               <property name="Preprocess">Composite</property>
               <property name="AlgorithmsSetName">AlgorithmsSet_sample.xml</property>
               <property name="AlgorithmName">SepaTextExtraction</property>
               <property name="AlgorithmProfile">processing_profile.xml</property>
               <property name="IPLTraceFilename">OCR_IPL.log</property>
            </properties> 
            </step> 
            <step name="SaveStep" expected-result-code="0" command="save">
            <properties> 
               <property name="OutputFilename">./ocr_out.html</property>
            </properties> 
            </step> 
         </steps> 
         </job> 
      </jobs> 
   </Comod> 
</root>