CIB ocr technical manual (EN)

4. Usecase: Calling CIB ocr Via CIB pdf toolbox

n CIB pdf toolbox it is possible to directly pass through properties t o CIB ocr. Further details can be found in the CIB pdf toolbox technical manual.  

This example uses CIB ocr to extract text from a PDF file that is not readable, i.e.  the text is represented by a picture. To pass through CIB ocr properties in this CIB pdf toolbox call, “CibOcr” is added as a prefix to the CIB ocr properties described in this document.   

Again a valid license has to be set. Then the property “OutputFormat is set to specify that a searchable PDF should be generated. It is also specified that in this case the image in the input PDF should be removed and the extracted text from that image shown instead in the output PDF  (properties “FormatSearchablePdfRemoveImage” and “FormatSearchablePdfShowText”).  

The language of the text in the PDF is set accordingly (“CibOcrOCRLanguage”) and the path to the necessary dictionary files is set (“CibOcrDataFolder”).  Also, a preprocessor is selected to improve the ocr result.  

Then cib pdf toolbox is called via –fj  (join) and the input and output files are specified. 

CibRsh.exe LicenseCompany=”Example Company” LicenseKey=”xxxx-xxxx-xxxxxxxx” 
OutputFormat=FormatSearchablePdf FormatSearchablePdfShowText=1 
FormatSearchablePdfRemoveImages=1 CibOcrDataFolder=”./tessdata” CibOcrOCRLanguage=”eng” 
CibOcrPreprocess=SauvolaThresholding –fj “./input/TextAsPicture.pdf” 
“./output/TextAsPicture.out.pdf”