CIB pdf toolbox technical guide (EN)

11. Supported graphic formats

Which graphic formats are supported by the CIB pdf toolbox depends on the processing mode of the affected PDFs. In the following chapters the different possibilities are described.

CIB pdf join/ CIB pdf merge
CIB pdf print / processing for the CIB viewer
Graphic Overlay
Generating graphic files
Insert text from graphics into PDFs using CIB ocr

CIB pdf join/ CIB pdf merge

All graphic formats are possible here, since the image objects are not processed, but only copied.


CIB pdf print / processing for the CIB viewer

Supported formats are:

  • RAW (Image data are available in a format described by the PDF Spec.)
  • JPG
  • TIFF
  • JBIG2 (from CIB pdf toolbox Version 1.4.100 onwards)
  • JPEG2000 (from CIB pdf toolbox Version 1.4.101 onwards)


Hint:

The image objects in these PDFs do not contain a complete JPG or TIFF, but only the image data itself (i.e. no color palette and metadata).

For processing JBIG2 a special library Jbig2dec.dll is required. The library is also available for Unix platforms.


Graphic Overlay

Supported graphic formats are BMP, JPG, GIF und PNG.

Please find detailed information about this in the chapter „Overlay functionality/graphics“.

Generating graphic files

(from CIB pdf toolbox 1.4.102 onwards)

With the CIB pdf join module of the CIB pdf toolbox, it is possible to create graphic files in addition to PDF output.

The output format is set by the OutputFormat property.

The following graphic formats are supported:

  • FormatTiff
  • FormatPng,
  • FormatJpeg

One graphic file per page in the PDF is generated, whereby the file names are made unique by automatic numbering. Only with TIFF is multi-page output to a TIFF file possible.

From CIB pdf toolbox version 1.5.113 onwards, these graphic formats are also available on Unix platforms.

(from CIB pdf toolbox Version 1.8.5a onwards:

  • FormatBmp
  • FormatBmpLz4 (BMP, but files corresponding to the lz4 Standard are compressed).

(from CIB pdf toolbox Version 1.9.0 onwards):

  • FormatJpegXR
  • FormatWebP (if RenderingEngine=CIBRenderer)

The resolution, compression, etc. can be specified in more detail for the individual graphic formats via corresponding property assignments.

Further information on this topic can be found in the chapter „CIB pdf/join / split“.

(from CIB pdf toolbox 1.4.113 onwards)

Property OutputFormat = FormatExtractImages

If this output format is specified, all Image-XObjects contained in the input PDFs are exported. The output is in TIFF format or (for certain Pdf image objects) in JPEG format under the file name specified in OutputFilename.

For TIFF images, it is possible to export into a single TIFF file or into a separate TIFF file for each image. For details on this and on exporting JPEG images, see description of the Property OutputFormat = FormatExtractImages.


PDF Layer Support

(from CIB pdf toolbox 1.40.0 onwards)

Processing and rendering of optional content groups is implemented into CIB pdf toolbox.

Optional content is handled within marked content streams, and also for XObjects and Annotations, having OC entry. It is able to communicate with doxiview, using input and output properties and JSON based arguments.

Property description

Functionality

Kind

PdfLayers

Syntax:

PdfLayers={"RequestedStates":[{"LayerId":<Id1>,"State":"On"/"Off"},..., {"LayerId":<IdN>,"State":"On"/"Off"}]}

where:

Id1...IdN are numeric IDs of the optional content layers, and the layer state is "Off" (invisible), or "On" (visible)

Notes:
After the first processing of a PDF file, an outer application can get info about the existing OC groups for the Default configuration from the output property PdfLayersInfo.

After knowing this info, the outer application can request arbitrary states (visible/not visible) for any OC layers.

 

Set

PdfLayersInfo

The output property outputs info about the existing optional content layers and their states.

Syntax:

PdfLayersInfo={"Tree":[<LayerDescription>], "PageLayers":[<PageLayersDescription>]}

where:

Tree is an array, describing hierarchical structure of the existing layers in PDF.

Tree contains the items: <LayerDescription> which is a description of each layer:

{"Name":"<LayerName>", "LayerId":<Id>,"Locked":true/false, "State":"On"/"Off", "RBGroups":[<RBId1>,...<RBIdN>],"Kids":[ <LayerDescription> ]}

where:

"LayerId": <Id> is unique number, (layer identifier that should be used for input property PdfLayers).

Locked: true/false shows if an user can switch the current state of the layer.

Kids: array, describing all kids of the layer.

 

PageLayers is an array of pages and their layers:

PageLayersDescription = {"PageIndex":<PageId>, LayerIds:[<LayerId>]}

 <PageId> is an index of a page, starting from 0

LayerIds is an array of <LayerId> (numeric id) that could be mapped to appropriate LayerIds from Tree object.

Note:

  1. If LayerDescription contains only Name and Kids (no LayerId and State entries), then it is not a real Layer, but simple node that can be expanded or collapsed and it contains other layers. If LayerDescription contains all entries and also has Kids, then the Layer is a node, which can be switched to On or Off, and also contains other layers.
  2. New action of type SetOcgState is additionally output into the metafile to allow OCG switching with using form fields and widget annotations.
  3. The output property PdfLayersInfo will be also filled for OutputFormat=FormatInfo with appropriate FilterInfo.

Get



Insert text from graphics into PDFs using CIB ocr

(from CIB pdf toolbox 1.6.116 onwards)

For images in PDF documents that contain text, the CIB ocr module can be used to extract this text from image via the CIB pdf toolbox and insert it into the PDF document as text. The main use case for this feature are scanned PDF documents where the text is only available as an image and therefore there is no possibility to search for text or copy it.

This functionality is available for both PDF Join and PDF Merge. It requires the CIB ocr module with a corresponding license.

The result is a PDF document similar to the input document(s), but containing (in)visible text, the text extracted from the images. This text can be searched for and copied from the PDF document.

From CIB pdf toolbox 1.9.0 onwards, it is possible to import such (in)visible text from external sources. This means that the text is no longer extracted from the images of the PDF document, but is instead transferred to the toolbox by the property "HocrInputData". This property consists of one or more memory blocks containing the Hocr data. For the exact format please see the property below.

In addition, from CIB pdf toolbox 1.9.0 onwards, all generated or imported Hocr data can be output as multi-Hocr files. The HocrOutputFilename property specifies the file in which the Hocr data should be saved.

From CIB pdf toolbox 1.10.0 onwards it is also possible to import a multi-page HOCR XML file. For this purpose, the property HocrInputData is assigned the path of this file. All information (like page numbers, fonts, ...) are contained in this HOCR-XML file. The properties FormatSearchablePdfTextColor, FormatSearchablePdfLayerName and FormatSearchablePdfDTDFolder are supported for this multi-page HOCR XML file. Such a multi-page HOCR XML file can be generated for exmaple with CIB format and the output format FormatHocr.

Properties:

Property description

Type

Functionality

Kind

OutputFormat

String

FormatSearchablePdf

Set

CurrentProgress

(from CIB pdf toolbox 1.24.0 onwards)

String

This property can be used to check the progress of the text recognition process of CIB ocr.

The CIB pdf toolbox transfers the content of the method CibOcrJobGetProgress of the CIB ocr module unchanged.

The property contains a string with the structure:
<current page number> <total page number> <processing progress for current page >

For details see „Technical manual CIB ocr“, chapter „CibOcrJobGetProgress“

Get

DictionaryWorkSpace

String

This property allows to set the path for the data required by CIB ocr.

Set

FormatSearchablePdfShowText

String

It can be definied whether the text inserted in the output PDF is visible or not.

„1“       Text is visible

„0“       Text is not visible (default)

Set

FormatSearchablePdfRemoveImages

String

It can be defined whether the images are removed from the output PDF or not.

Setting this property is only possible if "HocrInputData" is empty and is only useful in conjunction with "FormatSearchablePdfShowText=1". Then the images are replaced by visible text in the output PDF.

„1“       Images will be removed

„0“       Images will be maintained (default)

Set

PdfVersion

(from version 1.6.116b onwards)

String

The PdfVersion property can also be set optionally for OutputFormat=FormatSearchablePdf. Then the created PDF corresponds to the specified PDF/A standard.

Note: this is currently only supported in combination with FormatSearchablePdfShowText=0.

Possible values:

PDF/A-1b

PDF/A-2b

PDF/A-3b

Set

HocrInputData

(from version 1.9.0, multi page format from 1.10.0 onwards)

String

This string gives the possibility to import the hocrdata directly from the toolbox (instead of extracting them from the images of the PDF document). If this string is empty (default), it will not be used.

Otherwise it includes a list of memory addresses and lengths for strings containing the hocrdata. It must be in the following format:

Syntax:

HocrInputData ::= <OneHocrFile> [“;“ <OneHocrFile>] ...

OneHocrFile ::= “{” <Pagenumber> “};” <MemoryBlocks>

MemoryBlocks ::= [<MemoryBlocks-Delimiter> “;”] <MemoryBlock> [<MemoryBlocks-Delimiter> <MemoryBlock>] ...

MemoryBlock ::= <Address> <MemoryBlock-Delimiter> <Length>

MemoryBlock-Delimiter ::= An individual symbold that is not equal to ‚;‘ and ‚\0‘, e.g. ‚#‘ or ‚?‘, better not to use numbers or characters either.

Pagenumber ::= Page number of the page for which the hocrdata are specified.

Address :== A decimal number that specifies the address of a memory block for the hocrdata.

Length ::= A decimal number that specifies the length of a memory area for the hocrdata.

Example:

HocrInputData=“{1};?;111?100?222?200;{2};113?100;{3};+;300+100+400+200“ means:

Page 1 has hocr data in the memory areas (address, length) (111, 100) and (222, 200); page 2 has hocr data in the memory areas (113, 100); page 3 has hocr data in the memory areas (300, 100) and (400, 200).

From CIB pdf toolbox 1.10.0 onwards. a multi-page format is also possible:

If the first character is not equal to "{", only one XML HOCR file is specified. This can consist of several individual XML HOCR pages. The file has to be in a special format, such as generated by CIB format with the property OutputFormat=FormatHocr.

Set

HocrOutputFilename (from version 1.9.0 onwards)

String

If the string is empty, no hocr data is written out (default).

If the string is not empty, all used hocr data will be written in the file with this filename as multi-hocr file.

The line "<!-- CIB:page=page number -->" is written before each individual "hocr part", where "page number" is the page number for the hocr part. (e.g. "<!-- CIB:page=3 -->" = Hocrdata for page 3). Texts like 'CIB ocr testlicense' are removed from the hocr parts.

Set

OCRDebug

String

This property is only relevant for technical test purposes. It controls the output of images from the PDF.

Possible values:

„1“     The intermediate steps of barcode extraction using OCR are output as individual files.

„0“     No output of intermediate steps (default)

For every single image in the PDF document, which was transferred to the CIB Ocr Dll, several files are output:

  • The image itself as a bitmap as it was passed to OCR.
  • The barcode result of the CIB Ocr Dll, if the barcode result is not empty .If the bar code result is empty, there will be no output for this file.

The files are written to the same directory as the output file. The names of these files have the following form:

“output-file”__Page_”page-number”_Image_“image-number“_“file-extension“

For the “file extension” applies:

  • For bitmap file: „.bmp“
  • For the barcode result of CIB Ocr Dll: „_BARCODE.txt“

Example:

If the output document is "Output.xml", the following files are output for the 4th image of the 3rd page:

Output.xml__Page_3_Image_4.bmp
Output.xml__Page_3_Image_4_BARCODE.txt

Set

FormatSearchablePdfTextColor

(from version 1.10.0 onwards)

String

This property is only used if OutputFormat=FormatSearchablePdf and FormatSearchablePdfShowText="1" and the first character of HocrInputData is not equal to '{'. It specifies the color for the inserted text.

Possible values:

  • empty String (default): black will be used as standard font color.
  • The text color is given in the form "R;G;B", where R, G and B are natural decimal numbers between 0 and 255. (R is the red component, G is the green component and B is the blue component).

For example, "255;0;0" is red.

Set

FormatSearchablePdfLayerName

(from version 1.10.0 onwards)

String

This property is only used if OutputFormat=FormatSearchablePdf and FormatSearchablePdfShowText="1" and the first character of HocrInputData is not equal to '{'.

It indicates (if specified) the PDF layer name (in Adobe Reader) for the inserted text.

Possible values:

  • Empty String (default): The inserted text is not part of any PDF layer.
  • The inserted text is part of the Pdf layer with this layer name.

Set

FormatSearchablePdfDTDFolder

(from version 1.10.0 onwards)

String

This property is only used if OutputFormat=FormatSearchablePdf and FormatSearchablePdfShowText="1" and the first character of HocrInputData is not equal to '{'.

It is an auxiliary property and specifies (if specified) the local path for the XHTML DTD file xhtml1-transitional.dtd. The background is: The XML parser returns a NetAccessorException. And the W3C says that the calls to the DTDs should be stored locally.

Possible values:

  • Empty String (default): Nothing will be done.
  • The URL for xhtml1-transitional.dtd in the HOCR-XML-file of HocrInputData is replaced by a reference to the local file of the same name.

Set

FormatSearchablePdfConversionMode

(from version 1.18.0 onwards)

String

This property defines the form in which the images contained in the PDF are transferred to CIB ocr..

Possible values:

FormatSearchablePdfConvertImages
The previous behavior is applied, i.e. each image on a page is treated as a separate object.

FormatSearchablePdfConvertPages
Each page is converted into a single image and passed to CIB ocr. The resolution of this image can be controlled by the property TiffResolution (recommended values are TiffResolution=150 and higher).

FormatSearchablePdfConvertAuto
On pages that contain only a single image, the previous behavior is applied. On all other pages the behavior of „FormatSearchablePdfConvertPages“. (Default)

Set

FormatSearchablePdfUseRotationHint

(from version 1.18.0 onwards)

String

If this property is set, the information about a rotation is passed to CIB ocr for each page of the PDF. This improves the recognition rate of CIB ocr for rotated pages.

For this purpose the CIB pdf toolbox uses the CIB ocr property ImageRotationAngle.

Possible values:

0          Previous behavior

1        Transfer of information about the page rotation to CIB ocr. (Default)

Set

FormatSearchablePdfReplaceText

String

This property is only valid for OutputFormat=FormatSearchablePdf

If the property FormatSearchablePdfReplaceText IS NOT set then the original behavior will be used: pdf toolbox will always add new CIB HOCR content without removing existing ones.

If the property is set as FormatSearchablePdfReplaceText=1 then pdf toolbox will remove existing invisible CIB HOCR content from the processed document before adding a new one.

 

The new property FormatSearchablePdfReplaceText can also work in conjunction with TextSelectionFilter:

if FormatSearchablePdfReplaceText=1 and TextSelectionFilter contains some filter then this filter will be used to remove text from the original document before adding new HOCR text

 

Examples:

1. OutputFormat=FormatSearchablePdf FormatSearchablePdfReplaceText=1 - will remove existing CIB HOCR text before adding new HOCR text

2. OutputFormat=FormatSearchablePdf FormatSearchablePdfReplaceText=1 TextSelectionFilter={"groups":["any_invisible"]} - will remove any invisible text (including CIB HOCR) before adding new HOCR text