CIB pdf toolbox 2 technical documentation

4. Usecases

4.21. Exporting Text

Options for exporting text from a PDF document

Property

Meaning

Type

TextExtraction

 

A flag indicating, that text should be extracted from a PDF.

By default, only visible text is extracted and saved in Utf16 format. To change this behavior, use additional options: TextFormattingOptions and TextSelectionFilter

Set

FillTextOutput

A flag indicating if the extracted text is saved in memory (1) or not (0).

Set

TextOutputFilename

Filename for text file, to which the extracted text should be saved.

Set

TextFormattingOptions

Optional: allows to specify the output format (Utf8, Utf16, Hocr) and to enable additional word repositioning.

The options is specified as a JSON object.

Example:
The following formatting options are set by default:

TextFormattingOptions={"OutputFormats":["txt"], “OutputResolution”:72, “Options”:{“EnableWordSorting”:false, “SeparateTextBlocks”:false}}

So, if the option TextFormattingOptions  is not set explicitly then text will be saved in output file in Utf16 format, and word order is the same as in pdf stream.

The following output formats are currently supported:
1. txt: text is saved into the output file in Utf16 format;
2. utf8txt: text is saved into the output file in Utf8 format;
3. Hocr: text is saved into the output file in HOCR format

If EnableWordSorting is set as true then the words in the output file will be reordered, according to their coordinates in the PDF document
Option “OutputResolution” has effect only for HOCR output: it specifies the resolution of processed pages to calculate positions and sizes of all bounding boxes. Default resolution for PDF documents is 72dpi.

Set

TextSelectionFilter

Optional: allows to filter exported text by its visibility (visible/invisible) within a PDF document and also by special content markers (tags).

The options is specified as a JSON object. Now, only filtering by predefined text groups are supported.

Example:
TextSelectionFilter = {"groups": ["any_visible",cibocr_invisible", "others_invisible", …]}

The following groups may be set in any combination within the groups array:
1. any_visible: any visible text, as within marked content as within not-marked one;
2. any_invisible: any invisible text, as within marked content as within not-marked one.
3. simple_invisible: invisible text within not-marked content;
4. cibocr_invisible: invisible text within content, marked with CIB_HOCR tag;
5. others_invisible: invisible text within content, marked with tags other than CIB_HOCR;
6. marked_invisible: invisible text within content, marked with a tag, specified in TextMark property (CIB_HOCR is default);

Note:
The text group any_invisible is a composite group: it includes all groups with prefix _invisible. So if you need to extract all text from PDF, just set groups array as
{"groups": ["any_visible",”any_invisible”]}

Set