CIB pdf toolbox 2 technical documentation
4. Usecases
4.21. Exporting Text
Options for exporting text from a PDF document
Property |
Meaning |
Type |
TextExtraction
|
A flag indicating, that text should be extracted from a PDF. By default, only visible text is extracted and saved in Utf16 format. To change this behavior, use additional options: TextFormattingOptions and TextSelectionFilter |
Set |
FillTextOutput |
A flag indicating if the extracted text is saved in memory (1) or not (0). |
Set |
TextOutputFilename |
Filename for text file, to which the extracted text should be saved. |
Set |
TextFormattingOptions |
Optional: allows to specify the output format (Utf8, Utf16, Hocr) and to enable additional word repositioning. The options is specified as a JSON object. Example: TextFormattingOptions={"OutputFormats":["txt"], “OutputResolution”:72, “Options”:{“EnableWordSorting”:false, “SeparateTextBlocks”:false}} So, if the option TextFormattingOptions is not set explicitly then text will be saved in output file in Utf16 format, and word order is the same as in pdf stream. The following output formats are currently supported: If EnableWordSorting is set as true then the words in the output file
will be reordered, according to their coordinates in the PDF document |
Set |
TextSelectionFilter |
Optional: allows to filter exported text by its visibility (visible/invisible)
within a PDF document and also by special content markers (tags). The following groups may be set in any combination within the groups
array: |
Set |