CIB pdf toolbox 2 technical documentation: Exporting Text

CIB pdf toolbox 2 technical documentation

4. Usecases

4.21. Exporting Text

Options for exporting text from a PDF document

Property	Meaning	Type
TextExtraction	A flag indicating, that text should be extracted from a PDF. By default, only visible text is extracted and saved in Utf16 format. To change this behavior, use additional options: TextFormattingOptions and TextSelectionFilter	Set
FillTextOutput	A flag indicating if the extracted text is saved in memory (1) or not (0).	Set
TextOutputFilename	Filename for text file, to which the extracted text should be saved.	Set
TextFormattingOptions	Optional: allows to specify the output format (Utf8, Utf16, Hocr) and to enable additional word repositioning. The options is specified as a JSON object. Example: The following formatting options are set by default: TextFormattingOptions={"OutputFormats":["txt"], “OutputResolution”:72, “Options”:{“EnableWordSorting”:false, “SeparateTextBlocks”:false}} So, if the option TextFormattingOptions is not set explicitly then text will be saved in output file in Utf16 format, and word order is the same as in pdf stream. The following output formats are currently supported: 1. txt: text is saved into the output file in Utf16 format; 2. utf8txt: text is saved into the output file in Utf8 format; 3. Hocr: text is saved into the output file in HOCR format If EnableWordSorting is set as true then the words in the output file will be reordered, according to their coordinates in the PDF document Option “OutputResolution” has effect only for HOCR output: it specifies the resolution of processed pages to calculate positions and sizes of all bounding boxes. Default resolution for PDF documents is 72dpi.	Set
TextSelectionFilter	Optional: allows to filter exported text by its visibility (visible/invisible) within a PDF document and also by special content markers (tags). The options is specified as a JSON object. Now, only filtering by predefined text groups are supported. Example: TextSelectionFilter = {"groups": ["any_visible",cibocr_invisible", "others_invisible", …]} The following groups may be set in any combination within the groups array: 1. any_visible: any visible text, as within marked content as within not-marked one; 2. any_invisible: any invisible text, as within marked content as within not-marked one. 3. simple_invisible: invisible text within not-marked content; 4. cibocr_invisible: invisible text within content, marked with CIB_HOCR tag; 5. others_invisible: invisible text within content, marked with tags other than CIB_HOCR; 6. marked_invisible: invisible text within content, marked with a tag, specified in TextMark property (CIB_HOCR is default); Note: The text group any_invisible is a composite group: it includes all groups with prefix _invisible. So if you need to extract all text from PDF, just set groups array as {"groups": ["any_visible",”any_invisible”]}	Set

Diese Website verwendet Cookies, um einen fehlerfreien und sicheren Betrieb der Website zu gewährleisten und um unsere Services stetig zu verbessern. Durch die Bestätigung des Buttons „Akzeptieren“ stimmen Sie der Verwendung zu. Unter "Cookies" können Sie Ihre Auswahl jederzeit ändern. Weitere Infos in unserer Datenschutzerklärung.

Ihre Cookie-Einstellungen:

Notwendig

Diese Cookies sind für das reibungslose Funktionieren der Website unbedingt erforderlich. Diese Kategorie enthält nur Cookies, die die grundlegenden Funktionen und Sicherheitsmerkmale der Website gewährleisten. Auf Grundlage von Art. 6 Abs. 1 lit. f DSGVO werden diese Cookies gespeichert. Der Websitebetreiber hat ein berechtigtes Interesse an der Speicherung von Cookies zur technisch fehlerfreien und optimierten Bereitstellung seiner Dienste.

Statistik

Wir setzen Cookies zu statistischen Zwecken ein, um Ihr Nutzerverhalten besser zu verstehen und Sie bei Ihrer Navigation auf unseren Angebotsseiten zu unterstützen. Damit ist es uns zudem möglich, Ihre Navigation auf unseren Angebotsseiten zu erfassen und für die bedarfsgerechte Gestaltung unserer Services zu nutzen.