CIB ocr technical manual (EN): Properties Word-Recognition

CIB ocr technical manual (EN)

10. Properties Word-Recognition

Recognize
DictionaryPath
InputFormat
Wordrecognizeroptions
Wordrecognizerresult

Recognize

Property-Name	Data-Type	Type
Recognize	String	Set

In order to use the WordRecognizer this property has to be set to “WordRecognizer”.

Syntax

Recognize=<Value> 
<Value>:BarcodeRecognizer | OcrRecognizer | WordRecognizer

default=OcrRecognizer

Example

Recognize=WordRecognizer

DictionaryPath

Property-Name	Data-Type	Type
DictionaryPath	String	Set

This property can also be defined within the WordRecognizerOptions. It is recommended to define it within WordRecognizerOptions, as WordRecognizerOptions overrules this property.

However this property has to be defined at least within this property or WordRecognizerOptions.

Syntax

DictionaryPath=<Value>

No default value! It has to be set.

Example

DictionaryPath=".\\hunspell"

InputFormat

Property-Name	Data-Type	Type
InputFormat	String	Set

In order to use the WordRecognizer this property has to be set to “WordRecognizer”.

Syntax

InputFormat=<Value> 
<Value>: HOCR | UTF8 | UTF16 | Unicode

HOCR: input is a HOCR file which must be UTF8 encoded
UTF8: input is plain text in UTF8 encoding (with or without UTF8 BOM )
UTF16: input is plain text in UTF-16 encoding. The BOM (FE FF or FF FE) must be present

Example

InputFormat=UTF8

Wordrecognizeroptions

This property is defined as json-String and contains all the information that is needed in order to analyse the document by WordRecognizer.

Property-Name	Data-Type	Type
WordRecognizerOptions	String	Set

This property might look like this. A more detailed explanation for each component can be found below the example:

Example:

{ 
"DictionaryPath": "D:\\PROJEKTE-SVN\\products\\CIB ocr\\trunk\\src-test\\testdata\\hunspell",  
"Dictionaries": {"DE": {"Dictionaries": "de_DE_frami-UTF8", "StopwordFiles": "stopword_german.txt", "DigramScores": "de_digramscores.txt"}}, 
"InputFormat": "UTF8",  
"RecognizedWordsFilename": "recognized.log",  
"RejectedWordsFilename": "rejected.log",  
"StatisticsFilename": "statistics.log",  
"StatisticsOutputFormat": "FormatCsv"}

Explanation of each component:

Component	Value	Note
InputText	<string>	(required if property InputFilename / InputMemoryAddress is empty) text to parse, must be in UTF-8 format
DictionaryPath	<string>	(required) path to the hunspell folder, may be absolute or relative to the working directory
Dictionaries	<dictionary-object>	(required) dictionaries to use, one or more dictionaries for each language (see below)
InputFormat	<string>	Specifies the input format (e.g. “UTF8”)
RecognizedWordsFilename	<string>	(optional) filename for recognized words writes the numer of occurrences for each recognized word, per language a word is considered recognized if it is no stopword and contained in at least one dictionary for that language for the language "<GLOBAL>", all stop word lists are ignored, and the word is recognized if it is contained in at least one dictionary (excluding stop word dictionaries) the words are written as <language> <TAB> <count> <TAB> <word>
RejectedWordsFilename	<string>	(optional) filename for rejected words writes the numer of occurrences for each rejected word, per language a word is considered rejected if it is no stopword and is not contained in any dictionaries for that language for the language "<GLOBAL>", all stop word lists are ignored, and the word is rejected if it is not contained in any dictionary (excluding stop word dictionaries) the words are written as <language> <TAB> <count> <TAB> <word>
StatisticsFilename	<string>	(optional) filename for the summary of the word recognizer run
StatisticsOutputFormat	<string>	defines the output format for the summary "FormatText": output is written as tabbed text "FormatCsv": output is written in csv format (with ";" as delimiter) "FormatJSON": the property "WordRecognizerResult" is written to the specified file (as JSON string)
StatisticsPerPage	<boolean>	adds a pagewise statistics to the WordRecognizer result (if InputFormat is not HOCR, all input text is considered as page 1)
TextAcceptThreshold	<number>	sets "TextAccepted" flag in the result, if the "longer glyph ratio" is at least this value. Only meaningful if LargeWordLimit > 0
SmallWordLimit	<number>	(optional) if > 0, words with at most that many characters are counted in the "SmallWord" group
LargeWordLimit	<number>	(optional) if > 0, words with at least that many characters are counted in the "LargeWord" group

Component “Dictionaries”:

specification of <dictionary-object> (same as a few comments above):

{<language-name>: <language>, ...}

<language-name> = JSON-String: "..." (specifies a language name) 
<language> = JSON-String: "..." (specifies a single dictionary for that language, no stopwords)
<language> = JSON-Array: ["...","..."] (specifies one or more dictionaries for that language, no stopwords)
<language> = JSON-Object: {"Dictionaries": <dictionaries>, "StopwordDictionaries:" <stopword-dicts>, "Stopwords": <stopwords>]}
<dictionaries> = JSON-String: "..." (specifies a single dictionary for that language)
<dictionaries> = JSON-Array: ["...","..."] (specifies one or more dictionaries for that language)
<stopword-dicts> = JSON-String: "..." (specifies a single stopword dictionary for that language)
<stopword-dicts> = JSON-Array: ["...","..."] (specifies one or more stopword dictionaries for that language)
<stopwords> = JSON-Array: ["...", "..."] (specifies a list of stopwords (UTF-8 encoded))

Example 1:

"Dictionaries": {"DE": ["de_DE-frami-UTF8", "de_user"], "EN": "en_US-UTF8"}

Example 2:

"Dictionaries": { 
"DE": {"Dictionaries": "de_DE-frami-UTF8", "StopwordDictionaries": "de_stopwords", "DigramScores": "de_digramscores.txt"}  
"EN": {"Dictionaries": "en_US-UTF8", "Stopwords": ["a", "an", "in"], "DigramScores": "de_digramscores.txt}
    }

Wordrecognizerresult

Property-Name	Data-Type	Type
WordRecognizerResult	String	Set

This property contains all the output information. It makes sense to set the property to mode=“out”. This will result in the output-output.xml and the tracefile to contain all the results of the WordRecognizer (Additionally to the statistic files).

In version 2.7, WordRecognizerResult is a JSON object as follows:

{<language-string>: <statistics-object>, ...}

language-string: one of the language strings given in the WordRecognizerOptions (for instance ‘EN’ for english)

<statistics-object> = 
{
"SmallWordCount": <number> number of recognized words (excluding stop words), which are small words (according to SmallWordLimit)
"LargeWordCount": <number> number of recognized words (excluding stop words), which are large words (according to LargeWordLimit)
"MainWordCount": <number> number of words which are recognized, and are not stop words
"StopWordCount": <number> number of words which are in the stop word list/dictionary
"RejectedWordCount": <number> number of words which are neither stop words nor in one of the language dictionaries
(for example, most english words are rejected in german dictionaries)
"TotalWordCount": <number> number of words which are recognized (including stop words). Should be the MainWordCount+StopWordCount

"SmallWordCoverage": <number> number of characters over all small words
"LargeWordCoverage": <number> number of characters over all large words
"MainWordCoverage": <number> number of characters over all recognized words (excluding stop words)
"StopWordCoverage": <number> number of characters over all stop words
"RejectedWordCoverage": <number> number of characters over all rejected words
"TotalWordCoverage": <number> number of characters over all recognized words
"MainWordCountPerLength": [<length1>, <count1>, <length2>, <count2>, ...] number of occurrences per word length (counting only recognized words which are not stop words)
"TotalWordCountPerLength": [<length1>, <count1>, <length2>, <count2>, ...] number of occurrences per word length (counting only recognized words, including stop words, but not rejected words)

"GlyphRatioLongWords": <number> (old) number in percent of long words found in relation to all words. (Glyphs like "%","&" etc are filtered beforehand.

"LongerGlyphRate": <number> (new) number in percent of not-short words found in relation to all words. (Glyphs like "%","&" are included / and therefore lots of those symbols will reduce this value).

"DigramScoreArithmetic": <number> number between [0;9] that indicates the text quality based on digramm score tables. There are scoretables for each language. The language chosen in the "All" language is chosen by the language that has the highest TotalWordCount.

"FulltextQuality": <number> number in percent that indicates text quality. The formula takes the following values into consideration:

GlyphRatioLongWords, LongerGlyphRate, TotalWordCount, DigramScoreArithmetic
}

Note 1: In addition to the languages specified in WordRecognizerOptions, there is an additional language "<GLOBAL>". This (virtual) language consists of all dictionaries over all languages, excluding all stop word dictionaries and stop word lists. This means, if a token (word to check) is contained in at least one of these dictionaries, it is considered as "recognized". Otherwise, it is considered as "rejected"
Note 2: The character count counts only the characters of the words passed to the spellchecker. The parser may have eliminated blanks, numbers, punctuation marks, quotes, hyphens and such.

Since version 2.8, WordRecognizerResult is a JSON object as follows:

{"DocumentStatistics": <language-statistics> , "PageStatistics": <page-statistics> }
("PageStatistics" is only present if the "StatisticsPerPage" option is set to true)

<page-statistics> is a JSON object with page numbers as key and <language-statistics> objects as value.
Example: {"1": <language-statistics> , "3": <language-statistics> }
(if the input is HOCR, and a page has no "ppageno" attribute, the page number is "0")

<language-statistics> is a JSON object as follows:
{"AllLanguages": <word-statistics> ,
"Languages": <language-specific-statistics> ,
"TextAccepted": true | false}
(TextAccepted is false if the GlyphRatioLongWords of "AllLanguages" is lower than the TextAcceptedThreshold specified in the WordRecognizerOptions. The TextAccepted flag of the document-global statistics is also set to false if at least one page has a "longer glyph ratio" ratio below the threshold, even if there are enough other pages to get the global ratio above the limit)

<language-specific> is a JSON object as follows:
{<language-key>: <word-statistics> , ...}
where <language-key> is one of the language keys defined in the WordRecognizerOptions
(note: the statistics for all languages combined is now the value of "AllLanguages". The special language key "<GLOBAL>" is no longer used)

<word-statistics> is the same object as <statistics-object> described in 7.5, but with an additional key "GlyphRatioLongWords". This value of this key is defined as largeWordCoverage * 100 / (totalWordCoverage + rejectedWordCoverage), i.e. the ratio of glyphs in long words compared to the total number of checked glyphs (excluding blanks, delimiters, numbers). The value is expressed as an integer (percentage) ranging from 0 to 100.

Since version 2.14, there are two additional keys:
"RawGlyphCount": <number> number of glyphs before parsing (excluding whitespaces but including digits, punctuation marks etc.)
"LongerGlyphRate": calculated as
(TotalWordCoverage - SmallWordCoverage) / RawGlyphCount.
This is the ratio of glyphs in recognized words which are not "small", compared to the number of all glyphs including digits etc. (see above).

An example of WordRecognizerResult, with multiple pages and languages, might look like this:

WordRecognizerResult = { 
   "DocumentStatistics": { 
      "AllLanguages": {"GlyphRatioLongWords": 80, ...}, 
      "Languages": { 
         "DE": {"GlyphRatioLongWords": 55, ...}, 
         "EN": {"GlyphRatioLongWords": 33, ...}  
      }, 
      "TextAccepted": false  
}, 
"PageStatistics": { 
   "1": { 
      "AllLanguages": {"GlyphRatioLongWords": 100, ...}, 
      "Languages": { 
         "DE": {"GlyphRatioLongWords": 100, ...}, 
         "EN": {"GlyphRatioLongWords": 16, ...}  
      }, 
      "TextAccepted": true  
   }, 
   "2": { 
      "AllLanguages": {"GlyphRatioLongWords": 55, ...}, 
      "Languages": { 
         "DE": {"GlyphRatioLongWords": 0, ...}, 
         "EN": {"GlyphRatioLongWords": 55, ...}  
      }, 
      "TextAccepted": false  
    }  
  }  
}

A complete Job XML-Example for Word Recognition might look like this:

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<root> 
   <Comod> 
      <defaults/> 
      <jobs> 
      <job name="TextRecognize"> 
         <properties> 
            <property name="LicenseCompany">Example Company</property>  
            <property name="LicenseKey">xxxx-xxxx-xxxxxxxx</property> 
            <property name="OutputMode">Xml</property> 
         </properties> 
         <steps> 
         <step name="ocr-step" command="ocr"> 
            <properties> 
            <property name="LicenseCompany">CIB Demo</property>  
            <property name="LicenseKey">xxxx-xxxx-xxxxxxxx</property>	 
            <property name="InputFilename">..\templates-txt\wikipedia-Deutschland_DE.txt</property>	 
            <property name="TraceFilename">ocr.txt</property>	 
            <property name="PageSelection">All</property>	 
            <property name="Recognize">WordRecognizer</property>			 
            <property name="WordRecognizerOptions">{ 
               "DictionaryPath": "..\\hunspell",  
               "Dictionaries":{"deu":{"Dictionaries":"de_DE_frami-
UTF8","StopwordFiles":"de_stopwords.txt"}},"InputFormat":"UTF8","RecognizedWordsFilename":
"recognizedWords.txt","StatisticsFilename":"statistics.txt","StatisticsOutputFormat":"FormatJSON"}					 
            </property> 
            </properties> 
         </step> 
         </steps> 
      </job> 
      </jobs> 
   </Comod> 
</root>