Provides basic support to perform the comparison of two text files: a reference file (a ground-truth document) and a the output from an OCR engine (a text file). It can be used with the graphic user interface (GUI) provided, in addition to command line interface usage. Supported input formats include: plain text, FineReader 10 XML, PAGE XML, ALTO XML and hOCR HTML.
For additional information, please visit the github repository.
Description
Requirements
Usage
Performs external DTD simplification according to previously tagged text, as described in Bia, Carrasco and Forcada. Parameter entities are replaced at every model group and simplified independently. The behaviour with namespaces has not been checked.
Uses STL and libxml2. Usually compiled as c++ `xml2-config --cflags` -lxml2 -o dtdprune dtdprune.C
dtdprune [-s] file.dtd file1.xml [file2.xml ...]
Option -s is used to print DTD statistics (no simplification performed)
Statistical parser based on the extension described by Chappelier y Rajman. The grammar cannot contain empty rules (that is, with empty rhs). Unit production chains are followed upto N steps (the default N=4 can be #defined at compile time).
Uses STL. Compiled as c++ -D N=6 -oparser parser.C.
parser grammar_file [-S initial_variable] < text
Text contains sentence (one per line; words separated by whitespace) such as
The grammar file contains rules (one per line) such as
Pierre Vinken , 61 years old, will join the board as a nonexecutive director Nov. 29.
1086 NP NNP NNP
219 NP CD NNS
11 ADJP NP JJ
4 NP NP , ADJP ,
Each line contains the number of times the rule is used in the training set (or its probability). First variable is the lhs of the production.
Hyphenates spanish words as described by J. Mañas.
java Hyphenator [-f] input1 input2 ....
Input is a list of words or files (option -f).