Description: This set of Java classes provides basic support to perform the comparison of two text files: a reference file (a ground-truth document) and a the output from an OCR engine (a text file). Options for specific behavior include: ignore case, ignore diacritics, ignore punctuation, ignore stop-words, Unicode and user-defined equivalences between characters. It can be used with the graphic user interface (GUI) provided, in addition to command line interface usage.
Supported input formats: plain text, FineReader 10 XML, PAGE XML, ALTO XML and hOCR HTML.
Output: the output generates a report with statistics (including CER and WER error rates) and a table with the parallell input texts where the differences are highlighted.
Usage notes: A gentle introduction to OCR evaluation and to this tool can be found athttps://sites.google.com/site/textdigitisation/
Requirements: Java 1.7
Description: performs external DTD simplification according to previously tagged text, as described in Bia, Carrasco and Forcada. Parameter entities are replaced at every model group and simplified independently. The behaviour with namespaces has not been checked.
Requirements: STL and libxml2.
Description: Generates job equitable assignments with uniform temporal distribution.
Requirements: Java 1.6 and jexcel API.
Description: statistical parser based on the extension described by Chappelier y Rajman. The grammar cannot contain empty rules (that is, with empty rhs). Unit production chains are followed up to N steps (the default N=4 can be #defined at compile time).
Requirements: STL.
Usage notes:
Text contains sentence (one per line; words separated by whitespace) such as Pierre Vinken , 61 years old, will join the board as a nonexecutive director Nov. 29.
The grammar file contains rules (one per line) such as
1086 NP NNP NNP
219 NP CD NNS
11 ADJP NP JJ
4 NP NP , ADJP ,
Each line contains the number of times the rule is used in the training set (or its probability). First variable is the lhs of the production.
Description: hyphenates spanish words as described by J. Mañas. Input is a list of words or text files (option -f).
Requirements: Java 1.5 (or higher).