Software

ocrevalUAtion

Provides basic support to perform the comparison of two text files: a reference file (a ground-truth document) and a the output from an OCR engine (a text file). It can be used with the graphic user interface (GUI) provided, in addition to command line interface usage. Supported input formats include: plain text, FineReader 10 XML, PAGE XML, ALTO XML and hOCR HTML.

For additional information, please visit the github repository.

Script to obtain DBLP records in BibTeX format

DTDprune: reduce DTD to match XML files

Description

Requirements

Usage

Performs external DTD simplification according to previously tagged text, as described in Bia, Carrasco and Forcada. Parameter entities are replaced at every model group and simplified independently. The behaviour with namespaces has not been checked.

Uses STL and libxml2. Usually compiled as c++ `xml2-config --cflags` -lxml2 -o dtdprune dtdprune.C

dtdprune [-s] file.dtd file1.xml [file2.xml ...]

Option -s is used to print DTD statistics (no simplification performed)

parser.C

Description

Requirements

Usage

Statistical parser based on the extension described by Chappelier y Rajman. The grammar cannot contain empty rules (that is, with empty rhs). Unit production chains are followed upto N steps (the default N=4 can be #defined at compile time).

Uses STL. Compiled as c++ -D N=6 -oparser parser.C.

parser grammar_file [-S initial_variable] < text

- Text contains sentence (one per line; words separated by whitespace) such as
- The grammar file contains rules (one per line) such as

- - - Pierre Vinken , 61 years old, will join the board as a nonexecutive director Nov. 29.
    - 1086 NP NNP NNP
    - 219 NP CD NNS
    - 11 ADJP NP JJ
    - 4 NP NP , ADJP ,

- Each line contains the number of times the rule is used in the training set (or its probability). First variable is the lhs of the production.

Spanish Hyphenator

Description

Requirements

Java 1.5 (or higher).

Usage

Hyphenates spanish words as described by J. Mañas.

java Hyphenator [-f] input1 input2 ....

Input is a list of words or files (option -f).

Page updated

Google Sites

Report abuse