TextProcessor

The TextProcessor Java package is a text processing toolkit, which provides some frequently used text processing functions such as stemming, removing stop-words, generating a term vocabulary, and calculating the term-doc frequency matrix. Basic topic mining models such as LDA and sparse NMF are also supported. The package can also generate feature files from a given text dataset with LDA and LIBSVM format for posterior procedures such as classification or clustering. The toolkit is also being extended for more advanced text analysis tasks based on natural language processing techniques.

New functionality:

Extract NP VP NP triplet from a sentence. This is pretty useful when you want to analyze who does what in which action. E.g., if the input sentence is "Chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.", the output triplet would be "Rudolph Agnew/NP NP; was named/VP VP; a director/NP NP".

SourceForge:

https://sourceforge.net/projects/textprocessor/files/

Data format:

TextProcessor could process either a data directory containing separate text files or a single merged text file with one document per line. Tokens are delimited by a whitespace character. Labels are expected to be the first token if available. More specifically, The content of a text file has the following format:

Label[\t]token1[whitespace]token2[whitespace]token3[whitespace]... if labels are available or

token1[whitespace]token2[whitespace]token3[whitespace]... if labels are unknown

Usage:

(1) Command line

The default class-path is lib/*, thus all dependency libraries are supposed to be put into a folder named "lib" in the directory that TextProcessor.jar is located.

Syntax: java -jar TextProcessor.jar [options]

options:

-workSpacePath : work space path, or the home path for the parent folder of data

-mergedFileName : Name of the single merged text file with one document per line

-dataDirName : Name of the data directory which stores separate text data files

-ext : Extension of text files to be processed

-hasLabel : whether the input documents are labeled (true) or not (false)

-verbose : if showing the processing log (true) or not (false)

-doesStem : if contents of documents should be stemmed (true) or not (false)

-doTopicMining : if doing topic mining task or not

-method : clustering method

LDA : Latent Dirichlet Analysis

L1NMF: l_1 norm regularized NMF

-nTopic : number of topics

-maxIter : maximal iteration

-nTopTerm : number of top terms for each topic

For LDA:

-burnIn : number of burning steps for Gibbs sampling

-thinInterval : number of steps between samplings

-sampleLag : number of steps for sampling

Example:

java -jar TextProcessor.jar -workSpacePath data -dataDirName CNNTest -doesStem false -hasLabel true -verbose false -ext txt -doTopicMining true -method L1NMF -nTopic 3 -nTopTerm 15 -maxIter 500

or

java -jar TextProcessor.jar -workSpacePath data -mergedFileName CNNTest.txt -doesStem false -hasLabel true -verbose false -ext txt -doTopicMining true -method LDA -nTopic 3 -nTopTerm 15 -maxIter 5000 -burnIn 1500 -thinInterval 200 -sampleLag 10

(2) API

In order to use TextProcessor package, four String variables and two boolean variable need to be specified:

String workSpacePath: Home path for the data, either a single merged text file with one document per line, or a data directory which stores separate text data files.

String mergedFileName: Name of the single merged text file with one document per line, not including its parent path. If no such file, use empty string.

String dataDirName: Name of the data directory which stores separate text data files, not including its parent path. If no such directory, use empty string.

String ext: Extension of text files to be processed.

boolean hasLabel: A boolean variable indicating whether the input text file is labeled and the label of each document is the beginning term in each line.

boolean doesStem: A boolean variable indicating if contents of documents should be stemmed.

There are three ways to use the TextProcessor API:

The most easiest way to call:

TextProcessor textProcessor = new TextProcessor(workSpacePath, mergedFileName, dataDirName, ext, hasLabel, doesStem);
textProcessor.verbose = verbose;
textProcessor.process();
textProcessor.saveResults();

The second way to call using build-in Options to configure the parameters:

Options options = new Options();
options.workSpacePath = workSpacePath;
options.mergedFileName = mergedFileName;
options.dataDirName = dataDirName;
options.ext = ext;
options.hasLabel = hasLabel;
options.doesStem = doesStem;
options.verbose = verbose;
TextProcessor textProcessor = new TextProcessor(options);
textProcessor.process();
textProcessor.saveResults();

Advanced users can use the original way to set up the saving locations and filenames for the generated text files such as vocabulary and doc-term-count file:

TextProcessor textProcessor = new TextProcessor();
textProcessor.verbose = verbose;
textProcessor.doesStem = doesStem;
textProcessor.ext = ext;
String dataDir = workSpacePath;
String mergedFilePath = dataDir + File.separator + mergedFileName;
textProcessor.buildWordCnt(mergedFilePath, hasLabel);
String wordCntFilePath = dataDir + File.separator + "WordCnt.txt";
textProcessor.saveWordCnt(wordCntFilePath);
String wordListFilePath = dataDir + File.separator + "WordList.txt";
textProcessor.buildWordList();
textProcessor.saveWordList(wordListFilePath);
textProcessor.buildWordIDMap();
textProcessor.buildDocTermCountMatrix(mergedFilePath, hasLabel);
String docTermCountFilePath = workSpacePath + File.separator + "DocTermCount.txt";
textProcessor.saveDocTermCountMatrix(docTermCountFilePath);
if (hasLabel) {
    String labelIDMapFilePath = dataDir + File.separator + "LabelIDMap.txt";
    textProcessor.saveLabelIDMap(labelIDMapFilePath);
    String labelFilePath = dataDir + File.separator + "GroundTruth.txt";
    textProcessor.saveGroundTruth(labelFilePath);
}

Output:

There are several text files generated in the home data directory specified by workSpacePath.

DocTermCount.txt -- A doc-term-count text file. Each line corresponds to a element in the doc-term-count matrix with the format of (docID,[whitespace]termID):[whitespace]count, i.e., (1, 6): 2 indicates the a word with termID of 6 occurs 6 times in the document with docID of 1.

WordCnt.txt -- A list of term count pairs with the format: term[whitespace]count. Each line displays a term count pair. The line index is the integer ID for the term.

WordList.txt -- A list of vocabulary terms with each term per line. The line index is the integer ID for the term.

LDAInputData.txt -- A feature file with LDA required data input format.

LiBLINEARInputData.txt -- A LIBLINEAR formatted input data file.

LabelIDMap.txt -- If labels are available, this file stores the pair of label token and ID with format of label[whitespace]ID.

GroundTruth.txt -- If labels are available, this file stores the ground truth for the text data set. Each line displays a pair of docID and labelID with format of docID[:whitespace]labelID.

All the IDs in this package start from 1. After this, one could get their preferred data format in a particular programming language.

For more details about the meaning of member variables and how to use the member functions of TextProcessor, please refer to the online documentation.

Dependencies:

TextProcessor library depends on Apache Commons-Math library, OpenNLP, and TopicMiner library.

Acknowledgment:

Porter stemmer and LdaGibbsSampler are integrated into this package. We also provide interfaces to conveniently call them.

----------------------------

Version:

1.1 11 Dec 2011

Author:

Mingjie Qian