Java Open source Text Mining and Information Extraction tools

Here are some of the open source tools for text mining: information extraction, text classification, clustering, approximate string matching, language parsing and tagging, and more.

Weka - is a collection of machine learning algorithms for data mining. It is probably the most widely used text classification framework. It has implemented a wide variety of algorithms including Naive Bayes and SVM (listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. Another related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.

Mallet - Mallet is a collection of tools in Java for statistical NLP, text classification, clustering and IE created by Andrew Mccallum's group at UMass. (Note that Bow and Rainbow are pre-cursors written in C while he was at CMU. Bow is fast and contains implementations for Naive Bayes, k-nearest neighbor, TFIDF, and probabilistic indexing.)

LingPipe - Alias-I's Lingpipe is a java tool for information extraction and data mining (entity extraction, speech tagging, clustering, classification, etc...), not to mention string similarity. It is one of the most mature and widely used open source IE toolkits in industry. Recently, I noticed an informative post on their blog recently on Jaro-Winkler string comparison (developed by the Census Bureau, it is also useful for related "database linkage" problems). They have a good list of links to competition, both academic and industrial tools.

GATE - one of the leading toolkits for text mining and information extraction. It has a nice GUI. One of the components it is distributed with is ANNIE, which stands for "A Nearly-New IE system." It is maintained by the NLP group at the University of Sheffield.

NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets for experimentation. It is written by Steven Bird, from the University of Melbourne.

OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection, and co-reference analysis using the Maxent machine learning package.

Stanford Parser
- a Java package for sentence parsing from the Stanford NLP group. It has implementations of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. It's has a full GNU GPL license.

OpenEphyra - is start-of-the-art open framework for Question Answering. It is a full-featured, end-to-end system for QA written in Java and developed at CMU's LTI department. It is released on the GNU GPL license.

Carrot2 - Open source search result clustering software in Java. It is designed for Lucene and works as an add-on for Nutch. There is a commercial version called Lingo 3G.

String Similarity
SecondString - A collection of approximate string matching tools (for those record linkage problems), it also has an implementation of the Jaro-Winkler string distance metric. This is written by William Cohen from CMU.

MinorThird - Another toolkit for text classification and entity extraction, by William Cohen at CMU. It has some notable differences from the other toolkits mentioned, see the page for details (I'm not as familiar with this one, so I'm taking his word for it.).

Simmetrics - Another string similarity package. This is maintained by Sheffield University (the makers of the aforementioned GATE IE package).

The University of Sheffield, UMASS Amherst, and CMU have active programs contributing java toolkits in this area. Hopefully you found this list helpful, it was useful organizing my bookmarks. I hope to write on some of them in more detail in future posts.