The Open Roget’s Project:
A freely available NLP-friendly implementation
of the 1911 Roget's Thesaurus
The Open Roget's Project provides a fully functional lexical resource for Natural Language Processing, based on Roget's Thesaurus. A Java implementation with the 1911 data now has a significantly updated lexicon. The process of updating Roget’s Thesaurus is documented in this paper:
Alistair Kennedy, Stan Szpakowicz (2014). Evaluation of Automatic Updates of Roget’s Thesaurus. Journal of Language Modelling 2(1), 1-49
(open access; download it at the JLM site)
To get Open Roget’s, visit Alistair Kennedy's resource page, or download directly the tarred & gzipped thesaurus. It is available under the Attribution-ShareAlike 4.0 International Licence (CC BY-SA 4.0 -- details in the README file inside the archive).
Project Gutenberg offers the not quite NLP-friendly unedited 1911 Roget's Thesaurus.
Please direct questions and comments to Alistair Kennedy or to Stan Szpakowicz.
Thanks to Mario Jarmasz, the author of the original system filled with limited-access data, and to Alyona Medelyan for retooling that system to work with the public-domain 1911 data.