SpeedRead

Abstract

Online content analysis employs algorithmic methods to identify entities in unstructured text. Both machine learning and knowledge-base approaches lie at the foundation of contemporary named entities extraction systems. However, the progress in deploying these approaches on web-scale has been been hampered by the computational cost of NLP over massive text corpora. We present SpeedRead (SR), a named entity recognition pipeline that runs at least 10 times faster than Stanford NLP pipeline. This pipeline consists of a high performance Penn Treebank-compliant tokenizer, close to state-of-art part-of-speech (POS) tagger and knowledge-based named entity recognizer.

 Tokenizer

 Part of Speech Tagger

 Named Entity Chunker

Paper

http://www.aclweb.org/anthology/C12-1004

Code

https://bitbucket.org/aboSamoor/speedread

Bibtex

@InProceedings{alrfou-skiena:2012:PAPERS,

  author    = {Al-Rfou, Rami  and  Skiena, Steven},

  title     = {{S}peed{R}ead: A Fast Named Entity Recognition Pipeline},

  booktitle = {Proceedings of COLING 2012},

  month     = {December},

  year      = {2012},

  address   = {Mumbai, India},

  publisher = {The COLING 2012 Organizing Committee},

  pages     = {51--66},

  url       = {http://www.aclweb.org/anthology/C12-1004}

}