Polyglot-NER

Abstract

The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named Entity Recognition (NER) annotators for 40 major languages using Wikipedia and Freebase. Our approach does not require NER human annotated datasets or

language specific resources like treebanks, parallel corpora, and orthographic rules. The novelty of approach lies therein - using only language agnostic techniques, while achieving competitive performance.

Our method learns distributed word representations (word embeddings) which encode semantic and syntactic features of words in each language. Then, we automatically generate datasets from Wikipedia link structure and Freebase attributes. Finally, we apply two preprocessing stages (oversampling and exact surface form matching) which do not require any linguistic expertise.

Our evaluation is two fold: First, we demonstrate the system performance on human annotated datasets. Second, for languages where no gold-standard benchmarks are available, we propose a new method, distant evaluation, based on statistical machine translation.

Online Demo

Software

The software and the models are available as a python package. You can try them today :)

$ sudo pip install polyglot

$ polyglot download TASK:ner2

$ polyglot download TASK:embeddings2

$ polyglot --lang en tokenize --input file.txt | polyglot --lang en ner

For full documentation of the software, please refer to the official package documentation:

https://polyglot.readthedocs.org

Training Datasets

Training datasets used in this software are available to download over here.

Citiation

@article{polyglotner,

author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},

title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},

journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30- May 2, 2015}},

month = {April},

year = {2015},

publisher = {SIAM},

}