Polyglot

Abstract

Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. Moreover, we investigate the semantic features captured by these embeddings through the proximity of word groupings. We will release these embeddings publicly to help researchers in the development and enhancement of multilingual applications.

Polyglot is joint work with Bryan Perozzi, and Steven Skiena.

Presentation

Polyglot: CoNLL 2013

Online Demo

The demo shows words proximity in the embedding space. Given a word we calculate its neighbours in the space according to the Euclidean distance. In case, you are using the latest version of Firefox 23.0+, this demo will be blocked by default. Here are instructions on how to disable protection and enable the demo. Otherwise, you can have direct access to the demo at <wordrepresentation.appspot.com>.

Download the Embeddings

Download Wikipedia Text Dumps

In order to aid researchers, we offer a processed Wikipedia dumps that have tokenized text. This material is available under CC BY-SA 3.0. [Deprecated]

Please, check Wiki40B dumps since they represent newer and cleaner text dumps of Wikipedia and do not forget to cite the Wiki40B paper.
https://www.tensorflow.org/datasets/catalog/wiki40b

Embeddings Tutorial

For each language there is a directory that contains its own data. The data is stored as a pickled python object. Here is a small script to extract the data. The tutorial is hosted here at this link <http://nbviewer.ipython.org/6046170>.

Train Your Own Models

If the pre-trained models do not fit your problem, feel free to use one of two choices we developed:

word2embeddings

Supports CPU and GPU computation
Requires Theano.
bitbucket page

polyglot2

Supports CPU only.
Faster than word2embeddings on CPU (especially if compiled against OpenBLAS).
Requires Cython.
Project page

word2embeddings and polyglot2 are open source, licensed under the GNU General Public License (v3 or later). Note that this is the full GPL, which allows many free uses, but does not allow its incorporation into any type of distributed proprietary software, even in part or in translation. Commercial licensing is also available; please contact us if you are interested.

Citing Polyglot

If you use Polyglot for academic research, you are highly encouraged to cite the following paper:

Polyglot: Distributed Word Representations for Multilingual NLP,

Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.

In Proceedings Seventeenth Conference on Computational Natural Language Learning (CoNLL 2013).

Bibtex

@InProceedings{polyglot:2013:ACL-CoNLL,

author = {Al-Rfou, Rami and Perozzi, Bryan and Skiena, Steven},

title = {Polyglot: Distributed Word Representations for Multilingual NLP},

booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},

month = {August},

year = {2013},

address = {Sofia, Bulgaria},

publisher = {Association for Computational Linguistics},

pages = {183--192},

url = {http://www.aclweb.org/anthology/W13-3520}

}

Google Sites

Report abuse