Projects‎ > ‎

Polyglot

Abstract

Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. Moreover, we investigate the semantic features captured by these embeddings through the proximity of word groupings. We will release these embeddings publicly to help researchers in the development and enhancement of multilingual applications.

Polyglot is joint work with Bryan Perozzi, and Steven Skiena.

Presentation


Online Demo

The demo shows words proximity in the embedding space. Given a word we calculate its neighbours in the space according to the Euclidean distance. In case, you are using the latest version of Firefox 23.0+, this demo will be blocked by default. Here are instructions on how to disable protection and enable the demo. Otherwise, you can have direct access to the demo at <wordrepresentation.appspot.com>.

Download the Embeddings

Wikipedia Language Code Language name (English) Language name (native) Download Link Last Updated
af Afrikaans Afrikaans polyglot-af.pkl 2013-07-21
als

polyglot-als.pkl 2013-07-21
am Amharic አማርኛ polyglot-am.pkl 2013-07-21
an Aragonese Aragonés polyglot-an.pkl 2013-07-21
ar Arabic العربية polyglot-ar.pkl 2013-07-21
arz

polyglot-arz.pkl 2013-07-21
as Assamese অসমীয়া polyglot-as.pkl 2013-07-21
ast

polyglot-ast.pkl 2013-07-21
az Azerbaijani azərbaycan dili polyglot-az.pkl 2013-07-21
ba Bashkir башҡорт теле polyglot-ba.pkl 2013-07-21
bar

polyglot-bar.pkl 2013-07-21
be Belarusian Беларуская polyglot-be.pkl 2013-07-21
bg Bulgarian български език polyglot-bg.pkl 2013-07-21
bn Bengali বাংলা polyglot-bn.pkl 2013-07-21
bo Tibetan Standard, Tibetan, Central བོད་ཡིག polyglot-bo.pkl 2013-07-21
bpy

polyglot-bpy.pkl 2013-07-21
br Breton brezhoneg polyglot-br.pkl 2013-07-21
bs Bosnian bosanski jezik polyglot-bs.pkl 2013-07-21
ca Catalan; Valencian Català polyglot-ca.pkl 2013-07-21
ce Chechen нохчийн мотт polyglot-ce.pkl 2013-07-21
ceb Cebuano Binisaya polyglot-ceb.pkl 2013-07-21
cs Czech česky, čeština polyglot-cs.pkl 2013-07-21
cv Chuvash чӑваш чӗлхи polyglot-cv.pkl 2013-07-21
cy Welsh Cymraeg polyglot-cy.pkl 2013-07-21
da Danish dansk polyglot-da.pkl 2013-07-21
de German Deutsch polyglot-de.pkl 2013-07-21
diq

polyglot-diq.pkl 2013-07-21
dv Divehi; Dhivehi; Maldivian; ދިވެހި polyglot-dv.pkl 2013-07-21
el Greek, Modern Ελληνικά polyglot-el.pkl 2013-07-21
en English English polyglot-en.pkl 2013-07-21
eo Esperanto Esperanto polyglot-eo.pkl 2013-07-21
es Spanish; Castilian español, castellano polyglot-es.pkl 2013-07-21
et Estonian eesti, eesti keel polyglot-et.pkl 2013-07-21
eu Basque euskara, euskera polyglot-eu.pkl 2013-07-21
fa Persian فارسی polyglot-fa.pkl 2013-07-21
fi Finnish suomi, suomen kieli polyglot-fi.pkl 2013-07-21
fo Faroese føroyskt polyglot-fo.pkl 2013-07-21
fr French français, langue française polyglot-fr.pkl 2013-07-21
fy Western Frisian Frysk polyglot-fy.pkl 2013-07-21
ga Irish Gaeilge polyglot-ga.pkl 2013-07-21
gan

polyglot-gan.pkl 2013-07-21
gd Scottish Gaelic; Gaelic Gàidhlig polyglot-gd.pkl 2013-07-21
gl Galician Galego polyglot-gl.pkl 2013-07-21
gu Gujarati ગુજરાતી polyglot-gu.pkl 2013-07-21
gv Manx Gaelg, Gailck polyglot-gv.pkl 2013-07-21
he Hebrew (modern) עברית polyglot-he.pkl 2013-07-21
hi Hindi हिन्दी, हिंदी polyglot-hi.pkl 2013-07-21
hif

polyglot-hif.pkl 2013-07-21
hr Croatian hrvatski polyglot-hr.pkl 2013-07-21
hsb

polyglot-hsb.pkl 2013-07-21
ht Haitian; Haitian Creole Kreyòl ayisyen polyglot-ht.pkl 2013-07-21
hu Hungarian Magyar polyglot-hu.pkl 2013-07-21
hy Armenian Հայերեն polyglot-hy.pkl 2013-07-21
ia Interlingua Interlingua polyglot-ia.pkl 2013-07-21
id Indonesian Bahasa Indonesia polyglot-id.pkl 2013-07-21
ilo

polyglot-ilo.pkl 2013-07-21
io Ido Ido polyglot-io.pkl 2013-07-21
is Icelandic Íslenska polyglot-is.pkl 2013-07-21
it Italian Italiano polyglot-it.pkl 2013-07-21
ja
Japanese 日本語 (にほんご/にっぽんご) polyglot-ja.pkl 2013-09-20
ja
Japanese Characters 日本語 (にほんご/にっぽんご) polyglot-ja-char.pkl 2013-07-21
jv Javanese basa Jawa polyglot-jv.pkl 2013-07-21
ka Georgian ქართული polyglot-ka.pkl 2013-07-21
kk Kazakh Қазақ тілі polyglot-kk.pkl 2013-07-21
km Khmer ភាសាខ្មែរ polyglot-km.pkl 2013-07-21
kn Kannada ಕನ್ನಡ polyglot-kn.pkl 2013-07-21
ko Korean 한국어 (韓國語), 조선말 (朝鮮語) polyglot-ko.pkl 2013-07-21
ku Kurdish Kurdî, كوردی‎ polyglot-ku.pkl 2013-07-21
ky Kirghiz, Kyrgyz кыргыз тили polyglot-ky.pkl 2013-07-21
la Latin latine, lingua latina polyglot-la.pkl 2013-07-21
lb Luxembourgish, Letzeburgesch Lëtzebuergesch polyglot-lb.pkl 2013-07-21
li Limburgish, Limburgan, Limburger Limburgs polyglot-li.pkl 2013-07-21
lmo

polyglot-lmo.pkl 2013-07-21
lt Lithuanian lietuvių kalba polyglot-lt.pkl 2013-07-21
lv Latvian latviešu valoda polyglot-lv.pkl 2013-07-21
mg Malagasy Malagasy fiteny polyglot-mg.pkl 2013-07-21
mk Macedonian македонски јазик polyglot-mk.pkl 2013-07-21
ml Malayalam മലയാളം polyglot-ml.pkl 2013-07-21
mn Mongolian монгол polyglot-mn.pkl 2013-07-21
mr Marathi (Marāṭhī) मराठी polyglot-mr.pkl 2013-07-21
ms Malay bahasa Melayu, بهاس ملايو‎ polyglot-ms.pkl 2013-07-21
mt Maltese Malti polyglot-mt.pkl 2013-07-21
my Burmese ဗမာစာ polyglot-my.pkl 2013-07-21
ne Nepali नेपाली polyglot-ne.pkl 2013-07-21
nl Dutch Nederlands, Vlaams polyglot-nl.pkl 2013-07-21
nn Norwegian Nynorsk Norsk nynorsk polyglot-nn.pkl 2013-07-21
no Norwegian Norsk polyglot-no.pkl 2013-07-21
oc Occitan Occitan polyglot-oc.pkl 2013-07-21
or Oriya ଓଡ଼ିଆ polyglot-or.pkl 2013-07-21
os Ossetian, Ossetic ирон æвзаг polyglot-os.pkl 2013-07-21
pa Panjabi, Punjabi ਪੰਜਾਬੀ, پنجابی‎ polyglot-pa.pkl 2013-07-21
pam

polyglot-pam.pkl 2013-07-21
pl Polish polski polyglot-pl.pkl 2013-07-21
pms

polyglot-pms.pkl 2013-07-21
ps Pashto, Pushto پښتو polyglot-ps.pkl 2013-07-21
pt Portuguese Português polyglot-pt.pkl 2013-07-21
qu Quechua Runa Simi, Kichwa polyglot-qu.pkl 2013-07-21
rm Romansh rumantsch grischun polyglot-rm.pkl 2013-07-21
ro Romanian, Moldavian, Moldovan română polyglot-ro.pkl 2013-07-21
ru Russian русский язык polyglot-ru.pkl 2013-07-21
sa Sanskrit (Saṁskṛta) संस्कृतम् polyglot-sa.pkl 2013-07-21
sah

polyglot-sah.pkl 2013-07-21
scn

polyglot-scn.pkl 2013-07-21
sco

polyglot-sco.pkl 2013-07-21
se Northern Sami Davvisámegiella polyglot-se.pkl 2013-07-21
sh

polyglot-sh.pkl 2013-07-21
si Sinhala, Sinhalese සිංහල polyglot-si.pkl 2013-07-21
sk Slovak slovenčina polyglot-sk.pkl 2013-07-21
sl Slovene slovenščina polyglot-sl.pkl 2013-07-21
sq Albanian Shqip polyglot-sq.pkl 2013-07-21
sr Serbian српски језик polyglot-sr.pkl 2013-07-21
su Sundanese Basa Sunda polyglot-su.pkl 2013-07-21
sv Swedish svenska polyglot-sv.pkl 2013-07-21
sw Swahili Kiswahili polyglot-sw.pkl 2013-07-21
szl

polyglot-szl.pkl 2013-07-21
ta Tamil தமிழ் polyglot-ta.pkl 2013-07-21
te Telugu తెలుగు polyglot-te.pkl 2013-07-21
tg Tajik тоҷикӣ, toğikī, تاجیکی‎ polyglot-tg.pkl 2013-07-21
th Thai ไทย polyglot-th.pkl 2013-07-21
tk Turkmen Türkmen, Түркмен polyglot-tk.pkl 2013-07-21
tl Tagalog Wikang Tagalog polyglot-tl.pkl 2013-07-21
tr Turkish Türkçe polyglot-tr.pkl 2013-07-21
tt Tatar татарча, tatarça, تاتارچا‎ polyglot-tt.pkl 2013-07-21
ug Uighur, Uyghur Uyƣurqə, ئۇيغۇرچە‎ polyglot-ug.pkl 2013-07-21
uk Ukrainian українська polyglot-uk.pkl 2013-07-21
ur Urdu اردو polyglot-ur.pkl 2013-07-21
uz Uzbek zbek, Ўзбек, أۇزبېك‎ polyglot-uz.pkl 2013-07-21
vec

polyglot-vec.pkl 2013-07-21
vi Vietnamese Tiếng Việt polyglot-vi.pkl 2013-07-21
vls

polyglot-vls.pkl 2013-07-21
vo Volapük Volapük polyglot-vo.pkl 2013-07-21
wa Walloon Walon polyglot-wa.pkl 2013-07-21
war Waray-Waray Winaray polyglot-war.pkl 2013-07-21
yi Yiddish ייִדיש polyglot-yi.pkl 2013-07-21
yo Yoruba Yorùbá polyglot-yo.pkl 2013-07-21
zh Chinese 中文 (Zhōngwén), 汉语, 漢語 polyglot-zh.pkl 2013-07-21
zh_char Chinese Characters Model

polyglot-zh_char.pkl 2013-07-21



Download Wikipedia Text Dumps

In order to aid researchers, we offer a processed Wikipedia dumps that have tokenized text. This material is available under  CC BY-SA 3.0.

Wikipedia Language CodeLanguage name (English)Language name (native)Download Link
ar Arabic العربية ar_wiki_text.tar.lzma
bg Bulgarian български език bg_wiki_text.tar.lzma
ca Catalan; Valencian Català ca_wiki_text.tar.lzma
cs Czech česky, čeština cs_wiki_text.tar.lzma
da Danish dansk da_wiki_text.tar.lzma
de German Deutsch de_wiki_text.tar.lzma
el Greek, Modern Ελληνικά el_wiki_text.tar.lzma
en English English en_wiki_text.tar.lzma
es Spanish; Castilian español, castellano es_wiki_text.tar.lzma
et Estonian eesti, eesti keel et_wiki_text.tar.lzma
fa Persian فارسی fa_wiki_text.tar.lzma
fi Finnish suomi, suomen kieli fi_wiki_text.tar.lzma
fr French français, langue française fr_wiki_text.tar.lzma
he Hebrew (modern) עברית he_wiki_text.tar.lzma
hi Hindi हिन्दी, हिंदी hi_wiki_text.tar.lzma
hr Croatian hrvatski hr_wiki_text.tar.lzma
hu Hungarian Magyar hu_wiki_text.tar.lzma
id Indonesian Bahasa Indonesia id_wiki_text.tar.lzma
it Italian Italiano it_wiki_text.tar.lzma
ja
Japanese 日本語 (にほんご/にっぽんご) ja_wiki_text.tar.lzma
ko Korean 한국어 (韓國語), 조선말 (朝鮮語) ko_wiki_text.tar.lzma
lt Lithuanian lietuvių kalba lt_wiki_text.tar.lzma
lv Latvian latviešu valoda lv_wiki_text.tar.lzma
ms Malay bahasa Melayu, بهاس ملايو‎ ms_wiki_text.tar.lzma
nl Dutch Nederlands, Vlaams nl_wiki_text.tar.lzma
no Norwegian Norsk no_wiki_text.tar.lzma
pl Polish polski pl_wiki_text.tar.lzma
pt Portuguese Português pt_wiki_text.tar.lzma
ro Romanian, Moldavian, Moldovan română ro_wiki_text.tar.lzma
ru Russian русский язык ru_wiki_text.tar.lzma
sk Slovak slovenčina sk_wiki_text.tar.lzma
sl Slovene slovenščina sl_wiki_text.tar.lzma
sr Serbian српски језик sr_wiki_text.tar.lzma
sv Swedish svenska sv_wiki_text.tar.lzma
th Thai ไทย th_wiki_text.tar.lzma
tl Tagalog Wikang Tagalog tl_wiki_text.tar.lzma
tr Turkish Türkçe tr_wiki_text.tar.lzma
uk Ukrainian українська uk_wiki_text.tar.lzma
vi Vietnamese Tiếng Việt vi_wiki_text.tar.lzma
zh Chinese 中文 (Zhōngwén), 汉语, 漢語 zh_wiki_text.tar.lzma


Embeddings Tutorial

For each language there is a directory that contains its own data. The data is stored as a pickled python object. Here is a small script to extract the data. The tutorial is hosted here at this link <http://nbviewer.ipython.org/6046170>.


Train Your Own Models

If the pre-trained models do not fit your problem, feel free to use one of two choices we developed:

word2embeddings

polyglot2

  • Supports CPU only.
  • Faster than word2embeddings on CPU (especially if compiled against OpenBLAS).
  • Requires Cython.
  • Project page
word2embeddings  and polyglot2 are open source, licensed under the GNU General Public License (v3 or later). Note that this is the full GPL, which allows many free uses, but does not allow its incorporation into any type of distributed proprietary software, even in part or in translation. Commercial licensing is also available; please contact us if you are interested.

Citing Polyglot

If you use Polyglot for academic research, you are highly encouraged to cite the following paper:

Polyglot: Distributed Word Representations for Multilingual NLP

Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.
In Proceedings
Seventeenth Conference on Computational Natural Language Learning (CoNLL 2013).

Bibtex

@InProceedings{polyglot:2013:ACL-CoNLL,
  author    = {Al-Rfou, Rami  and  Perozzi, Bryan  and  Skiena, Steven},
  title     = {Polyglot: Distributed Word Representations for Multilingual NLP},
  booktitle = {Proceedings of the
Seventeenth Conference on Computational Natural Language Learning},
  month     = {August},
  year      = {2013},
  address   = {Sofia, Bulgaria},
  publisher = {Association for Computational Linguistics},
pages = {183--192}, 
url = {http://www.aclweb.org/anthology/W13-3520}
}

Subpages (1): polyglot-tutorial
Comments