Word Embeddings - Word2Vec

2018-04-06 14:14:36,269 : INFO : START - Read wamex reports in C:/wamex/data/wamex_xml_10k

2018-04-06 14:14:37,344 : INFO : END - Read wamex reports

2018-04-06 14:14:37,349 : INFO : START - Clean list of strings that contain wamex reports. Number of reports: 8933

2018-04-06 14:14:51,971 : INFO : END - Clean list of strings

2018-04-06 14:14:51,972 : INFO : START - Keywords will be extracted using TF-IDF.

2018-04-06 14:19:17,295 : INFO : END - Keywords are ready.

2018-04-06 14:19:18,425 : INFO : START - Word embeddings will be generated by gensim.models.Word2Vec.

2018-04-06 14:19:18,620 : INFO : collecting all words and their counts

2018-04-06 14:19:18,658 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types

2018-04-06 14:19:18,673 : INFO : collected 2907 word types from a corpus of 178660 raw words and 8933 sentences

2018-04-06 14:19:18,674 : INFO : Loading a fresh vocabulary

2018-04-06 14:19:18,697 : INFO : min_count=20 retains 1471 unique words (50% of original 2907, drops 1436)

2018-04-06 14:19:18,698 : INFO : min_count=20 leaves 166121 word corpus (92% of original 178660, drops 12539)

2018-04-06 14:19:18,757 : INFO : deleting the raw counts dictionary of 2907 items

2018-04-06 14:19:19,124 : INFO : sample=0.001 downsamples 56 most-common words

2018-04-06 14:19:19,124 : INFO : downsampling leaves estimated 141451 word corpus (85.1% of prior 166121)

2018-04-06 14:19:19,125 : INFO : estimated required memory for 1471 words and 100 dimensions: 1912300 bytes

2018-04-06 14:19:19,201 : INFO : resetting layer weights

2018-04-06 14:19:19,309 : INFO : training model with 4 workers on 1471 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5

2018-04-06 14:19:21,299 : INFO : PROGRESS: at 71.64% examples, 500793 words/s, in_qsize 7, out_qsize 0

2018-04-06 14:19:21,540 : INFO : worker thread finished; awaiting finish of 3 more threads

2018-04-06 14:19:21,544 : INFO : worker thread finished; awaiting finish of 2 more threads

2018-04-06 14:19:21,559 : INFO : worker thread finished; awaiting finish of 1 more threads

2018-04-06 14:19:21,569 : INFO : worker thread finished; awaiting finish of 0 more threads

2018-04-06 14:19:21,569 : INFO : training on 893300 raw words (706823 effective words) took 1.3s, 551718 effective words/s

2018-04-06 14:19:21,681 : INFO : saving Word2Vec object under C:/1WAMEX/word2vec_wamex_10k.model, separately None

2018-04-06 14:19:22,084 : INFO : saved C:/1WAMEX/word2vec_wamex_10k.model

2018-04-06 14:19:22,106 : INFO : END - Word embeddings generated and saved in C:/1WAMEX/word2vec_wamex_10k.model

In [3]: word2vec_model.most_similar(positive=['iron','iron ore'], negative=['gold'], topn=3)

Out[3]:

[('detrital', 0.8199945688247681),

('hematite', 0.8168871998786926),

('iron formation', 0.7921037077903748)]

In [4]: word2vec_model.most_similar(positive=['iron','pilbara'], negative=['gold'], topn=3)

Out[4]:

[('newman', 0.7953650951385498),

('detrital', 0.7779804468154907),

('fortescue', 0.7682926058769226)]

In [5]: word2vec_model.most_similar(positive=['nickel','komatiite'], negative=['gold'], topn=3)

Out[5]:

[('lava', 0.8143723011016846),

('flow', 0.7812291383743286),

('olivine', 0.769776463508606)]

In [6]: word2vec_model.most_similar(positive=['iron','banded iron formation'], negative=['gold'], topn=3)

Out[6]:

[('hematite', 0.8250837326049805),

('banded iron', 0.8139172792434692),

('goethite', 0.8059136271476746)]

In [7]:

Google Sites

Report abuse