2018-04-06 14:14:36,269 : INFO : START - Read wamex reports in C:/wamex/data/wamex_xml_10k
2018-04-06 14:14:37,344 : INFO : END - Read wamex reports
2018-04-06 14:14:37,349 : INFO : START - Clean list of strings that contain wamex reports. Number of reports: 8933
2018-04-06 14:14:51,971 : INFO : END - Clean list of strings
2018-04-06 14:14:51,972 : INFO : START - Keywords will be extracted using TF-IDF.
2018-04-06 14:19:17,295 : INFO : END - Keywords are ready.
2018-04-06 14:19:18,425 : INFO : START - Word embeddings will be generated by gensim.models.Word2Vec.
2018-04-06 14:19:18,620 : INFO : collecting all words and their counts
2018-04-06 14:19:18,658 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-04-06 14:19:18,673 : INFO : collected 2907 word types from a corpus of 178660 raw words and 8933 sentences
2018-04-06 14:19:18,674 : INFO : Loading a fresh vocabulary
2018-04-06 14:19:18,697 : INFO : min_count=20 retains 1471 unique words (50% of original 2907, drops 1436)
2018-04-06 14:19:18,698 : INFO : min_count=20 leaves 166121 word corpus (92% of original 178660, drops 12539)
2018-04-06 14:19:18,757 : INFO : deleting the raw counts dictionary of 2907 items
2018-04-06 14:19:19,124 : INFO : sample=0.001 downsamples 56 most-common words
2018-04-06 14:19:19,124 : INFO : downsampling leaves estimated 141451 word corpus (85.1% of prior 166121)
2018-04-06 14:19:19,125 : INFO : estimated required memory for 1471 words and 100 dimensions: 1912300 bytes
2018-04-06 14:19:19,201 : INFO : resetting layer weights
2018-04-06 14:19:19,309 : INFO : training model with 4 workers on 1471 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2018-04-06 14:19:21,299 : INFO : PROGRESS: at 71.64% examples, 500793 words/s, in_qsize 7, out_qsize 0
2018-04-06 14:19:21,540 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-04-06 14:19:21,544 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-06 14:19:21,559 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-06 14:19:21,569 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-06 14:19:21,569 : INFO : training on 893300 raw words (706823 effective words) took 1.3s, 551718 effective words/s
2018-04-06 14:19:21,681 : INFO : saving Word2Vec object under C:/1WAMEX/word2vec_wamex_10k.model, separately None
2018-04-06 14:19:22,084 : INFO : saved C:/1WAMEX/word2vec_wamex_10k.model
2018-04-06 14:19:22,106 : INFO : END - Word embeddings generated and saved in C:/1WAMEX/word2vec_wamex_10k.model
In [3]: word2vec_model.most_similar(positive=['iron','iron ore'], negative=['gold'], topn=3)
Out[3]:
[('detrital', 0.8199945688247681),
('hematite', 0.8168871998786926),
('iron formation', 0.7921037077903748)]
In [4]: word2vec_model.most_similar(positive=['iron','pilbara'], negative=['gold'], topn=3)
Out[4]:
[('newman', 0.7953650951385498),
('detrital', 0.7779804468154907),
('fortescue', 0.7682926058769226)]
In [5]: word2vec_model.most_similar(positive=['nickel','komatiite'], negative=['gold'], topn=3)
Out[5]:
[('lava', 0.8143723011016846),
('flow', 0.7812291383743286),
('olivine', 0.769776463508606)]
In [6]: word2vec_model.most_similar(positive=['iron','banded iron formation'], negative=['gold'], topn=3)
Out[6]:
[('hematite', 0.8250837326049805),
('banded iron', 0.8139172792434692),
('goethite', 0.8059136271476746)]
In [7]: