Majiga - Document Similarity

Document Similarity using Stratigraphic Word Vectors

1. Loads the pre-trained word2vec embeddings on the WA Strtaigraphic units data (min 10 occureance for each word)

2. Applies TSNE to the embeddings (i.e. dimension reduction)

tsne_model = sklearn.manifold.TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)

3. Plots the 2D position of each word with a label

sim_scores = calculate_similarity(source_doc, target_docs):

Finding doc similarity by:

1. Creating word vectors on the WA Strats data using Gensim Word2Vec

2. Calculate the doc vector value for each documents using WA stratigraphic word vectors (mean of word vectors)

3. Calculate the similarity value for between the input doc and each document to compare using cosine similarity distance between doc vectors

4. Show similarity scores as follows:

[{'score': 0.99226624, 'doc': 'Not showing as requested', 'docNum': 0},

{'score': 0.99226624, 'doc': 'Not showing as requested', 'docNum': 1},

{'score': 0.89561278, 'doc': 'Not showing as requested', 'docNum': 2},

{'score': 0.60469347, 'doc': 'Not showing as requested', 'docNum': 3}]

Google Sites

Report abuse