Finding similarity among documents

Introduction

Laymen explanation

If you see google search, they saw you similar pages. Are you wondering anytime how machine finds that they are similar to my interest. This document tries to explain this

Technical explanation

Document similarity (or distance between documents) is a one of the central themes in Information Retrieval. Usually documents treated as similar if they are semantically close and describe similar concepts.

TF/IDF

It is (Term frequency)/(Document frequency)

In other words, this is done by multiplying two metrics: how many times a word appears in a document(Term frequency), and the inverse document frequency of the word across a set of documents(document frequency). Higher the score, better the chance of relevance.

Latent semantic analysis

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns.

Semantic hashing

Semantic hashing is a method to map documents to a code e.g., 32-bit memory address so documents with semantically closed content will be mapped to close addresses.

Reference

https://monkeylearn.com/blog/what-is-tf-idf/

https://distributedalgorithm.wordpress.com/2017/06/24/semantic-hashing/

http://text2vec.org/similarity.html

Page updated

Google Sites

Report abuse