Topic Modeling

What is Topic Modeling ?

Topic modeling is an unsupervised technique that intends to analyze large volumes of text data by clustering the documents into groups. In the case of topic modeling, the text data do not have any labels attached to it. Rather, topic modeling tries to group the documents into clusters based on similar characteristics.

Two approaches are mainly used for topic modeling: Latent Dirichlet Allocation and Non-Negative Matrix factorization

Methods for Topic Modeling

Latent Dirichlet Allocation (LDA)
Non Negative Matrix Factorization (NMF)
Latent Semantic Analysis (LSA)
Parallel Latent Dirichlet Allocation (PLDA)
Pachinko Allocation Model (PAM)

Latent Dirirchlet Allocation (LDA)

Latent Dirichlet Allocation is a statistical and graphical model which are used to obtain relationships between multiple documents in a corpus. It is developed using Variational Exception Maximization (VEM) algorithm for obtaining the maximum likelihood estimate from the whole corpus of text.

The LDA is based upon two general assumptions:

Documents that have similar words usually have the same topic
Documents that have groups of words frequently occurring together usually have the same topic.

These assumptions make sense because the documents that have the same topic, for instance, Business topics will have words like the "economy", "profit", "the stock market", "loss", etc. The second assumption states that if these words frequently occur together in multiple documents, those documents may belong to the same category.

Mathematically, the above two assumptions can be represented as:

Documents are probability distributions over latent topics
Topics are probability distributions over words

Latent Semantic Analysis

Latent Semantic Analysis is also an unsupervised learning method used to extract relationship between different words in a pile of documents.

Non Negative Matrix Factorization

NMF is a matrix factorization method where we make sure that the elements of the factorized matrices are non-negative.

Parallel Latent Dirirchlet Allocation

It is also known as Partially Labeled Dirichlet Allocation. Here, the model assumes that there exists a set of n labels and each of these labels are associates with each topics of the given corpus. Then the individual topics are represented as the probabilistic distribution of the whole of corpus similar to the LDA

Pachinko Allocation Model

Pachinko Allocation Model (PAM) is an improved method of Latent Dirichlet Allocation model. LDA model brings out the correlation between words by identifying topics based on the thematic relationships between words present in the corpus.

Applications

Topic modeling can be used in graph based models to obtain semantic relationship between words.
It can be used in text summarization to quickly find out what the document or book is explaining about.
It can identify the keywords of search and recommend products to the customers accordingly.

Google Sites

Report abuse