Topic Modeling

LDA  Latent Dirichlet allocation
LSA (LSI) latent semantic analysis



LSA
Create a matrix where rows are terms and columns are documents. The entry is the count of each term (or tf/idf)
take a SVD (singular value decomposition) of the matrix.
SVD gives you X = UΣVT where  U and V are orthogonal matrices and Σ is a diagonal matrix.
The entries of the Σ matrix are the singular values (eigenvalue), and the U and V matrices are the left and right singular vectors, corresponding to term and documents vectors
This is simply a re-representation of the X matrix using orthogonal indexing dimensions.
LSA uses a truncated SVD, keeping only the k largest singular values and their associated vectors, so
X=Uk∗Σk∗VTk
The rows in Uk are the term vectors in LSA space and the rows in Vk are the document vectors.
Since both passages and terms are represented as vectors, it is straightforward to compute the similarity between passage-passage, term-term, and term-passage. In addition, terms and/or passages can be combined to create new vectors in the space. The process by which new vectors can be added to an existing LSA space is called folding-in. The cosine distance between vectors is used as the measure of their similarity for many applications because of its relation to the dot-product criterion and has been found effective in practice, although other metrics, such as Euclidean or city-block distances are sometimes used. To accurately update the SVD and thereby take into account new term frequencies and/or new terms requires considerable computation; minor perturbations to the original term-by-document matrix X can produce different term and passage vectors (and therefore affect precision and recall for query matching).



Latent Dirichlet allocation

Say you have K topics
For each document, for each word assign a topic in random.
This gives a random representations of all the documents and word distributions of all the topics
For each document
For each word
For each topic
Compute two things:
1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t,
2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w.
Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability)



We’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
That sampling of topics based on that prob. for each word is called Gibbs Sampling.



















Comments