A way of thinking about topic modeling is as a matrix factorization problem. For m documents and v words we want k topics
Dirichlet distribution:
Each topic
A single document
Generative model of LDA:
K # of topics V # of vocabs M # of documents α ∈ R ^{K}for i ∈ {1,..,M} θ ## Latent Dirichlet allocationSay you have K topics For each document, for each word assign a topic in random. This gives a random representations of all the documents and word distributions of all the topics For each document For each word For each topic Compute two things: 1) p(topic t | document d) = % of words with topic t in document d 2) p(word w | topic t) = % of word with topic t over all topic ts choose topic t with probability p(topic t | document d) * p(word w | topic t) for word. (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability) We’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated. That sampling of topics based on that prob. for each word is called Gibbs Sampling. LSA Create a matrix where rows are terms and columns are documents. The entry is the count of each term (or tf/idf) take a SVD (singular value decomposition) of the matrix. SVD gives you X = UΣV ^{T} where U and V are orthogonal matrices and Σ is a diagonal matrix. The
entries of the Σ matrix are the singular values (eigenvalue), and the U
and V matrices are the left and right singular vectors, corresponding
to term and documents vectors This is simply a re-representation of the X matrix using orthogonal indexing dimensions. LSA uses a truncated SVD, keeping only the k largest singular values and their associated vectors, so X=U _{k}∗Σ_{k}∗V^{T}_{k}The rows in U _{k} are the term vectors in LSA space and the rows in V_{k} are the document vectors.Since both passages and terms are represented as vectors, it is straightforward to compute the similarity between passage-passage, term-term, and term-passage. In addition, terms and/or passages can be combined to create new vectors in the space. The process by which new vectors can be added to an existing LSA space is called folding-in. The cosine distance between vectors is used as the measure of their similarity for many applications because of its relation to the dot-product criterion and has been found effective in practice, although other metrics, such as Euclidean or city-block distances are sometimes used. To accurately update the SVD and thereby take into account new term frequencies and/or new terms requires considerable computation; minor perturbations to the original term-by-document matrix X can produce different term and passage vectors (and therefore affect precision and recall for query matching). |