Latent Dirichlet allocationSay you have K topics
For each document, for each word assign a topic in random.
This gives a random representations of all the documents and word distributions of all the topics
For each document For each word
Compute two things: 1) p(topic t  document d) = % of words with topic t in document d 2) p(word w  topic t) = % of word with topic t over all topic ts choose topic t with probability p(topic
t  document d) * p(word w  topic t) for word. (according to our generative
model, this is essentially the probability that topic t generated word
w, so it makes sense that we resample the current word’s topic with this
probability)
We’re assuming that all topic assignments except for the current word in
question are correct, and then updating the assignment of the current
word using our model of how documents are generated. That sampling of topics based on that prob. for each word is called Gibbs Sampling.
LSA Create a matrix where rows are terms and columns are documents. The entry is the count of each term (or tf/idf)
take a SVD (singular value decomposition) of the matrix.
SVD gives you X = UΣV^{T} where U and V are orthogonal matrices and Σ is a diagonal matrix.
The
entries of the Σ matrix are the singular values (eigenvalue), and the U
and V matrices are the left and right singular vectors, corresponding
to term and documents vectors
This is simply a rerepresentation of the X matrix using orthogonal indexing dimensions.
LSA uses a truncated SVD, keeping only the k largest singular values and their associated vectors, so
X=U_{k}∗Σ_{k}∗V^{T}_{k}
The rows in U_{k} are the term vectors in LSA space and the rows in V_{k} are the document vectors. Since
both passages and terms are represented as vectors, it is
straightforward to compute the similarity between passagepassage,
termterm, and termpassage. In addition, terms and/or passages can be
combined to create new vectors in the space. The process by which new
vectors can be added to an existing LSA space is called foldingin. The
cosine distance between vectors is used as the measure of their
similarity for many applications because of its relation to the
dotproduct criterion and has been found effective in practice, although
other metrics, such as Euclidean or cityblock distances are sometimes
used. To accurately update the SVD and thereby take into account new
term frequencies and/or new terms requires considerable computation;
minor perturbations to the original termbydocument matrix X can
produce different term and passage vectors (and therefore affect
precision and recall for query matching).
