Naive Bayes is a conditional probability model based on applying Bayes' theorem: given a problem instance to be classified, represented by a vector X=(x1, x2, x3, …., xn) representing some n features (independent variables), it assigns to this instance probabilities: P(Y|X)
Using Bayes' theorem, we know that: P(y|x) = P(y)*P(x|y)/P(x)
Now the "naive" conditional independence assumptions come into play: assume that each feature xi is conditionally independent of every other feature xj for i ≠j in the given category Y. That means,
Naive Bayesian classification can be apply to the document classification problem. Consider the problem of classifying documents by their content, for example into spam and non-spam e-mails. Imagine that documents are drawn from a number of classes of documents which can be modeled as sets of words where the (independent) probability that the ith word of a given document occurs in a document from class C can be written as p(wi|C).
Then the probability that a given document D contains all of the words wi, given a class C, is
P(D|C)=p(w1|C)*p(w2|C)*p(w3|C)…
The question that we desire to answer is: "what is the probability that a given document D belongs to a given class C?" In other words, what is P(C|D)?
Using Naive Bayes classifier, P(C|D) = P(C)*P(D|C)/P(D)
Latent Dirichlet Allocation (LDA)
LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model.
This model can be show as following:
Generative process:
1. Sampling from Dirichlet distribution α to generate a topic distribution θi of document i
2. Sampling from the polynomial distribution θi to generate the topic zij of the jth word of the document i
3. Sampling from Dirichlet distribution β to generate a word distribution φi of topic zij
4. Sampling from the word distribution φi to generate the result word wij
Word2vec
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.