Naïve Bayes, rooted in Bayes' theorem, is a powerful statistical technique used for classification tasks. It assumes that the presence of a particular feature in a class is independent of the presence of any other feature. This simplicity, often considered naïve, is what gives the Naïve Bayes model its name and strength, particularly in high-dimensional datasets found in text or image recognition tasks. Naïve Bayes is versatile, being used in sentiment analysis, spam filtering, and even medical diagnosis, showcasing its ability to efficiently handle complex data structures and make predictions..
The Bayes theorem, which quantifies the probability of an event happening based on the probability of a preceding event, forms the basis for the supervised learning technique called Naive Bayes. The strategy is referred described as "naive" since it assumes that the characteristics (or predictors) are conditionally independent given the class (or target variable). In other words, it assumes that the presence or absence of one feature does not affect the existence or absence of any other feature. Naive Bayes classifiers often exhibit outstanding performance in practical applications, particularly when dealing with high-dimensional data such as text or photos, despite the relatively simplistic assumption they make. This is because the algorithm can effectively detect underlying patterns and relationships between the target variable and its features, even when the premise of independence is violated. Naive Bayes algorithms are widely utilized in various applications such as text categorization, sentiment analysis, spam filtering, and even medical diagnosis. When dealing with large datasets and when the connections between features are uncertain or intricate, they prove to be very advantageous.
Bayes’ Theorem is the backbone of NB models, used to compute the probability of a class given certain evidence (feature values):
Where:
• P(C|X): Posterior probability of class C given features X
• P(X|C): Likelihood of features X given class C
• P(C): Prior probability of class C
• P(X): Probability of features X
NB models calculate these probabilities and classify data into the class with the highest posterior probability.
Naive Bayes (NB) algorithms are a set of straightforward probabilistic classifiers that utilize Bayes' theorem while assuming strong (naïve) independence between the features. They are especially well-suited for classification problems that involve a high number of dimensions in the input space, such as text classification. Nevertheless, these techniques can be utilized to address a diverse array of issues, such as identifying spam, analyzing sentiment, and even predicting weather by treating the problem as a classification challenge.
Multinomial Naive Bayes is often used when dealing with discrete data (e.g., word counts for text classification). It calculates the probability of each label, and then the label with the highest probability is output as the prediction. Smoothing techniques, such as Laplace smoothing, are applied to avoid the problem of zero probability for unseen features during training.
Bernoulli Naive Bayes is best suited for binary/boolean features. It works similarly to Multinomial NB but is designed for binary/boolean feature vectors rather than frequency counts.
Multinomial Naive Bayes
A modified form of the Naive Bayes algorithm created especially for text classification jobs where the data is represented as a bag of words is called the Multinomial Naive Bayes. This model treats every document as a collection of words without regard to the words' context or sequence. During the training phase, the algorithm discovers the likelihood that each word will appear in each class.
In order to train the Multinomial NB model, the following steps are taken:
By counting the number of occurrences in each class and dividing by the total number of instances, you may calculate the prior probabilities of each class, or P(class).
For every word in the vocabulary and every class, find the probability probabilities, P(word|class). To accomplish this, count how many times each word appears in each class, then divide the total by the number of times words appear in that class.
Avoid having 0 probabilities for words that have not yet been seen, as this would render the probability zero overall, by using smoothing techniques like Laplace smoothing.
In the process of prediction, the algorithm multiplies the likelihood probabilities of each word in the document, given the class, by the prior probability of the class to get the probability of a new document belonging to each class. Following that, the document is assigned to the class with the highest likelihood. Because it includes word occurrence frequencies, which can be important in identifying a document's topic or sentiment, the Multinomial NB model is especially well-suited for text classification tasks.
What is smoothing?
In Naive Bayes models, smoothing is an essential technique for resolving zero probabilities, which occur when a word appears absent from the training set for a given class. Regardless of the probabilities of the other words, the overall probability would be 0 if the likelihood probability of a word for a class was zero. Because of this, the model would always give that class a zero chance, making accurate predictions impossible. By adding a tiny value (often 1) to the numerator and modifying the denominator correspondingly, smoothing resolves this issue. This procedure, often referred to as additive smoothing or Laplace smoothing, guarantees that no word has a zero probability, enabling the model to predict even words that have never been seen before. The Naive Bayes model would be strongly biased to give classes containing words that weren't in the training set 0 probabilities in the absence of smoothing. This could result in subpar performance, particularly when working with sparse data or coming across uncommon or novel words when making predictions. By using smoothing, the model can handle unknown words with more ease and robustness, which helps it avoid zero probability problems and improves prediction accuracy.
Bernoulli Naive Bayes
The Multinomial NB technique has been modified into the Bernoulli Naive Bayes algorithm, which is meant for binary or boolean data, where each word or feature in a document is represented as either present (1) or absent (0). The Bernoulli NB model simply takes word presence or absence into account, in contrast to the Multinomial NB model, which also takes word frequency into account.
According to the Bernoulli NB model, each word's likelihood of occurrence in a document is independent of the words that precede it, and features are produced by a sequence of separate Bernoulli trials. The likelihood probabilities are calculated by the model not by calculating word frequencies, but rather by considering the presence or absence of terms in each class.
The Bernoulli NB model is trained using the following steps:
Like the Multinomial NB model, computing P(class), the prior probability of each class.
Finding the probability probabilities, P(word|class), for every word in the vocabulary and every class according to whether that word appears in the class's documents or not.
using smoothing methods to deal with words or classes that are not visible.
When predicting, the algorithm multiplies the prior probability of the class by the likelihood probabilities of each word in the document, given that class, based on its presence or absence, to determine the probability of a new document belonging to each class. The Bernoulli NB model is appropriate for tasks involving brief texts or sparse data, such as document classification, where the presence or absence of a characteristic matters more than its frequency. In some biological or medical applications, for example, where the data is essentially binary or Boolean, it is also helpful.