Before going into Naïve Bayes, it is essential to understand conditional probability, Bayes theorem, and some common terms associated with it like prior Probability, Posterior Probability, and Likelihood.
Probability, in layman's terms, is a way of measuring how likely something is to happen. It's like estimating the chances of an event occurring, such as winning a game, getting heads in a coin toss, or predicting the weather. It's often expressed as a number between 0 and 1, where 0 means impossible and 1 means certain. So, if something has a probability of 0.5, it's like saying there's a 50-50 chance of it happening.
Example:
Consider an unbiased coin toss situation. The probability of getting tails and heads are the same and there's a 50-50 chance of it happening. Mathematically it is expressed as:
P(Getting a Heads) = P(Getting a Tails) = 1/2 = 0.5
Conditional probability is similar to normal probability but with respect to another event happening. Conditional probability is like figuring out the chance of something happening but with extra information.
Consider this example:
What's the chance that it will rain tomorrow is normal probability whereas what's the chance that it will rain tomorrow given that it's cloudy today is conditional probability.
It is mathematically expressed as
P(B|A) = P(A, B)/P(A)
where P(B|A) is the conditional probability of event B occurring given event A occurs.
P(A, B) is the probability of events A and B occurring simultaneously.
P(A) is the probability that event A occurs.
Bayes Theorem is one of the most famous concepts in probability theory which states that given two independent events A & B, the conditional probability of A given B can be computed using the conditional probability of B given A. In simple terms, Bayes theorem describes the probability of an event based on prior knowledge.
It is mathematically expressed as
P(A|B) = (P(B|A) * P(A)) / P(B)
where P(A|B) is the conditional probability of event A given B occurs. This is also called Posterior probability.
P(B|A) is the conditional probability of event B occurring given event A occurs. This is also called Prior probability.
P(A) is the probability that event A occurs. This is called Likelihood.
P(B) is the probability that event B occurs. This is called Marginalization.
Naïve Bayes (NB) is a popular supervised machine learning algorithm based on Bayes theorem. It is a simple but effective algorithm that works well with large datasets. The term "Naïve" is added because this algorithm assumes that all features are independent for the Bayes theorem to be applied consistently, hence the name Naïve Bayes.
Naïve Bayes can be utilized for various purposes across multiple fields. Some of the common applications of Naïve Bayes include:
Email Spam Filtering
Text Classification
Medical diagnosis
Credit Scoring
There are three types of Naïve Bayes:
Multinomial Naïve Bayes
Gaussian Naïve Bayes
Bernoulli Naïve Bayes
Multinomial Naïve Bayes:
Multinomial Naïve Bayes is commonly used for data that follows the multinomial distribution. This includes text data which are represented by word frequencies. It is suitable for classification tasks where features represent the frequency of occurrences of different events. Its applications include text classification, such as spam email detection or sentiment analysis, and classification of DNA and RNA sequences.
Gaussian Naïve Bayes:
Gaussian Naive Bayes (GNB) is commonly used for datasets in which features are continuous. The major difference between Multinomial Naïve Bayes and Gaussian Naïve Bayes is that multinomial uses count frequencies to compute conditional probabilities whereas the latter uses Gaussian Distribution to compute likelihoods. Its applications include medical diagnoses and creditworthiness assessment.
Bernoulli Naïve Bayes:
Bernoulli Naïve Bayes is another variant of the Naïve Bayes algorithm. It's suitable for binary feature data, where features represent the presence or absence of certain attributes. Bernoulli Naïve Bayes is often used in document classification tasks, where the presence or absence of specific words in a document is considered. Its applications include text classification, image classification, and Bioinformatics.
Example of Multinomial Naïve Bayes:
Consider a scenario of classifying mail as spam/ham. Our task is to classify whether "Hooray Congratulations, Gift card Money" as spam/ham.
Let's assume the frequency of Hooray is 10, Congratulations is 6, Gift card is 5 and Money is 9 for spam and the frequency of Hooray is 2, Congratulations is 4, Gift card is 8 and Money is 6 for ham.
Let's also assume the occurence of spam mail is 8 and ham is 4.
P(Spam) = 8/12 = 2/3
P(Ham) = 4/12 = 1/3
P(Hooray|Spam) = 2/3 P(Hooray|Ham) = 1/10
P(Congratulations|Spam) = 3/5 P(Congratulations|Ham) = 1/5
P(Gift card|Spam) = 1/6 P(Gift card|Ham) = 2/5
P(Money|Spam) = 7/10 P(Money|Ham) = 3/10
Text = "Hooray Congratulations, Gift card Money"
P(Ham|Text) = P(Ham)*P(Text | Ham) = P(Ham) * P(Hooray|Ham) * P(Congratulations|Ham) * P(Gift card|Ham) * P(Money|Ham)
P(Ham|Text) = 1/3 * 1/10 * 1/5 * 2/5 * 3/10 = 0.0008
P(Spam|Text) = P(Spam)*P(Text | Spam) = P(Spam) * P(Hooray|Spam) * P(Congratulations|Spam) * P(Gift card|Spam) * P(Money|Spam)
P(Spam|Text) = 2/3 * 2/3 * 3/5 * 1/6 * 7/10 = 0.03
Smoothing is required in Naïve Bayes models to handle cases where the training data doesn't contain examples of certain feature-class combinations. This can lead to zero probabilities, causing the model to incorrectly classify unseen data. Techniques like Laplace smoothing or Lidstone smoothing are commonly used to alleviate this issue by adding a small value to all feature counts during probability estimation. So, it is always important to use Smoothing when utilizing the Naïve Bayes.
It is simple and easy.
Naïve Bayes is less prone to overfitting and can also be used for multi-class classification problems.
Naïve Bayes always assume the features to be independent which may result in inaccurate predictions.
Naïve Bayes cannot capture complex relationships.