After completing this learning module, students will be able to:
Describe email spam and spam filtering
Explain why Naive Bayes theorem is useful when detecting spam
Apply Naive Bayes theorem to identify and classify spam emails using dataset having spam and ham emails
Email security concern
Discovering threats hiding in an email server is a major security concern. Malicious emails such as hacking, malware, phishing, the scam may reach to end-user via emails and lure users clicking to malicious links, opening attached malware, and submitting personal private data. Machine learning can help to identify potentially malicious emails and filter them out. This can help to eliminate the attacks and protect users.
What is spam?
Spam is unsolicited, unwanted electronic mail similar to paper junk mail. It can advertise a range of goods and services or attempt to get you to reveal information about yourself. People who send spam generally harvest email address from a variety of sources and then craft their messages in specific ways to keep them from being recognized and blocked by spam filters. Messages range from well designed and legitimate looking to simply text that makes no sense at all. [1]
Most spam falls into the following categories:
Adult content
Advance fee scams and fraud
Medications and health
E-advertising services and business opportunities
IT, education and training
Here is an example spam email: [3]
Dear Mr. Aman,
You have won a lottery offer for $2,000,000!!! Click here to claim it now.
Clearly, this is a spam email for most cases. The sender of such email wants you to click on the link so that he/she could fool you by:
§ Capturing your personal details.
§ Making you download a malware/virus.
Spam detection problem is therefore quite important to solve. More formally, we are given an email or an SMS and we are required to classify it as a spam or a no-spam (often called ham).
The Naive Bayes Theorem
Naive Bayes theorem works on the conditional probability which means that something will occur when the given something has already happened.
in this above equation,
P(B) is the probability of the evidence
P(A) is the prior probability(The probability of hypothesis A is true)
P(B|A) is the probability of the evidence B given that hypothesis A is true
P(A|B) is the probability of the hypothesis A given that the evidence B is true.
Flowchart to detect email spam using Naïve Bayes
Naive Bayes Classifier
Naive Bayes Theorem is used in Naive Bayes classifier. The probabilities for each class in the given dataset will be predicted and the highest probability will be the prediction. This process is called Maximum A Posteriori (MAP).
We assume everything is independent given the class label.
Naive Bayes algorithm
Naive Bayes is a popular statistical classification text mining technique for spam filtering.
Naive Bayes classifiers work correlates the use of word tokens in spam and non-spam e-mails and then using Bayes' statistics theorem to calculate a probability that an email is a spam or ham.
Naive Bayes spam filtering can result in low false positive spam prediction and detection.
Here is a simple example which we will implement the Naive Bayes algorithm.
And this is a dataset about 8 students what they will do before the exam.
And assume we want to predict two students whether they will pass the exam, we assume the first student study for the exam and do not go a to party or play video games. And we assume the second student do not study for the exam, instead, he goes to a party and plays video games before the exam.
So the first sample is (Go to party(A)=0,Play video games(B)=0, study for the exam (D) = 1)
First, we calculate the probability for this student will pass the exam:
P(A=1|B=0,C=0,D=1) = P(B=0|A=1)*P(C=0|A=1)*P(D=1|A=1)*P(A=1)/P(B=0,C=0, D=1)
=0.75*0.5*0.75*0.5/P(B=0,C=0, D=1)
=0.140625/P(B=0,C=0, D=1)
Then we calculate the probability for this student will fail the exam:
P(A=0|B=0,C=0,D=1) = P(B=0|A=0)*P(C=0|A=0)*P(D=1|A=0)*P(A=0)/P(B=0,C=0, D=1)
=0.5*0.5*0.25*0.5/P(B=0,C=0, D=1)
=0.03125/P(B=0,C=0, D=1)
As we can see, these two calculation have same denominator. The value of P(A=1|B=0,C=0,D=1) is greater than P(A=0|B=0,C=0,D=1).
So using the Naive Bayes algorithm, we predict that this student will pass the exam.
the second sample is (Go to party(A)=1,Play video games(B)=1, study for the exam (D) = 0)
First, we calculate the probability for this student will pass the exam:
P(A=1|B=1,C=1,D=0) = P(B=1|A=1)*P(C=1|A=1)*P(D=0|A=1)*P(A=1)/P(B=1,C=1, D=0)
=0.25*0.5*0.25*0.5/P(B=1,C=1, D=0)
=0.015625/P(B=1,C=1, D=0)
Then we calculate the probability for this student will fail the exam:
P(A=0|B=1,C=1,D=0) = P(B=1|A=0)*P(C=1|A=0)*P(D=0|A=0)*P(A=0)/P(B=1,C=1, D=0)
=0.5*0.5*0.75*0.5/P(B=1,C=1, D=0)
=0.09375/P(B=1,C=1, D=0)
As we can see, these two calculation have same denominator. The value of P(A=1|B=0,C=1,D=0) is less than P(A=0|B=1,C=1,D=0).
So using the Naive Bayes algorithm, we predict that this student will fail the exam.
Now we test these two samples in python
Copy and paste the following code to colab and run it
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# this is our dataset
X_train=[[1,1,0],[0,0,1],[1,0,1],[1,0,0],[0,1,0],[0,1,1],[0,1,0],[0,0,1]]
X_train=np.asarray(X_train)
#this is our label of dataset
y_train=[0,1,1,0,0,1,1,0]
y_train=np.asarray(y_train)
#these are the samples we just calculated
sample = [[0,0,1],[1,1,0]]
sample =np.asarray(sample)
#import the Naive Bayes Algorithm
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
#fit the dataset to the algorithm
classifier.fit(X_train , y_train)
#make prediction
y_pred = classifier.predict(sample)
print("The two examples are predict(1means pass, 0 means fail on exam)")
print("The first example we predict it is:",y_pred[0])
print("The second example we predict it is:",y_pred[1])
The result should be similar to the following picture
Pros and Cons of Naive Bayes algorithm [4]
Pros
Very simple, and easy to use.
Need less training data.
Makes Probabilistic predictions.
Handles continuous and discrete data.
It is a generative model, i.e., it can make predictions even if some feature is missing by altering decision rules.
Cons:
A subtle issue with Naive-Bayes Classifier is that if you have no occurrences of a class label and a certain attribute value together then the frequency-based probability estimation will be zero.
A big data set is required for making reliable predictions of the probability of each class. We can use this with small data sets but the precision will be altered.
Attribute independence.
References:
1. About Spam. (2015, July 20). Retrieved August 26, 2020, from https://tech.rochester.edu/about-spam/
2. Bahnsen, A., Alejandro Correa Bahnsen is part of Easy Solutions’ Research organization. In this role, Villegas, S., & Scientist, A. (2018, March 10). Machine Learning Algorithms Explained - Naive Bayes Classifier. Retrieved August 26, 2020, from https://blog.easysol.net/machine-learning-algorithms-4/
3. Krishnakumar. (2018, August 10). Spam Detection using Naive Bayes Algorithm. Retrieved August 26, 2020, from https://blog.eduonix.com/networking-and-security/spam-detection-naive-bayes-algorithm/
4. Pal, A. (2019, August 25). Spam Detection and filtering with Naive Bayes Algorithm. Retrieved August 26, 2020, from https://medium.com/secure-and-private-ai-math-blogging-competition/spam-detection-and-filtering-with-naive-bayes-algorithm-f6c2ac181174