The dataset in this module is used to detect spam emails.
The dataset has only one attribute: Email.
In this Module, we will be implementing Naive Bayes techniques for spam email filtering in a new Google Collab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. In the dataset, the attributes are following:
TEXT: This field represents the subject and body message of emails.
TYPE: this is a categorical variable, its values represent whether the email's subject or body message contains spam email identity where 1 is for spam email and 0 is for non-spam email.
Copy and paste the following link to open google colab
https://colab.research.google.com/notebooks/welcome.ipynb
Then click File --> New notebook
Click the red box area in the website and change the file name to Naive Bayes for email spam detection.ipynb
Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)
On the first code cell copy and paste the following code to upload the dataset to google colab
from google.colab import files
uploaded = files.upload()
And the click the run button (looks like play button) to run this code cell
After successful execution of that cell, you should be able to see the same result like the following picture. Next step, click choose upload email spam dataset(emails.csv) into google colab.
This is the link for email spam dataset. You need to agree with the terms and conditions and require a Gmail account.
Then the result shows the same result as the following picture(it means our dataset uploaded successfully)
Next, create a new cell to install some library we will use.
And copy and paste the following code on that cell then run the cell by clicking the run button.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
nltk.download('all-corpora')
The result should shows similar as the following picture
After that, we will read our data and do some data pre-processing.
Create a new cell, copy and paste the following code and then run it. (Because the dataset has about 5731 samples, so it might take around 2-4 minute to run that cell)
dataset = pd.read_csv("emails.csv",sep = ',')
# data pre-processing
dataset['text']=dataset['text'].map(lambda text: text[1:])
dataset['text'] = dataset['text'].map(lambda text:re.sub('[^a-zA-Z0-9]+', ' ',text)).apply(lambda x: (x.lower()).split())
ps = PorterStemmer()
corpus=dataset['text'].apply(lambda text_list:' '.join(list(map(lambda word:ps.stem(word),(list(filter(lambda text:text not in set(stopwords.words('english')),text_list)))))))
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus.values).toarray()
y = dataset.iloc[:, 1].values
After data pre-processing, we create a new cell for splitting the data into the training set and test set. Using Naive Bayes algorithm to the training set
Copy and paste the following code and then run it. The result should shows similar as the following picture
# Splitting the dataset into the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# Using Naive Bayes algorithm to the trainting set
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train , y_train)
After training, we will predict our test set.
Finally, we calculate the accuracy and making the confusion matrix.
Copy and paste the following code and then run it. The result should shows similar as the following picture
# Predict the test set
y_pred = classifier.predict(X_test)
# Showing the accuracy rate and making the Confusion Matrix
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test, y_pred)
accuracy = np.trace(matrix) / float(np.sum(matrix))
print("Cofusion Matrix")
print(matrix)
print("The accuracy is: {:.2%}".format(accuracy))
As we can see the accuracy is 98.95%. The Naive Bayes machine learning algorithm did a good job on email spam detection.