Hands-on Lab Practice

The dataset in this module is used to detect spam emails.

The dataset has only one attribute: Email.

In this Module, we will be implementing Naive Bayes techniques for spam email filtering in a new Google Collab notebook. Our dataset contains many different attributes that will help us to decide if a link is malicious or benign. In the dataset, the attributes are following:

TEXT: This field represents the subject and body message of emails.

TYPE: this is a categorical variable, its values represent whether the email's subject or body message contains spam email identity where 1 is for spam email and 0 is for non-spam email.

Copy and paste the following link to open google colab

https://colab.research.google.com/notebooks/welcome.ipynb

Then click File --> New notebook

Click the red box area in the website and change the file name to Naive Bayes for email spam detection.ipynb

Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)

On the first code cell copy and paste the following code to upload the dataset to google colab

from google.colab import files

uploaded = files.upload()

And the click the run button (looks like play button) to run this code cell

After successful execution of that cell, you should be able to see the same result like the following picture. Next step, click choose upload email spam dataset(emails.csv) into google colab.

This is the link for email spam dataset. You need to agree with the terms and conditions and require a Gmail account.

https://www.kaggle.com/karthickveerakumar/spam-filter

Then the result shows the same result as the following picture(it means our dataset uploaded successfully)

Next, create a new cell to install some library we will use.

And copy and paste the following code on that cell then run the cell by clicking the run button.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import re

import nltk

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

nltk.download('stopwords')

nltk.download('all-corpora')

The result should shows similar as the following picture

After that, we will read our data and do some data pre-processing.

Create a new cell, copy and paste the following code and then run it. (Because the dataset has about 5731 samples, so it might take around 2-4 minute to run that cell)

dataset = pd.read_csv("emails.csv",sep = ',')

# data pre-processing

dataset['text']=dataset['text'].map(lambda text: text[1:])

dataset['text'] = dataset['text'].map(lambda text:re.sub('[^a-zA-Z0-9]+', ' ',text)).apply(lambda x: (x.lower()).split())

ps = PorterStemmer()

corpus=dataset['text'].apply(lambda text_list:' '.join(list(map(lambda word:ps.stem(word),(list(filter(lambda text:text not in set(stopwords.words('english')),text_list)))))))

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

X = cv.fit_transform(corpus.values).toarray()

y = dataset.iloc[:, 1].values

After data pre-processing, we create a new cell for splitting the data into the training set and test set. Using Naive Bayes algorithm to the training set

Copy and paste the following code and then run it. The result should shows similar as the following picture

# Splitting the dataset into the training set and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Using Naive Bayes algorithm to the trainting set

from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

classifier.fit(X_train , y_train)

After training, we will predict our test set.

Finally, we calculate the accuracy and making the confusion matrix.

Copy and paste the following code and then run it. The result should shows similar as the following picture

# Predict the test set

y_pred = classifier.predict(X_test)

# Showing the accuracy rate and making the Confusion Matrix

from sklearn.metrics import confusion_matrix

matrix = confusion_matrix(y_test, y_pred)

accuracy = np.trace(matrix) / float(np.sum(matrix))

print("Cofusion Matrix")

print(matrix)

print("The accuracy is: {:.2%}".format(accuracy))

As we can see the accuracy is 98.95%. The Naive Bayes machine learning algorithm did a good job on email spam detection.

Page updated

Google Sites

Report abuse