Post-Add-on-Lab

How to improve the accuracy of email spam detection by using other pre-processing methods？

Try to use other algorithms for this email spam detection and compare the result.
Try to use other datasets to apply the Naive Bayes algorithm.

This add-on lab shows how to detect opinion spam and fake news using Naïve Bayes text classification.

Fake news may contain false and/or exaggerated claims and may spread through social media. Fake news often imposes some ideas with political agendas. Machine learning can be used to effectively detect fake news in social media.

The dataset we’ll use for this python project has columns of news title and news text, and the target column with labels whether the news is FAKE or REAL. Our task is to classify a news item into true real news or false/fake news.

Copy and paste the following link to open google colab

https://colab.research.google.com/notebooks/welcome.ipynb

Click File and Create a new python 3 notebook

Click the red box area in the website and change the file name to Naive Bayes for fake news detection.ipynb

Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)

On the first code cell copy and paste the following code to import some libraries we will use.

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB

And the click the run button(looks like play button) to run this code cell

Copy and paste the following code to load the dataset we will use

df = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/fake_or_real_news.csv")

y = df.label

df.head()

You should see the five example news with labels

After that, we will read our data and do some data pre-processing.

Create a new cell, copy and paste the following code and then run it.

# Drop the `label` column

df.drop("label", axis=1)

# split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33)

# Initialize the `count_vectorizer`

count_vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the training data

count_train = count_vectorizer.fit_transform(X_train)

# Transform the test set

count_test = count_vectorizer.transform(X_test)

# Initialize the `tfidf_vectorizer`

tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Fit and transform the training data

tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test set

X_test = tfidf_vectorizer.transform(X_test)

After data pre-processing, we will use Naive Bayes algorithm to the training set

Copy and paste the following code and then run it.

clf = MultinomialNB()

After training, we will predict our test set.

Finally, we calculate the accuracy and make the confusion matrix.

Copy and paste the following code and then run it. The result should show similar as the following picture

clf.fit(tfidf_train, y_train)

y_pred = clf.predict(X_test)

from sklearn.metrics import confusion_matrix

import numpy as np

matrix = confusion_matrix(y_test, y_pred)

accuracy = np.trace(matrix) / float(np.sum(matrix))

print("Cofusion Matrix")

print(matrix)

print("The accuracy is: {:.2%}".format(accuracy))

As we can see the accuracy is 82.21%. The Naive Bayes machine learning algorithm did a good job on fake news detection.

Page updated

Google Sites

Report abuse