How to improve the accuracy of email spam detection by using other pre-processing methods?
Try to use other algorithms for this email spam detection and compare the result.
Try to use other datasets to apply the Naive Bayes algorithm.
This add-on lab shows how to detect opinion spam and fake news using Naïve Bayes text classification.
Fake news may contain false and/or exaggerated claims and may spread through social media. Fake news often imposes some ideas with political agendas. Machine learning can be used to effectively detect fake news in social media.
The dataset we’ll use for this python project has columns of news title and news text, and the target column with labels whether the news is FAKE or REAL. Our task is to classify a news item into true real news or false/fake news.
Copy and paste the following link to open google colab
https://colab.research.google.com/notebooks/welcome.ipynb
Click File and Create a new python 3 notebook
Click the red box area in the website and change the file name to Naive Bayes for fake news detection.ipynb
Next click the Runtime and change runtime type (in Hardware accelerator) to GPU (Because it will run faster than CPU)
On the first code cell copy and paste the following code to import some libraries we will use.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
And the click the run button(looks like play button) to run this code cell
Copy and paste the following code to load the dataset we will use
df = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/fake_or_real_news.csv")
y = df.label
df.head()
You should see the five example news with labels
After that, we will read our data and do some data pre-processing.
Create a new cell, copy and paste the following code and then run it.
# Drop the `label` column
df.drop("label", axis=1)
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33)
# Initialize the `count_vectorizer`
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the training data
count_train = count_vectorizer.fit_transform(X_train)
# Transform the test set
count_test = count_vectorizer.transform(X_test)
# Initialize the `tfidf_vectorizer`
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
# Fit and transform the training data
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
# Transform the test set
X_test = tfidf_vectorizer.transform(X_test)
After data pre-processing, we will use Naive Bayes algorithm to the training set
Copy and paste the following code and then run it.
clf = MultinomialNB()
After training, we will predict our test set.
Finally, we calculate the accuracy and make the confusion matrix.
Copy and paste the following code and then run it. The result should show similar as the following picture
clf.fit(tfidf_train, y_train)
y_pred = clf.predict(X_test)
from sklearn.metrics import confusion_matrix
import numpy as np
matrix = confusion_matrix(y_test, y_pred)
accuracy = np.trace(matrix) / float(np.sum(matrix))
print("Cofusion Matrix")
print(matrix)
print("The accuracy is: {:.2%}".format(accuracy))
As we can see the accuracy is 82.21%. The Naive Bayes machine learning algorithm did a good job on fake news detection.