Overview of Naive Bayes (NB):
Naive Bayes (NB) is a machine learning algorithm used for classification. It assumes that features are independent, simplifying calculations. Multinomial NB is employed in text analysis, estimating word probabilities for document categorization. In contrast, Bernoulli NB is ideal for binary or categorical data, assessing the presence/absence of features. Applications of NB include spam detection, sentiment analysis, recommendation systems, and medical diagnosis. Its versatile nature makes it valuable in various domains, facilitating tasks such as language identification and customer support routing. NB, particularly in Multinomial and Bernoulli forms, is a powerful tool for classification challenges.
Data Preparation:
In supervised machine learning, labeled data is a fundamental requirement. The initial step in preparing the data for the Naive Bayes (NB) classifier involved loading a dataset from a CSV file. This dataset included various columns such as 'Likes,' 'Dislikes,' 'Views,' and 'Video_Duration (in seconds).' To ensure that the data is suitable for analysis, a deliberate choice was made to include only numeric columns. Following this selection, the dataset was divided into two distinct sets: the Training Set and the Testing Set, with an 80-20 split ratio. The Training Set serves the purpose of constructing and training the NB model, while the Testing Set is employed to evaluate the model's accuracy. This division is imperative as it ensures that the model's performance is assessed on previously unseen data, allowing for a more robust evaluation of its generalization capabilities. I also have done bootstrapping to increase the accuracy of my models.
For the dataset I have used "youtube_data_final_1.csv" file.Â
Image of the dataset:
Image of the Training Set and Testing set:
Explanation:
Two Scatter Plots: The code generates two scatter plots to visually represent the data.
First Scatter Plot (Training Data):
plt.subplot(1, 2, 1): This line of code prepares to create the first scatter plot and places it in a 1x2 grid of subplots, indicating that it will be the first subplot.
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.Paired): In this line, it creates the scatter plot using the training data.
X_train[:, 0] and X_train[:, 1] correspond to the first two columns of the training data X_train. These two columns likely represent two specific features of the data.
c=y_train assigns different colors to data points based on their class labels. In this case, it's using the class labels from the training set y_train.
cmap=plt.cm.Paired specifies the color map to be used for the scatter plot.
Second Scatter Plot (Testing Data):
plt.subplot(1, 2, 2): This line of code prepares to create the second scatter plot and places it in the same 1x2 grid of subplots, indicating that it will be the second subplot.
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=plt.cm.Paired): This line creates the scatter plot for the testing data.
X_test[:, 0] and X_test[:, 1] represent the same two columns as in the training data but for the testing data.
c=y_test assigns colors to data points based on their class labels from the testing set y_test.
cmap=plt.cm.Paired specifies the same color map as in the first plot.
Title and Labels: For both scatter plots, the code adds titles to the plots. The first plot is labeled "Training Set," and the second is labeled "Testing Set."
Display: Finally, the code uses plt.tight_layout() to ensure that the plots are nicely arranged and not overlapping, and plt.show() to display the plots.
So we have the images of training and testing dataset in form of a scatterplot. They represent columns 'Likes' and 'Views'.
Code:
Bootstrapping code:
import pandas as pd
import numpy as np
# Replace 'your_file_path.csv' with the actual path to your CSV file
file_path = 'youtube_data_final_1.csv'
# Number of bootstrapped samples to create
num_samples = 100 # You can adjust this as needed
# Read the original dataset into a DataFrame
original_df = pd.read_csv(file_path)
# Create an empty DataFrame to store bootstrapped samples
bootstrapped_df = pd.DataFrame()
# Perform bootstrapping
for _ in range(num_samples):
# Randomly sample rows with replacement from the original dataset
bootstrap_sample = original_df.sample(n=len(original_df), replace=True)
# Append the bootstrap sample to the bootstrapped DataFrame
bootstrapped_df = bootstrapped_df.append(bootstrap_sample)
# Reset the index of the bootstrapped DataFrame
bootstrapped_df.reset_index(drop=True, inplace=True)
# Display the first few rows of the bootstrapped DataFrame
print(bootstrapped_df.head())
bootstrapped_df
Naive Bayes:
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as n
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd
# Load your dataset from the CSV file
file_path = 'youtube_data_final_1_1.csv'
# Link to "youtube_data_final_1_1.csv" file : https://drive.google.com/file/d/15pKK_1ftSTwLWlLo06nHcDd6pR0xKHIw/view?usp=drive_link
# List of columns to read (exclude non-numeric columns)
numeric_columns = ['Likes', 'Dislikes', 'Views', 'Video_Duration (in seconds)'] # Replace with the actual column names
# Read the CSV file with specific columns
df_1 = pd.read_csv(file_path, usecols=numeric_columns)
print(df_1.head())
# Replace 'target' with the actual column name representing the class labels
X = df_1.drop('Likes', axis=1) # Features
y = df_1['Views'] # Target variable
# Split the data into a training set and a testing set (e.g., 80% for training and 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Define the hyperparameter grid to search
param_grid = {'alpha': [0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]}
# Convert your data to a NumPy array
X_train = np.array(X_train)
X_test = np.array(X_test)
# Create a scatter plot of the training data
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.Paired)
plt.title('Training Set')
# Create a scatter plot of the testing data
plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=plt.cm.Paired)
plt.title('Testing Set')
plt.tight_layout()
plt.show()
# Create the GridSearchCV object
grid_search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
# Fit the grid search to your training data
grid_search.fit(X_train, y_train)
# Get the best hyperparameters
best_alpha = grid_search.best_params_['alpha']
# Create a new Naive Bayes model with the best hyperparameters
best_nb_model = MultinomialNB(alpha=best_alpha)
best_nb_model.fit(X_train, y_train)
# Make predictions and evaluate the model
best_nb_predictions = best_nb_model.predict(X_test)
accuracy = accuracy_score(y_test, best_nb_predictions)
confusion = confusion_matrix(y_test, best_nb_predictions)
print("Best Accuracy:", accuracy)
print("Best Confusion Matrix:")
print(confusion)
Explanation:
This code is an example of applying a Multinomial Naive Bayes classifier to a dataset for classification. Here's a step-by-step explanation:
Libraries are imported:
matplotlib.pyplot for data visualization.
train_test_split from sklearn.model_selection for splitting the dataset into training and testing sets.
numpy as np for working with numerical data.
GridSearchCV for hyperparameter tuning.
MultinomialNB for the Multinomial Naive Bayes classifier.
pandas for data manipulation.
The dataset is loaded from a CSV file located at 'youtube_data_final_1_1.csv' using pd.read_csv(). This dataset is assumed to have numeric columns.
The target variable 'Views' and the features (excluding 'Likes') are defined.
The data is split into a training set (80%) and a testing set (20%) using train_test_split().
A hyperparameter grid param_grid for the Laplace smoothing parameter 'alpha' is defined.
The training data is converted to NumPy arrays, as required by the classifier.
Two scatter plots are created to visualize the training and testing data. The first plot shows the training set, and the second plot shows the testing set. Data points are color-coded based on the 'Views' (target variable).
A GridSearchCV object is created to perform hyperparameter tuning on a Multinomial Naive Bayes classifier using 5-fold cross-validation.
The grid search is fitted to the training data, and the best hyperparameters are identified, specifically the best 'alpha' value.
A new Multinomial Naive Bayes model is created with the best 'alpha' value, and it's fitted to the training data.
Predictions are made on the testing set using the tuned model, and the accuracy of the model is calculated with accuracy_score.
The confusion matrix, which shows the true positive, true negative, false positive, and false negative values, is computed with confusion_matrix.
Finally, the best accuracy and the confusion matrix are printed to evaluate the model.
Results:
Explanation:
Best Accuracy: 0.9509475218658893: This line indicates the accuracy achieved by the machine learning model on a test dataset. The value 0.9509 (approximately 95.09%) is a measure of how many of the test samples were correctly classified by the model. An accuracy of 1.0 would mean a perfect classification, so 0.9509 is quite high and suggests that the model performed well.
Best Confusion Matrix: The confusion matrix is a table used to evaluate the performance of a classification model. In this case, the confusion matrix is displayed as a square matrix, where each row corresponds to the actual (true) class, and each column corresponds to the predicted class. The values within the matrix represent the counts of samples for each combination of actual and predicted classes.
Rows: The rows in the matrix represent the actual classes.
Columns: The columns represent the predicted classes.
In the matrix:
The value at the intersection of row 1, column 1 (top-left corner) represents the number of true positives (correctly predicted instances of class 1).
The value at the intersection of row 1, column 2 represents the number of false negatives (instances of class 1 misclassified as class 2).
The value at the intersection of row 2, column 1 represents the number of false positives (instances of class 2 misclassified as class 1).
The value at the intersection of row 2, column 2 represents the number of true negatives (correctly predicted instances of class 2).
Conclusions.
The code provided demonstrates the use of a Multinomial Naive Bayes classifier for a specific dataset, optimizing it with grid search to find the best hyperparameter alpha. The accuracy score and confusion matrix reveal the model's performance.
Here are the key takeaways related to the topic:
Supervised Learning: This code illustrates the concept of supervised learning, which requires labeled data for training and testing a machine learning model. In this case, 'Likes' are used as the target variable to predict 'Views.'
Hyperparameter Tuning: Grid search is employed to find the best hyperparameter (alpha) for the Multinomial Naive Bayes model. It demonstrates the importance of parameter optimization for improving model performance.
Model Evaluation: The accuracy score and confusion matrix provide insights into the model's accuracy and ability to predict 'Views.' This evaluation is vital in assessing the model's suitability for the given dataset.
Customization: The code can be customized to different datasets by changing the target variable, feature set, and hyperparameter grid. This adaptability is a fundamental aspect of applying machine learning techniques to various domains.
In conclusion, this code snippet showcases a practical example of supervised learning using Naive Bayes and emphasizes the significance of model optimization and evaluation in machine learning tasks.