My project: YouTube Trending PATTERN in france and united kingdom

Introduction to Dataset

We are going to use the YouTube dataset from Kaggle. This dataset includes several months of data on daily trending YouTube videos. Data is included for the US, GB, DE, CA, FR, GB regions and many more with up to 200 listed trending videos per day. Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count. The data also includes a category_id field, to indicate the country for the trending videos. There has a total of 16 columns that are video_id, trending_date, title, channel_title, category_id, publish_time, tags, views, likes, dislikes, comment_count, thumbnail_link, comments_disabled, ratings_disabled, video_error_or_removed and description. For our project, we are going to use only FR and GB dataset.

Table 1: Data Description

Problem Statement and Hypothesis Relevant to Stakeholder

Many companies that wanted to venture into making YouTube videos do not know what factors affect how popular a YouTube video will be. They do not know what the minimum requirement such as how many views, likes, dislikes and comment for their videos to be trending on YouTube. Because of that, the YouTube videos that they make not become trending and not visible to other people. So, they would get less money and the cost for making the videos is more than the profit that they got from the videos. This will make the company lose a lot of money and subsequently can make the company go bankrupt. So, by doing this project, we can suggest the minimum number of views, likes, dislikes and comments so that their videos become trending. When the company knows the minimum number needed for their videos, they can set the target to reach their goal to become trending videos. So, it will be easier for them when they know the direction to reach the target. Then, when their videos become trending, for example, when the company do videos marketing for their product, many people would view their videos and subsequently increase visibility within their industry and attracting new customers. This will help our stakeholders to get more money and profit with their videos because of increased sales in their product. From the YouTube video also they can get more money with the YouTube income.

Objective

To know the cluster for trending videos based on categories in France and United Kingdom (UK) and suggest the best metric for YouTube marketing strategy to the company that wanted to venture into YouTube.

Stakeholder

The specified company that wanted to venture into YouTube marketing. The field of the company is entertainment, music production and travel. The area of coverage is in France and United Kingdom.

Data Pre-Processing

Data Integration

We are going to use only FRvideos and GBvideos dataset because we are going to focus only in France and United Kingdom regions. Firstly, we add one column named country to the FRvideos dataset and GBvideos dataset to differentiate between those two datasets. The FRvideos dataset we named it France while GBvideos dataset we named it UK in column country. After that, we combine the FRvideos dataset and GBvideos dataset and named it YTvideos so that it is easier for us to dashboard later. To combine, we just need to copy and paste the dataset. We convert the csv file to xlsx format because we get the warning from Microsoft Excel that we need to convert it to xlsx format because the possible data loss.

Figure 1: Add column country

Figure 2: After combine dataset

Data Transformation

After that, we are going to use RapidMiner because it is easier for us to do data pre-processing. RapidMiner is a data science software platform that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. We just need to import the YTvideos dataset into RapidMiner. Then, we use Turbo Prep to transform the publish_time and trending_date to date format so that it is easier when we want to reduce or filter our data. After we have changed to date type we can export the dataset and named it to YTvideos-DC.

Figure 3: Change trending_date column to date type

Figure 4: Change publish_time column to date type

Figure 5: Export dataset to local repostitory and named it YTvideos-DC.

Data Reduction

We can see that the total of our dataset is 79640 rows, that is too much. So, we need to filter the data based on our preferences. In this tutorial, I am going to choose just three categories that are Entertainment, Music and People & Blogs. The category_id for Entertainment is 24, Music is 10 and People & Blogs is 22. We are also going to choose only the latest trending videos, so we can filter the publish time that starts on 03/01/2018. To remove the outliers, we are going to filter the views, so we choose the views that start at 100,000 until 1 million. We are also do not want 0 values in the column because it would affect our machine learning performance so we are going to filter the comment_count, likes and dislikes not equal to zero. We need to export the data and replace the old YTvideos-DC with the new one.

Figure 6: Design to filter the data

Figure 7: Filter the category_id column

Figure 8: Filter comment_count, views, likes, dislikes and publish_time columns

Data Transformation

We are going to do transformation again to transform the category_id to their own name (24 to Entertainment, 10 to Music, 22 to People & Blogs) so that it is easier for us to do dashboard later.

Figure 9: After transform the category_id column

Data Reduction and Data Selection

Finally, we are going to remove unnecessary columns that are video_id, tags, comments_disabled, rating_disabled and video_error_or_removed because it is not useful for our EDA and machine learning model. We export the data to replace it with the new one. After we pre-process, the total of our data is 6,824 rows that are more smaller and easier for us to do EDA and machine learning later.

Figure 10: Dataset after data pre-processing

Exploratory Data Analysis

Data Requirement

The first step for our EDA process is data requirement. We generate questions about our dataset. We pick four important columns that is views, likes, dislikes and comment_count for the ideas to generate questions. We want to know the relationship between views, likes, dislikes and number of comments. We generate seven questions for our EDA.

Question:

It is true that the category will have higher number of comments if the category have higher number of likes.
It is true that the category will have higher number of views if the category have higher number of likes.
It is true that the category will have higher number of dislikes if the category have higher number of likes.
What is the average number of views, likes, comments and dislikes from dataset ?
Category with high number of views can have high number of comments ?
Category with higher views can have higher number of dislikes ?
In average, category with high number of comment will have high number of dislikes ?

Data Collection

The dataset that we are going to use is Trending YouTube Video Statistics from Kaggle.

Data Pre-Processing

The steps for data pre-processing are similar to what I explained earlier at the above to make YTvideos-DC dataset.

Data Exploration

We use pivot table in excel to explore the data. We first use pivot table to count the total of each category in France and UK country. We can see that category Entertainment has the highest number for both country but France has higher count number compared to UK . The second highest is Music category and lastly People & Blogs in UK while in France the second highest is People & Blogs and lastly Music category. We can see that France has higher total number compared to UK.

Figure 11: Pivot Table to show the count of each category in France and UK country

Next, we count the sum of views in each category for France and UK country. The highest sum of views is in Entertainment country for both country. The second highest is Music and lastly People & Blogs for both country. We can say that most of the people more prefer to watch in entertainment category. We can see that the total of views for both country is around 2 billion.

Figure 12: Pivot Table to show the count views of each category in France and UK country

We also use describe() function in Python to know statistical properties of each attribute. We can see the number of non-null for views, likes, dislikes and comment_count columns that is 6823 rows. We can also see that the maximum number of views is around 999K, likes is around 176K, dislikes is around 28K and comments is around 317K. Other than that we can see the mean, standard deviation and minimum value.

Figure 13: Descriptive statistics of dataset

Data Analysis

For the analysis, we use Analysis ToolPak add-in in excel. We first need to load in Analysis ToolPak add-in on Options tab. After that we can start analysis the dataset. We first use histogram for analysis. The input range we use views column. The bin range is 100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000 and 900,000. We tick the chart output to display in histogram visualization.

Figure 14: Set the analysis using histogram

We can see that, the highest range of views is around 100,000 to 200,00. There are near 2000 data for views that are around 100,000 until 200,000.

Figure 15: Result for views analysis using histogram

Then, we use correlation analysis to know the strength of the relationship between the relative movements of two variables. We can see that likes and comment_count has the highest relationship compare to others. When the likes increase there is likely the comment_count also increase.

Figure 16: Result for correlation analysis

Then, we can use descriptive statistics analysis to gives us a general idea of trends in our data including mean, mode, median, range, variance, standard deviation, skewness, count, maximum and minimum.

Figure 17: Result for descriptive statistics analysis

Data Visualization (Dashboard)

We are going to use Power BI for our dashboard. Firstly, we load the data YTvideos-DC to the Power BI.

Figure 18: Load dataset

After that, we are going to visualize the data to answer the questions.

It is true that the category will have higher number of comments if the category have higher number of likes.

Figure 19: comment_count and likes by category_id

Answer: False, because 1.5 million number of comments have higher number of likes that is 20.1 million likes compare to 2.0 million number of comments that have 16.7 million likes.

2. It is true that the category will have higher number of views if the category have higher number of likes.

Figure 20: views and likes by category_id

Answer: False, because 0.34 billion number of views have higher number of likes that is 20.1 million likes compare to 0.57 billion number of views that have 16.7 million likes.

3. It is true that the category will have higher number of dislikes if the category have higher number of likes.

Figure 21: %GT likes and %GT dislikes by category_id

Answer: False, because 50.47% of dislikes have lesser number of likes that is 40.60% likes compare to 29.33% of dislikes that have 48.99% likes.

4. What is the average number of views, likes, comments and dislikes from dataset?

Figure 22: Average of comment_count, views, likes and dislikes

Answer: The average for number of comments is 1,180 , the average number of views is 327,720, the average number of likes is 11,840 and the average number of dislikes is 589.40.

5. Category with high number of views can have high number of comments?

Figure 23: %GT views and %GT comment_count by category_id

Answer: True, because when the views increase, the number of comments also increase.

6. Category with higher views can have higher number of dislikes?

Figure 24: views and dislikes by category_id

Answer: True, because when the views increase, the number of dislikes also increase.

7. In average, category with high number of comments will have high number of dislikes?

Figure 25: Average of comment_count and Average of dislikes by category_id

Answer: True, because when the number of comments increase, the number of dislikes also increase.

Full Dashboard

Figure 26: Full dashboard

Steps in Developing the Descriptive and Predictive Data Mining

Descriptive

Rapid Miner

Data Pre-Processing

Data Transformation

We use YTvideos-DC dataset for our descriptive model. We first do data pre-processing to convert categorical data to numerical value because machine learning algorithms interpret only numeric values. The column we want to convert is category_id and country. We replace Entertainment to 1, Music to 2 and People & Blogs to 3. We do the same to country column, we changer France to 1 and UK to 2.

Figure 27 : Change category_id column to numerical value

Figure 28: Change country column to numerical value

After that, we change the data type to number for category_id and country column.

Figure 29: Change data type for category_id and country columns to number

Then, we export the data after we pre-process to the local repository and named it YTvideos-DC (2). We also export to excel file to use it in Python.

Figure 30 : Export data after pre-process to local repository

K-Means

The first descriptive model that we are going to use is k-means. We first design the process for k-means clustering. We retrieve data from YTvideos-DC (2) that we pre-process earlier. Then, the data need to connect it into select attribute operator because we only choose certain attributes for clustering. Select attribute operator we connect it into clustering (k-means) operator.

Figure 31: Design for k-means clustering

In select attributes opertor, we select only category-id, comment_count, country, dislikes, likes and views attribute for our clustering process.

Figure 32: Select attribute to cluster

In clustering (k-means) operator, we tick add cluster attribute and for k value we choose 3 because in Python we use elbow method we found out the best number of cluster is 3. The max runs we choose 10 value. Then run the process.

Figure 33: Parameters for clustering

After we cluster the data, we can see new data after clustering with an additional column named cluster..

Figure 34: Data after clustering

After that, we need to export the data into local repository and named it YTvideos-DDM-KM because we need to use this data for our predictive model in RapidMiner.

Figure 35: Export cluster data to local repository and named it YTvideos-DDM-KM

We also need to export the data into excel with the same name YTvideos-DDM-KM to use it for predictive model in Python.

Figure 36: Export data after pre-process to excel

We can click at the visualization tab to show the visualization of our cluster. We can see that our cluster data follow the views value to cluster. So, we know that the views value have the most influence for our cluster data. We can see that at the bottom is cluster_0, at the middle is cluster_2 and at the top is cluster_1. From this, we can decide that cluster_0 is the worst trending, cluster_2 is moderate trending and cluster_1 is best trending.

Figure 37: Visualization of k-means clustering (cluster vs views)

We can also change the value column to likes, dislikes and comment_count. Below is the example of visualization in another value column.

Figure 38: Visualization of k-means clustering (cluster vs likes)

Figure : Visualization of k-means clustering (cluster vs comment_count)

Figure 39: Visualization of k-means clustering (cluster vs dislikes)

We can also extend our design with cluster model visualizer operator to have more visualization.

Figure 40: Design k-means clustering with cluster model visualizer

In the result overview, we can see that the average cluster distance and average distance for every cluster. The davies-boulden index for cluster 3 is 0.508. We can see that cluster_0 has 3382 items, cluster_1 has 1447 items and cluster_2 1994 items. So, we can say that cluster_0 has the highest number of rows.

Figure 41: K-means analysis

From the heat map we can see views value have the most influence for all the cluster.

Figure :42 Heat Map Analysis

Agglomerative Clustering

The second descriptive model that we are going to use is agglomerative clustering. We first design the process for agglomerative clustering. We retrieve data from YTvideos-DC (2). Then, the data need to connect it into select attribute operator because we only choose certain attributes for clustering. Select attribute operator we connect it into clustering (agglomerative clustering) operator. After that, we connect it to flatten clustering operator to creates a flat clustering model from the given hierarchical clustering model.

Figure 43: Design for agglomerative clustering

In select attributes opertor, we select only category-id, comment_count, country, dislikes, likes and views attribute for our clustering process.

Figure 44: Select attribute to cluster

In clustering (Agglomerative Clustering) operator we select the parameters. For the mode, we choose CompleteLink, measure types we choose MixedMeasures and mixed measure we choose MixedEuclideanDistance.

Figure 45: Parameters for agglomerative clustering

In flatten clustering operator, we choose number of clusters is 3. Then run the process.

Figure 46: Parameters for flatten clustering

At the result, we can see that similar k-means the agglomerative clustering is heavy on views value. So we can say that the cluster data is based on views value.

Figure 47: Visualization of agglomerative clustering (cluster vs views)

We can change the value column to likes, dislikes and comment_count. Below is the example of visualization in another value column.

Figure 48: Visualization of agglomerative clustering (cluster vs likes)

Figure 49: Visualization of agglomerative clustering (cluster vs comment_count)

Figure 50: Visualization of agglomerative clustering (cluster vs dislikes)

Python

Full code: Click here

Data Loading

First, we import the libraries that we are going to use,

#importing libraries for data manipulation

import numpy as np #library for numerical computing

import pandas as pd #library for data manipulation and analysis

Then, we upload the YTvideos-DC (2) dataset to Google Colab and load and copy the data into data frame called youtube_data. We look at the first 5 rows of data.

from google.colab import files

uploaded = files.upload()

#loading and copying data from a file in collab folder named YTvideos-DC.xlsx into data frame called youtube_data

youtube_data=pd.read_excel('YTvideos-DC.xlsx')

#read the first 5 rows in the youtube_data

youtube_data.head()

Figure 51: First 5 rows of data

Then, we get a quick overview of the dataset

#read data types

youtube_data.dtypes

Figure 52: Dataset info

We can see that some of the columns are in categorical value that need to convert into numerical value for machine learning model.

Then, we generate the descriptive statistics to get the basic quantitative information about the features of our data set.

#get descriptive summary (count, mean, std, min, quartiles and max) of youtube_data

youtube_data.describe()

Figure 53: Descriptive statistics of dataset

There are three aspects that usually catch my attention when I analyse descriptive statistics:

Min and max values: This can give us an idea about the range of values and is helpful to detect outliers.
Mean and standard deviation: The mean shows us the central tendency of the distribution, while the standard deviation quantifies its amount of variation.
Count: Give us a first perception about the volume of missing data.

Then, we analyse the data if there are missing values.

#print the nmber of missing data in each columns

column_names = youtube_data.columns

for column in column_names:

print(column + ' - ' + str(youtube_data[column].isnull().sum()))

Figure 54: Dataset missing values

We can see that all the columns have no missing value.

2. Data Visualization

Then, we visualize the data into histogram and scatter plot matrix

import matplotlib.pyplot as plt

# histograms

youtube_data.hist(figsize=(20, 10))

plt.show()

Figure 55: Histogram Visualization

# scatter plot matrix

from pandas.plotting import scatter_matrix

scatter_matrix(youtube_data, figsize=(40, 40))

plt.show()

Figure 56: Scatter Plot Matrix visualization

3. Data Pre-Processing

Then, we are going to pre-process the data. We are going to drop unimportant columns that are trending_date, title, channel_title, publish_time and thumbnail_link because it is not useful for our machine learning model.

#exclude columns

youtube_data.drop(['trending_date', 'title', 'channel_title', 'publish_time', 'thumbnail_link', 'description' ],axis=1,inplace=True)

Figure 57: Data after pre-processing for descriptive model

4. Descriptive Machine Learning Model

K-Means

Then, we use k-means for our first descriptive machine learning model. We use elbow method to determine the best number of cluter.

x = youtube_data.iloc[:, [0, 1, 2, 3, 4, 5]].values

#USING ELBOW METHOD to identify the best k-value

from sklearn.cluster import KMeans

Error =[]

for i in range(1, 11):

kmeans = KMeans(n_clusters = i).fit(x)

kmeans.fit(x)

Error.append(kmeans.inertia_)

import matplotlib.pyplot as plt

plt.plot(range(1, 11), Error)

plt.title('Elbow method')

plt.xlabel('No of clusters')

plt.ylabel('Error')

plt.show()

Figure 58: Elbow method

By using elbow method we can find the best k value for clustering. We can see that the elbow point is at no of cluster 2 and 3. We use 3 for number of cluster. So 3 is the best k value for clustering.

Agglomerative Clustering

After that, we use agglomerative clustering for our second descriptive machine learning model. We build dendrogram graph for our agglomerative clustering.

# calculate full dendrogram

from scipy.cluster.hierarchy import dendrogram, linkage

# generate the linkage matrix

Z = linkage(youtube_data, 'ward')

# set cut-off to 150

max_d = 0.8 * 10 ** 7 # max_d as in max_distance

plt.figure(figsize=(25, 15))

plt.title('Youtube Hierarchical Clustering Dendrogram')

plt.xlabel('Sample')

plt.ylabel('distance')

dendrogram(

truncate_mode='lastp', # show only the last p merged clusters

p=150, # Try changing values of p

leaf_rotation=90., # rotates the x axis labels

leaf_font_size=8., # font size for the x axis labels

)

plt.axhline(y=max_d, c='k')

plt.show()

Figure 59: Dendrogram Graph

Predictive Machine Learning Model

RapidMiner

Random Forest

Design Without HyperParameter Tuning

The first predictive model that we are going to use is Random Forest. We use YTvideos-DDM-KM dataset that we get after we cluster using k-means clustering. We first retrieve the excel file YTvideos-DDM-KM and then connect to set role operator. For set role operator, we choose the attribute name as cluster and target role is label. After that, we use split data operator to split data into training and testing. Click the add numeration button to set the training and testing ratio. Then, we use random forest operator to generates a random forest model. We set the parameters for number of trees, criterion, and maximal depth. Add apply model operator and performance operator to check the accuracy of the model. Run the model and look at the table results.

Figure 60: Design for Random Forest without HyperParameter Tuning

In split data operator, we click the edit numeration to fill in the ratio for training and testing. For the first ratio is for training and then click add entry for the next testing ratio. For example in figure below, 0.7 and 0.3 meaning 70% training ratio and 30% testing ratio.

Figure 61: Split Data into Training and Testing

We set the parameters for number of trees is 100, criterion is gain ratio and maximal depth is 10.

Figure 62: Parameter for Random Forest Operator

Design With HyperParameter Tuning

The different with and without hyperparameter tuning design is optimize parameters (grid) operator. Optimize parameters (grid) operator is used to finds the optimal values of the selected parameters for the Operators in its subprocess. We first retrieve the excel file YTvideos-DDM-KM and then connect to set role operator. For set role operator, we choose the attribute name as cluster and target role is label. After that, we connect set role operator to optimize parameters (grid) operator.

Figure 63: Design for Random Forest with HyperParameter Tuning

Inside the process for optimize parameters (grid) operator, we use split data operator to split data into training and testing. Click the add numeration button to set the training and testing ratio. Then, we use random forest operator to generates a random forest model. We set the parameters for number of trees is 100, criterion is gain_ratio, and maximal depth is 10. Add apply model operator and performance operator to check the accuracy of the model. Run the model and look at the table results.

Figure 64: Inside the process in Optimize Parameters (Grid) for Random Forest

Naive Bayes

Design Without HyperParameter Tuning

The second predictive model that we are going to use is Naive Bayes. We first retrieve the excel file YTvideos-DDM-KM and connect it to set role operator. For set role operator, we choose the attribute name as cluster and target role is label. After that, we use split data operator to split data into training and testing. Click the add numeration button to set the training and testing ratio. Then, we use naive bayes operator to generates a naive bayes model. We tick the laplace correction box. Add apply model operator and performance operator to check the accuracy of the model. Run the model and look at the table results.

Figure 65: Design for Naive Bayes without HyperParameter Tuning

Design With HyperParameter Tuning

Design for Naive Bayes with hyperParameter tuning we add optimize parameters (grid) to find the optimal value for selected parameter. We first retrieve the excel file YTvideos-DDM-KM and then connect to set role operator. For set role operator, we choose the attribute name as cluster and target role is label. The set role operator we connect it to optimize parameters (grid).

Figure 66: Design for Naive Bayes with HyperParameter Tuning

Inside optimize parameters (grid), we use split data operator to split data into training and testing. Click the add numeration button to set the training and testing ratio. Then, we use naive bayes operator to generates a naive bayes model. We tick the laplace correction box. Add apply model operator and performance operator to check the accuracy of the model. Run the model and look at the table results.

Figure 67: Inside the process in Optimize Parameters (Grid) for Naive Bayes

K-NN

Design Without HyperParameter Tuning

The third predictive model that we are going to use is k-NN. We first retrieve the excel file YTvideos-DDM-KM and then connect to set role operator. For set role operator, we choose the attribute name as cluster and target role is label. After that, we use split data operator to split data into training and testing. Click the add numeration button to set the training and testing ratio. Then, we use k-NN operator to generates a k-NN model. We set the parameter k=5. Add apply model operator and performance operator to check the accuracy of the model. Run the model and look at the table results.

Figure 68: Design for k-NN without HyperParameter Tuning

The parameter for k-NN operator we set the k = 5, tick the weighted vote, measure types is MixedMeasures and mixed measures is MixedEuclideanDistance.

Figure 69: Parameters for k-NN operator

Design With HyperParameter Tuning

Design for k-NN with hyperParameter tuning we add optimize parameters (grid) to find the optimal value for selected parameter. We first retrieve the excel file YTvideos-DDM-KM and then connect to set role operator. For set role operator, we choose the attribute name as cluster and target role is label. The set role operator we connect it to optimize parameters (grid).

Figure 70: Design for k-NN with HyperParameter Tuning

Inside optimize parameters (grid), we use split data operator to split data into training and testing. Click the add numeration button to set the training and testing ratio. Then, we use k-NN operator to generates a k-NN model. We set the parameter k=5. Add apply model operator and performance operator to check the accuracy of the model. Run the model and look at the table results.

Figure 71: Inside the process in Optimize Parameters (Grid) for k-NN

Python

Full code: Click here

Data Loading

We first import the libraries for our predictive machine learning model.

#import the classifier

from sklearn import metrics

## Import Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB

## Import Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

## Import kNN Classifier

from sklearn.neighbors import KNeighborsClassifier

Then, we upload the YTvideos-DDM-KM dataset into Google Colab. We load and copy the data into data frame called youtube_data_cluster. Then, we look at the first 5 rows of data.

from google.colab import files

uploaded = files.upload()

#loading and copying data from a file in collab folder named YTvideos-DDM-KM.xlsx into data frame called youtube_data_cluster

youtube_data_cluster=pd.read_excel('YTvideos-DDM-KM.xlsx')

#read the first 5 rows in the youtube_data

youtube_data_cluster.head()

Figure 72: First 5 rows of data

2. Data Pre-Processing

We can see that cluster column is in categorical value so we need to do data transformation. We convert it into categorical value. We replace cluster_0 with 0, cluster_1 with 1 and cluster_2 with 2.

youtube_data_cluster = youtube_data_cluster.replace({"cluster":{"cluster_0":0,"cluster_1":1,"cluster_2":2}})

youtube_data_cluster

Then, we split the data into training and testing set. We set our training and testing ratio.

#Lets split the data into train and test set

X = youtube_data_cluster.iloc[:,:-1]

y = youtube_data_cluster.iloc[:, -1].values

Experiment 1 : 70 % Training and 30% Testing Ratio

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

Experiment 2 : 50 % Training and 50% Testing Ratio

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, random_state = 0)

Experiment 3 : 30 % Training and 70% Testing Ratio

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.70, random_state = 0)

3. Predictive Machine Learning Model

Random Forest

The first predictive model we are going to use is Random Forest. We calculate the accuracy and the classification error of the model.

RandomF = RandomForestClassifier()

RandomF.fit(X_train, y_train)

y_pred = RandomF.predict(X_test)

#print('Random Forest Model:')

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Random Forest = {:.2f}%'.format(accuracy))

print('Classification Error for Random Forest = {:.2f}%'.format(100 - accuracy))

Then, we need to do HyperParameter Tuning to choose best parameter for our model. We use GridSearchCV library to find the best parameter for random forest model.

from sklearn.model_selection import (GridSearchCV, cross_val_score, cross_val_predict,

StratifiedKFold, learning_curve)

K_fold = StratifiedKFold(n_splits=10)

# RFC Parameters tunning

RFC = RandomForestClassifier()

## Search grid for optimal parameters

rf_param_grid = {"max_depth": [None],

"min_samples_split": [2, 6, 20],

"min_samples_leaf": [1, 4, 16],

"n_estimators" :[100,200,300,400],

"criterion": ["gini"]}

gsRFC = GridSearchCV(RFC, param_grid = rf_param_grid, cv=K_fold,

scoring="accuracy", n_jobs= 4, verbose = 1)

gsRFC.fit(X_train,y_train)

RFC_best = gsRFC.best_estimator_

gsRFC.best_params_

After that, we check the accuracy and classification error again after we apply the parameter from hyperparameter tuning to our model.

RandomF = RandomForestClassifier(criterion= 'gini', max_depth= None, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 200)

RandomF.fit(X_train, y_train)

y_pred = RandomF.predict(X_test)

#print('Random Forest Model:')

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Random Forest = {:.2f}%'.format(accuracy))

print('Classification Error for Random Forest = {:.2f}%'.format(100 - accuracy))

Naive Bayes

The second predictive model we are going to use is Naive Bayes. We calculate the accuracy and the classification error of the model.

#Create a Gaussian Classifier

gnb = GaussianNB()

#Train the model using the training sets

gnb.fit(X_train, y_train)

#Predict the response for test dataset

y_pred = gnb.predict(X_test)

# Model Accuracy, how often is the classifier correct?

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Naive Bayes = {:.2f}%'.format(accuracy))

print('Classification Error for Naive Bayes = {:.2f}%'.format(100 - accuracy))

Then, we need to do hyperparameter tuning to the model using RepeatedStratifiedKFold library to choose best parameter for our naive bayes model.

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.preprocessing import PowerTransformer

cv_method = RepeatedStratifiedKFold(n_splits=5,

n_repeats=3,

random_state=999)

from sklearn.naive_bayes import GaussianNB

np.random.seed(999)

nb_classifier = GaussianNB()

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}

gs_NB = GridSearchCV(estimator=nb_classifier,

param_grid=params_NB,

cv=cv_method,

verbose=1,

scoring='accuracy')

Data_transformed = PowerTransformer().fit_transform(X)

gs_NB.fit(Data_transformed, y);

gs_NB.best_params_

After that, we apply the parameter to our model and calculate the accuracy and classification error again.

#Create a Gaussian Classifier

gnb = GaussianNB(priors=None, var_smoothing= 0.0001)

#Train the model using the training sets

gnb.fit(X_train, y_train)

#Predict the response for test dataset

y_pred = gnb.predict(X_test)

# Model Accuracy, how often is the classifier correct?

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Naive Bayes = {:.2f}%'.format(accuracy))

print('Classification Error for Naive Bayes = {:.2f}%'.format(100 - accuracy))

K-NN

The third predictive model we are going to use is k-NN. We calculate the accuracy and the classification error of the model.

knn = KNeighborsClassifier(n_neighbors= 20)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for kNN = {:.2f}%'.format(accuracy))

print('Classification Error for kNN = {:.2f}%'.format(100 - accuracy))

Then, we need to do hyperparameter tuning using GridSearchCV library to find the best parameter for k-NN model.

import numpy as np

params_KNN = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7],

'p': [1, 2, 5]}

from sklearn.model_selection import GridSearchCV

gs_KNN = GridSearchCV(estimator=KNeighborsClassifier(),

param_grid=params_KNN,

cv=cv_method,

verbose=1, # verbose: the higher, the more messages

scoring='accuracy',

return_train_score=True)

gs_KNN.fit(X, y);

gs_KNN.best_params_

After we know the best parameter, we apply those in k-NN model and calculate the accuracy and classification error again.

#Create a Gaussian Classifier

gnb = GaussianNB(priors=None, var_smoothing= 0.0001)

#Train the model using the training sets

gnb.fit(X_train, y_train)

#Predict the response for test dataset

y_pred = gnb.predict(X_test)

# Model Accuracy, how often is the classifier correct?

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Naive Bayes = {:.2f}%'.format(accuracy))

print('Classification Error for Naive Bayes = {:.2f}%'.format(100 - accuracy))

Experiment Setting of the Data Mining Performance Comparison

For our experiment, we are going to use three machine learning models, random forest, naive bayes and k-NN. We are using 3 various dataset splitting. The first one is 70% training and 30% testing, the second one is 50% training and 50% testing ratio and the third one is 30% training and 70% testing ratio. We are going to investigate the result before and after using hyperparameter tuning. We are going to experiment this in RapidMiner and Python.

RapidMiner

Random Forest

Firstly, we are going to do our experiment in RapidMiner. The first machine learning model we are going to use is random forest.

Before HyperParameter Tuning

We can see that in the figure below, by using 70% training and 30% testing ratio without hyperparameter tuning we get the accuracy is 99.95%.

Figure 73: Accuracy for Random Forest without HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio without hyperparameter tuning we get the accuracy is 99.94%.

Figure 74: Accuracy for Random Forest without HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio without hyperparameter tuning we get the accuracy is 99.90%.

So, without hyperparameter tuning 70% training and 30% testing ratio has the best performance in random forest follow by 50% training and 50% testing ratio and lastly 30% training and 70% testing ratio.

Figure 75: Accuracy for Random Forest without HyperParameter Tuning (30% Training and 70% Testing Ratio)

After HyperParameter Tuning

We need to do hyperParameter tuning to increase our model performance. We can see in the figure below the best parameter for our random forest model.

Figure 76: Result for Optimize Parameters (Grid) for Random Forest

We can see that in the figure below, by using 70% training and 30% testing ratio with hyperparameter tuning we get the accuracy is 100%. The accuracy increase by 0.05%.

Figure 77: Accuracy for Random Forest with HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio with hyperparameter tuning we get the accuracy is also 100%. The accuracy increase by 0.06%.

Figure 78: Accuracy for Random Forest with HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio with hyperparameter tuning we get the accuracy is also 100%. The accuracy increase by 0.1%.

So, we can say that, all split ratio in random forest have the best model perfomance after hyperparameter tuning because all the models get 100% accuracy.

Figure 79: Accuracy for Random Forest with HyperParameter Tuning (30% Training and 70% Testing Ratio)

Naive Bayes

The second machine learning model we are going to use is naive bayes.

Before HyperParameter Tuning

We can see that in the figure below, by using 70% training and 30% testing ratio without hyperparameter tuning we get the accuracy is 93.06%.

Figure 80: Accuracy for Naive Bayes without HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio without hyperparameter tuning we get the accuracy is 91.97%.

Figure 81: Accuracy for Naive Bayes without HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio without hyperparameter tuning we get the accuracy is 91.42%.

So, without hyperparameter tuning 70% training and 30% testing ratio has the best performance in naive bayes follow by 50% training and 50% testing ratio and lastly 30% training and 70% testing ratio.

Figure 82: Accuracy for Naive Bayes without HyperParameter Tuning (30% Training and 70% Testing Ratio)

After HyperParameter Tuning

Then, we need to do hyperparameter tuning to increase our model performance. We can see that in figure below, we can know the best parameter for our naive bayes model.

Figure 83: Result for Optimize Parameters (Grid) for Naive Bayes

We can see that in the figure below, by using 70% training and 30% testing ratio with hyperparameter tuning we get the accuracy is 92.38%. The accuracy decrease by 0.68%.

Figure 84: Accuracy for Naive Bayes with HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio with hyperparameter tuning we get the accuracy is 92.20%. The accuracy increase by 0.23%.

Figure 85: Accuracy for Naive Bayes with HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio with hyperparameter tuning we get the accuracy is 91.14%. The accuracy decrease by 0.28%.

We can see that 70% training and 30% testing ratio and 30% training and 70% testing ratio after hyperparameter tuning, their accuracy are actually decrease. Only 50% training and 50% testing ratio model the accuracy is increase. So, the hyperparameter tuning is not acceptable for those model. We use the accuracy before hyperparameter tuning for those model. So, with hyperparameter tuning 70% training and 30% testing ratio has the best performance in naive bayes follow by 50% training and 50% testing ratio and lastly 30% training and 70% testing ratio.

Figure 86: Accuracy for Naive Bayes with HyperParameter Tuning (30% Training and 70% Testing Ratio)

K-NN

The third machine learning model we are going to use is k-NN.

Before HyperParameter Tuning

We can see that in the figure below, by using 70% training and 30% testing ratio without hyperparameter tuning we get the accuracy is 99.80%.

Figure 87: Accuracy for k-NN without HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio without hyperparameter tuning we get the accuracy is 99.74%.

Figure 88: Accuracy for k-NN without HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio without hyperparameter tuning we get the accuracy is 99.69%.

Figure 89: Accuracy for k-NN without HyperParameter Tuning (30% Training and 70% Testing Ratio)

After HyperParameter Tuning

Then, we need to do hyperParameter tuning to increase our model performance. We can see in the figure below the best parameter for our k-NN model.

Figure 90: Result for Optimize Parameters (Grid) for k-NN

We can see that in the figure below, by using 70% training and 30% testing ratio with hyperparameter tuning we get the accuracy is 100%. The accuracy increase by 0.2%.

Figure 91: Accuracy for k-NN with HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio with hyperparameter tuning we get the accuracy is 100%. The accuracy increase by 0.26%.

Figure 92: Accuracy for k-NN with HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio with hyperparameter tuning we get the accuracy is 99.94%. The accuracy increase by 0.25%.

We can see that all the model increase for their accuracy after hyperparameter tuning. So, with hyperparameter tuning 70% training and 30% testing ratio and 50% training and 50% testing ratio have the best performance in k-NN because both have 100% accuracy follow by 30% training and 70% testing ratio.

Figure 93: Accuracy for k-NN with HyperParameter Tuning (30% Training and 70% Testing Ratio)

Python

Next, we are going to do our experiment in Python.

We first split the data into various training and testing ratio. Below is the code to split our data into various ratios.

#Lets split the data into train and test set

X = youtube_data_cluster.iloc[:,:-1]

y = youtube_data_cluster.iloc[:, -1].values

Experiment 1 : 70 % Training and 30% Testing Ratio

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

Experiment 2 : 50 % Training and 50% Testing Ratio

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, random_state = 0)

Experiment 3 : 30 % Training and 70% Testing Ratio

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.70, random_state = 0)

Random Forest

The first machine learning model that we are going to use is random forest.

Before HyperParameter Tuning

Below is the code to calculate the accuracy of random forest model.

RandomF = RandomForestClassifier()

RandomF.fit(X_train, y_train)

y_pred = RandomF.predict(X_test)

#print('Random Forest Model:')

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Random Forest = {:.2f}%'.format(accuracy))

print('Classification Error for Random Forest = {:.2f}%'.format(100 - accuracy))

We can see that in the figure below, by using 70% training and 30% testing ratio without hyperparameter tuning we get the accuracy is 99.85%.

Figure 94: Accuracy for Random Forest without HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio without hyperparameter tuning we get the accuracy is 99.91%.

Figure 95: Accuracy for Random Forest without HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio without hyperparameter tuning we get the accuracy is 99.92%.

So, without hyperparameter tuning 30% training and 70% testing ratio has the best performance in random forest follow by 50% training and 50% testing ratio and lastly 70% training and 30% testing ratio. We can see that it is quite different from RapidMiner.

Figure 96: Accuracy for Random Forest without HyperParameter Tuning (30% Training and 70% Testing Ratio)

HyperParameter Tuning on Random Forest

We need to do hyperparameter tuning to increase performance of our model. Below is the code for hyperparameter tuning on random forest.

of our mfrom sklearn.model_selection import (GridSearchCV, cross_val_score, cross_val_predict,

StratifiedKFold, learning_curve)

K_fold = StratifiedKFold(n_splits=10)

# RFC Parameters tunning

RFC = RandomForestClassifier()

## Search grid for optimal parameters

rf_param_grid = {"max_depth": [None],

"min_samples_split": [2, 6, 20],

"min_samples_leaf": [1, 4, 16],

"n_estimators" :[100,200,300,400],

"criterion": ["gini"]}

gsRFC = GridSearchCV(RFC, param_grid = rf_param_grid, cv=K_fold,

scoring="accuracy", n_jobs= 4, verbose = 1)

gsRFC.fit(X_train,y_train)

RFC_best = gsRFC.best_estimator_

gsRFC.best_params_

We can see that in the figure below is the best parameter for our random forest model.

Figure 97: Best Parameter for Random Forest

After HyperParameter Tuning

From the figure above, we know the best parameter for our random forest. We apply those parameters to our model and calculate the accuracy again. Below, is the example code after we apply the best parameter in hyperparameter tuning.

RandomF = RandomForestClassifier(criterion= 'gini', max_depth= None, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 200)

RandomF.fit(X_train, y_train)

y_pred = RandomF.predict(X_test)

#print('Random Forest Model:')

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Random Forest = {:.2f}%'.format(accuracy))

print('Classification Error for Random Forest = {:.2f}%'.format(100 - accuracy))

We can see that in the figure below, by using 70% training and 30% testing ratio with hyperparameter tuning we get the accuracy is 99.85%. The accuracy is not increase.

Figure 98: Accuracy for Random Forest with HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio with hyperparameter tuning we get the accuracy is 99.91%. The accuracy is not increase.

Figure 99: Accuracy for Random Forest with HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio with hyperparameter tuning we get the accuracy is 99.92%. The accuracy is not increase.

We can see that all accuracy is not increase after hyperparameter tuning on random forest model. Maybe there is something wrong with the code. So, with hyperparameter tuning 30% training and 70% testing ratio has the best performance in random forest follow by 50% training and 50% testing ratio and lastly 70% training and 30% testing ratio.

Figure 100: Accuracy for Random Forest with HyperParameter Tuning (30% Training and 70% Testing Ratio)

Naive Bayes

The second machine learning model that we are going to use is naive bayes.

Before HyperParameter Tuning

Below is the code to calculate the accuracy of naive bayes model.

#Create a Gaussian Classifier

gnb = GaussianNB()

#Train the model using the training sets

gnb.fit(X_train, y_train)

#Predict the response for test dataset

y_pred = gnb.predict(X_test)

# Model Accuracy, how often is the classifier correct?

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Naive Bayes = {:.2f}%'.format(accuracy))

print('Classification Error for Naive Bayes = {:.2f}%'.format(100 - accuracy))

We can see that in the figure below, by using 70% training and 30% testing ratio without hyperparameter tuning we get the accuracy is 90.57%.

Figure 101: Accuracy for Naive Bayes without HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio without hyperparameter tuning we get the accuracy is 91.79%.

Figure 102: Accuracy for Naive Bayes without HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio without hyperparameter tuning we get the accuracy is 91.29%.

So, without hyperparameter tuning 50% training and 50% testing ratio has the best performance in naive bayes follow by 50% training and 50% testing ratio and lastly 70% training and 30% testing ratio. We can see that it is also quite different from RapidMiner.

Figure 103: Accuracy for Naive Bayes without HyperParameter Tuning (30% Training and 70% Testing Ratio)

HyperParameter Tuning on Naive Bayes

We need to do hyperparameter tuning to increase performance of our model. Below is the code for hyperparameter tuning on naive bayes.

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.preprocessing import PowerTransformer

cv_method = RepeatedStratifiedKFold(n_splits=5,

n_repeats=3,

random_state=999)

from sklearn.naive_bayes import GaussianNB

np.random.seed(999)

nb_classifier = GaussianNB()

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}

gs_NB = GridSearchCV(estimator=nb_classifier,

param_grid=params_NB,

cv=cv_method,

verbose=1,

scoring='accuracy')

Data_transformed = PowerTransformer().fit_transform(X)

gs_NB.fit(Data_transformed, y);

gs_NB.best_params_

We can see that in the figure below is the best parameter for our naive bayes model.

Figure 104: Best Parameter for Naive Bayes

After HyperParameter Tuning

From the figure above, we know the best parameter for our naive bayes model. We apply those parameters to our model and calculate the accuracy again. Below, is the example code after we apply the best parameter in hyperparameter tuning.

#Create a Gaussian Classifier

gnb = GaussianNB(priors=None, var_smoothing= 0.0001)

#Train the model using the training sets

gnb.fit(X_train, y_train)

#Predict the response for test dataset

y_pred = gnb.predict(X_test)

# Model Accuracy, how often is the classifier correct?

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for Naive Bayes = {:.2f}%'.format(accuracy))

print('Classification Error for Naive Bayes = {:.2f}%'.format(100 - accuracy))

We can see that in the figure below, by using 70% training and 30% testing ratio with hyperparameter tuning we get the accuracy is 93.55%. The accuracy increase by 2.98%.

Figure 105: Accuracy for Naive Bayes with HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio with hyperparameter tuning we get the accuracy is 94.20%. The accuracy increase by 2.41%.

Figure 106: Accuracy for Naive Bayes with HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio with hyperparameter tuning we get the accuracy is 94.41%. The accuracy increase by 3.12%.

We can see that all the model increase their accuracy after hyperparameter tuning. So, with hyperparameter tuning 30% training and 70% testing ratio has the best performance in naive bayes follow by 50% training and 50% testing ratio and lastly 70% training and 30% testing ratio.

Figure 107: Accuracy for Naive Bayes with HyperParameter Tuning (70% Training and 30% Testing Ratio)

K-NN

The third machine learning model that we are going to use is k-NN.

Before HyperParameter Tuning

Below is the code to calculate the accuracy of k-NN model.

knn = KNeighborsClassifier(n_neighbors= 20)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for kNN = {:.2f}%'.format(accuracy))

print('Classification Error for kNN = {:.2f}%'.format(100 - accuracy))

We can see that in the figure below, by using 70% training and 30% testing ratio without hyperparameter tuning we get the accuracy is 99.80%.

Figure 108: Accuracy for k-NN without HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio without hyperparameter tuning we get the accuracy is 99.50%.

Figure 109: Accuracy for k-NN without HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio without hyperparameter tuning we get the accuracy is 99.37%.

So, with hyperparameter tuning 70% training and 30% testing ratio has the best performance in k-NN follow by 50% training and 50% testing ratio and lastly 30% training and 70% testing ratio.

Figure 110: Accuracy for k-NN without HyperParameter Tuning (30% Training and 70% Testing Ratio)

HyperParameter Tuning on k-NN

We need to do hyperparameter tuning to increase performance of our model. Below is the code for hyperparameter tuning on k-NN.

import numpy as np

params_KNN = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7],

'p': [1, 2, 5]}

from sklearn.model_selection import GridSearchCV

gs_KNN = GridSearchCV(estimator=KNeighborsClassifier(),

param_grid=params_KNN,

cv=cv_method,

verbose=1, # verbose: the higher, the more messages

scoring='accuracy',

return_train_score=True)

gs_KNN.fit(X, y);

gs_KNN.best_params_

We can see that in the figure below is the best parameter for our k-NN model.

Figure 111: Best Parameter for k-NN

After HyperParameter Tuning

From the figure above, we know the best parameter for our k-NN model. We apply those parameters to our model and calculate the accuracy again. Below, is the example code after we apply the best parameter in hyperparameter tuning.

knn = KNeighborsClassifier(n_neighbors= 7, p=2)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = (metrics.accuracy_score(y_pred,y_test))

accuracy = accuracy * 100

print('Accuracy for kNN = {:.2f}%'.format(accuracy))

print('Classification Error for kNN = {:.2f}%'.format(100 - accuracy))

We can see that in the figure below, by using 70% training and 30% testing ratio with hyperparameter tuning we get the accuracy is 99.85%. The accuracy increase by 0.05%.

Figure 112: Accuracy for k-NN with HyperParameter Tuning (70% Training and 30% Testing Ratio)

We can see that in the figure below, by using 50% training and 50% testing ratio with hyperparameter tuning we get the accuracy is 99.77%. The accuracy increase by 0.27%.

Figure 113: Accuracy for k-NN with HyperParameter Tuning (50% Training and 50% Testing Ratio)

We can see that in the figure below, by using 30% training and 70% testing ratio with hyperparameter tuning we get the accuracy is 99.73%. The accuracy increase by 0.36%.

We can see that all the model increase their accuracy after hyperparameter tuning. So, with hyperparameter tuning 70% training and 30% testing ratio has the best performance in k-NN follow by 50% training and 50% testing ratio and lastly 30% training and 70% testing ratio.

Figure 114: Accuracy for k-NN with HyperParameter Tuning (30% Training and 70% Testing Ratio)

Results and Analysis

Results

Table 2: Result Table for Model Perfomance

Analysis

From those results, we can see that the accuracy for RapidMiner and Python are slightly different. We can say that maybe in Python there is something wrong with the coding. All the model increase in accuracy after hyperparameter tuning except in RapidMiner for naive bayes model, the accuracy decrease and in Python for random forest model, the accuracy same before and after tuning. We can assume that there is something wrong for design in RapidMiner and in Python maybe there is wrong coding for hyperparameter tuning.

Before hyperparameter tuning in RapidMiner, random forest model have the highest accuracy follow by k-NN and naive bayes. In Python, random forest also has the highest accuracy follow by k-NN and naive bayes. We can see that before hyperparameter tuning, RapidMiner and Python are similar for the best model performance that is random forest. In RapidMiner before hyperparameter tuning, 70% training ratio and 30% testing ratio has the best performance follow by 50% training ratio and 50% testing ratio and 30% training ratio and 70% testing ratio. In Python before hyperparameter tuning, 50% training ratio and 50% testing ratio has the best model performance follow by 70% training ratio and 30% testing ratio and 30% training ratio and 70% testing ratio. We can see that for the spit ratio, RapidMiner and Python are not the same for the best model performance.

After hyperparameter tuning, in RapidMiner random forest has the highest accuracy follow by k-NN and naive bayes. In Python, random forest also has the highest accuracy follow by k-NN and naive bayes. We can see that after hyperparameter tuning, RapidMiner and Python are similar for the best model performance that is random forest. In RapidMiner after hyperparameter tuning, 70% training ratio and 30% testing ratio has the best performance follow by 50% training ratio and 50% testing ratio and 30% training ratio and 70% testing ratio. In Python after hyperparameter tuning, 30% training ratio and 70% testing ratio has the best model performance follow by 50% training ratio and 50% testing ratio and 70% training ratio and 30% testing ratio. We can see that for the spit ratio, RapidMiner and Python are not the same for the best model performance.

We can see that in overall, the best machine learning model is random forest because both RapidMiner and Python have the highest accuracy. For the training and testing ratio, RapidMiner and Python do not have the same for the best ratio. In RapidMiner, the best ratio is 70% training ratio and 30% testing ratio while in Python is 30% training ratio and 70% testing ratio. So, we follow RapidMiner result because RapidMiner has higher accuracy compared to Python. For example, we can see that in RapidMiner random forest with 70% training ratio and 30% testing ratio has 100% accuracy while in Python random forest with 30% training ratio and 70% testing ratio has 99.92% accuracy. So RapidMiner has higher accuracy compared to Python. Finally, we can say that random forest with 70% training ratio and 30% testing ratio has the best model performance.

Summary:

Machine Learning Model

Before hyperparameter tuning: Random Forest (RapidMiner and Python)
After hyperparameter tuning: Random Forest (RapidMiner and Python)

Ratio

Before hyperparameter tuning: 70% training ratio and 30% testing ratio (RapidMiner ), 50% training ratio and 50% testing ratio (Python)
After hyperparameter tuning: 70% training ratio and 30% testing ratio (RapidMiner) 30% training ratio and 70% testing ratio (Python)

Best model performance: Random forest with 70% training ratio and 30% testing ratio

Insight

To become the best trending video, we need to have the minimum number of views is 626,794, number of likes is 346, number of dislikes is 67 and number of comments is 3.
To become moderate trending video, we need to have the minimum number of views is 322079, number of likes is 363, number of dislikes is 38 and number of comments is 6.
To become worst trending video, we need to have the minimum number of views is 100053, number of likes is 91, number of dislikes is 3 and number of comments is 8.

Data Product

My Youtube Trending Prediction App : Click here

We are going deploy our model in this section. We will build a YouTube prediction model based on views, likes, dislikes and comments and then deploy it using Streamlit. We would use three machine learning model that is random forest, k-NN and naive bayes. By using this app, our stakeholders can easily predict whether their videos are on the best, moderate or worst trending. This is beneficial to our stakeholders so that they know how well or how good their videos are going on. The stakeholder can target the minimum number of views, likes, dislikes and comments so that their videos will become trending. With the target, it will be easier for our stakeholder to reach their goal and subsequently increase their profit when their goal is reached. We will first save our model in pkl format. We import the required libraries and then read the YTvideos-DDM-KM excel file.

Full code to make pkl file: Click here

from google.colab import files

uploaded = files.upload()

import pandas as pd

train = pd.read_excel('YTvideos-DDM-KM.xlsx')

We need to convert the cluster column into numbers.

train = train.replace({"cluster":{"cluster_0":0,"cluster_1":1,"cluster_2":2}})

Next, we will separate the dependent (cluster) and the independent variables. For this, I have only picked 6 variables that I think are most relevant. These are the category_id, views, likes, dislikes, comment_count and country and stored them in variable X. Target variable is stored in another variable y.

X = train[['category_id', 'views', 'likes', 'dislikes', 'comment_count', 'country']]

y = train.cluster

Here, we will first split our dataset into a training and validation set, so that we can train the model on the training set and evaluate its performance on the validation set.

from sklearn.model_selection import train_test_split

x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size = 0.3, random_state = 10)

Random Forest

We have split the data using the train_test_split function from the sklearn library keeping the test_size as 0.3 which means 30 percent of the total dataset will be kept aside for the validation set. Next, we will train the random forest model using the training set.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(max_depth=4, random_state = 10)

model.fit(x_train, y_train)

Here, I have kept the max_depth as 4 for each of the trees of our random forest and stored the trained model in a variable named model. Now, our model is trained, let’s check its performance on both the training and validation set.

from sklearn.metrics import accuracy_score

pred_cv = model.predict(x_cv)

accuracy_score(y_cv,pred_cv)

Let’s also check the performance on the training set too

pred_train = model.predict(x_train)

accuracy_score(y_train,pred_train)

We are saving the model in pickle format and storing it as classifier.pkl. This will store the trained model and we will use this while deploying the model.

# saving the model

import pickle

pickle_out = open("classifier.pkl", mode = "wb")

pickle.dump(model, pickle_out)

pickle_out.close()

Naive Bayes

Then, we also need to do the same for naive bayes model

from sklearn.naive_bayes import GaussianNB

model = GaussianNB(priors=None, var_smoothing=1e-09)

model.fit(x_train, y_train)

from sklearn.metrics import accuracy_score

pred_cv = model.predict(x_cv)

accuracy_score(y_cv,pred_cv)

pred_train = model.predict(x_train)

accuracy_score(y_train,pred_train)

For naive bayes model, we save it as classifier_second.pkl

# saving the model

import pickle

pickle_out = open("classifier_second.pkl", mode = "wb")

pickle.dump(model, pickle_out)

pickle_out.close()

K-NN

We also do the same for k-NN model.

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors= 20)

model.fit(x_train, y_train)

from sklearn.metrics import accuracy_score

pred_cv = model.predict(x_cv)

accuracy_score(y_cv,pred_cv)

pred_train = model.predict(x_train)

accuracy_score(y_train,pred_train)

For naive bayes model we save it as classifier_third.pkl

# saving the model

import pickle

pickle_out = open("classifier_third.pkl", mode = "wb")

pickle.dump(model, pickle_out)

pickle_out.close()

Next, we will have to create a separate session in Streamlit for our app. You can download the sessionstate.py file from here and store that in your current working directory. We are loading the required libraries which are pickle to load the trained model and streamlit to build the app. Then we are loading the trained model and saving it in a variable named classifier.

import pickle

import streamlit as st

# loading the trained model

pickle_in = open('classifier.pkl', 'rb')

classifier = pickle.load(pickle_in)

pickle_in_second = open('classifier_second.pkl', 'rb')

classifier_second = pickle.load(pickle_in_second)

pickle_in_third = open('classifier_third.pkl', 'rb')

classifier_third = pickle.load(pickle_in_third)

import base64

@st.cache(allow_output_mutation=True)

def get_base64_of_bin_file(bin_file):

with open(bin_file, 'rb') as f:

data = f.read()

return base64.b64encode(data).decode()

def set_png_as_page_bg(png_file):

bin_str = get_base64_of_bin_file(png_file)

page_bg_img = '''

<style>

body {

background-image: url("data:image/png;base64,%s");

background-size: cover;

}

</style>

''' % bin_str

st.markdown(page_bg_img, unsafe_allow_html=True)

return

set_png_as_page_bg('background.jpg')

Next, we define the prediction function. This function will take the data provided by users as input and make the prediction using the model that we have loaded earlier. It will take the category, views, likes, dislikes, comment, and country as input, and then pre-process that input so that it can be feed to the model and finally, make the prediction using the model loaded as a classifier. We have three functions for prediction that is for different machine learning model, random forest, naive bayes and k-NN. In the end, it will return whether the video is best, moderate or worst trending based on the output of the model.

@st.cache()

# defining the function which will make the prediction using the data which the user inputs

#Random Forest

def prediction(category_id, views, likes, dislikes, comment_count, country):

# Pre-processing user input

if country == "France":

country = 1

elif country == "United Kingdom":

country = 2

if category_id == "Entertainment":

category_id = 1

elif category_id == "Music":

category_id = 2

elif category_id == "People & Blogs":

category_id = 3

# Making predictions

prediction = classifier.predict([[category_id, views, likes, dislikes, comment_count, country]])

if prediction == 0:

pred = 'Worst Trending'

elif prediction == 1:

pred = 'Best Trending'

else:

pred = 'Moderate Trending'

return pred

#Naive Bayes

def prediction_second(category_id, views, likes, dislikes, comment_count, country):

# Pre-processing user input

if country == "France":

country = 1

elif country == "United Kingdom":

country = 2

if category_id == "Entertainment":

category_id = 1

elif category_id == "Music":

category_id = 2

elif category_id == "People & Blogs":

category_id = 3

# Making predictions

prediction_second = classifier_second.predict([[category_id, views, likes, dislikes, comment_count, country]])

if prediction_second == 0:

pred = 'Worst Trending'

elif prediction_second == 1:

pred = 'Best Trending'

else:

pred = 'Moderate Trending'

return pred

#k-NN

def prediction_third(category_id, views, likes, dislikes, comment_count, country):

# Pre-processing user input

if country == "France":

country = 1

elif country == "United Kingdom":

country = 2

if category_id == "Entertainment":

category_id = 1

elif category_id == "Music":

category_id = 2

elif category_id == "People & Blogs":

category_id = 3

# Making predictions

prediction_third = classifier_third.predict([[category_id, views, likes, dislikes, comment_count, country]])

if prediction_third == 0:

pred = 'Worst Trending'

elif prediction_third == 1:

pred = 'Best Trending'

else:

pred = 'Moderate Trending'

return pred

At the main function, we define the header of the app. It will display “YouTube Trending Prediction”. To do that, we are using the markdown function from streamlit. Next, we are creating three boxes and four slider in the app to take input from the users. These will represent the features on which our model is trained.

# this is the main function in which we define our webpage

def main():

# front end elements of the web page

html_temp = """

<h1 style ="color:black;text-align:center;">Youtube Trending Prediction</h1>

</div>

"""

# display the front end aspect

st.markdown(html_temp, unsafe_allow_html=True)

The first box is for the machine learning model. The user will have three options, Random Forest, Naive Bayes and k-NN, and they will have to pick one from them. We are creating a dropdown using the selectbox function of streamlit. Similarly, for Country, we are providing two options, France and United Kingdom and again, the user will pick one from it. Next, user will have to choose Category that have three options, Entertainment, Music and People & Blogs. Then, user also need to input the views, likes, dislikes and comment using the slider.

# following lines create boxes in which user can enter data required to make prediction

model = st.selectbox('Machine Learning Model', ("Random Forest", "Naive Bayes", "k-NN"))

country = st.selectbox('Country', ("France", "United Kingdom"))

category_id = st.selectbox('Category', ("Entertainment", "Music", "People & Blogs"))

views = st.sidebar.slider("View", 100053, 999966)

likes = st.sidebar.slider("Likes", 91, 176713)

dislikes = st.sidebar.slider("Dislikes", 8, 28372)

comment_count = st.sidebar.slider("Comments", 3, 31749)

At the end of the app, there will be a predict button and after filling in the details, users have to click that button. Once that button is clicked, the prediction function will be called and the result of the YouTube trending prediction will be displayed in the app.

result = ""

# when 'Predict' is clicked, make the prediction and store it

if st.button("Predict"):

if model == "Random Forest":

result = prediction(category_id, views, likes, dislikes, comment_count, country)

st.success('Your Video Is On The {}'.format(result))

elif model == "Naive Bayes":

result = prediction_second(category_id, views, likes, dislikes, comment_count, country)

st.success('Your Video Is On The {}'.format(result))

elif model == "k-NN":

result = prediction_third(category_id, views, likes, dislikes, comment_count, country)

st.success('Your Video Is On The {}'.format(result))

print("Done!")

if __name__ == '__main__':

main()

Next, we need to deploy the Streamlit app to Heroku so that we can access it on the internet

First, we need to add some files that allow Heroku to install the needed requirements and run the application.

requirements.txt
Setup.sh
Procfile

The requirements.txt file contains all the libraries that need to be installed for the project to work. This file can be created manually by going through all files and looking what libraries are used. For our app we use streamlit and scikit-learn library. The requirements file should look something like the following:

streamlit==0.75.0

scikit-learn==0.24.1

Using the setup.sh and Procfile files you can tell Heroku the needed commands for starting the application.

Setup.sh

mkdir -p ~/.streamlit/

echo "\

[server]\n\

port = $PORT\n\

enableCORS = false\n\

headless = true\n\

\n\

" > ~/.streamlit/config.toml

The Procfile is used to first execute the setup.sh and then call streamlit run to run the application.

Procfile

web: sh setup.sh && streamlit run youtube.py

Next, we need to create Heroku account. Then go to dashboard and click create a new app. Write the app-name and create app button. After that, we need to connect it into Github account. Search the repository of the streamit app and click connect. Go to the manual deploy and click deploy branch. Finally, your Streamlit is available on the internet.

Figure 115: Create new app in Heroku

Figure 116: Connect Github repository to Heroku

Figure 117: Deploy the app

Figure 118: YouTube Trending Prediction App

Demonstration Video for YouTube Trending Prediction App

Conclusion and Reflection

In conclusion, by doing this project we can recommend and give suggestions to our stakeholders. It can help our stakeholders to know what factors that affect how popular their videos will be. For our project, the factors that affect the popularity of videos are number of views, likes, dislikes and comments. They can know the minimum the number of factors that affect the popularity of their videos. To become the best trending video, we need to have the minimum number of views is 626,794, number of likes is 346, number of dislikes is 67 and number of comments is 3. To become moderate trending video, we need to have the minimum number of views is 322079, number of likes is 363, number of dislikes is 38 and number of comments is 6. To become worst trending video, we need to have the minimum number of views is 100053, number of likes is 91, number of dislikes is 3 and number of comments is 8. When the stakeholder know the minimum number, they can set their target to reach the minimum so that their videos can become trending on YouTube. So it is easier for the stakeholder when they know the minimum target. When they can reach their target they can get more profit with their videos. This is because for example, when the company is marketing their product using YouTube videos, many people would know their product and will attract new customer to buy their product. Furthermore, when their videos are always on trending many people would know their company. When many people know, many people would watch their videos and subsequently increase their profit.

For the reflection, this project gives good experience for me even though I think it is quite hard and challenging. I learn many new things in this project that are important to be a data scientist from using tools like RapidMiner, Power BI and finally the deployment of the model using Streamlit. It also enhance my coding skills especially in Python. So, it is very important to use data analytics because a company can use that to make better business decisions and helps optimize their performances. Implementing it into the business model means companies can help reduce costs by identifying more efficient ways of doing business.

Page updated

Google Sites

Report abuse