Overview:
Decision Trees (DTs) are versatile machine learning models employed for classification and regression tasks. They form a tree-like structure, where nodes represent feature tests, branches indicate outcomes, and leaves denote class labels or predicted values. DTs find applications in diverse fields, including medical diagnosis, finance, customer relationship management, anomaly detection, and recommendation systems. They evaluate the quality of splits using metrics like GINI impurity, entropy, and information gain. Information gain quantifies the reduction in impurity or entropy after a split and aids in feature selection. It is theoretically possible to create an infinite number of DTs due to various feature combinations and split options, but practical implementations limit the number for better interpretability and generalization.
Data Prep:
In supervised machine learning, labeled data is essential. The data is divided into two distinct subsets: the Training Set (typically 80% of the data) and the Testing Set (usually 20%). The Training Set is used to teach the model by exposing it to labeled data, enabling it to learn patterns and associations. The Testing Set, distinct from the Training Set, is employed to evaluate the model's performance, allowing for an unbiased assessment of its predictive capabilities. Maintaining the separation ensures the model doesn't merely memorize the training data but learns to generalize to new, unseen data accurately, highlighting its ability to perform in real-world applications. I also have done bootstrapping to increase the accuracy of my models.
For the dataset I have used "youtube_data_final_1_1.csv" file.
Image of the dataset:
Image of the Training Set and Testing set:
Explanation:
Two Scatter Plots: The code generates two scatter plots to visually represent the data.
First Scatter Plot (Training Data):
plt.subplot(1, 2, 1): This line of code prepares to create the first scatter plot and places it in a 1x2 grid of subplots, indicating that it will be the first subplot.
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.Paired): In this line, it creates the scatter plot using the training data.
X_train[:, 0] and X_train[:, 1] correspond to the first two columns of the training data X_train. These two columns likely represent two specific features of the data.
c=y_train assigns different colors to data points based on their class labels. In this case, it's using the class labels from the training set y_train.
cmap=plt.cm.Paired specifies the color map to be used for the scatter plot.
Second Scatter Plot (Testing Data):
plt.subplot(1, 2, 2): This line of code prepares to create the second scatter plot and places it in the same 1x2 grid of subplots, indicating that it will be the second subplot.
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=plt.cm.Paired): This line creates the scatter plot for the testing data.
X_test[:, 0] and X_test[:, 1] represent the same two columns as in the training data but for the testing data.
c=y_test assigns colors to data points based on their class labels from the testing set y_test.
cmap=plt.cm.Paired specifies the same color map as in the first plot.
Title and Labels: For both scatter plots, the code adds titles to the plots. The first plot is labeled "Training Set," and the second is labeled "Testing Set."
Display: Finally, the code uses plt.tight_layout() to ensure that the plots are nicely arranged and not overlapping, and plt.show() to display the plots.
So we have the images of training and testing dataset in form of a scatterplot. They represent columns 'Likes' and 'Views'.
Code:
Bootstrapping
import pandas as pd
import numpy as np
# Replace 'your_file_path.csv' with the actual path to your CSV file
file_path = 'youtube_data_final_1.csv'
# Number of bootstrapped samples to create
num_samples = 100 # You can adjust this as needed
# Read the original dataset into a DataFrame
original_df = pd.read_csv(file_path)
# Create an empty DataFrame to store bootstrapped samples
bootstrapped_df = pd.DataFrame()
# Perform bootstrapping
for _ in range(num_samples):
# Randomly sample rows with replacement from the original dataset
bootstrap_sample = original_df.sample(n=len(original_df), replace=True)
# Append the bootstrap sample to the bootstrapped DataFrame
bootstrapped_df = bootstrapped_df.append(bootstrap_sample)
# Reset the index of the bootstrapped DataFrame
bootstrapped_df.reset_index(drop=True, inplace=True)
# Display the first few rows of the bootstrapped DataFrame
print(bootstrapped_df.head())
bootstrapped_df
Decision Trees:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd
# Load your dataset from the CSV file
file_path = 'youtube_data_final_1_1.csv'
# Link to "youtube_data_final_1_1.csv" file : https://drive.google.com/file/d/15pKK_1ftSTwLWlLo06nHcDd6pR0xKHIw/view?usp=drive_link
# List of columns to read (exclude non-numeric columns)
numeric_columns = ['Likes', 'Dislikes', 'Views', 'Video_Duration (in seconds)'] # Replace with the actual column names
# Read the CSV file with specific columns
df_1 = pd.read_csv(file_path, usecols=numeric_columns)
# Replace 'target' with the actual column name representing the class labels
X = df_1.drop('Likes', axis=1) # Features
y = df_1['Likes'] # Target variable
# Split the data into a training set and a testing set (e.g., 80% for training and 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Convert your data to a NumPy array
X_train = np.array(X_train)
X_test = np.array(X_test)
# Create a scatter plot of the training data
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.Paired)
plt.title('Training Set')
# Create a scatter plot of the testing data
plt.subplot(1, 2, 2)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=plt.cm.Paired)
plt.title('Testing Set')
plt.tight_layout()
plt.show()
# Define a list of dictionaries, each representing different sets of hyperparameters
param_grids = [
{
'criterion': ['gini', 'entropy'],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
},
{
'criterion': ['gini'],
'max_depth': [None, 15, 25],
'min_samples_split': [3, 6, 9],
'min_samples_leaf': [2, 3, 4]
},
{
'criterion': ['entropy'],
'max_depth': [None, 12, 22],
'min_samples_split': [4, 7, 11],
'min_samples_leaf': [3, 4, 5]
}
]
# Create an empty list to store the best models
best_models = []
# Loop through the parameter grids and fit the models
for param_grid in param_grids:
dt_model = DecisionTreeClassifier(random_state=123)
grid_search = GridSearchCV(dt_model, param_grid, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_model = DecisionTreeClassifier(**best_params, random_state=123)
best_model.fit(X_train, y_train)
best_models.append(best_model)
# Make predictions and evaluate the models
for i, model in enumerate(best_models):
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
confusion = confusion_matrix(y_test, predictions)
print(f"Decision Tree {i + 1} - Accuracy: {accuracy}")
print(f"Decision Tree {i + 1} - Confusion Matrix:")
print(confusion)
Explanation:
Libraries are imported:
matplotlib.pyplot for data visualization.
DecisionTreeClassifier from sklearn.tree for the Decision Tree classifier.
train_test_split for splitting the dataset.
GridSearchCV for hyperparameter tuning.
accuracy_score and confusion_matrix from sklearn.metrics.
pandas for data manipulation.
The dataset is loaded from a CSV file located at 'youtube_data_final_1_1.csv'.
The target variable 'Likes' and the features (excluding 'Likes') are defined.
The data is split into a training set (80%) and a testing set (20%) using train_test_split().
The training data is converted to NumPy arrays.
Two scatter plots are created to visualize the training and testing data. The first plot shows the training set, and the second plot shows the testing set. Data points are color-coded based on the 'Likes' (target variable).
A list of dictionaries (param_grids) representing different sets of hyperparameters for the Decision Tree classifier is defined. This is used to search for the best combination of hyperparameters.
An empty list best_models is created to store the best models after hyperparameter tuning.
A loop iterates through the param_grids, and for each set of hyperparameters, a Decision Tree model is trained and tuned using GridSearchCV. The best hyperparameters are identified.
The best Decision Tree model is created with the best hyperparameters and is fitted to the training data. This model is added to the best_models list.
Predictions are made on the testing set for each of the best models, and accuracy scores and confusion matrices are computed for each model.
The accuracy and confusion matrix for each Decision Tree model are printed to evaluate the models.
Results:
Explanation:
Fitting Process: The output starts with the message "Fitting 3 folds for each of X candidates, totalling Y fits." This indicates that the code performed a grid search for hyperparameters on three sets of candidates, totalling a certain number of fits.
Decision Tree Models: Three Decision Tree models were trained and tuned with distinct sets of hyperparameters. These models are labeled as "Decision Tree 1," "Decision Tree 2," and "Decision Tree 3."
Accuracy: For each Decision Tree model, the accuracy is reported as "Accuracy: 1.0." An accuracy of 1.0 means that the model achieved perfect accuracy on the testing data, correctly classifying all instances.
Confusion Matrix: The confusion matrix for each model is also displayed. The confusion matrix provides a detailed breakdown of how the model's predictions align with the actual class labels. In each confusion matrix, you can see the counts of true positives, true negatives, false positives, and false negatives for each class. However, the confusion matrix is too large to be fully displayed in the provided output.
Conclusions:
The provided code demonstrates the use of Decision Trees for classification using a specific dataset. It includes hyperparameter tuning and model evaluation to optimize the Decision Tree model's performance. Here's what can be learned and predicted in the context of this code:
Supervised Learning with Decision Trees: The code exemplifies supervised learning, where the target variable 'Likes' is predicted based on the features. Decision Trees are a popular choice for classification tasks.
Hyperparameter Tuning: The code showcases the importance of hyperparameter tuning. It uses grid search to explore various combinations of hyperparameters, such as the criterion (gini or entropy), maximum depth, minimum samples for split, and minimum samples per leaf. This process can significantly impact the model's accuracy.
Model Evaluation: The code evaluates multiple Decision Tree models with different hyperparameters. It calculates accuracy and displays the confusion matrix for each model. This information is crucial for selecting the best-performing model.
Flexibility and Customization: Decision Trees offer flexibility in modeling complex relationships in data. By modifying hyperparameters and dataset features, this code can be adapted to different classification tasks.