Overview:
Support Vector Machines (SVMs) are adept linear separators, seeking hyperplanes that maximize class separation. The kernel trick elevates SVMs' capabilities by implicitly mapping data to higher-dimensional spaces without explicit computation, facilitated by the critical dot product. The polynomial kernel captures intricate relationships using a polynomial function, while the Radial Basis Function (RBF) kernel, akin to a Gaussian function, excels in discerning complex patterns. For instance, a 2D point is transformed into a higher-dimensional space using the polynomial kernel, enabling SVMs to achieve non-linear separation. SVMs' strength lies in their ability to navigate intricate data landscapes, making them versatile for classification tasks.
Data Prep.
Supervised learning necessitates labeled data for model training and evaluation. The dataset is divided into Training and Testing Sets, where the former educates the model, and the latter assesses its performance on new, unseen data. Crucially, these sets must be disjoint to ensure unbiased evaluation. Support Vector Machines (SVMs) specifically demand labeled numeric data for effective classification. Stratified splitting and random processes are common techniques for creating disjoint sets. SVMs excel in scenarios with a clear class separation, utilizing kernels like polynomial and Radial Basis Function, for non-linear relationships. The dataset should be numeric, labeled, and diverse. I have used the dataset youtube_data_final_1.csv. This is because I feel that the SVM works efficiently using this dataset.
Code.
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
# Load your YouTube dataset
df = pd.read_csv('youtube_data_final_1.csv')
#Link to
#df = pd.read_csv('https://drive.google.com/file/d/1Nzy4066pW8BNdFXIwFCKtn83I1w3NMGb/view?usp=drive_link')
df['Views_per_Like'] = df['Views'] / (df['Likes'] + 1) # Avoid division by zero
# Assuming 'Views_per_Like' is your target column
y = df['Views_per_Like'] # Target variable
# Convert 'Views_per_Like' to categorical if needed
y = pd.cut(y, bins=[-float('inf'), 0.5, 1.0, float('inf')], labels=['Low', 'Medium', 'High'])
# Drop non-numeric and unnecessary columns
non_numeric_columns = df.select_dtypes(exclude=['float64', 'int64']).columns
df = df.drop(columns=non_numeric_columns)
# Split the data into Features (X) and Target variable (y)
X = df.drop('Views_per_Like', axis=1)
# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into Training and Testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
# SVM Modeling with hyperparameter tuning using GridSearchCV and StratifiedKFold
param_grid = {'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'poly', 'rbf'], 'degree': [2, 3, 4]}
svm_model = SVC()
# Instantiate StratifiedKFold with shuffle
stratified_kfold = StratifiedKFold(n_splits=2, shuffle=True)
# Lists to store results
accuracies = []
confusion_matrices = []
# Loop through different kernels
for kernel in ['linear', 'poly', 'rbf']:
# Loop through different costs
for C_value in [0.1, 1, 10, 100]:
# Create SVM model
svm_model = SVC(kernel=kernel, C=C_value)
# Train the model
svm_model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = svm_model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracies.append((kernel, C_value, accuracy))
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
confusion_matrices.append((kernel, C_value, cm))
# Visualize results
for kernel, C_value, accuracy in accuracies:
print(f'Accuracy for {kernel} kernel with C={C_value}: {accuracy}')
for kernel, C_value, cm in confusion_matrices:
print(f'Confusion Matrix for {kernel} kernel with C={C_value}:\n{cm}')
# Plot confusion matrices
for kernel, C_value, cm in confusion_matrices:
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix - {kernel} kernel with C={C_value}')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
Explanation:
This Python code conducts SVM modeling on a YouTube dataset, incorporating hyperparameter tuning through GridSearchCV and StratifiedKFold. The process involves:
Loading and Preprocessing:
- Reads the YouTube dataset.
- Introduces a new feature, 'Views_per_Like,' representing the ratio of Views to Likes.
- Converts 'Views_per_Like' into a categorical variable (Low, Medium, High) based on specified bins.
Splitting Data:
- Segregates the data into features (X) and the target variable (y).
- Utilizes StandardScaler for feature scaling.
- Divides the data into training and testing sets.
SVM Modeling:
- Defines a parameter grid for hyperparameter tuning, encompassing various kernels ('linear', 'poly', 'rbf'), C values, and polynomial degrees.
- Implements StratifiedKFold for cross-validation with shuffling.
Model Training and Evaluation:
- Iterates through diverse kernels and C values, creating SVM models and training them on the training set.
- Assesses each model's performance on the testing set, computing accuracy and generating confusion matrices.
Results Visualization:
- Displays accuracy for each kernel and C value combination.
- Exhibits confusion matrices for each combination.
- Generates plots of confusion matrices for visual analysis.
This code provides an extensive evaluation of SVM performance under different kernels and costs, offering valuable insights into the YouTube dataset.
Results.
Output Explanation:
Accuracy Scores:
For each combination of kernel type and C value, the accuracy of the SVM model is printed. This indicates how well the model is performing on the test set.
Confusion Matrices:
For each combination, the confusion matrix is printed. It shows the counts of true positive, true negative, false positive, and false negative values for each class ('Low', 'Medium', 'High').
Visualization - Confusion Matrices:
Heatmaps of confusion matrices are displayed. Each heatmap corresponds to a specific combination of kernel type and C value, providing a visual representation of the model's performance.
Conclusion:
The SVM modeling conducted on the YouTube dataset uncovered key insights:
Kernel Influence: The choice of kernel played a pivotal role in shaping model performance. Each kernel—linear, polynomial, and radial basis function—yielded distinct accuracy levels and confusion matrices.
Hyperparameter Optimization: The impact of the regularization parameter (C) and polynomial degree was thoroughly examined. Precise tuning of these parameters through cross-validation resulted in enhanced model performance.
Dataset Dynamics: Analysis of the YouTube dataset, considering features like 'Likes,' 'Views,' and 'Comments,' offered valuable insights into how these factors contribute to predicting the target variable, 'Views_per_Like.'
Feature Engineering Significance: The introduction of the 'Views_per_Like' feature underscored the potential importance of engineered features in refining model understanding and predictive capabilities.
Visualization Insights: Examination of confusion matrices and visualizations facilitated the interpretation of model behavior, identification of misclassifications, and informed the selection of the most effective kernel and parameters.