Support Vector Machines (SVMs)

Overview:

Support Vector Machines (SVMs) are adept linear separators, seeking hyperplanes that maximize class separation. The kernel trick elevates SVMs' capabilities by implicitly mapping data to higher-dimensional spaces without explicit computation, facilitated by the critical dot product. The polynomial kernel captures intricate relationships using a polynomial function, while the Radial Basis Function (RBF) kernel, akin to a Gaussian function, excels in discerning complex patterns. For instance, a 2D point is transformed into a higher-dimensional space using the polynomial kernel, enabling SVMs to achieve non-linear separation. SVMs' strength lies in their ability to navigate intricate data landscapes, making them versatile for classification tasks.

Data Prep.

Supervised learning necessitates labeled data for model training and evaluation. The dataset is divided into Training and Testing Sets, where the former educates the model, and the latter assesses its performance on new, unseen data. Crucially, these sets must be disjoint to ensure unbiased evaluation. Support Vector Machines (SVMs) specifically demand labeled numeric data for effective classification. Stratified splitting and random processes are common techniques for creating disjoint sets. SVMs excel in scenarios with a clear class separation, utilizing kernels like polynomial and Radial Basis Function, for non-linear relationships. The dataset should be numeric, labeled, and diverse. I have used the dataset youtube_data_final_1.csv. This is because I feel that the SVM works efficiently using this dataset.

Code.

import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler

# Load your YouTube dataset

df = pd.read_csv('youtube_data_final_1.csv')

#Link to

#df = pd.read_csv('https://drive.google.com/file/d/1Nzy4066pW8BNdFXIwFCKtn83I1w3NMGb/view?usp=drive_link')

df['Views_per_Like'] = df['Views'] / (df['Likes'] + 1) # Avoid division by zero

# Assuming 'Views_per_Like' is your target column

y = df['Views_per_Like'] # Target variable

# Convert 'Views_per_Like' to categorical if needed

y = pd.cut(y, bins=[-float('inf'), 0.5, 1.0, float('inf')], labels=['Low', 'Medium', 'High'])

# Drop non-numeric and unnecessary columns

non_numeric_columns = df.select_dtypes(exclude=['float64', 'int64']).columns

df = df.drop(columns=non_numeric_columns)

# Split the data into Features (X) and Target variable (y)

X = df.drop('Views_per_Like', axis=1)

# Feature Scaling

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Split the data into Training and Testing sets

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# SVM Modeling with hyperparameter tuning using GridSearchCV and StratifiedKFold

param_grid = {'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'poly', 'rbf'], 'degree': [2, 3, 4]}

svm_model = SVC()

# Instantiate StratifiedKFold with shuffle

stratified_kfold = StratifiedKFold(n_splits=2, shuffle=True)

# Lists to store results

accuracies = []

confusion_matrices = []

# Loop through different kernels

for kernel in ['linear', 'poly', 'rbf']:

# Loop through different costs

for C_value in [0.1, 1, 10, 100]:

# Create SVM model

svm_model = SVC(kernel=kernel, C=C_value)

# Train the model

svm_model.fit(X_train, y_train)

# Make predictions on the testing set

y_pred = svm_model.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

accuracies.append((kernel, C_value, accuracy))

# Create confusion matrix

cm = confusion_matrix(y_test, y_pred)

confusion_matrices.append((kernel, C_value, cm))

# Visualize results

for kernel, C_value, accuracy in accuracies:

print(f'Accuracy for {kernel} kernel with C={C_value}: {accuracy}')

for kernel, C_value, cm in confusion_matrices:

print(f'Confusion Matrix for {kernel} kernel with C={C_value}:\n{cm}')

# Plot confusion matrices

for kernel, C_value, cm in confusion_matrices:

plt.figure(figsize=(8, 6))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.title(f'Confusion Matrix - {kernel} kernel with C={C_value}')

plt.xlabel('Predicted')

plt.ylabel('True')

plt.show()

Explanation:

This Python code conducts SVM modeling on a YouTube dataset, incorporating hyperparameter tuning through GridSearchCV and StratifiedKFold. The process involves:

Loading and Preprocessing:

- Reads the YouTube dataset.

- Introduces a new feature, 'Views_per_Like,' representing the ratio of Views to Likes.

- Converts 'Views_per_Like' into a categorical variable (Low, Medium, High) based on specified bins.

Splitting Data:

- Segregates the data into features (X) and the target variable (y).

- Utilizes StandardScaler for feature scaling.

- Divides the data into training and testing sets.

SVM Modeling:

- Defines a parameter grid for hyperparameter tuning, encompassing various kernels ('linear', 'poly', 'rbf'), C values, and polynomial degrees.

- Implements StratifiedKFold for cross-validation with shuffling.

Model Training and Evaluation:

- Iterates through diverse kernels and C values, creating SVM models and training them on the training set.

- Assesses each model's performance on the testing set, computing accuracy and generating confusion matrices.

Results Visualization:

- Displays accuracy for each kernel and C value combination.

- Exhibits confusion matrices for each combination.

- Generates plots of confusion matrices for visual analysis.

This code provides an extensive evaluation of SVM performance under different kernels and costs, offering valuable insights into the YouTube dataset.

Results.

Output Explanation:

Accuracy Scores:

For each combination of kernel type and C value, the accuracy of the SVM model is printed. This indicates how well the model is performing on the test set.

Confusion Matrices:

For each combination, the confusion matrix is printed. It shows the counts of true positive, true negative, false positive, and false negative values for each class ('Low', 'Medium', 'High').

Visualization - Confusion Matrices:

Heatmaps of confusion matrices are displayed. Each heatmap corresponds to a specific combination of kernel type and C value, providing a visual representation of the model's performance.

Conclusion:

The SVM modeling conducted on the YouTube dataset uncovered key insights:

Kernel Influence: The choice of kernel played a pivotal role in shaping model performance. Each kernel—linear, polynomial, and radial basis function—yielded distinct accuracy levels and confusion matrices.
Hyperparameter Optimization: The impact of the regularization parameter (C) and polynomial degree was thoroughly examined. Precise tuning of these parameters through cross-validation resulted in enhanced model performance.
Dataset Dynamics: Analysis of the YouTube dataset, considering features like 'Likes,' 'Views,' and 'Comments,' offered valuable insights into how these factors contribute to predicting the target variable, 'Views_per_Like.'
Feature Engineering Significance: The introduction of the 'Views_per_Like' feature underscored the potential importance of engineered features in refining model understanding and predictive capabilities.
Visualization Insights: Examination of confusion matrices and visualizations facilitated the interpretation of model behavior, identification of misclassifications, and informed the selection of the most effective kernel and parameters.

Page updated

Google Sites

Report abuse