Srimedha - SVM Data Prep

Data Preparation for Support Vector Machine (SVM) Algorithm:

Data preparation is a critical step before applying the Support Vector Machine (SVM) algorithm due to several reasons. Supervised learning methods like SVM which require labeled data specifically need the input data to be in a certain format. Labeled data means each data point has a corresponding target or output value. In the context of SVM, this means having data instances with defined classes or categories that the SVM will learn to classify.

Creating Training and Testing Sets:

Before training the SVM model, it's essential to split the data into two disjoint sets: a Training Set and a Testing Set. The Training Set is used to train or build the SVM model, while the Testing Set is used to evaluate the model's accuracy and performance.

Training Set: This set comprises a subset of the data (80% of the total data in this case) randomly selected for model training. It includes labeled instances used by the SVM algorithm to learn the patterns and relationships between features and labels.

Testing Set: The Testing Set is a separate subset of the data (20% of the total data in this case) that the model has not seen during training. It is used to assess the model's generalization and predictive accuracy on new, unseen data.

Disjointness of Training and Testing Sets: It's crucial for the Training and Testing Sets to be disjoint, meaning they do not overlap or share data instances. This ensures that the model is evaluated on data it hasn't learned from during training, providing a more realistic measure of its performance on unseen data. This will also avoid overfitting where the model performs very well on the data but will perform poorly when tested on new unseen data. The generalizability of the model will be affected if the disjointness is not maintained which will reduce the accuracy and efficiency of the model.

Numeric Labeled Data for SVM:

SVMs can only work with labeled numeric data, where the labels are represented as numerical values. This is because SVM algorithms are designed to find optimal hyperplanes (decision boundaries) that separate different classes in the feature space. Numeric labels allow SVMs to calculate distances between data points and determine the best separating hyperplane accurately.

By adhering to these data preparation steps and ensuring the availability of labeled numeric data, we can effectively train and evaluate the SVM model for accurate classification tasks.

The Dataset Before Preparation:

The Dataset After Preparation:

The Dataset After Vectorization:

Train-Test Split of the Data:

The `train_test_split` function from scikit-learn is used in this project to create the test-train split. This function allowed the model to be trained on 80% of the data (the training set) and evaluated on 20% of the data (the testing set) by randomly dividing the dataset into training and testing subsets in a 80 to 20 ratio.

To appropriately assess the model's generalization ability, a disjoint split—where the training and testing sets do not overlap—must be created. The model could simply learn the training data by heart and perform well on it without really understanding the underlying patterns if the same data were used for both testing and training. Overfitting would result from this, making the model underperform on fresh, untested data.

The testing set serves as a stand-in for fresh, unused data by generating a disjoint split, which enables us to evaluate how effectively the model generalizes to actual situations. It ensures that the model is useful in real-world applications by assisting in the detection of problems like overfitting and offering a more trustworthy estimate of the model's performance on unseen data.

The code to all the above can be found here.

Results and Conclusion