SVM
SVM
Overview
SVMs are, in effect, linear classifiers in the way they attempt to achieve the best hyperplane with maximum class separation in the feature space. As shown in the image, SVM constructs an in-between Maximum Margin Hyperplane which indeed lies between two classes, the green circles and blue diamonds. The boundary line (solid line) is straight and maximizes the distance margin to the closest data points, which are called support vectors (marked in the figure). Two dashed lines, Positive Hyperplane and Negative Hyperplane, are parallel to the middle margin, and the support vectors lie on these boundaries precisely. Even if the data is complicated, SVMs always try to seek a linear separator, either in the original space of features or after mapping the data into a high-dimensional space with kernels.
In Support Vector Machines (SVMs), when the data is not linearly separable in its original input space, we use a trick known as the kernel trick. From the picture, you can see that the points in the 2D Input Space are scattered all over and cannot be separated by a line. Using a feature transformation ϕ, we map the data into a higher dimensional Feature Space in such a way that the classes could be separated linearly using a hyperplane. This transformation makes SVMs extremely efficient in handling non-linearly related data, but it remains a linear separator in the transformed space. The most critical mathematical property to take advantage of here is the dot product of the transformed vectors. Instead of calculating ϕ(x) and ϕ(x′) directly in an explicit manner, kernels calculate the dot product K(x,x′)=⟨ϕ(x),ϕ(x′)⟩. This is efficient calculation without ever projecting onto the high-dimensional feature space, yet retaining its benefit.
Below we can see an example of how a 2D vectors, a = (a1,a2) and b = (b1,b2) can get transformed into 6D:
Data Prep and Code
The data set contains 19 columns at the start, but for SVM modeling, 7 key features were chosen which are numerical: DURATION, RATING, VOTES, BUDGET, GROSSWORLDWIDE, NOMINATIONS, and OSCARS. All of them denote different quantitative attributes of films, like their duration, viewer rating, budget, and nomination for awards.
The target attribute is MOVIE_CATEGORY, which classifies movies into their corresponding success levels (e.g., Successful, Mediocre, Failure). The raw numerical values already occur in the data set before preprocessing, and MOVIE_CATEGORY is of type string and hence should be encoded before using it when applying it as a classifier.
After characteristic selection and encoding of target variable, data was split into a test set and training set using 80/20 stratified split in order to maintain both sets balanced in movie categories. In order to ensure that the SVM algorithm, which is feature scale-sensitive, is working at its optimum, the feature variables were normalized so that all features were brought to the same scale with mean 0 and standard deviation 1. After this preparation, the training set was used to train the SVM models, and the testing set was reserved for testing their performance on new data, giving an unbiased and fair estimate.
Results and Conclusion
Three kernel types were utilized: Linear, RBF (Radial Basis Function), and Polynomial, with each attempted using different cost (C) values. For the linear types, performance remained constant at 83%, irrespective of the cost value (0.01, 0.1, or 1). For RBF, performance improved as the cost value increased, with the best performance of 85.80% when C = 10 and hence the overall top-performing SVM configuration. Polynomial models were less accurate than linear and RBF kernels, with accuracies ranging from 74.88% to 82.90%, depending on degree and cost. This table clearly shows that the RBF kernel with C = 10 had the highest accuracy, indicating that a non-linear boundary with strong regularization fits this movie classification data better.
Linear SVM Models
Classification reports contain precision, recall, and F1-score by class ("Failure," "Mediocre," and "Successful") for several linear SVM models with different values for the cost (C). Precision and recall for the "Failure" class are very high (around 0.86–0.87 precision and 0.97 recall) for each of the three models, indicating that the model is very good at labeling failures correctly. The "Successful" class is performing relatively better with an accuracy of around 0.80 and recall of around 0.74–0.75. However, the "Mediocre" class is performing terribly on all the three models with a very low recall of around 0.06–0.08 and F1-scores of small numbers, i.e., the model incorrectly classifies mediocre films in a very high number of cases. All three linear SVMs have the same accuracy of about 84% for all the cases, showing consistent model performance overall but overpowered by majority classes.
The confusion matrices provide a more pictorial representation of these values. In all the models, most of the "Failure" and "Successful" instances are well-classified, as reflected by the high diagonal values. But "Mediocre" movies tend to be misclassified as either "Failure" or "Successful," which explains the low recall for the "Mediocre" class. Comparing the three cost setups, the variations in performance are modest; nonetheless, lowering the cost (C=0.01) again lessens marginally the accuracy of the model at correctly predicting, negatively impacting on the already dismal "Mediocre" label. In general, linear SVM is decent for the "Failure" and "Successful" categories but has an ongoing issue with "Mediocre," and this shows the challenges of coping with imbalanced and convoluted categories within the data set.
RBF SVM Models
Classification reports show the performance of RBF SVM models learned with three different levels of strength of regularization parameter C (0.01, 1, and 10). As C increases, overall accuracy is better from 81% (C=0.01) to 85% (C=10). Particularly, we observe the recall for class "Mediocre" significantly improves from 52% to 78% as C increases, i.e., the model significantly improves in detecting this minority class. The f1-score and accuracy for "Failure" and "Successful" classes are always better in all the models, but the "Mediocre" class is best supported by appropriately adjusting C. Macro average f1-score and weighted average f1-score increased by increasing at C=10 prove the more balanced and improved classification performance.
The corresponding confusion matrices also confirm the above findings. At C=0.01, there is much class confusion, especially the majority of "Mediocre" and "Successful" instances being classified as "Failure." When C is increased to 1 and then 10, accurate classification for "Mediocre" and "Successful" classes improves greatly, and fewer misclassifications are observed overall. The C=10 model provides the cleanest discrimination and is therefore the best of the three RBF models, particularly as it improves precision and recall across all classes with a relatively high support balance.
Polynomial SVM Models
We can observe from the classification reports that with changing complexity of polynomial kernel (i.e., through varying C values and degree) the models' ability to distinguish between the "Mediocre" class varies significantly. With Polynomial SVM (Degree=4, C=10), the model had a general accuracy of 83% but precision and recall for the "Mediocre" class were low when compared with those of "Failure" and "Successful" classes. As we get closer to C=1, Degree=3, the model's poor class identification is even worse (very low F1-score and recall), suggesting that the model is heavily biased towards majority classes ("Failure" and "Successful"). Finally, when C=0.01, Degree=2, the model almost completely fails to identify the "Mediocre" class (0 precision, 0 recall), heavily harming the model's macro-averaged scores and suggesting severe underfitting for minority classes.
In the confusion matrices, this decline is graphically confirmed. With more complexity (Degree=4, C=10), a couple of "Mediocre" instances are properly classified, though most still confuse with "Failure" or "Successful." As we lower C and degree, misclassification worsens, and finally (C=0.01), the model virtually entirely predicts everything into "Failure" and "Successful" classes and entirely disregards "Mediocre." This illustrates the problem polynomial kernels encounter when facing imbalanced class distributions unless accurately calibrated.
Best Model: RBF SVM (C=10) Decision Boundary
The RBF SVM with C=10 provides the best decision boundary out of all models tested. The non-linear areas closely encircle the points, indicating a better fit of the data patterns. The class separation is cleaner with less overlap especially between "Failure" and "Successful" groups. The high value of C enables the model to focus more on correct classification rather than a large margin, resulting in fewer misclassification rates while still generalizing well to new data.
Second Best Model: RBF SVM (C=1) Decision Boundary
The RBF SVM with C=1 provides a better separation than the Linear SVM, yet there is still class overlap. The decision boundary curves around some of the clusters, demonstrating the non-linear flexibility of the RBF kernel. However, it still does not create tight boundaries around the minority "Mediocre" class. The model captures the general shape of the data far better than the linear model, yet a slightly larger C value would probably reduce misclassifications even further without overfitting.
Worst Model: Linear SVM (C=0.01) Decision Boundary
In this here area, Linear SVM model with C=0.01 is constructing an extremely poor decision boundary after PCA transformation. The boundary is nearly linear and is not in a position to separate the three classes (Failure, Mediocre, Successful) reasonably at all. Overlap of most of the points, i.e., those belonging to the classes 1 and 2, is excellent due to the fact that the model was not able to establish definite margins between classes. The extremely low C value causes severe regularization and leads to underfitting as the model becomes over-simplified and does not capture important data patterns.
In all SVM models attempted on the data set, the RBF kernel models were outperformed by both the Linear and Polynomial SVMs. The top-performing model was the RBF SVM with C=10 that posted the optimum classification performance of being able to learn in a flexible manner from the non-linear structure of the data, as shown by the clear decision boundaries and high precision, recall, and F1-scores across all classes. In comparison with them, Linear SVM models performed badly, especially the "Mediocre" class, due to the failure to capture high-level non-linear relationships, and things were compounded for very small C values like 0.01 that led to underfitting. Polynomial SVMs proved to have some ability to construct non-linear tendencies but were weaker and less concrete than RBF models and constructed large class misunderstanding in most configurations.
From this analysis, it's clear that kernel selection and regularization parameter C significantly impact SVM model performance, especially on imbalanced and multi-class problems such as these. While Linear SVMs provided a simple baseline, Polynomial kernels were fairly successful with some degree and cost, but the RBF kernel was the most flexible and most effective choice. The non-linear decision boundaries produced by RBF SVMs, particularly at higher C values, more accurately reflected the intricate data distribution, with enhanced overall accuracy and minority class treatment. Therefore, for this data set, an RBF SVM well-tuned is recommended for achieving best predictive performance.