SVM

Support Vector Machine is supervised machine learning models primarily used for classification tasks. SVM is linear separators at their core, but they can be extended to perform nonlinear classification using kernels.

They attempt to find the optimal hyperplane that separates data points of different classes with the maximum margin. The "margin" refers to the distance between the hyperplane and the nearest data points of each class, known as support vectors. The goal is to separate two classes using a line, plane or hyperplane . This works well only if the data is linearly separable.

But as mentioned at first, using kernels could help to perform nonlinear classification. The kernel trick allows SVMs to perform nonlinear classification by implicitly mapping data into a higher-dimensional space where it becomes linearly separable. Importantly, this is done without explicitly computing the transformation, which saves computational resources.

Data preparation

Link to Code Link to Data

Our Target variable is "patent_kind" as always.

And sort out numeric datas as a feature variable. There was one exception, which is "patent_year". this is numeric but count as date data which can occur negative effect to the test. Rest of the remaining columns were about coordinates and counts. Which works well with the SVM model.

Features would be like this.

And split by train data and test data.

Train data and test data should be disjoint. This is for preventing "remember" from data not predicting.

Running SVM, using 3 types of kernels. Linear, Polynomial and Radial Basis Function. increasing costs 10 times starting from 0.1.

Result

When seeing the result, Polynomial and Radial Basis Function were failed to identify B1 category patents. There should be a possible reason that Data was not balanced between B1 and B2 categories. B2 has 5,586 samples while B1 has only 1,284 samples. Or data was not that complex and mostly linear.

Highest accuracy was 82.46% when using linear kernel with all costs 0.1 and 1 and 10. all kernels did great on predicting B2 type patents.

Conclusion

Unlikely other analysis, imbalanced dataset became huge problem. Because SVMs aim to find a decision boundary that best separates the classes. When the classes are imbalanced, the model might focus more on the majority class. In future, resampling might be helpful to make this analysis useful.

Page updated

Google Sites

Report abuse