Machine Learning

1 Analyzing Default Payments of Credit Card Clients in Taiwan

Determined demographic variables which would cause default payment, and the model that was more predictive to decide the
probability of credit cardholders’ default payment.
Used Excel and SQL to clean data and 5 machine learning methods (Logistic Regression, Linear Discriminant Analysis, K-NN Classifier, Classification Tree and Support Vector Classifier) to build model, judged the models’ efficiency by classification rate and ROCPLOT, and called correlation heat map from ggplot2 to show the correlation coefficient among 24 variables.

CONCLUSION: accuracy rank is Classification Tree > LDA > Support Vector Classifier > KNN > Logistics Regression. Older people are inclined to default payment; the repayment status in Sept.2005 is the vital variable; the Classification Tree is the best model with the correct rate, 81.54%.

Codes

Report

2 Select best bandwidth value under different kernel smoother for 3 optional bandwidths

Coded a model to return the best bandwidth and relative mean square error under different kernels.
The kernel for Aympotically Optimal Bandwidth and Bandwidth Selection by Cross-Validation is Epanechnilkov kernel, the
kernel for Plug-in Bandwidth is a normal density kernel.
Inputted randomly generated Y and X into self-coded cross-validation model and found that Bandwidth Selection by Cross-Validation will lead to lowest MSE, 0.0004067. The relative bandwidth is 0.0821.

CONCLUSION: Bandwidth Selection by Cross-validation works best among 3 optimal bandwidths. Asympotically Optimal Bandwidth ranks number 2, Plug-in Bandwidth is the worst one.

Codes

Report

3 MNIST Handwritten Letters Recognition

Implement different statistical models to detect handwritten letters.
The dataset is MNIST, consisting of 60000 training set and 10000 testing set.
These models are coded from scratch, Bayesian Decision Rule, KNN, Logistics Regression, K-Means Cluster, and PCA, are called from the package, SVM, random forest, LDA, VGG-16.

CONCLUSION: The accuracy rank is VGG-16 (99.99%) > KNN > Random Forest > SVM > PCA+SVM > Logistic Regression > PCA+KNN > LDA > PCA+BDR(84.09%) > K-Means Cluster(62.33%).