The Models

Support Vector Machines

SVC supports only one vs. all classifiers, increasing runtime drastically compared to other models. Linear kernels performed the best, followed by RBF, polynomial, then sigmoid with 0.1 and 0.3 being the best values for regularization.

Logistic Regression

Logistic regression is designed for one vs. all classification and requires similar resource demands as SVM. Feature reduction was not useful in improving this model’s performance, neither was weighting the classes, or increasing the number of maximum iterations. Standardizing the results appeared to improve the classification of clusters with smaller numbers of cells, with little to no impact on the larger cell type clusters. L2 outperformed the L1 and elasticnet penalties with the saga solver quite significantly.

Random Forest

Random forest creates decision tree classifiers from a random subset of the features and chooses the best tree as the model. Training the one vs. all model with the 6 major classifications and then with the granular classifications provided the best results because certain genes classify specific versus general cell types better.

Neural Networks

The number of neurons in each layer had negligible effect on model performance (leading us to arbitrarily utilize 100), however, keeping the network at 5-6 total layers performed best. We strictly utilized dense layers, as implementing batch normalization between layers and activation functions also had negligible effect.

Light GBM

All LightGBM produced models suffer from extreme overfitting. The parameters utilized in our best performing model all help alleviate this issue, though the model still has near perfect prediction on the training set.

Made by Beverly Peng