ML classifier

Initially, a simple linear SVM model was used after standardizing the data, as outlined in the Algorithm below to test the entire workflow.

In the final computational pipeline of Super. Complex, the training dataset is run through an Auto-ML algorithm, tpot (Olson et al.) that evaluates several preprocessors and machine learning models and yields cross validation (CV) scores on the training dataset for each pipeline (combination of several preprocessors and the machine learning model).

The preprocessors and the ML models evaluated in our experiments are some of the most commonly used ones in the sklearn ML library in Python and are also listed below:

Motivation behind listing these is to provide a short summary of the most common ML techniques.

Preprocessors:

Scaling: Each feature can be scaled by these methods:
1. Binarizer: Set to 0 or 1 based on threshold
2. MaxAbsScaler: Divide by max absolute value of the feature
3. MinMaxScaler: Subtract min and divide by range
4. Normalizer: Divide feature vector by its norm to get unit norm
5. RobustScaler: Make robust to outliers by scaling using the interquartile range
6. StandardScaler: Standardization to Z-score by subtracting mean and dividing by standard deviation of the feature
Feature selection:
Decomposition: PCA, FastICA (Independent Component Analysis)
Feature Agglomeration
Kernel Approximation methods: Nystroem, RBFSampler
Adding Polynomial Features
Zero count: Adds the count of zeros and non-zeros per sample as features.
OneHotEncoder for numeric categorical variables

ML models:

Naive Bayes: Gaussian, Bernoulli, Multinomial
Tree: Decision Tree
Ensemble: Extratrees, Random Forest, Gradient Boosting, XGB
Neighbors: KNN
SVM: Linear
Linear models: Logistic Regression

The pipelines with high cross-validation (CV) scores are evaluated on the test dataset to find the best pipeline for our data for use later in the sampling stage for prediction.

Results: Classifier performance ->

Move on to sampling ->