A SVM is a type of machine learning model that finds optimal boundaries, known as hyperplanes, to classify data points. The support vectors is the distance between the hyperplane and nearest data points for each class. It is very good at handling non-linear data, which is why we chose to use it.
For our model, we utilized a radial bases function (rbf) kernel to map the data into a high-dimensional space.
To improve model performance and ensure accuracy, we utilized the following techniques and tools:
Binary Comparisons
Principal Component Analysis
Grid Search
5-Fold Cross Validation
Model Testing
Initially, we intended to build a single multiclass classifier that would be able to distinguish between all three diagnoses (NC vs MCI vs AD). After performing some research and consulting with our advisor, we determined that three binary classifiers would be easier to create and would have a better overall performance.
The comparisons used are:
NC vs MCI
NC vs AD
MCI vs AD
PCA is used to reduce dimensionality of data. This is necessary for our model because our dataset contains 914 features. By utilizing PCA, we are able to reduce the data from 914 features to between 50-200 features. This reduction reduces noise and helps the model find the most important patterns in each classification. The reduction also improves model performance because there are fewer features for the model to process.
For initial creation and testing, we utilized a universal PCA value of 50. In our final version of the model, we included various PCA values ranging from 50-200 to allow the grid search to find the ideal value for each model.
A grid search is a technique in machine learning that tunes hyperparameters towards to their ideal values. The grid is built by giving each hyperparameter a list of potential values. The grid search then trains multiple versions of the model using each different combination of hyperparameters. The combination of hyperparameters that performs the best is then saved and used for the complete model.
This tool is especially helpful because we aren't just building one model, we are building three and each binary model may have different requirements for their hyperparameters. By utilizing grid search, we can ensure that each model is performing with the ideal hyperparameters without having to manually figure out which values work best for them.
5-fold cross validation is a model evaluation technique that works by splitting the dataset into five equal parts ("folds"). It then trains the model five different times using four folds for training and one for validation. The final calculated performance is based on the average results from each iteration of training.
Our 5-fold cross validation returns and saves the average values and standard deviation of the following fields:
Accuracy (Acc)
Area under the Curve (AUC)
Sensitivity (Sens)
Specificity (Spec)
F1-Score (F1)
The average confusion matrix for each model is also generated and the receiver operating characteristic (ROC) is plotted.
To test our model, we split the dataset into 80% testing and 20% training. Each binary model is built based on the testing set, then overall performance is calculated on the training set. The performance calculation is based on how well our model and prediction algorithms can predict a diagnosis of NC, MCI, and AD.
To make a final diagnosis prediction, the probability of each diagnosis is calculated for each binary comparison. Then, pairwise coupling is performed to calculate an overall diagnosis based on the results of all three comparisons.
Because MCI is difficult to predict due to high similarity with NC, we also utilize boundary shifting and weight boosting to help the model better predict it.
Our model has an overall accuracy of 71%. The confusion matrix based on our 20% testing set can be seen below.