Machine Learning Models

After prepping our data, we used it to train and test multiple machine learning models to determine which one produced the lowest error, and thus the most reliable predictions of gestational age.

Supervised Models

1. Linear Regression

Linear regression assumes that there is a linear relationship between our multidimensional RNAseq data and gestational age.

Since real data rarely presents as a perfect linear relationship, this model is used to find the best fit line for the data. Based on this plot, we can figure out the distance of each individual point to the best fit line using the Spearman or Pearson Correlation methods.

2. Relevance Vector Machines (RVM)

In order to understand RVM, it is important to first understand SVM. SVM is a model that draws a line between data points to separate them into classes.

RVM is similar to SVM, however, instead of just categorizing all the data points on either one side or the other, RVM gives each data point a value that indicates how far the data points are from the line.

3. Random Forest

Random forest was previously used as a method of data extraction, where its purpose was to calculate feature importance. As a machine learning model, we use the decision trees of Random forest for classification- to determine whether an observation falls into a category or not. In other words, we are classifying each variable into groups to determine their relationships with each other and to better understand our data.

A decision tree is a decision making template in the form of "if this then that". In our case, this decision would be "if this gene is expressed often, then the woman is in her 32nd week of gestation".

Root Mean Squared Error (RMSE)

RMSE Results for the Models

Root Mean Squared Error (RMSE) indicates the difference between our models' predicted gestational age and the true gestational age of the patient. The smaller the RMSE value, the better our model has performed.

In this project, the RMSE value was used to compare the different combinations of models and features that we tested.

Cross Validation

10-fold Cross Validation (CV)

10-fold CV is an experimental framework that we used to evaluate our results. It is a resampling procedure used to estimate the skill of a machine learning model on unseen data. In this case, our unseen data is the GA, which we are trying to figure out.

We split our data set into a training and test set, and then split the training set further into its constituent training and validation sets. We train our model using our split up training set 10 times, compare it to our validation set and evaluate the model based on its mean score (see below). We take the model with the best mean score and then train that model on the first training set in order to predict the final answer.

Here is our code with explanations:

Additional Models

In addition to the three models outlined above, we also tested the following models:

enet, foba, gaussprPolyg, gaussprRadial, glmnet, icr, kernelpls, krlsRadial, lars2, lasso, leapBackward, leapForward, leapSeq, nnls, partDSA, pls , plsRglm, rbf, rpart, rqlasso, rvmPoly, rvmRadial, simpls, spikeslab, spls, svmPoly, svmRadial, svmRadialCost, svmRadialSigma, widekernelpls, rqnc, nodeHarvest, mlpML, xgbDART

Our Top Ten Models:

Features_pca / lars2 - 8.181284
Features_a / nodeHarvest - 8.329784
Features_ra / lars2 - 8.347729
Features_a / rbf - 8.3484
Features_ra / foba - 8.365274
Features_a / lars2 - 8.367263
Features_raw/ icr - 8.367654
Features_a / svmRadialSigma - 8.375153
Features_ra / glmnet - 8.381297
Features_ra / glmnet - 8.38243

Page updated

Google Sites

Report abuse