Linear regression assumes that there is a linear relationship between our multidimensional RNAseq data and gestational age.
Since real data rarely presents as a perfect linear relationship, this model is used to find the best fit line for the data. Based on this plot, we can figure out the distance of each individual point to the best fit line using the Spearman or Pearson Correlation methods.
In order to understand RVM, it is important to first understand SVM. SVM is a model that draws a line between data points to separate them into classes.
RVM is similar to SVM, however, instead of just categorizing all the data points on either one side or the other, RVM gives each data point a value that indicates how far the data points are from the line.
Random forest was previously used as a method of data extraction, where its purpose was to calculate feature importance. As a machine learning model, we use the decision trees of Random forest for classification- to determine whether an observation falls into a category or not. In other words, we are classifying each variable into groups to determine their relationships with each other and to better understand our data.
A decision tree is a decision making template in the form of "if this then that". In our case, this decision would be "if this gene is expressed often, then the woman is in her 32nd week of gestation".
Root Mean Squared Error (RMSE) indicates the difference between our models' predicted gestational age and the true gestational age of the patient. The smaller the RMSE value, the better our model has performed.
In this project, the RMSE value was used to compare the different combinations of models and features that we tested.
10-fold CV is an experimental framework that we used to evaluate our results. It is a resampling procedure used to estimate the skill of a machine learning model on unseen data. In this case, our unseen data is the GA, which we are trying to figure out.
We split our data set into a training and test set, and then split the training set further into its constituent training and validation sets. We train our model using our split up training set 10 times, compare it to our validation set and evaluate the model based on its mean score (see below). We take the model with the best mean score and then train that model on the first training set in order to predict the final answer.
Here is our code with explanations:
In addition to the three models outlined above, we also tested the following models:
enet, foba, gaussprPolyg, gaussprRadial, glmnet, icr, kernelpls, krlsRadial, lars2, lasso, leapBackward, leapForward, leapSeq, nnls, partDSA, pls , plsRglm, rbf, rpart, rqlasso, rvmPoly, rvmRadial, simpls, spikeslab, spls, svmPoly, svmRadial, svmRadialCost, svmRadialSigma, widekernelpls, rqnc, nodeHarvest, mlpML, xgbDART