Datasets can be complex. The full RxRx19 dataset is 450 gigabytes large and contains over 1000 features. One of the challenges often encountered when building machine learning models is unnecessary complexity in datasets, which can confuse models and reduce accuracy. The lack of accuracy is especially consequential for machine learning models dealing with COVID-19, where false-negative test results have drastic implications for public health and human lives. Dimensionality reduction transforms data from a high-dimensional space into a low-dimensional space, reducing complexity while retaining important details of the original data that are needed to build an accurate model.
Neural network models, while accurate, can be very time consuming. With complex models, neural networks can take days or weeks to compute results. This problem is especially serious for COVID-19 because healthcare resources are already strained and the virus can be spread asymptomatically. Dimensionality reduction extracts variables to reduce complexity while maintaining approximately the same accuracy, allowing for models to have shorter training times. With shorter training times, we can focus our efforts on improving the model and increasing our understanding of SARS-CoV-2. A fast and accurate model can reduce waiting times and testing costs.
Overfitting is a problem that can happen when a training set has too many features and not enough data. To consider all features, the model increases its complexity and memorizes details specific to the training set that might not apply to new examples, which causes overfitting and reduces accuracy. This is a problem for Sars-COV-2 because the virus does not present itself in the same way for different humans, and a reliable model will be required to be accurate for a diverse selection of cases. Dimensionality reduction reduces the number of features, which can reduce overfitting.
Many characteristics of SARS-CoV-2 are still unknown. To truly understand Sars-COV-2 and determine cause-effect relationships, we need to pinpoint specific features and control what variables are analyzed at one time. Dimensionality reduction selects and narrows down relevant features to analyze, making it easier for researchers to interpret results and advance our understanding of Sars-CoV-2.
Dimensionality reduction helps to remove irrelevant features that unnecessarily complicate models and increase computing time. Removing this noise using dimensionality reduction can increase model performance.
We tested three factor models to reduce dimensionality. We generated latent features using these models and saved the best feature set from each factor model for use in our classification model.
PCA uses linear algebra to perform a linear transformation of variables. It reduces variables into their respective components. PCA uses unsupervised machine learning, which is a method to identify patterns in unlabeled datasets with minimal human supervision. In a PCA, data is first normalized on a scale of 0-1. Then, the covariance matrix is calculated to identify correlations between variables, and an eigenvalue decomposition is performed.
Because PCA performs a linear transformation, it might miss non-linear relationships between variables that are relevant for more complex data in fields such as genetics. The explained variance, highlighted in the code, measures the discrepancy between the model and the actual data. In general, a good explained variance score should be above 0.60. The higher the explained variance, the smaller the discrepancy between the model and the actual data. The goal is to maximize the explained variance while minimizing the k-value, which is the number of features in the model.
Matrix factorization decomposes a matrix into two lower-dimensional matrices. When these lower-dimensional matrices are multiplied by a hidden layer, they reconstruct the original matrix. Matrix factorization helps to find a lower-dimensional representation of the original features. The average error value in a matrix factorization describes how closely the MF reconstructs the original data. Increasing the size of the latent variables (k2) can minimize the average error value, but also increases the computing time. The goal is to minimize the average error value and the size of the latent variables.
Autoencoders are neural networks that can reconstruct data in lower dimensions. It can be time-consuming like other neural networks and more complex than other methods such as PCA, however, autoencoders can capture non-linear relationships unidentifiable to other methods.
After we construct the neural network, we split the dataset into a training set and a testing set.
Here, we check how well the autoencoder reconstructs the original dataset by printing the mean squared error value. The error in the training and testing set are similar and fairly low, which tells us that the autoencoder reconstructs the original dataset fairly well.