The Process

Pre-Processing

  1. Merge intron and exon snRNA-seq datasets.

  2. Apply log2cpm transformation.

  3. Remove genes with 0 expression across all samples.

  4. Filter out housekeeping genes by calculating the median expression of every gene within each granular cell cluster. Remove genes with 0 variance across this median expression.

  5. Determine a coefficient of variation threshold to further select for useful features in classification.

  6. Split the data into 60% training, 20% validation, and 20% testing.

Distribution of Cells by Major Classification

The dataset clearly illustrates highly imbalance clusters, with excitatory neurons having a significantly larger distribution than the other major cell types.

Experimental Design

  1. Train the models with the training dataset

  2. Predict the cell types of the validation dataset with the model

  3. Tune the model hyperparameters with the validation results

  4. Predict the cell types of the testing dataset with the final model

  5. Analyze the predictions and draw conclusions about the model based on f-beta scores


Coefficient of Variation Histogram

A histogram detailing the frequency of each gene at specific intervals of Coefficient of Variation. A peak of COV=0.52 can be observed with a median above 1.

Made by Daniel Carrillo