The Process

Pre-Processing

Merge intron and exon snRNA-seq datasets.
Apply log2cpm transformation.
Remove genes with 0 expression across all samples.
Filter out housekeeping genes by calculating the median expression of every gene within each granular cell cluster. Remove genes with 0 variance across this median expression.
Determine a coefficient of variation threshold to further select for useful features in classification.
Split the data into 60% training, 20% validation, and 20% testing.

Distribution of Cells by Major Classification

The dataset clearly illustrates highly imbalance clusters, with excitatory neurons having a significantly larger distribution than the other major cell types.

Experimental Design

Train the models with the training dataset
Predict the cell types of the validation dataset with the model
Tune the model hyperparameters with the validation results
Predict the cell types of the testing dataset with the final model
Analyze the predictions and draw conclusions about the model based on f-beta scores

Coefficient of Variation Histogram

A histogram detailing the frequency of each gene at specific intervals of Coefficient of Variation. A peak of COV=0.52 can be observed with a median above 1.

Made by Daniel Carrillo

Page updated

Report abuse