CLL Dataset
CNN model trained with UMAP feature transformed scatter plotted images on CLL samples
Data Augmentation + CNN Model Training
Solution to the limited patients sample size: Data Augmentation with multiple templates embedding (Figure.3a)
Enumerate for pairs of samples with one positive and one negative labels.
Sample UMAP template embedding: embed transformed sample to the templates pairs that marked in the previous steps, and
Saved as inputs for CNN model training.
Hyperparameter optimization with Gridsearch
CNN model training
Split training testing, and validation datasets
Train model with augmented datasets and the optimal parameters
Obtain accuracy and predictions on testing datasets
Optional ensemble learning algorithm XGBoost was utilized to combine multiple models for the optimal performance.
Figure 3. Data Augmentation workflow visualization
ALL Dataset
UMAP Construction and feature transformation, Multiple templates + Single Positive Voting
ALL patients and non-patients with labels, 60 samples in total for training/cross-validation, 29 samples without labels for testing
a. To accommodate the varying endotypes among ALL samples, we construct 5 templates (based on 2 different samples from ALL patients and 2 samples from non-patients) as a single training set (Figure 4), then apply HDBSCAN for cluster classification
b. Prediction, Training and testing via cross validation. Get the prediction results for each template, check if there is any outliers
c. Single-positive voting: Consider the predictions of the 5 templates collaboratively, mark the sample as “ALL” as long as one of the model gives a true prediction
d. Repeat the above steps for a new training set, following the 4-folds cross-validation procedure
2. UMAP Construction and feature transformation, 4 samples for building templates + Density Bias Approach
ALL patients and non-patients with labels, 60 samples in total for training/cross-validation, 29 samples without labels for testing
a. Bulild a template (based on 2 different samples from ALL patients and 2 samples from non-patients) as a single training set (Figure 5. a.), then apply HDBSCAN for cluster classification (Figure 5. b.)
b. Use Density-Bias Selection for building new template (Figure 5. c.): Select the same number of cells from each identified clusters, for example:
If number of cells in cluster i > 600, randomly select 600 cells from it;
otherwise, keep all the cells in cluster i and proceed
c. Prediction, Training and testing via cross validation. Get the prediction results for each template, check if there is any outliers
d. Repeat the above steps for a new training set, following the 4-folds cross-validation procedure
Figure 4. An example for a single UMAP training set, comprised of 5 different templates, each built from 4 samples
Figure 5. An example for the procedure of constructing a single training set in density-bias approach
Page Leader: Meixian Wu