Currently the most common way to make leukemia diagnosis is through manually drawing gates on patients’ protein marker expression (provided by Flow Cytometry data), and analyzing the protein expression level in the designated cancerous region. However, hard cut boundaries defining the cancerous region are also very common in diagnosis, because in many cases the visualized expression patterns do not show explicit clusters indicating abnormal cells. Therefore, we expect a more reliable method to detect the cancerous region with better accuracy and precision, instead of a rigid classification by linear threshold.
Problem 1: Finding a way to build a valid UMAP template that can be used for ALL dataset analysis
The current automated method is developed upon the training data obtained from CLL patients. CLL, comparing with ALL or other subtypes of Leukemia, is less variable and more likely to follow specific patterns. Therefore we first need to find consistent features among patients for clustering and diagnosis in ALL samples as well.
Problem 2: Availability of current Machine learning/Deep learning algorithms.
We are currently using either UMAP transform or Deep learning transform (CNN model). The clustering method is therefore UMAP-enhanced clustering, implemented via HDBSCAN. However, if a new method is released it is possible that it outperforms the current clustering algorithm. Then most of the work will need to be updated accordingly and this is very time-consuming.
Problem 3: Adaptivity of the algorithm among ALL and CLL samples
The current automated method is developed upon the training data obtained from CLL patients. CLL, comparing with ALL or other subtypes of Leukemia, is less variable and more likely to follow specific patterns.Currently we rely on finding consistent features among patients for clustering and diagnosis, therefore if there is no such pattern (biologically) our automated approach would be complicated.
Problem 4: Short of patients samples
One of our approaches is implemented through the Deep learning model, CNN. But an effective result of CNN usually requires 1000+ samples for training and testing. We only got about 300 patients’ data for CLL right now, which above the average sample size, so it is challenging to implement CNN approach when we are shor of samples from the hospitals.
Page Leader: Meixian Wu