Single-cell Label & Differential Cluster Expression With FlowSOM
To obtain the single-cell labels for the FCM data, we used FlowSOM, a self organizing map algorithm. In order to find differential heterogenous cancer clusters between patients through FlowSOM we first established a baseline by quantifying the proportion of cells that expressed each defined cluster.
For the value of k (representing meta clusters) a value of k = 40 was chosen to represent the different types of cell usually found within blood samples (B-cells, red-blood cells, T-cells, etc..). After obtaining the proportions for both positive and negative patients, we analyzed the differential expression analysis.
Method
In terms of the analysis and classification of a “cancer cluster” we followed two specifications:
Performing a Wilcoxon's Rank Sums test between all cancer samples versus all healthy samples:
a. If the p-value of the Rank Sums test was below a threshold of α >0.05 we considered that significant enough to be considered a differential cluster between classes (healthy and cancerous).
i. For CLL samples only the 2 clusters with the most significant p-values were chosen from this test.
ii. For ALL samples, to account for increased heterogeneity all clusters with significant p-values between them were chosen.
Obtaining individual patient heterogenous cancer clusters through comparison of standard deviation.
i. The mean proportion of each cluster for all healthy patients was taken.
ii. The proportion of each cluster in individual cancer patients was then compared.
iii. If a cluster in a cancer patient was 3 or more standard deviations away from the healthy mean it was classified as a cancer cluster.
Heterogenous CLL Samples
Homogenous CLL Samples
Healthy Samples
Using an image based CNN for patient label prediction
Initially the approach of using an image based CNN patient label predictions was used. For this model, the inputs were different plot images as seen in the previous section, but without any visible cluster identification annotations. This was because the goal for this model was to be able to identify if a deep learning model could properly identify cell distributions within these biologically relevant plots- which is what are traditionally used to “gate”.
According to most publicly available research with Image based CNNs such as ResNet and ImageNet models, around 1000 images are needed per class for effective training. Given that we had 288 total patients with only 60% (171) belonging to the training set, we used sampling to attempt to circumvent this issue. As a result, patients were randomly sampled at 80% of their cells (80,000 for a typical CLL Flow Cytometry tube) three different times. For the training set we used the following sets of parameters to create multiple UMAP models:
Number of Neighbors: 30
Distance: 0.2, 0.3, 0.5
This allowed us to be able to obtain similar models in which we could create our images with similar locations in terms of the UMAP model, but with slight differences that added “noise” in our data to avoid overfitting. 80 epochs were used with early stopping monitoring for validation loss.
Results for Image Based CNN
The results that we obtained from this model were sadly not very fruitful. The area under the curve (AUC) for the testing and validation sets were 0.54 and 0.45 respectively, making this model worse than the typical random classifier. This is most likely due to the fact that the images we were using had too many “features” to them, given the amount of detail within our cell distribution plots. Even when given the UMAPs plots individually as those pictured in Fig 9. , the model performed similarly. This type of plotted data might be too complex for it to be properly evaluated by the model. We did observe that the model performed differently when predicting different kinds of UMAP parameters, with some representations being “favored” over others in the presence of no sampling.
CytoFlowDX Method: Numerical input CNN
This model was created in reference to previous attempts of using numerical based CNNs to analyze Flow Cytometry data. Deep learning models are usually less sensitive to data than more traditional models, meaning that they need data to obtain the same performance. To circumvent lack of data that is typically needed for this approach, padding and sampling were used with only the training set to create “pseudo patients” to simulate more data. Padding was only one of patients that had below 100,000 cells and the padding value used negative infinity in order to create a padding mask so those padded values would be ignored during training. For sampling, using the single-cell labels obtained from FlowSOM, at least 0.01% of the cells within a sampled cancer patient had to be marked as positive for it to be considered a valid sampling. This followed the logic of 0.01% proportion of cancer cells for a positive diagnosis, as we did not want to trade quantity over quality in terms of our training dataset.
Furthermore, a decaying learning rate was also used with a starting learning rate of 0.01 that decayed by a factor of 10 for every 150 epochs. The total number of epochs used were 1000. Similarly to the previous model, the validation loss was monitored for early stopping during this time and the weights with the best validation loss were saved.
Baseline
To set a standard of performance, we first tested the performance of the model using only UMAP coordinates. For the Training set in this scenario, the coordinates for the three main UMAP models previously described were used.
Version 1
Version 2