Credit: Tian Liu
Task 1: Weighted NSForest
Table 3: Comparing NSForest Performance Using Different Numbers of Estimators
Table 4: Diagonal Scores for Different Weights Applied Through Different Approaches
Figure 6: Comparison Between Results of Original NSForest and Weighted NSForest Shown in Violin Plots
Table 5: d-score Evaluating the Performance of Overall Dataset and 3 Target Subclades
Task 2: Large Cluster Bias
Figure 8. Scatter plot of f-beta scores(x=500) vs. f-beta scores (x=50). Cluster size is distinguished by color gradient; the darker the marker is, the larger the cluster size is.
Table 6: Average F-beta scores resulting from various cluster size sampling thresholds imposed on 75 MTG clusters
Table 7. P-values obtained from t-test and Wilcoxon test from 10 NSForest runnings with random subsampling
Discussions and Conclusions
With all the weight we develop with two different schemes and in different scales, weight2 developed through scheme 1 (1 - scaled pairwise distance) on a scale of 0.05 to 0.5 has the best performance. We decide to use weight2 as the default weight of the weighted NSForest workflow for the MTG dataset. The approach for producing all the weights will be available in GitHub for further study.
From Table 3, we conclude that increasing the number of estimators used in Random Forest is able to improve the overall performance of the NSForest workflow in identifying distinctive molecular biomarkers for each cell type. This is because the random forest algorithm relies on the bagging approach, which selects part of the data whole training dataset to train each of the decision trees and make an aggregated result based on the majority decisions of the decision trees. With more trees (estimators), we can build a more generalized model with greater accuracy but require more computation time.
Table 5 compares the resulting overall performance and performance of the 3 target subclades with many neighboring clusters when using the original NSForest workflow and the weighted NSForest workflow. The diagonal scores indicate that the performance on the PVALB subclade increased by 50% without sacrificing the overall performance. However, the weighted NSForest does not improve the performance of the L4 and PVALB subclades significantly. A possible explanation for why the weighted NSForest improves the performance of PVALB more in comparison to the other two subclades is that the clusters in the PVALB subclade have greater pairwise weight relatively. In Figure 7, the lighter color is assigned to clusters with shorter pairwise distances. The 3 target subclades we want to improve are highlighted in Figure 7. The PVALB subclade has a relatively lighter color, indicating shorter pairwise cluster distances and thus greater pairwise weight in Figure 7. As greater weights are assigned to PVALB, a greater magnitude of penalty will be applied to PVALB when incorrect decision is made in classifying clusters within the PVALB subclade. This might be the reason why the weighted NSForest workflow has better performance in PVALB in comparison to the other 2 target subclades