Single-cell/nuclei RNA sequencing (sc/sn-RNA-seq) has revolutionized our understanding and discovery of cell phenotypes and states. Marker genes are an integral part of identifying cell types. NSForest (Necessary and Sufficient Forest) is an existing machine learning-based program that utilizes the random forest ensemble learning method to find the minimal marker genes required to identify cell types. However, the current implementation of NSForest (v 3.9.2) struggles with subclade specificity in hierarchically close clusters, cluster size bias, and lack of accessibility. Therefore, we introduced various weighting schemes based on cluster distance to enable bias towards hierarchically close clusters and improve marker selection, which shows a marginal improvement in overall marker selection quality through metrics such as the diagonal score, but a significant improvement of 50% on the diagonal score in one of the three target subclades. The diagonal score is a weighted mean of binary scores which can be used to evaluate the overall quality of a set of NSForest-selected markers. A method involving subsampling larger clusters through thresholding was also attempted to lower cluster bias and lower computational cost. Through this, we show that generally improved f-beta scores from 0.699 to 0.716 can be obtained by thresholding to the median cluster size. However, removing samples from large clusters reduced the consistency of NSForest marker gene outputs, the impacts of which remain to be determined. The program has now been officially released on pypi and conda for easy installation. More work on improving the weighting scheme to optimize performance can be explored in the future and the implementation of cluster thresholding into the official NSForest workflow. The package also requires work to better adhere to Python packaging standards.
The team would like to express our sincere gratitude toward our mentors from JCVI, Drs. Renee Zhang and Richard H. Scheuermann for their guidance and support throughout the project. Thank you to Dr. Wheeler and our wonderful TA Noah Mehringer!
Credit: All Team Members