My research interests have been always been centered around developing computational tools to aid the study of molecular biology or healthcare. During my PhD, I primarily worked on development of Machine Learning algorithms to enable the study of biological phenomenon, such as development and disease progression. After my PhD, my research has centered around 2 broad themes: i) Development of algorithms to study/predict health outcomes, ii) Development of algorithms that enable collaborative science. A somewhat more granular view of my research work can be found below.

Privacy preserving synthetic data sharing

Fairness implications of the practical use of private synthetic data for ML

  • M. Pereira*, M. Kshirsagar*, S. Mukherjee*, R. Dodhia, J.L. Ferres, "An Analysis of the Deployment of Models Trained on Private Tabular Synthetic Data: Unexpected Surprises", Submitted to ICML workshop on on Machine Learning for Data: Automated Creation, Privacy, Bias [Arxiv].

Federated generative modeling with a central adversary

  • J-F. Rajotte, S. Mukherjee, et. al. "Reducing bias and increasing utility by federated generative modeling of medical images using a centralized adversary", Submitted to ACM GoodIT [Arxiv]

Flexible membership privacy estimation framework for generative models

  • X. Liu, Y. Xu, S. Tople, J.L. Ferres*, S. Mukherjee*. "MACE: A Flexible Framework for Membership Privacy Estimation in Generative Models". Submitted to USENIX [Arxiv]. *co-senior author

Novel GAN architecture to prevent membership inference attacks

  • S. Mukherjee, Y. Xu, A. Trivedi, J. L. Ferres, 'privGAN: Protecting GANs from membership inference attacks at low cost'. [Arxiv]. (PETS 2021)

AI for social good

ML pipeline for school identification from child trafficking images

  • Sumit Mukherjee, et al. "A machine learning pipeline for aiding school identification from child trafficking images." Submitted to ACM GoodIT [Arxiv].

COVID symptom tracking survey

  • B. E. Dixon, S. Mukherjee, et. al. " Capturing COVID-19 Symptoms At-Scale using Banner Ads: A Novel Survey Methodology Pilot using an Online News Platform". JMIR [Preprint].

Predicting drug overdose deaths using search trends

  • S. Mukherjee*, B. Weeks*, N. Becker, J.L. Ferres. "Using internet search trends to opioid deaths in Connecticut". Major revision at Journal of General Internal Medicine.

  • S. Mukherjee, N. Becker, B. Weeks, J.L. Ferres "Using internet search trends to forecast short term drug overdose deaths: A case study on Connecticut". Accepted at ICMLA 2020 [Pdf]

Issues with using open data from developing nations

  • S. Mukherjee*, A. Trivedi*, E. Tse*. 'Risks of Using Non-verified Open Data: A case study on using Machine Learning techniques for predicting Pregnancy Outcomes in India'. Accepted to NeurIPS workshop on ML for Developing World 2019 [Arxiv].

Computational tools & analysis to identify genetic drivers of Alzheimer's Disease

My work at Sage Bionetworks focused on developing novel Machine Learning tools to identify drivers of AD. I have been studying the use of manifold learning on RNA-Seq data from post-mortem brain samples of AD patients and controls to predict disease staging and subtypes. Downstream applications of this technique includes studying the effect on biological pathways as a function of disease progression and subtype.

  • S. Mukherjee, .., L. Mangravite, B. Logsdon. "Molecular estimation of neurodegeneration pseudotime in older brains". Accepted at Nature Communications [BioRxiv].

  • S. Mukherjee, T. Perumal, K. Daily, S. Sieberts, C. Preuss, G. Carter, L. Mangravite, B. Logsdon. "Identifying and ranking potential driver genes of Alzheimer’s disease using multiview evidence aggregation". In Bioinformatics (ISMB/ECCB 2019 issue) [BioRxiv].

  • B. Logsdon, ..., S. Mukherjee, .. L. Mangravite. "Meta-analysis of the human brain transcriptome identifies heterogeneity across human AD coexpression modules robust to sample collection and methodological approach". Appearing in Cell Reports [Biorxiv].

Algorithmic tools for Single Cell RNA-Seq analysis

During my PhD, I developed a novel algorithmic tool for sampling distribution aware pre-processing of sparse Single Cell RNA-Seq (scRNA-Seq) data [1]. Our tool (UNCURL) is also able to incorporate qualitative prior biological information into the pre-processing pipeline and is shown to improve the performance of downstream analytic tasks such as clustering and lineage inference. Additionally, the algorithm was shown to be easily scalable and has led to the development of a web-app to enable online scRNA-Seq data analysis. I also developed a novel algorithm (PIPER) for driver gene prediction for stage-wise scRNA-Seq data [2]. Our tool utilized state-of-the-art network inference techniques for sparse data types and performed differential network analysis to predict the most likely drivers of biological processes.

  • S. Mukherjee, A. Carignano, G. Seelig, and S.I. Lee. "Identifying progressive gene network perturbation from single-cell RNA-seq data." In transactions of IEEE EMBC (2018). [Biorxiv]

  • S. Mukherjee*, Y. Zhang*, J. Fan, G. Seelig, and S. Kannan. "Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge." Bioinformatics 34, no. 13 (2018): i124-i132. [Pdf]

  • Y. Xhang, S. Mao, S. Mukherjee, S. Kannan, G. Seelig, 'UNCURL-App: Interactive Database-Driven Analysis of scRNA-Seq Data'. Under major revision at OUP Bioinformatics [Biorxiv].

Novel low-cost protocol for Single Cell RNA-Seq

During early stages of my PhD, I worked on the development of a low-cost scRNA-Seq protocol (Split-Seq) [1]. I was involved both in experimental protocol development as well as the initial data analysis for the project. Our method was shown to scale to hundreds of thousands of cells and had a lower cost-per-cell than most competing methods.

  • A. Rosenberg, ..., S. Mukherjee, ..., G. Seelig. "Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding". Science 360, no. 6385 (2018): 176-182. [Biorxiv]

Studying noise-rejection properties of microRNA mediated network motifs

I worked on a systematic mathematical analysis of the noise rejection properties of microRNA mediated incoherent feed-forward loops. Our analysis focused on elucidating the noise rejection properties of this class of network motifs at the population level (extrinsic noise) which led to the identification of parameter spaces where the noise is most efficiently rejected [1].

  • A. Carignano*, S. Mukherjee*, A. Singh, and G. Seelig. "Extrinsic Noise Suppression in Micro RNA Mediated Incoherent Feedforward Loops." Accepted to IEEE CDC (2018). [Biorxiv]

A routing app for pedestrians with disabilities

I worked on the development of an algorithmic method for identifying a connected network of sidewalks from publicly curated municipal data from the King county [1]. This connected network of sidewalk was then used in conjunction with sidewalk attributes such as elevation gain/loss and presence of curb ramps to develop a customizable routing app for pedestrians with disabilities.

  • Bolten, N., S. Mukherjee, V. Sipeeva, A. Tanweer, and A. Caspi. "A pedestrian-centered data approach for equitable access to urban infrastructure environments." IBM Journal of Research and Development 61, no. 6 (2017): 10-1. [Pdf]