Research

We are interested in development of machine learning models for different medical applications, developing computational tools and software for high-throughput biological data (NGS, Mass-spectrometry) analysis and using advanced computational approaches for understanding human disease and biology.

Our interests include:

  • Big Data and Artificial Intelligence

  • Mass-spectrometry based proteomics and proteogenomics analysis

  • Comparative Genomics and Transcriptomics

  • Biological Databases development

Different ongoing research projects are briefed below.

We have been involved in the development of different software tools and databases. This includes software like “GenoCluster” - for protein coding gene identification, “PLHost” - for Protein function assignment, machine learning tool "Pro-Gyan" for protein classification, "MassWiz", "Proteostat", "Genosuite" and "EuGenosuite" for proteomics research. The databases includes "VitiVar" for Vitiligo for disease biology, "HuBSProt" for brain proteoforms, and "IGVdb" for Indian genome variations.

Research Projects

RAPID-CT project for CT scan triage and diagnosis

RAPID-CT aims to reduce turnaround time for triage of medical images while also prioritization of patients based on their condition in a clinical setting automatically. As a case study, Computed Tomography scans of the Head for detection and diagnosis of Intracranial Hemorrhages (ICH) has been selected as an initial challenge. RAPID-CT also aims to provide a software for assessment of CT scans in a remote setting. Apart from patient samples available publicly from RSNA to be used for training, we collected patient samples from an Indian radiologist as a test set. We then built modules for anonymization and standardization of CT, and for collection and inference of the models which we thenintegrated into a web portal. The models we built for detection of ICH presence and subtypes, and localization of ICH on the CT slices had an accuracy of 95.6% and 98% for ICH detection respectively. In the future we plan to work on the following avenues: (i) Build better models for ICH detection, diagnosis and localization including patient level detection models. (ii) Work on generalizability of models in multiple hospitals in a clinical setting, and (iii) Translate skills and toolsets developed for ICH into other use cases in the near future.

CovBaseAI–COVID 19 detection and diagnosis

CovBaseAI project aims to develop an AI classifier for detection and diagnosis of COVID-Pneumonia from Chest X-Rays (CXRs). The AI classifier predicts Covid-Pneumomia and is composed of an ensemble model consisting of three DL modules feeding to an expert decision system. The DL modules as part of the ensemble include pathology classification, lung segmentation, and opacity detection models and are explainable to the extent of an activation map output. The expert decision system is a rule-based classification system that classifies the X-ray into one of three classes, namely COVID-unlikely, indeterminate, and COVID-likely and is fully explainable as well as modifiable as needed. We validated the segmentation algorithm of lung mask detection on 100 chest X-rays as test set, from the pool of 1000 X-rays from RSNA dataset. We obtained 0.91 as Jaccard similarity index on the validation set. We validated the lung opacity detection module on 1012 test X-ray images from RSNA Kaggle dataset. The exactness of object detection is usually well determined by mAP (Mean Average Precision) , for opacity detection mAP of 0.34 is achieved. CovBaseAI algorithm was found to have an accuracy of 87% with negative predictive value of 98% in the quarantine-center data for Cov-Pneum. However, sensitivity varied from 0.66 to 0.90 depending on whether RT-PCR or radiologist opinion was set as ground truth. Since the CovBaseAI initiative several new datasets for SARS-CoV detection are publicly available which needs to be incorporated in the algorithm. Further, there are plans to work on comparing the efficacy of CXRs as compared to CT scans for detection of COVID-19 and to build a better version of the model.

Interstitial Lung Disease (ILD) identification from CT scans

ILD project aims at development of a system which would acquire raw and processed CT data for texture-based classification and quantification of Interstitial Lung Diseases (ILD). The data would be processed in real time with algorithms for texture-based analysis. The texture-based analysis would be converted to a quantitative scoring or rating. The developed system would be validated with another set of unused test data. Deep learning based models developed for the segmentation of lung lobes and classification of Interstitial Lung Diseases patterns using publicly available multimedia database of ILDs from the University Hospital of Geneva, consisting of 109 HRCT scans of different ILDs. In addition to this we have also collaborated with AIIMS, Delhi for the collection and annotation of lung CTs of different ILDs. It is hoped that the developed models can be further improved by training on local data and making developed system more generalizable. Other areas such as augmentation of data using GANs are also explored to address data scarcity. Improve the performance of deep learning models by training them on local data. Make models interpretable to identify patterns responsible for classification and quantification of the disease. The project also aims to deploy the developed algorithm online for easing access even in remote areas.

VitiVar: A compendium of genes and variants associated with Vitiligo

Vitiligo is a complex auto-immune skin disorder characterized by patchy loss of pigmentation from the skin. Although, Vitiligo as a disease is non-fatal however, it poses a huge psycho-social impact on patients quality of life. With the advent in multi omics datasets we aim towards understanding Vitiligo genetics and disease pathogenesis utilizing the computational and integrative genomic approaches.

To facilitate this, we systematically catalogued genetic studies on vitiligo and created a disease centric web portal cataloguing the information from 202 genetic studies in the form of 322 genes and 254 variations (along with their associated details). To make it a comprehensive resource and to increase its utility for the users we integrated in-house Vitiligo transcriptomics dataset and skin cell type specific information. Users can make use of these datasets and information in making a new testable hypothesis or in prioritizing their candidate set of gene list.

The ongoing work aims towards identifying the genetic components and how it affects the overall disease pathology using multi-omics datasets (Generated in-house)

HuBSProt : The brain tissue-specific proteoforms

The human brain is a complex network of structural and functional systems. Its complexity is majorly governed by the expressed proteins. Alterations or mutations at different levels from genome to proteome give rise to various proteoforms. Various studies have shown that proteoforms not only show distinctive tissue specificity but also lead to variability in phenotypic traits. Diversity in peptides that gives rise to different proteoforms is important in several neurological disorders like Alzheimer’s disease. The proteoforms cannot be directly predicted from the genome but using proteogenomics for integrating genome/transcriptome with proteomics data can reveal deeper insights into the identification and tissue-specific expression of such novel proteoforms. Thus, a detailed proteogenomic study analyzing proteins across different tissues of the human brain can provide information on how proteoforms relate to human biology and disease. We conducted deep proteogenomic profiling of publicly available proteomics data from various regions of the brain like Cerebrum, Substantia Nigra, Pituitary, Temporal Lobe, Corpus Callosum and Hippocampus to create a comprehensive landscape of brain tissue-specific proteoforms. We developed a proteoform identification pipeline for bottom-up proteomics data which incorporates information from different proteomic and transcriptomic sources like neXtProt and GENCODE in the search database. Using 25 brain MS data sets corresponding to different regions of the brain, we identified proteoforms that exhibited distinct patterns of expression in the different tissues. This study resulted in a rich set of data that was compiled as HuBSProt, a dedicated MS-level data resource, for finding and comparing proteoforms for the brain proteomics community. HuBSProt can be utilized as a healthy reference brain proteoform map to identify the proteoforms expressing in various neurological disorders.

Ayurgenomics: Uncovering patterns of genetic variation using ayurveda

Genetic variations have been known to impact molecular traits thus impacting the ultimate phenotype in a healthy as well as diseased state. However connecting genotype to phenotype in a structured manner is a challenging task. In this regard, deep phenotyping strategy of Ayurveda attempts to subgroup healthy individuals based upon their physical, physiological, anatomical & physochological attributes. As a part of this, we explore cohorts of healthy individuals stratified using deep phenotyping approach of Ayurveda to understand patterns of genetic variations & their association with diseases/traits. We use statistical methods & machine learning approaches to integrate large scale genomics data with phenotypic traits.

​ Protein Interaction Studies using computational approaches

India harbors a rich content of biodiversity comprising approximately 7% of world plant species, constituting 21907 species of plants out of which 4720 (~21%) are expected to be endemic to India. This immense variety of gene pool and metabolites is a rich universe in itself for biological exploration. Several studies attempt to assimilate this information into single comprehensive studies but the number of studies, as well as scope of information yielded, are limited. Our study proposes to explore the molecular mechanisms behind anti-pathogenic effects of Indian plants and their extract using computational approaches. Indian medicinal plants harbor many metabolites that can interact with pathogens, inhibiting their growth and minimizing production of toxins. Knowledge of known metabolite-protein interactions will accelerate plant extract-based drug development and our understanding of mechanism of anti-pathogenic actions of Indian medicinal and non-medicinal plants hence, can positively impact biomedical research. Our work focuses on comprehending 3d structures for proteins of pathogens and prediction of plant components showing interaction with them. Such information can be extended to give additional insights into geographical and taxonomic distribution of plants and their metabolites with binding affinity of pathogen proteins. Our work will also create a platform for further research for finding drugs against infectious diseases which have multiple drug resistant pathogen variants.