Research

My primary research interests include:

Leveraging Transformers for Personalized Biomedicine
Designing AI-driven applications for safe and efficient immunotherapies (CAR-T cells, TCR-T cells and cancer vaccines).
Role of inflammatory cell death in cancer and non-communicable diseases.

Past:

Sampling: Exploring Sampling Techniques for Representative Subset Selection from Large Scale Datasets
Big Data: Designing Tools based on Kernel Methods for Big Data Learning
Visualization: Designing Visualization Tools for Evolutionary Datasets like Time-Series.
Deep Learning for Protein Property Prediction: Solubility, Crystallization, Toxicity
Computational Network Biology: Differential Network Analysis, Gene Regulatory Network Reconstruction & Disease Module Identification in Cellular Networks

VISH-Pred: an ensemble of fine-tuned ESM2 models for protein toxicity prediction

Motivation: Peptide- and protein-based therapeutics are becoming a promising treatment regimen for myriad diseases. Toxicity of proteins is the primary hurdle for protein-based therapies. Thus, there is an urgent need for accurate in silico methods for determining toxic proteins to filter the pool of potential candidates. At the same time, it is imperative to precisely identify non-toxic proteins to expand the possibilities for protein-based biologics.

Methods: To address this challenge, we proposed an ensemble framework, called VISH-Pred, comprising models built by fine-tuning ESM2 transformer models on a large, experimentally validated, curated dataset of protein and peptide toxicities. The primary steps in the VISH-Pred framework are to efficiently estimate protein toxicities taking just the protein sequence as input, employing an under sampling technique to handle the humongous class-imbalance in the data and learning representations from fine-tuned ESM2 protein language models which are then fed to machine learning techniques such as Lightgbm and XGBoost.

Results: The VISH-Pred framework is able to correctly identify both peptides/proteins with potential toxicity and non-toxic proteins, achieving a Matthews correlation coefficient of 0.737, 0.716 and 0.322 and F1-score of 0.759, 0.696 and 0.713 on three non-redundant blind tests, respectively, outperforming other methods by over 10% on these quality metrics. Moreover, VISH-Pred achieved the best accuracy and area under receiver operating curve scores on these independent test sets, highlighting the robustness and generalization capability of the framework.

Availability and implementation: By making VISH-Pred available as an easy-to-use web server, we expect it to serve as a valuable asset for future endeavors aimed at discerning the toxicity of peptides and enabling efficient protein-based therapeutics.

Publication: Raghvendra Mall, Ankita Singh, Chirag N. Patel, Gregory Guirimand, and Filippo Castiglione. "VISH-Pred: an ensemble of fine-tuned ESM models for protein toxicity prediction." Briefings in Bioinformatics 25, no. 4 (2024).

Tools Used: Python, Pytorch, Pytorch Geometric

Web-server: http://ec2-35-170-123-194.compute-1.amazonaws.com:7860/

Comparative analysis identifies genetic & molecular factors associated with prognostic clusters of PANoptosis in cancer

Motivations: The importance of inflammatory cell death, PANoptosis, in cancer is increasingly being recognized. PANoptosis can promote or inhibit tumorigenesis in context-dependent manners, and a computational approach leveraging transcriptomic profiling of genes involved in PANoptosis has shown that patients can be stratified into PANoptosis High and PANoptosis Low clusters that have significant differences in overall survival for low grade glioma (LGG), kidney renal cell carcinoma (KIRC) and skin cutaneous melanoma (SKCM). However, the molecular mechanisms that contribute to the differential prognosis between PANoptosis clusters require further elucidation.

Methods: Therefore, we performed a comprehensive comparison of genetic, genomic, tumor microenvironment, and pathway characteristics between the PANoptosis High and PANoptosis Low clusters to determine the relevance of each component in driving the differential associations with prognosis for LGG, KIRC and SKCM.

Results: Across these cancer types, we found that activation of the proliferation pathway was significantly different between PANoptosis High and Low clusters. In LGG and SKCM, we also found that aneuploidy and immune cell densities and activations contributed to differences in PANoptosis clusters.

In individual cancers, we identified important roles for barrier gene pathway activation (in SKCM) and the somatic mutation profiles of driver oncogenes as well as hedgehog signaling pathway activation (in LGG).

Conclusion: By identifying these genetic and molecular factors, we can possibly improve the prognosis for at risk-stratified patient populations based on the PANoptosis phenotype in LGG, KIRC and SKCM. This not only advances our mechanistic understanding of cancer but will allow for the selection of optimal treatment strategies.

Data and Code Availability: All code and relevant data are publicly available on Mendeley (https://doi.org/10.17632/7x237xf2m3.1)

Tools Used: R

Pancancer Transcriptomic Profiling Identifies Key PANoptosis Markers as Therapeutic Targets for Oncology

Motivations: Resistance to programmed cell death (PCD) is a hallmark of cancer. While some PCD components are prognostic in cancer, the roles of many molecules can be masked by redundancies and crosstalks between PCD pathways, impeding the development of targeted therapeutics. Recent studies characterizing these redundancies have identified PANoptosis, a unique innate immune-mediated inflammatory PCD pathway that integrates components from other PCD pathways.

Methods: Here, we designed a systematic computational framework to determine the pancancer clinical significance of PANoptosis and identify targetable biomarkers.

Results: We found that high expression of PANoptosis genes was detrimental in low grade glioma (LGG) and kidney renal cell carcinoma (KIRC). ZBP1, ADAR, CASP2, CASP3, CASP4, CASP8 and GSDMD expression consistently had negative effects on prognosis in LGG across multiple survival models, while AIM2, CASP3, CASP4 and TNFRSF10 expression had negative effects for KIRC. Conversely, high expression of PANoptosis genes was beneficial in skin cutaneous melanoma (SKCM), with ZBP1, NLRP1, CASP8 and GSDMD expression consistently having positive prognostic effects. As a therapeutic proof-of-concept, we treated melanoma cells with combination therapy that activates ZBP1 and showed that this treatment induced PANoptosis.

Conclusion: Overall, through our systematic framework, we identified and validated key innate immune biomarkers from PANoptosis which can be targeted to improve patient outcomes in cancers.

Data and Code Availability: All code and relevant data are publicly available on Mendeley (doi: 10.17632/5drb9c5y9h.2)

Tools Used: R

DeepRepurpose: A Modelling Framework for Embedding-based Predictions for Compound-Viral Protein Activity

Motivation: A global effort is underway to identify compounds for the treatment of COVID-19. Since de novo compound design is an extremely long, time-consuming, and expensive process, efforts are underway to discover existing compounds that can be repurposed for COVID-19 and new viral diseases.

Results: Our consensus framework achieves a high mean Pearson correlation of 0.916, mean R2 of 0.840 and a low mean squared error of 0.313 for the task of compound-viral protein activity prediction on an independent test set. As a use case, we identify a ranked list of 47 compounds common to three main proteins of SARS-COV-2 virus (PL-PRO, 3CL-PRO and Spike protein) as potential targets including 21 antivirals, 15 anticancer, 5 antibiotics and 6 other investigational human compounds. We perform additional molecular docking simulations to demonstrate that majority of these compounds have low binding energies and thus high binding affinity with the potential to be effective against the SARS-COV-2 virus.

Availability and implementation: A webserver is available for the prediction task. The standalone source code and models are available here.

Publication: Raghvendra Mall, Abdurrahman Elbasir, Hossam Almeer, Zeyaul Islam, Prasanna R. Kolatkar, Sanjay Chawla, and Ehsan Ullah. "A Modelling Framework for Embedding-based Predictions for Compound-Viral Protein Activity." Bioinformatics (2021).

Tools Used: Python, Pytorch, Pytorch Geometric

Source Code: DeepPurpose

DeepSol: a deep learning framework for sequence-based protein solubility predictions

Motivation: Protein solubility plays a vital role in pharmaceutical research and production yield. For a given protein, the extent of its solubility can represent the quality of its function, and is ultimately defined by its sequence. Thus, it is imperative to develop novel, highly accurate in silico sequence-based protein solubility predictors. In this work we propose, DeepSol, a novel Deep Learning-based protein solubility predictor. The backbone of our framework is a convolutional neural network that exploits frequent k-mer and additional sequence and structural features extracted from the protein sequence.

Results: DeepSol outperformed all known sequence-based state-of-the-art solubility prediction methods and attained an accuracy of 0.77 and Matthew’s correlation coefficient of 0.55. The superior prediction accuracy of DeepSol allows to screen for sequences with enhanced production capacity and can more reliably predict solubility of novel proteins.

Availability and implementation: DeepSol’s best performing models and results are publicly deposited at https://doi.org/10.5281/zenodo.1162886 (Khurana and Mall, 2018)

Publication: Sameer Khurana, Reda Rawi, Khalid Kunji, Gwo-Yu Chuang, Halima Bensmail, and Raghvendra Mall. "DeepSol: a deep learning framework for sequence-based protein solubility prediction." Bioinformatics (2018).

Tools Used: Python, Tensorflow

Source Code: DeepSol

Network-based identification of key master regulators associated with an immune-silent cancer phenotype

Motivation: A cancer immune phenotype characterized by an active T-helper 1 (Th1)/cytotoxic response is associated with responsiveness to immunotherapy and favorable prognosis across different tumors. However, in some cancers, such an intratumoral immune activation does not confer protection from progression or relapse. Defining mechanisms associated with immune evasion is imperative to refine stratification algorithms, to guide treatment decisions, and identify candidates for immune-targeted therapy. Molecular alterations governing mechanisms for immune exclusion are still largely unknown. The availability of large genomic datasets offers an opportunity to ascertain key determinants of differential intratumoral immune response.

Results: We follow a network-based protocol to identify transcription regulators (TRs) associated with poor immunologic antitumor activity. We use a consensus of 4 different pipelines consisting of two state-of-the-art gene regulatory network inference techniques, Regularized Gradient Boosting Machines (RGBM) and ARACNE to determine TR regulons, and three separate enrichment techniques, including fast gene-set enrichment analysis (FGSEA), gene set variation analysis (GSVA), and virtual inference of protein-activity by enriched regulon analysis (VIPER) to identify the most important TRs affecting immunologic anti-tumor activity. These TRs, referred to as Master Regulators (MRs), are unique to immune-silent and immune-active tumors respectively. We validated the MRs coherently associated with the immune-silent phenotype across cancers in The Cancer Genome Atlas (TCGA) and a series of additional datasets in the PREdiction of Clinical Outcomes from Genomic Profiles (PRECOG) repository.

Downstream analysis of MRs specific to the immune-silent phenotype resulted in the identification of several enriched candidate pathways, including NOTCH1, TGF-β, Interleukin-1, and TNF-α signaling pathways. TGFB1I1 emerged as one of the main negative immune modulators preventing the favorable effects of a Th1/cytotoxic response.

Availability and implementation: All code for this paper is available on GitHub and associated data at Mendeley.

Publication: Raghvendra Mall, Mohamad Saad, Jessica Roelands, Darawan Rinchai, Khalid Kunji, Hossam Almeer, Wouter Hendrickx, Francesco M Marincola, Michele Ceccarelli, and Davide Bedognetti. "Network-based identification of key master regulators associated with an immune-silent cancer phenotype." Briefings in Bioinformatics (2021).

Tools Used: R

Source Code: ICR

RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete glioma subtypes

Motivation: Transcription factors (TF) that regulate gene expression are key determinants of cellular phenotypes. Reconstructing large-scale genome-wide networks capturing the influence of TFs on target genes is essential to understand and accurately model living cells.

Methods: In this paper, we propose a generic framework for gene regulatory network (GRN) inference that converts the problem into a feature selection problem. It is able to handle data from heterogeneous information sources including dynamic time-series, gene knockout or knockdown, DNA microarrays and RNA-Seq expression profiles. GRNs obtained using ML techniques are often dense, whereas real GRNs contain only a few interactions between the TFs and target genes. To this aim we propose a Tikonov regularization inspired optimal L-curve criterion that utilizes the edge weight distribution for a given target gene to determine the optimal set of TFs associated with it.

Finally, we re-compute the subgraph for each target gene using the expression of the corresponding optimal set of TFs. Our proposed framework allows to incorporate a priori networks, such as mechanistic active biding network (ABN) based on cis-regulatory motif analysis between TFs and target genes. In the presence of an ABN, the resulting GRN is a subgraph of it.

Results: We evaluate our regularization framework in conjunction with two non-linear ML techniques, namely gradient boosting machines (GBM) and random-forests (GENIE), resulting in a regularized feature selection based methods specifically called RGBM and RGENIE respectively. We show that the proposed methods out-perform standard GRN inference techniques on synthetic RNA-Seq data, DREAM challenge and real E.coli and Yeast datasets. Moreover, they surpass the winners of DREAM competitions and other established methods.

Case Studies: RGBM has been used to identify the main transcription factors that are causally involved as master regulators of the gene expression signature activated in the FGFR3-TACC3-positive glioblastoma. Here we illustrate that RGBM can also identify the main regulators of the molecular subtypes of brain tumors. Our analysis reveals the identity and corresponding biological activities of the master regulators driving the transformation of G-CIMP-high into the G-CIMP-low subtype of glioma and PA-like into LGm6-GBM, thus providing a clue to the yet undetermined nature of the transcriptional events among these novel glioma subtypes.

Publication: Raghvendra Mall, Luigi Cerulo, Luciano Garofano, Veronique Frattini, Khalid Kunji, Halima Bensmail, Thais S Sabedot, Houtan Noushmehr, Anna Lasorella, Antonio Iavarone, Michele Ceccarelli. "RGBM: regularized gradient boosting machines for identification of the transcriptional regulators of discrete glioma subtypes", Nucleic Acids Research, gky015, https://doi.org/10.1093/nar/gky015

Tools Used: R

Source Code: RGBM

https://sites.google.com/site/raghvendramallmlresearcher/research/figure-mecc-gboost-3.png?attredirects=0

Schematic representation of the RGBM approach.

Differential Community Detection in Paired Biological Networks

Motivation: Biological networks unravel the inherent structure of molecular interactions which can lead to discovery of driver genes and meaningful pathways especially in cancer context. Often due to gene mutations, the gene expression undergoes changes and the corresponding gene regulatory network sustains some amount of localized re-wiring. The ability to identify significant changes in the interaction patterns caused by the progression of the disease can lead to the revelation of novel relevant signatures.

Methods: The task of identifying differential sub-networks in paired biological networks (A:control,B:case) can be re-phrased as one of finding dense communities in a single noisy differential topological (DT) graph constructed by taking absolute difference between the topological graphs of A and B. In this paper, we propose a fast three-stage approach, namely Differential Community Detection (DCD), to identify differential sub-networks as differential communities in a denoised version of the DT graph. In the first stage, we iteratively re-order the nodes of the DT graph to determine approximate block diagonals present in the DT adjacency matrix using neighborhood information of the nodes and Jaccard similarity. In the second stage, the ordered DT adjacency matrix is traversed along the diagonal to remove all the edges associated with a node, if that node has no immediate edges within a window. Finally, we apply community detection methods on this de-noised DT graph to discover differential sub-networks as communities.

Results: Our proposed DCD approach can effectively locate differential sub-networks in several simulated paired random-geometric networks and various paired scale-free graphs with different power-law exponents. The DCD approach easily outperforms community detection methods applied on the original noisy DT graph and recent statistical techniques in simulation studies. We applied DCD method on two real datasets: a) Ovarian cancer dataset to discover differential DNA co-methylation sub-networks in patients and controls; b) Glioma cancer dataset to discover the difference between the regulatory networks of IDH-mutant and IDH-wild-type. We demonstrate the potential benefits of DCD for finding network-inferred bio-markers or pathways associated with a trait of interest.

Conclusion: The proposed DCD approach overcomes the limitations of previous statistical techniques and the issues associated with identifying differential sub-networks by use of community detection methods on the noisy DT graph. This is reflected in the superior performance of the DCD method with respect to various metrics like Precision, Accuracy, Kappa and Specificity. The code implementing proposed DCD method is available at https://sites.google.com/site/raghvendramallmlresearcher/codes.

Publication: Mall, R., Ullah, E., Kunji, K., Angelo, F., Bensmail, H. and Ceccarelli, M. "Differential Community Detection in Paired Biological Networks". Accepted in Proceedings of 8th ACM Conference on Bioinformatics, Computational Biology and Health Informatics, Boston, MA, U.S.A, 2017.

Tools Used: R

Source Code: DCD

a) Control Differential Sub-network for Ovarian Cancer

b) Case Differential Sub-network for Ovarian Cancer

Detection of statistically significant changes in complex biological networks

Background

Biological networks contribute effectively to unveil the complex structure of molecular interactions and to discover driver genes especially in cancer context. It can happen that due to gene mutations, as for example when cancer progresses, the gene expression network undergoes some amount of localized re-wiring. The ability to detect statistical relevant changes in the interaction patterns induced by the progression of the disease can lead to the discovery of novel relevant signatures. Several procedures have been recently proposed to detect sub-network differences in pairwise labeled weighted networks.

Methods

In this paper, we propose an improvement over the state-of-the-art based on the Generalized Hamming Distance adopted for evaluating the topological difference between two networks and estimating its statistical significance. The proposed procedure exploits a more effective model selection criteria to generate p-values for statistical significance and is more efficient in terms of computational time and prediction accuracy than literature methods. Moreover, the structure of the proposed algorithm allows for a faster parallelized implementation.

Results

In the case of dense random geometric networks the proposed approach is 10-15x faster and achieves 5-10% higher AUC, Precision/Recall, and Kappa value than the state-of-the-art. We also report the application of the method to dissect the difference between the regulatory networks of IDH-mutant versus IDH-wild-type glioma cancer. In such a case our method is able to identify recently reported master regulators as well as novel important candidates.

Conclusions

We show that our network differentiating procedure can effectively and efficiently detect statistical significant network re-wirings in different conditions. When applied to detect the main differences between the networks of IDH-mutant and IDH-wild-type glioma tumors, it correctly selects sub-networks centered on important key regulators of these two different subtypes. In addition its application highlights the role novel candidates that are not detected by standard single network-based approaches.

Publication: Mall, R., Cerulo, L., Bensmail, H., Iavarone, A. and Ceccarelli, M., 2017. "Detection of statistically significant network changes in complex biological networks". BMC Systems Biology, 2017 Mar 4;11(1):32. doi: 10.1186/s12918-017-0412-6.

Tools Used: R

Source Code: Closed-Form

https://sites.google.com/site/raghvendramallmlresearcher/research/Figure6.jpg?attredirects=0

Netgram: Visualizing Communities in Evolving Networks

Netgram is a tool which can be used as a post-processing step for any evolutionary community detection/clustering technique to visualize and track the evolution of communities in dynamic networks and datasets. Netgram maintains evolution of communities over 2 consecutive time-stamps in tables which are used to create a query database using the sql outer-join operation. It uses a line-based visualization technique which adheres to certain design principles and aesthetic guidelines. Netgram uses a greedy solution to order the initial community information provided by the evolutionary clustering technique such that we have fewer line cross-overs in the visualization.

Publication: R. Mall, R. Langone and J.A.K. Suykens, "Netgram: Visualizing Communities in Evolving Networks", PloS One, 10(9):e0137502, 2015.

Tools Used: Matlab

Source Code: Netgram

https://sites.google.com/site/raghvendramallmlresearcher/research/NIPS_Net_LineTrack_MKSC_1.jpg?attredirects=0

MHKSC: Multilevel Hierarchical Kernel Spectral Clustering

Multilevel Hierarchical Kernel Spectral Clustering tries to exploit the structures of the projections in the eigenspace to determine thresholds of distance. Using these increasing thresholds of distance multiple levels of hierarchy for large scale networks are obtained. Hierarchical structure is obtained in bottom-up fashion. Overcomes resolution limit issues faced by Louvain method, issue of large number of small-sized communities for OSLOM method and generates better quality communities in comparison to Infomap method.

Publication: R. Mall, R. Langone and J.A.K. Suykens, "Multilevel Hierarchical Kernel Spectral Clustering for Real-Life Large Scale Complex Networks", PloS One, 9(6):e99966, 2014.

Tools Used: Matlab, Python

Source Code: MHKSC

https://sites.google.com/site/raghvendramallmlresearcher/research/Figure2.jpg?attredirects=0

KSC-net: Kernel Spectral Clustering for Big Data Networks

We show the feasibility of Kernel Spectral Clustering (KSC) method for the purpose of community detection in big data networks. KSC employs a primal-dual framework and has a powerful out-of-sample extension property which allows to effectively infer community affiliation for unseen nodes. We perform model selection and hierarchical community detection too and experiement on graphs with O(10^6) nodes and O(10^9) edges.

Publication: R. Mall, R. Langone and J.A.K. Suykens, "Kernel Spectral Clustering for Big Data Networks", Entropy, Special Issue: Big Data, Vol 13, No: 5, pp. 1567-1586, 2013.

Tools Used: Matlab

Source Code: KSC-net (Linux, Windows)

https://sites.google.com/site/raghvendramallmlresearcher/research/Simulation_EigenSpace.jpg?attredirects=0

VS-LSSVM: Very Sparse Least Squares Support Vector Machines

LSSVMs have been widely applied for classification and regression. LSSVM model lacks sparsity and is unable to handle large scale. A primal Fixed-Size LSSVM (PFS-LSSVM) was previously proposed to introduce sparsity using Nystrom approximation with a set of prototype vectors (PV). However, its solution is not the sparsest. We investigate the sparsity-error trade-off by introducing a second level of sparsity. This is done by means of L0-norm based reductions by iteratively sparsifying LSSVM and PFS-LSSVM models. The proposed method overcomes the problem of memory constraints and high computational costs resulting in highly sparse reductions to LSSVM models. Experiments on real world classification and regression datasets from the UCI repository illustrate that these approaches achieve sparse models without a significant trade-off in errors.

Publication: R. Mall and J.A.K. Suykens, "Very Sparse LSSVM Reductions for Large Scale Data", IEEE TNNLS, vol 6, no 25, pp. 1086-1097, 2015.

Tools Used: Matlab

Source Code: VS-LSSVM

https://sites.google.com/site/raghvendramallmlresearcher/research/Ripley_ALL_L0_norm_err.jpg?attredirects=0

SR-KSC: Sparse Reductions to Kernel Spectral Clustering

Kernel Spectral Clustering (KSC) selects model on a subset of data for building the training model and validation. It has a powerful out-of-sample extension property leading to good clustering generalization. The clustering dual model is expressed in terms of non-sparse kernel expansions where every point in the training set contributes. The goal is to find reduced set of training points which can best approximate the original solution. In this work we investigate various reduced set techniques including the Group Lasso, L0, L1+L0 penalization and compare the amount of sparsity gained w.r.t. a previous L1 penalization technique.

Publication: R. Mall, S. Mehrkanoon, R. Langone and J.A.K. Suykens, "Optimal Reduced Sets for Sparse Kernel Spectral Clustering", IJCNN 2014, Beijing, China.

Tools Used: Matlab

https://sites.google.com/site/raghvendramallmlresearcher/research/GroupLasso_RS_7.jpg?attredirects=0

https://sites.google.com/site/raghvendramallmlresearcher/research/GroupLass_BestResults.jpg?attredirects=0

Google Sites

Report abuse