Research
I specialize in advancing methodologies for causal inference, particularly within the complex landscape of healthcare and public health. My work develops methodologies that are:
Accurate - to ensure accurate estimation of heterogeneous causal effects even when confronted with limited data, offering decision-makers a reliable foundation upon which to base their choices.
Trustworthy - to empower domain experts to comprehend the inner workings of the causal inference process. This not only enables experts to validate the underlying assumptions but also guarantees patients' safety.
Domain-conscious - to bridge the research-to-practice gap and yield solutions that are readily implementable. I leverage the context and domain knowledge to tailor solutions specific to a subject matter.
I firmly believe that the most impactful and implementable contributions arise when methodological advancements are deeply rooted in the relevant context. This belief is exemplified by my active collaboration with healthcare professionals, including neurologists at Massachusetts General Hospital (MGH) and Beth Israel Deaconess Medical Center (BIDMC), to improve treatment for critically ill patients, and with epidemiologists at Columbia University Medical College (CUMC) to design strategies for managing opioid use disorder.
Publications
Journal
Harsh Parikh*, Kentaro Hoffman*, Haoqi Sun, Sahar F Zafar, Wendong Ge, Jin Jing, Lin Liu, Jimeng Sun, Aaron Struck, Alexander Volfovsky, et al. Effects of epileptiform activity on discharge outcome in critically ill patients in the USA: a retrospective cross-sectional study. The Lancet Digital Health, 2023
Harsh Parikh, Alexander Volfovsky, and Cynthia Rudin. Malts: Matching after learning to stretch. Journal of Machine Learning Research, 23(240), 2022
Harsh Parikh, Carlos Varjao, Louise Xu, and Eric Tchetgen Tchetgen. Validating causal inference methods. International Conference on Machine Learning, pages 17346–17358. PMLR, 2022
Harsh Parikh, Cynthia Rudin, and Alexander Volfovsky. An application of matching after learning to stretch (malts). Observational Studies, (5):118–130, 2019
Sarul Malik, Harsh Parikh, Neil Shah, Sneh Anand, and Shalini Gupta. Non-invasive platform to estimate fasting blood glucose levels from salivary electrochemical parameters. Healthcare Technology Letters, 6(4):87–91, 2019
Shayoni Dutta, Spandan Madan, Harsh Parikh, and Durai Sundar. An ensemble micro neural network approach for elucidating interactions between zinc finger proteins and their target DNA. Bmc Genomics, 17(13):97–107, 2016
Harsh Parikh, Apoorvi Singh, Annangarachari Krishnamachari, and Kushal Shah. Computational prediction of origin of replication in bacterial genomes using correlated entropy measure (cem). Biosystems, 128:19–25, 2015
Conference
Quinn Lanners, Harsh Parikh, Alexander Volfovsky, Cynthia Rudin, and David Page. Variable importance matching for causal inference. In Uncertainty in Artificial Intelligence, pages 1174–1184. PMLR, 2023
Harsh Parikh. Synthetic Control as Balancing Score, In International Conference on Learning Representation (Tiny Papers), 2023
Harsh Parikh, Carlos Varjao, Louise Xu, and Eric Tchetgen Tchetgen. Validating causal inference methods. In International Conference on Machine Learning, pages 17346–17358. PMLR, 2022
Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, and Dan Suciu. Causal relational learning. In Proceedings of the 2020 ACM SIGMOD international conference on management of data, pages 241–256, 2020
Sarul Malik, Shalini Gupta, Harsh Parikh, and Sneh Anand. Gargling affect on salivary electrochemical parameters to predict blood glucose. In 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), pages 603–606. IEEE, 2016
Media Articles
Covid-19: Mitigating the risk from reverse migration, Ideas for India - 2020 (link)
Efficacy of India’s Covid-19 response, Center for Soft Power - 2020 (link)
Rents are driven by quality of public services, not proximity to transit, Urban Wire - 2017 (link)
Empowering women through international tourism, Urban Wire: International Development - 2017 (link)
Book Review: The Indian Economy- A Macroeconomic Perspective, ARTNeT UNESCAP - 2017 (link)
Work Experience
Applied Science Intern | Selling Partner Insights and Research Intelligence Team | Seattle, June 2020 - September 2020
Credence to Causal Estimates: Designed a 'validation of causal estimation methods' framework. The framework learns the parameters of a simulator for generating data that imitates the dynamics of real world data of interest. It generates a synthetic dataset with known ground truth causal effect using the learned simulator to validate the performance of causal estimation methods based on their ability to recover true treatment effects.
Research Intern | International Development and Governance | Washington DC, June 2017 - July 2017
Public Transport and Rental Market in Lahore: Performed causal analysis on impact of metro-bus service on ridership patterns across occupations, income groups and genders. Results showed metro-bus to be preferred mode for low income servicemen. Studied the variation in housing rents 3 years before and after metro-bus service's induction. Deduced that, in Lahore, expenses on amenities dictate rents more than access to metrobus station.
Women Empowerment and Labor force participation: Analyzed Tanzania, Senegal, Nigeria and Madagascar's household survey data to understand the effect of women's labor force participation with decision making power. Results showed the positive correlation but of varying degree across cultures. Studied the impact of shared economy initiatives in international tourism on women empowerment highlighting the lack of empirical evidence.
Blue Scholar | Data Fusion & Graph Analytics | New Delhi, July 2015 - May 2016
Social network data analysis for law enforcement: Developed computational method for suspect identification from twitter network based on characteristic matching, location mining, tweet analysis & network's graph structure, to enable the law enforcement agencies track the activities of the suspect based on his social-network updates.
Software Engineering Intern | Audio-Video Bridging team | Bangalore, May 2014 - July 2014
Timing and Synchronization: Worked as part of Audio-Video Bridging (AVB) team to develop IEEE 802.1AS implementation and data-structural optimization to access multiple stream reservation protocol statistics for ESPN.
Academic Projects
Duke University
August 2016 - PresentInterpretable Dynamic Treatment Regime: Devising a methodology to estimate an optimal drug regime for epileptic patients in ICU to reduce mortality rate. The method uses a learned bayesian pharmacological model of anti-epileptic drugs' interactions with human body for generating and evaluating drug regimes. We perform off-policy constrained Q-learning on the simulated data to estimate the optimal drug regime which provides simple and interpretable treatment rule to health-care practitioner.
Intervention on Relational Data: Formulating a framework and a methodology for estimating the causal effect of interventions on relational link in a relational database with multiple entities' data across many tables. Our approach extends Pearl's framework by designing a language describing a causal query on relational skeleton and uses the semantics of the language to access the identifiability of treatment effect of interest. This work extends our work in Causal Relational Learning for allowing interventions on Relational Skeletons.
Distance metric learning for Causal Inference: We introduce a flexible framework for matching in causal inference that produces high quality almost-exact matches. Most prior work in matching uses ad hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates that degrade the distance metric. In this work, we learn an interpretable distance metric used for matching, which leads to substantially higher quality matches. The framework is flexible in that the user can choose the form of distance metric, the type of optimization algorithm, and the type of relaxation for matching. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for estimation of conditional average treatment effects.
Causal Inference on Relational Data: Existing methods critically rely on on restrictive assumptions such as the study population consisting of homogeneous elements that can be represented in a single flat table, where each row is referred to as a unit. In contrast, in many real-world settings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. We designed a formal framework for causal inference from such relational data. We propose a declarative language for capturing causal assumptions and specify-ing causal queries. We provide a foundation for inferring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational data to illustrate the applicability of our approach on academic and healthcare datasets.
Network based Economy: We designed multi-agent multi-round trading network, which maximized sub-graphs' aggregate utility in each iteration. Our analysis of the equilibria by controlling for centrality of nodes and degree of connectivity of the network exemplified that the central nodes enjoy exponential benefits. The results also highlighted the lack of correlation between aggregate social welfare and connectivity of the network. We further modelled each agent using a neural network with an ability to perform policy search by performing stochastic gradient descent each epoch. For baseline comparison, we performed steady-state comparison with rational agents' network for two-agent case and analyzed the convergence rate. Lastly, we studied the conditions for emergence of a steady state in graphs with more than two agents.
Correlation Clustering on Social Network Graphs: We expanded the literature surrounding correlation clustering by implementing various theoretical algorithms. In addition to creating our own algorithms for generating preferential attachment model graphs, we tested both Ailon et al’s. CC-Pivot algorithm, as well as our own modification. Through a number of small modification to CC-Pivot we were able to obtain consistently better results with regards to the cost of clustering. In addition to our implementation, our theoretical results shows that existing constant time approximations of the modularity of unsigned power-law graphs can be modified to achieve constant time approximations on signed power law graphs as well.
Technological Advancement and Economic Growth: Empirical and theoretical studied growth of Indian economy, with development of information and communication technology infrastructure and increased penetration of mobile phones and internet in India. Analysis showed a significant impact on exchange rate in short term, while a strong positive effect on GDP and negative relation to trade balance were apparent only in long term. Developed a two-sector endogenous growth model with a monopolistic manufacturing firms, research firms, and utility maximizing households. Non-competitive and partially-excludable output of research firms leads to TFP growth and automation. The model predicts super specialization of labor force overtime along with linear growth in short-term and exponential growth in long-term. The results were corroborated using thirty years macroeconomic data of five developed and developing nations.
Indian Institute of Technology Delhi
July 2011 - May 2015Neural networks based noninvasive blood glucose level sensor for diabetes patients: Most existing approaches for measuring fasting blood glucose levels (FBGLs) are invasive. This work presents a proof-of-concept study in which saliva is used as a proxy biofluid to estimate FBGL. Saliva collected from 175 volunteers was analysed using portable, handheld sensors to measure its electrochemical properties such as conductivity, redox potential, pH and K+ , Na+ and Ca2+ ionic concentrations. These data, along with the person's gender and age, were trained and tested after casewise annotation with their true FBGL values using a set of mathematical algorithms. An accuracy of 87.4 ± 1.7% and a mean relative deviation of 14.1% ( R 2 = 0.76) was achieved using a mathematical algorithm. All parameters except the gender were found to play a key role in the FBGL determination process. Finally, the individual electrochemical sensors were integrated into a single platform and interfaced with the authors’ algorithm through a simple graphical user interface. The system was revalidated on 60 new saliva samples and gave an accuracy of 81.67 ± 2.53% ( R-sq = 0.71). This study paves the way for rapid, efficient and painless FBGL estimation from saliva.
Ocean health prediction by image analysis using convolutional neural networks: Plankton classification problem has been a crucial problem, adding a lot of meaning in current climate change scenario due to anthropogenic interventions. Plankton diversity estimation and risk mitigation is one of the key aspects to check environmental degradation. Automate process using machine learning and image processing techniques speeds up the year long process to a few minutes job. The problem was initially approached from two fronts of machine learning and image processing, merging in a pipeline to produce better results. A breadth of techniques was experimented to understand the direction of the problem and improvise on the performance. The domain knowledge about the plankton hierarchy was exploited to devise a novel hierarchically stacked classifier approach to mimic the naturally found phylogeny. The model was fine tuned using boosting and training data equalization to reduce undermining of classes with low data points.
Identifying origin of replication in bacterial genome: We have carried out an analysis on 500 bacterial genomes and found that the de-facto GC skew method could predict the replication origin site only for 376 genomes. We also found that the auto-correlation and cross-correlation based methods have a similar prediction performance. In this paper, we propose a new measure called correlated entropy measure (CEM) which is able to predict the replication origin of all these 500 bacterial genomes. The proposed measure is context sensitive and thus a promising tool to identify functional sites. The process of identifying replication origins from the output of CEM and other methods has been automated to analyze a large number of genomes in a faster manner. We have also explored the applicability of SVM based classification of the workability of each of these methods on all the 500 bacterial genomes based on its length and GC content.