Research

Working Papers

Identification, Estimation, and Inference for Multivalued Endogenous Treatment Effect Models: A Control Function Approach (Previous Job Market Paper)

Abstract: Using a control function (CF) approach, I study average treatment effects (ATEs) in discrete multivalued treatments with endogeneity and heterogeneous counterfactual errors. In discrete ATE literature, most of the attention has been devoted to binary treatment models with endogeneity. On the other hand, this paper extends the investigations of ATEs in binary treatments to those in discrete multivalued treatments with both endogeneity and heterogeneous counterfactual errors and explores the behavior of CF and instrumental variables (IV) methods in this framework. Specifically, in this paper, I offer identification strategies for the ATEs, suggest a consistent estimator for the ATEs, show the asymptotic properties of CF parameter estimates, and derive a score test in order to draw inferences about the ATEs and other parameters of interest. Moreover, using a Monte Carlo simulation analysis, I compare CF method with IV method (widely used method when endogeneity is present) in terms of asymptotic efficiency, asymptotic unbiasedness, and consistency. Simulation results suggest that CF method is asymptotically up to 12% more efficient than IV method, and asymptotic bias in parameter estimates of IV method can be as high as 43%. However, when misspecification is introduced, simulations results favor IV method. For the empirical illustration, I apply OLS, CF, IV, and nonparametric bound analysis to the estimation of how limited English proficiency (LEP) influences wages of Hispanic workers in the USA. The data come from the 1% PUMS of the 1990 US Census. Utilizing age at arrival as an instrumental variable, both OLS and CF methods indicate that LEP on average imposes a statistically significant wage penalty (up to 79% in some CF estimates) on Hispanic community in the USA. IV method mostly produces insignificant results, and nonparametric bound analysis provides uninformative lower bounds.


Control Function Approach to Multivalued Endogenous Treatment Effects (to be submitted to Journal of Business and Economic Statistics, joint with Jeffrey M. Wooldridge)

Abstract: We incorporate a structure of correlated random coefficients (CRCs) into the framework of discrete multivalued treatments with endogeneity and heterogeneous counterfactual errors. In discrete endogenous multiple treatments with CRCs, conventional IV method is generally inconsistent for ATEs because of the existence of CRCs and/or heterogeneous counterfactual errors. In this paper, we propose a consistent CF estimation procedure for the ATEs, find the asymptotic distribution of CF parameter estimates, and derive a score test to draw inferences about the ATEs and other parameters of interest. In addition, our Monte Carlo simulation analysis suggests that, in the absence of misspecification, CF method is asymptotically unbiased and consistent (but not necessarily more efficient). Whereas, IV method is generally asymptotically biased and inconsistent. In the presence of misspecification, the simulation results show that both CF and IV methods have biased estimates (more on CF estimates). With regard to efficiency, the simulation findings show that none of the methods outperforms the other one clearly.


Estimation for Multivalued Endogenous Treatment Effect Models Using High Dimensional Methods: A Simulation Study (to be submitted to Journal of Applied Econometrics)

Abstract: Using a simulation study, I examine the finite sample performances of several machine learning (ML) methods and CF method for discrete multivalued endogenous treatments in a particular setting where there exists an extra set of high dimensional variables and a low dimensional (and unknown) subset of these variables has an impact on the outcome; however, all of these high dimensional variables are totally ignorable to the decision to undertake the treatment given some instruments in the selection equation. I also allow non-Gaussian and heterogeneous counterfactual errors in the model and use a CF approach to address endogeneity. To estimate the parameters of interest, I use CF method and four different ML methods (i.e., least absolute shrinkage and selection operator (LASSO), post partial-out LASSO , post double selection LASSO, and double/debiased LASSO). Then, I compare their performances taking into consideration measures such as bias of estimates, standard deviation of estimates, mean absolute prediction error, root mean square error, mean number of correctly selected covariates, and mean size of selected set of covariates.  The main Monte Carlo simulation finding is that, on top of being on par with CF method in finite sample bias ground when the high dimensional variables are orthogonal to the variables of interest already included, the LASSO-based methods can surpass the efficiency performance of CF method in ATE estimation if there exist enough extra predictive variables that are ignorable in treatment selection among a set of high dimensional predictors of outcome.


Work in Progress

An Application of Machine Learning Methods to Demand for Organic Fruits

Abstract: I estimate monthly household demand for organic fruit by household income class by using machine learning methods, e.g., LASSO, support vector machines, bagging, and random forests, and standard methods, e.g., stepwise regression, forward stagewise regression, linear regression and the conditional logit, and compare their predictive power. I use a sample of US household organic and conventional fruit purchases from 2011 through 2013, which comes from the Nielsen Corporation’s Consumer Panel Data.


Instrumental Variables Estimation for the Effectiveness of Peer Tutoring Programs with Self-selection Problem: Evidence from Economics Help Rooms and Integrative Studies Peer-Assisted Learning Sessions at Michigan State University

Abstract: I analyze the effectiveness of the MSU College of Social Sciences’ Economics Help Rooms and Peer-assisted Learning (PAL) program. Over a period of 6 semesters from Fall 2017 through Spring 2020, I collected administrative and survey data for all students enrolled in classes served by the Economics Help Rooms and PAL program. The data have information on the final grade that students received for the class; gender; minority status; class; high school GPA; college grades thus far; composite ACT score; Pell grant eligibility; whether the student is an international student, a first generation college student, and/or an intercollegiate athlete; whether and how often these students visited the help rooms and/or PAL program for their class, and some other student characteristics. In a first-stage negative binomial regression using distance to the help rooms and/or PAL program locations as an identification variable, I tackle the self-selection problem and generate fitted values for student visits. With these fitted values, in the second-stage instrumental variables ordered probit regression, I relate the use of the help rooms and/or PAL program to the final grade in the course. My empirical results suggest that the help rooms and PAL program contribute to higher grades for students.


Weak-instrument Robust Estimation and Inference in Linear Instrumental Variables Regression with Heteroskedasticity and a Single Multivalued Endogenous Explanatory Variable

Abstract: I investigate the detection of weak instruments, weak-instrument robust estimation and inference. I especially focus on the case where the errors in the reduced-form and first-stage regressions are heteroskedastic, and the linear IV regression has a single multivalued endogenous explanatory variable with weak instruments. It is also worth analyzing how nonlinear instruments (e.g., predicted probabilities from the first-stage regression) can help with a weak IV in this framework.


Control Function Method vs. Instrumental Variables: An Asymptotic Efficiency Comparison

Abstract: From asymptotic efficiency standpoint, I compare the average treatment effect estimates of instrumental variables method to those of control function method. In this work, I specifically consider a discrete multivalued endogenous treatment and follow a brute force comparison of asymptotic variance covariance matrices of the methods in positive semidefinite sense.


Breathomics: A Multidimensional Approach to Rapid Early Cancer Detection Using Artificial Intelligence Algorithms and Advanced Sensors for Breath-Molecular Biomarkers (with PI Talayeh Razzaghi and Co-PI Thirumalai Venkatesan), OU Big Idea Challenge 2.0 Competition

Summary: This research initiative pioneers advancements in cancer diagnostics, with a primary focus on developing an innovative artificial intelligence and machine learning model for early detection using Breathomics — the study of volatile organic compounds in human breath, with a specific emphasis on pancreatic cancer. I am part of the socioeconomic analysis team that explores the cost-effectiveness of and other economic welfare implications of the proposed model in comparison to existing cancer diagnostics. This encompasses considerations such as healthcare expenditures, potential long-term treatment cost savings, and the overall economic feasibility of the technology. Additionally, I play a role in preparing and submitting grant proposals to secure funding for the research project.