What is a Labor Market? Classifying Workers and Jobs Using Network Theory (with Jamie Fogel) PAPER (honorable mention in the Urban Economics Association Conference, 2020) YOUTUBE (in portuguese)
This paper develops a new data-driven approach to characterizing latent worker skill and job task heterogeneity by applying an empirical tool from network theory to large-scale Brazilian administrative data on worker--job matching. We micro-found this tool using a standard model of workers matching with jobs according to comparative advantage. Our classifications identify important dimensions of worker and job heterogeneity that standard classifications based on occupations and sectors miss. Moreover, a general equilibrium model based on our classifications more accurately predicts wage changes in response to the 2016 Olympics than a model based on occupations and sectors. Finally, we show that reduced form estimates of the effects of labor market shock exposure on workers' earnings are as much as 4 times larger when workers and jobs are classified using our classifications as opposed to occupations and sectors.
An Interdisciplinary Outlook on Large Language Models for Scientific Research (with MIDAS postdocs) PAPER
In this paper, we describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications, enhancing code development through automated syntax correction, and refining the scientific writing process. Simultaneously, we articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use. Our critical discussion extends to the varying impacts of LLMs across fields, from the natural sciences, where they help model complex biological sequences, to the social sciences, where they can parse large-scale qualitative data. We conclude by offering a nuanced perspective on how LLMs can be both a boon and a boundary to scientific progress.
Local and Global Explainability Metrics for Machine Learning Predictions (with Cristian Muñoz, Kleyton da Costa, Adriano Koshiyama) PAPER POSTER
Rapid advancements in artificial intelligence (AI) technology have brought about a plethora of new challenges in terms of governance and regulation. AI systems are being integrated into various industries and sectors, creating a demand from decision-makers to possess a comprehensive and nuanced understanding of the capabilities and limitations of these systems. One critical aspect of this demand is the ability to explain the results of machine learning models, which is crucial to promoting transparency and trust in AI systems, as well as fundamental in helping machine learning models to be trained ethically. In this paper, we present novel quantitative metrics frameworks for interpreting the predictions of classifier and regressor models. The proposed metrics are model agnostic and are defined in order to be able to quantify: (i) the interpretability factors based on global and local feature importance distributions; (ii) the variability of feature impact on the model output; and (iii) the complexity of feature interactions within model decisions. We employ publicly available datasets to apply our proposed metrics to various machine learning models focused on predicting customers’ credit risk (classification task) and real estate price valuation (regression task). The results expose how these metrics can provide a more comprehensive understanding of model predictions and facilitate better communication between decision-makers and stakeholders, thereby increasing the overall transparency and accountability of AI systems.
Detailed Gender Wage Gap Decompositions: Controlling for Worker Unobserved Heterogeneity Using Network Theory (with Jamie Fogel) PAPER
This paper measures gender discrimination by decomposing male--female differences in average wages into a component explained by male and female workers having different productivity distributions and a component explained by equally productive male and female workers being paid differently. This requires us to build reliable counterfactuals by identifying all relevant controls such that male workers are compared to female workers who are identical in all aspects relevant to pay other than their gender, conditional on controls. To do this, we (i) develop a new economically principled network-based approach to control for unobserved worker skill and job task heterogeneity using the information revealed by detailed data on worker--job matching patterns, (ii) non-parametrically estimate counterfactual wage functions for male and female workers, (iii) introduce a correction for the possibility that the male and female productivity distributions do not overlap, and (iv) apply our methods by revisiting gender wage gap decompositions using improved counterfactuals based on (i), (ii) and (iii). We decompose the gender wage gap in Rio de Janeiro, Brazil and find that the gender wage gap is almost entirely explained by male and female workers who possess similar skills and perform similar tasks being paid different wages.
Advancing Distribution Decomposition Methods Beyond Common Supports: Applications to Racial Wealth Disparities PAPER
I generalize state-of-the-art approaches that decompose differences in the distribution of a variable of interest between two groups into a portion explained by covariates and a residual portion. The method that I propose relaxes the overlapping supports assumption, allowing the groups being compared to not necessarily share exactly the same covariate support. I illustrate my method revisiting the black-white wealth gap in the U.S. as a function of labor income and other variables. Traditionally used decomposition methods would trim (or assign zero weight to) observations that lie outside the common covariate support region. On the other hand, by allowing all observations to contribute to the existing wealth gap, I find that otherwise trimmed observations contribute from 3% to 19% to the overall wealth gap, at different portions of the wealth distribution.
A network theory approach to imputing workers' occupations in the Longitudinal Employer-Household Dynamics (LEHD) (with Jamie Fogel and Dylan Nelson) PAPER
The Census Bureau’s Longitudinal Employer Household Dynamics (LEHD) is the main source of labor market information in the United States. Despite having the register of all workers in the country, it lacks a crucial information for labor market analysis: the workers' occupations. In this paper, we attempt to impute occupation on the LEHD by exploiting the information contained in the LEHD’s rich set of worker–job matches, forming the labor market network using economic theory and network theory. We find that while the information contained in these matches is informative about economic outcomes like earnings, it is minimally informative about occupation. In particular, the information gleaned from worker–job matches has minimal predictive power for occupation when other variables like industry are included as predictors.
A Model Ensemble Approach to Individual Fairness in Machine Learning (funded by the Rocket Companies) POSTER
Individual fairness is based on the principle that similar observations should be treated similarly by a machine learning (ML) model, addressing the limitations of group fairness methods. Despite its intuitive appeal, implementing individual fairness algorithms is challenging due to difficulties in defining a metric for similarity between individuals. In this paper, we develop a model ensemble approach inspired by individual fairness to assess ML model fairness. Leveraging results from the double/causal ML literature and ML clustering techniques, our method requires considerably fewer assumptions than previous individual fairness methods, in addition to being model-agnostic and avoiding cherry-picking decisions in fairness assessment. Our data-driven method involves: (i) removing variation in the dataset related to sensitive attributes using causal ML; (ii) clustering observations using random forests and a Bayesian network algorithm; (iii) performing within-cluster inference to test if the model treats similar observations similarly, and applying multiple hypothesis test correction to aggregate the results. We provide a single statistical p-value for the null hypothesis that the model is unbiased based on individual fairness and create a scale to measure the extent of bias against minorities, enhancing the interpretability of the p-value for decision-makers. We apply our methodology to assess bias in the mortgage industry and provide an open-source Python package for our methods.
Market Definition Bias in Studies of (Labor) Market Power (with Jamie Fogel and Benjamin Scuderi) SLIDES
This paper demonstrates two distinct and quantitatively important biases introduced by using an "incorrect" definition of market boundaries when attempting to make inferences about labor market power. The first source of bias, long recognized in the antitrust literature, stems from mismeasurement of relative firm size: the same firm will appear artificially dominant when markets are drawn too narrowly and artificially competitive when they are drawn too broadly. We derive a novel second source of bias, which we term elasticity bias, that generates statistical attenuation of estimates of key parameters that govern model-based conclusions about the size and distribution of markdowns across employers and markets. In simulations calibrated to Brazilian administrative data, we show that the second channel is an order of magnitude more important than the first. Further, we show that market definition bias can be large in empirically-relevant cases where the relative rate of misclassification may be modest, as with administrative labor—market boundaries such as industry/occupation—region cells adopted by virtually all existing studies. We propose an alternative network-based procedure for defining labor market boundaries that extends the algorithm of Fogel and Modenesi (2022). Drawing upon the empirical strategy of \cite{Felix}, we show that relative to using administrative market definitions, using network-based market definitions yields estimates with 40% larger markdown dispersion and overturns several qualitative conclusions about which workers are harmed by monopsony power. Finally, we propose a simple diagnostic that allows practitioners to pick among off-the-shelf classifications when using a data-driven one is infeasible.
This project aims to develop and validate an AI-based model to assess the quality of shared decision-making (SDM) in clinical encounters involving children with medical complexity (CMC), their parents, and orthopedic surgeons. Although CMC represent less than 5% of all children, they account for over 30% of pediatric healthcare expenditures, often receiving low-quality care and experiencing poor health outcomes. We will train an AI model ensemble—combining supervised learning and large language models—on real-world clinical data, leveraging the established DEEP-SDM coding scheme to provide clinicians with real-time feedback on SDM quality. The ultimate goal is to create a provider coaching tool that enhances SDM practices, addressing a critical gap in pediatric care for CMC. This project also serves as a pilot for a future R01 grant, aimed at expanding the tool to other high-stakes SDM contexts.
BMBench: an Empirical Bias Mitigation Benchmark for Multitask Machine Learning Predictions (with Kleyton da Costa, Cristian Muñoz, Franklin Fernandez, Adriano Koshiyama, Emre Kazim)
In the evolving landscape of algorithmic fairness, the development and assessment of bias mitigation methods require rigorous benchmarks. This paper introduces BMBench, a comprehensive benchmarking workflow designed to evaluate bias mitigation strategies across multitask machine learning predictions. Our benchmark leverages up-to-date datasets commonly used in fairness research, offering a broad spectrum of fairness metrics for a thorough evaluation for both classification and regression tasks. By incorporating a diverse set of fairness metrics, BMBench enables a nuanced assessment of bias mitigation methods, addressing the multifaceted nature of algorithmic bias. Additionally, we provide a publicly accessible code repository to empower researchers to test and refine their bias mitigation approaches with our workflow, fostering advancements in the creation of fair machine learning models.
A Statistical Framework for Quantifying Linear Extrapolation Bias in Regressions (with Lonjezo Sithole)
As ubiquitous as they are, linear regressions impose stringent functional forms, which may bias their coefficients, especially for causal inference applications. In this paper, we introduce a data-driven approach to quantify and mitigate the extent of incorrect linear extrapolation in linear regression models compared to non-parametric methods, leading to biased coefficients. Our estimation strategy consists of a switching regression that lies in between two extremes: (1) a conventional fully parametric linear regression; and (2) a fully nonparametric exact matching estimator. Our approach allows the researcher to choose how close the estimation is to approaches (1) or (2) by regulating a key hyperparameter. We provide a metric for optimal hyper-parameter choice based on bias and a statistical test for the null hypothesis of no linear extrapolation bias in the regression.
Improving causal inference controls using network theory in discrete choice data
Many datasets in social sciences result from agents making repeated choices over time, each choice leading to an observable outcome. Researchers often aim to model the causal impact of covariates on the outcome variable using various estimation strategies (e.g. fixed effects regression, difference-in-differences, instrumental variables, etc). I propose a new way to increase control in these estimation procedures by applying network theory models motivated by a discrete choice framework. I suggest representing these datasets as a bipartite network, where agents are nodes on one side and choices are nodes on the other. Edges in this network represent a choice made by an agent at a certain time, stemming from a discrete choice problem. I argue that the structure of connections in this choice-network allows the researcher to further improve controls when modeling the outcome variable. For instance, I use the choice-network to project agents into a multidimensional latent space that captures each agent's choice-profile. Distances between agents in this latent space represent a metric of similarity between them. By exploring the high-dimensional choice-profile of agents, I propose several ways to enhance causal inference exercises.
Matrix Learning, a New Tool in the Causal Inference Toolkit: omitted variable bias mitigation under a potential-outcomes similarity assumption
A common problem in treatment effects estimation is the existence of a confounding omitted variable, which biases estimates. Traditional approaches to handling the bias problem often employ exogenous variations that can be explored in the form of an instrumental variable, or in the form of introducing discontinuities, etc. In the absence of exogenous variables, the researcher is left with covariate balancing techniques that do not resolve the omitted variable bias problem. In this work, I introduce a new causal inference approach that allows the researcher to learn a matrix to partial out any variation in the covariates that is related to the confounding variable, leading to unbiased estimates. In order to learn this crucial matrix, the researcher needs to select at least one subset of the dataset such that all observations in that subset possess the same potential outcomes. Under the assumption of similar potential outcomes, I employ two alternative machine learning techniques that learn the desired matrix, which is later used to mitigate bias in treatment effects estimations.
Sparse modeling of wavelet features achieves high accuracy for fault classification and regression in spacecraft propulsion systems (with MIDAS postdocs) POSTER (honorable mention in the Asia Pacific Conference of the Prognostics and Health Management Society (PHMAP 2023))
Findings: We used all of the features in all of our tasks and we splitted the sensor data into three parts, as shown in the image above. Complex AI models like XG-Boost or Neural Networks did not perform well, given how sparse the data was. Less complicated tasks like predicting the mode of failure were better accomplished with Random Forest Classifiers. The complicated tasks, namely bubble and valve fault locations, and the valve opening ratios, were better performed with simple models, with features coefficients being regularized with L1 norm. The simplicity and regularization of these models, Logistic Regression and LASSO, allowed these models to filter through the high dimensionality of the features, focusing only on the variables that matter for each task at hand. One puzzling feature of our analysis regards when our models for each segment of the data were in disagreement for the solenoid valve fault and location. We learnt from the data that when these disagreements happened, the 1st segment was disproportionately diagnostic and we followed its predictions. This is a result that we plan on exploring further in a follow up paper.
Inference-Based Instrumental Variable Estimation Choice
Instrumental variables estimation strategy is widespread in the field of Economics, however, a justification for choosing a certain functional relationship between the outcome y with the treatment d and control variables x is not often given, nor is it validated by inference. Regardless of potential functional misspecification between the relationship of y and d, the 2SLS IV estimator can still provide a valid estimation of an average effect, taking into consideration specific weights. If the researcher is further interested in obtaining consistent estimators for other covariates in the y equation, in marginal effects of d, or in an average effect with respect to other weights, then conventional linear IV is no longer suited. Fortunately, parametric functional forms can be tested with recent (and more powerful) “goodness-of-fit” tests, corroborating the choice of certain specifications. Another approach is to move away from parametric estimations, making use of recent non-parametric IV tools or even Deep Neural Networks IV techniques, which suffer less from the curse of dimensionality; this approach would thius be more feasible in practice. Once more, the “goodness-of-fit” family of tests can provide tools for inferring the validity, or invalidity, of these non-parametric/data driven approaches.
This article investigates the impact of monetary policy on income distribution in Brazil. Income inequality affects both developed and underdeveloped economies, but its presence in the latter has a greater impact on vulnerable segments of society. The investigation of this phenomenon is critical for directing economic policies aimed at mitigating its adverse effects. We use macroeconomic variables and a Gini index calculated from microdata to measure income distribution. Our analysis employs vector autoregressive and Bayesian vector autoregressive approaches, regression analysis, and causality tests to find evidence of the impact of monetary policy on income distribution in the Brazilian case. The results show that a shock to SELIC and inflation positively impacts the Gini index, increasing inequality within a 95% confidence interval. However, an increase in economic activity and job generation has a negative impact on the Gini index, reducing income inequality observed in the economy.
MODENESI, B. (2015) Bounds on Policy Relevant Parameters with Discrete Policy Variation. Fundação Getúlio Vargas. São Paulo, Brazil PAPER
When estimating policy parameters, also known as treatment effects, the assignment to treatment mechanism almost always causes endogeneity and thus biases many of these policy parameters estimates. Additionally, heterogeneity in program impacts is more likely to be the norm than the exception for most social programs. In situations where these issues are present, the Marginal Treatment Effect (MTE) parameter estimation makes use of an instrument to avoid assignment bias and simultaneously to account for heterogeneous effects throughout individuals. Although this parameter is point identified in the literature, the assumptions required for identification may be strong. Given that, I use weaker assumptions in order to partially identify the MTE, i.e. to establish a methodology for MTE bounds estimation, implementing it computationally and showing results from Monte Carlo simulations. The partial identification we perform requires the MTE to be a monotone function over the propensity score, which is a reasonable assumption on several economic examples, and the simulation results show it is possible to get informative bounds even in restricted cases where point identification is lost.
ANDRADE, M. et al. (2012) Market definition and concentration in the private health care industry in Brazil. Pesquisa e Planejamento Econômico, IPEA/Rio de Janeiro, v. 42, p. 329-361, 2012 PAPER
In this paper we investigate the concentration in health insurance sector in Brazil. In order to conduct this analysis it is necessary to establish the definition of relevant market in product and geographical dimensions. In this paper we apply a methodology based on gravitation models to define the geographical market. Till now the concentration analysis was performed in Brazil using geopolitical boundaries as the market definition. Geopolitical boundaries may not be an adequate criteria, once Brazil is specially large and heterogeneous country. We assume that health services are locally demanded and supplies. In that manner the market area is defined by the flow of trade. This flow is conditioned on health services supply, potential demand and friction variables. The empirical analysis was conducted using database sourced by the National Health Insurance Agency in Brazil (Agencia Nacional de Saúde Suplementar – ANS) to 2007 and 2010. We analyzed the competition structure performing concentration indexes. Our results point out that health insurance sector in Brazil is very concentrated. The most important firm is UNIMED that dominates the majority of markets.
MODENESI, B; ANDRADE, M.; MAIA, A. (2010) Private health care market definition and concentration in MG state. Proceedings of the 14th Seminar on the Economy of Minas Gerais. CEDEPLAR-UFMG. PAPER
Minas Gerais (MG) is the third Brazilian state in the ranking of private health beneficiaries, with approximately 10% of the country's market. This work proposes and operationalizes a methodology for market definition for health plans and health insurance in the state of MG, as opposed to the use of geopolitical borders as a guide for defining markets in the geographic dimension. We observe that all markets for private health plans and insurance in the state of MG present high concentration, regardless of the method of defining the market area used. This result is highly contrasting with the market concentration measurements obtained by using geopolitical borders, conventionally used in the literature of industrial organization.
ANDRADE, M. et al. (2010) Private Health Market Structure in Brazil. CEDEPLAR Discussion Paper 400. PAPER
In this paper we investigate the concentration in health insurance sector in Brazil. In order to conduct this analysis it is necessary to establish the definition of relevant market in product and geographical dimensions. In this paper we apply a methodology based on gravitation models to define the geographical market. Till now the concentration analysis was performed in Brazil using geopolitical boundaries as the market definition. Geopolitical boundaries may not be an adequate criteria, once Brazil is specially large and heterogeneous country. We assume that health services are locally demanded and supplies. In that manner the market area is defined by the flow of trade. This flow is conditioned on health services supply, potential demand and friction variables. The empirical analysis was conducted using database sourced by the National Health Insurance Agency in Brazil (Agencia Nacional de Saúde Suplementar – ANS) to 2007 and 2010. We analyzed the competition structure performing concentration indexes. Our results point out that health insurance sector in Brazil is very concentrated. The most important firm is UNIMED that dominates the majority of markets.