Research

Working papers

This paper develops a new data-driven approach to characterizing latent worker skill and job task heterogeneity by applying an empirical tool from network theory to large-scale Brazilian administrative data on worker--job matching. We micro-found this tool using a standard model of workers matching with jobs according to comparative advantage. Our classifications identify important dimensions of worker and job heterogeneity that standard classifications based on occupations and sectors miss. Moreover, a general equilibrium model based on our classifications more accurately predicts wage changes in response to the 2016 Olympics than a model based on occupations and sectors. Finally, we show that reduced form estimates of the effects of labor market shock exposure on workers' earnings are as much as 4 times larger when workers and jobs are classified using our classifications as opposed to occupations and sectors.

In this paper, we describe the capabilities and constraints of Large Language Models (LLMs) within disparate academic disciplines, aiming to delineate their strengths and limitations with precision. We examine how LLMs augment scientific inquiry, offering concrete examples such as accelerating literature review by summarizing vast numbers of publications, enhancing code development through automated syntax correction, and refining the scientific writing process. Simultaneously, we articulate the challenges LLMs face, including their reliance on extensive and sometimes biased datasets, and the potential ethical dilemmas stemming from their use. Our critical discussion extends to the varying impacts of LLMs across fields, from the natural sciences, where they help model complex biological sequences, to the social sciences, where they can parse large-scale qualitative data. We conclude by offering a nuanced perspective on how LLMs can be both a boon and a boundary to scientific progress.

This paper measures  gender discrimination by decomposing male--female differences in average wages into a component explained by male and female workers having different productivity distributions and a component explained by equally productive male and female workers being paid differently. This requires us to build reliable counterfactuals by identifying all relevant controls such that male workers are compared to female workers who are identical in all aspects relevant to pay other than their gender, conditional on controls. To do this, we (i) develop a new economically principled network-based approach to control for unobserved worker skill and job task heterogeneity using the information revealed by detailed data on worker--job matching patterns, (ii) non-parametrically estimate counterfactual wage functions for male and female workers, (iii) introduce a correction for the possibility that the male and female productivity distributions do not overlap, and (iv) apply our methods by revisiting gender wage gap decompositions using improved counterfactuals based on (i), (ii) and (iii). We decompose the gender wage gap in Rio de Janeiro, Brazil and find that the gender wage gap is almost entirely explained by male and female workers who possess similar skills and perform similar tasks being paid different wages.

I generalize state-of-the-art approaches that decompose differences in the distribution of a variable of interest between two groups into a portion explained by covariates and a residual portion. The method that I propose relaxes the overlapping supports assumption, allowing the groups being compared to not necessarily share exactly the same covariate support. I illustrate my method revisiting the black-white wealth gap in the U.S. as a function of labor income and other variables. Traditionally used decomposition methods would trim (or assign zero weight to) observations that lie outside the common covariate support region. On the other hand, by allowing all observations to contribute to the existing wealth gap, I find that otherwise trimmed observations contribute from 3% to 19% to the overall wealth gap, at different portions of the wealth distribution.

The Census Bureau’s Longitudinal Employer Household Dynamics (LEHD) is the main  source of labor market information in the United States. Despite having the register of all workers in the country, it lacks a crucial information for labor market analysis: the workers' occupations. In this paper, we attempt to impute occupation on the LEHD by exploiting the information contained in the LEHD’s rich set of worker–job matches, forming the labor market network using economic theory and network theory. We find that while the information contained in these matches is informative about economic outcomes like earnings, it is minimally informative about occupation. In particular, the information gleaned from worker–job matches has minimal predictive power for occupation when other variables like industry are included as predictors.

Work in progress

Individual fairness methods depart from the idea that similar observations should be treated similarly by a machine learning model, circumventing some of the shortcomings of group fairness tools. Nevertheless, many existing individual fairness approaches are either tailored to specific models or rely on a series of ad hoc decisions to determine model bias. In this paper, we propose an individual fairness-inspired, inference-based bias detection pipeline. Our method is model-agnostic, suited for all data types, avoids commonly used ad-hoc thresholds and decisions, and provides an intuitive scale to indicate how biased the assessed model is. We propose a model ensemble approach for our bias detection tool, consisting of: (i) building a proximity matrix with random forests based on features and output; (ii) inputting it into a Bayesian network method to cluster similar observations; (iii) performing within-cluster inference to test the hypothesis that the model is treating similar observations similarly; and (iv) aggregating the cluster tests up with multiple hypothesis test correction. In addition to providing a single statistical p-value for the null hypothesis that the model is unbiased based on individual fairness, we further create a scale that measures the amount of bias against minorities carried by the model of interest, making the overall p-value more interpretable to decision-makers. We apply our methodology to assess bias in the mortgage industry, and we provide an open-source Python package for our methods.

In the evolving landscape of algorithmic fairness, the development and assessment of bias mitigation methods require rigorous benchmarks. This paper introduces BMBench, a comprehensive benchmarking workflow designed to evaluate bias mitigation strategies across multitask machine learning predictions. Our benchmark leverages up-to-date datasets commonly used in fairness research, offering a broad spectrum of fairness metrics for a thorough evaluation for both classification and regression tasks. By incorporating a diverse set of fairness metrics, BMBench enables a nuanced assessment of bias mitigation methods, addressing the multifaceted nature of algorithmic bias. Additionally, we provide a publicly accessible code repository to empower researchers to test and refine their bias mitigation approaches with our workflow, fostering advancements in the creation of fair machine learning models.

As ubiquitous as they are, linear regressions impose stringent functional forms, which may bias their coefficients, especially for causal inference applications. In this paper, we introduce a data-driven approach to quantify and mitigate the extent of incorrect linear extrapolation in linear regression models compared to non-parametric methods, leading to biased coefficients. Our estimation strategy consists of a switching regression that lies in between two extremes: (1) a conventional fully parametric linear regression; and (2) a fully nonparametric exact matching estimator. Our approach allows the researcher to choose how close the estimation is to approaches (1) or (2) by regulating a key hyperparameter. We provide a metric for optimal hyper-parameter choice based on bias and a statistical test for the null hypothesis of no linear extrapolation bias in the regression.

Many datasets in social sciences are a result of agents making repeated choices over time, with some observable outcome resulting from each choice. Researchers often want to model the causal impact of covariates on the outcome variable using different estimation strategies (e.g. fixed effects regression, difference-in-differences, instrumental variables, etc). I propose a way to increase control in these estimation procedures by using network theory models motivated by a discrete choice framework. I suggest a bi-partite network representation of these datasets, with agents being nodes on one side of the network and choices being nodes on the other side of it. Edges in this network represent a choice made by an agent at a certain time, resulting from a discrete choice problem. I argue that the structure of connections in this choice-network allows the researcher to further improve controls when modelling the outcome variable. For instance, I use the choice-network to project agents in a multidimensional latent space that captures each agent's choice-profile and distances between agents in this latent space represent a metric of similarity between them. I propose exploring the high-dimensional choice-profile of agents to improve causal inference exercises in a series of ways.

A common problem in treatment effects estimation is the existence of a confounding omitted variable, which biases estimates. Traditional approaches to handling the bias problem often employ exogenous variations that can be explored in the form of an instrumental variable, or in the form of introducing discontinuities, etc. In the absence of exogenous variables, the researcher is left with covariate balancing techniques that do not resolve the omitted variable bias problem. In this work, I introduce a new causal inference approach that allows the researcher to learn a matrix to partial out any variation in the covariates that is related to the confounding variable, leading to unbiased estimates. In order to learn this crucial matrix, the researcher needs to select at least one subset of the dataset such that all observations in that subset possess the same potential outcomes. Under the assumption of similar potential outcomes, I employ two alternative machine learning techniques that learn the desired matrix, which is later used to mitigate bias in treatment effects estimations.

This paper develops a new approach to defining the scope of skill and task-based labor markets  and uses it to compute labor market power.  Building upon tools from network theory, we classify workers into latent types and jobs into "markets" by exploiting the network structure of worker-job links and worker movement between jobs, inherent in linked employer-employee data. Intuitively, two workers belong to the same latent type if they have similar probabilities of working in the same market, and two jobs belong to the same market if they have similar probabilities of hiring the same workers. We use discrete choice methods to infer the productivity of each worker type when matched with each market using the logic that worker-job matches that pay more and occur more frequently in equilibrium reveal themselves to be more productive. Using this high-dimensional productivity matrix, we compute a measure of market concentration, similar to a HHI, that accounts for the fact that workers may match with multiple markets and that some markets are "closer" to each other, in the sense of hiring more similar workers.  Using the market concentration measure, we compute labor supply elasticities and markdowns for individual firms.

Findings: We used all of the features in all of our tasks and we splitted the sensor data into three parts, as shown in the image above. Complex AI models like XG-Boost or Neural Networks did not perform well, given how sparse the data was. Less complicated tasks like predicting the mode of failure were better accomplished with Random Forest Classifiers. The complicated tasks, namely bubble and valve fault locations, and the valve opening ratios, were better performed with simple models, with features coefficients being regularized with L1 norm. The simplicity and regularization of these models, Logistic Regression and LASSO, allowed these models to filter through the high dimensionality of the features, focusing only on the variables that matter for each task at hand. One puzzling feature of our analysis regards when our models for each segment of the data were in disagreement for the solenoid valve fault and location. We learnt from the data that when these disagreements happened, the 1st segment was disproportionately diagnostic and we followed its predictions. This is a result that we plan on exploring further in a follow up paper.

Instrumental variables estimation strategy is widespread in the field of Economics, however, a justification for choosing a certain functional relationship between the outcome y with the treatment d and control variables x is not often given, nor is it validated by inference. Regardless of potential functional misspecification between the relationship of y and d, the 2SLS IV estimator can still provide a valid estimation of an average effect, taking into consideration specific weights. If the researcher is further interested in obtaining consistent estimators for other covariates in the y equation, in marginal effects of d, or in an average effect with respect to other weights, then conventional linear IV is no longer suited. Fortunately, parametric functional forms can be tested with recent (and more powerful) “goodness-of-fit” tests, corroborating the choice of certain specifications. Another approach is to move away from parametric estimations, making use of recent non-parametric IV tools or even Deep Neural Networks IV techniques, which suffer less from the curse of dimensionality; this approach would thius be more feasible in practice. Once more, the “goodness-of-fit” family of tests can provide tools for inferring the validity, or invalidity, of these non-parametric/data driven approaches.

Publications

Rapid advancements in artificial intelligence (AI) technology have brought about a plethora of new challenges in terms of governance and regulation. AI systems are being integrated into various industries and sectors, creating a demand from decision-makers to possess a comprehensive and nuanced understanding of the capabilities and limitations of these systems. One critical aspect of this demand is the ability to explain the results of machine learning models, which is crucial to promoting transparency and trust in AI systems, as well as fundamental in helping machine learning models to be trained ethically. In this paper, we present novel quantitative metrics frameworks for interpreting the predictions of classifier and regressor models. The proposed metrics are model agnostic and are defined in order to be able to quantify: (i) the interpretability factors based on global and local feature importance distributions; (ii) the variability of feature impact on the model output; and (iii) the complexity of feature interactions within model decisions. We employ publicly available datasets to apply our proposed metrics to various machine learning models focused on predicting customers’ credit risk (classification task) and real estate price valuation (regression task). The results expose how these metrics can provide a more comprehensive understanding of model predictions and facilitate better communication between decision-makers and stakeholders, thereby increasing the overall transparency and accountability of AI systems.

This article investigates the impact of monetary policy on income distribution in Brazil. Income inequality affects both developed and underdeveloped economies, but its presence in the latter has a greater impact on vulnerable segments of society. The investigation of this phenomenon is critical for directing economic policies aimed at mitigating its adverse effects. We use macroeconomic variables and a Gini index calculated from microdata to measure income distribution. Our analysis employs vector autoregressive and Bayesian vector autoregressive approaches, regression analysis, and causality tests to find evidence of the impact of monetary policy on income distribution in the Brazilian case. The results show that a shock to SELIC and inflation positively impacts the Gini index, increasing inequality within a 95% confidence interval. However, an increase in economic activity and job generation has a negative impact on the Gini index, reducing income inequality observed in the economy.

When estimating policy parameters, also known as treatment effects, the assignment to treatment mechanism almost always causes endogeneity and thus biases many of these policy parameters estimates. Additionally, heterogeneity in program impacts is more likely to be the norm than the exception for most social programs. In situations where these issues are present, the Marginal Treatment Effect (MTE) parameter estimation makes use of an instrument to avoid assignment bias and simultaneously to account for heterogeneous effects throughout individuals. Although this parameter is point identified in the literature, the assumptions required for identification may be strong. Given that, I use weaker assumptions in order to partially identify the MTE, i.e. to establish a methodology for MTE bounds estimation, implementing it computationally and showing results from Monte Carlo simulations. The partial identification we perform requires the MTE to be a monotone function over the propensity score, which is a reasonable assumption on several economic examples, and the simulation results show it is possible to get informative bounds even in restricted cases where point identification is lost. 

In this paper we investigate the concentration in health insurance sector in Brazil. In order to conduct this analysis it is necessary to establish the definition of relevant market in product and geographical dimensions. In this paper we apply a methodology based on gravitation models to define the geographical market. Till now the concentration analysis was performed in Brazil using geopolitical boundaries as the market definition. Geopolitical boundaries may not be an adequate criteria, once Brazil is specially large and heterogeneous country. We assume that health services are locally demanded and supplies. In that manner the market area is defined by the flow of trade. This flow is conditioned on health services supply, potential demand and friction variables. The empirical analysis was conducted using database sourced by the National Health Insurance Agency in Brazil (Agencia Nacional de Saúde Suplementar – ANS) to 2007 and 2010. We analyzed the competition structure performing concentration indexes. Our results point out that health insurance sector in Brazil is very concentrated. The most important firm is UNIMED that dominates the majority of markets. 

Minas Gerais (MG) is the third Brazilian state in the ranking of private health beneficiaries, with approximately 10% of the country's market. This work proposes and operationalizes a methodology for market definition for health plans and health insurance in the state of MG, as opposed to the use of geopolitical borders as a guide for defining markets in the geographic dimension. We observe that all markets for private health plans and insurance in the state of MG present high concentration, regardless of the method of defining the market area used. This result is highly contrasting with the market concentration measurements obtained by using geopolitical borders, conventionally used in the literature of industrial organization.

In this paper we investigate the concentration in health insurance sector in Brazil. In order to conduct this analysis it is necessary to establish the definition of relevant market in product and geographical dimensions. In this paper we apply a methodology based on gravitation models to define the geographical market. Till now the concentration analysis was performed in Brazil using geopolitical boundaries as the market definition. Geopolitical boundaries may not be an adequate criteria, once Brazil is specially large and heterogeneous country. We assume that health services are locally demanded and supplies. In that manner the market area is defined by the flow of trade. This flow is conditioned on health services supply, potential demand and friction variables. The empirical analysis was conducted using database sourced by the National Health Insurance Agency in Brazil (Agencia Nacional de Saúde Suplementar – ANS) to 2007 and 2010. We analyzed the competition structure performing concentration indexes. Our results point out that health insurance sector in Brazil is very concentrated. The most important firm is UNIMED that dominates the majority of markets.