Publications
AI-assisted teams outperform AI-led teams but not human-only teams in assessing research reproducibility in quantitative social science
(with Abel Brodeur, David Valenta, Alexandru Marcoci, Juan P. Aparicio, Derek Mikola, Bruno Barbarioli, Rohan Alexander, Lachlan Deer, Tom Stafford, Lars Vilhuber, Gunther Bensch et al.) (Proceedings of the National Academy of Sciences, 2026)
(WP)
Large Language Models (LLMs) such as ChatGPT are transforming how scientists conduct and validate research, offering promise as tools to improve scientific reproducibility. However, computational reproducibility and error detection remain expensive and labor-intensive. We experimentally test how collaboration between researchers and LLM assistants influences the reproduction of quantitative social science findings across different levels of AI autonomy. We randomly assigned 288 researchers to 103 teams working under three conditions: human-only, AI-assisted (using ChatGPT as a collaborative tool), or AI-led (ChatGPT operating with minimal human oversight). Teams reproduced published results from leading social science journals, detected coding errors, and proposed robustness checks. Human-only and AI-assisted teams achieved comparable reproduction rates (94% vs. 91%) and performed similarly on most outcomes, except human-only teams identified significantly more major coding errors. Both substantially outperformed AI-led teams, which achieved only a 37% reproduction rate, detected fewer errors across all categories, proposed weaker robustness checks, and required more time. This autonomous approach, however, likely represents only a lower bound of AI capabilities. Despite rapid model advances, expert human judgment currently remains indispensable for reliable empirical verification. While AI assistance did not degrade most outcomes, it provided no measurable advantages and was associated with reduced detection of major errors. However, the 37% autonomous reproduction rate indicates that AI could provide value in settings where scale or cost constraints preclude human review of papers, even though general-purpose LLMs offer no immediate advantages for human-supervised verification.
Living in the Gender Spectrum: Evidence from Non-Cisgender Applications in the Rental Housing Market
(with Sofia Fritzson) (Journal of Housing Economics, 2025)
(WP)
We present novel evidence from the first correspondence study investigating the effect of individual non-cisgender signals in the housing market. In a preregistered trial, 800 fictitious letters were sent to rental apartment landlords in Sweden. Cismale applicants received fewer positive responses compared to ciswomen, while non-cisgender applicants had response rates that fell between those of ciswomen and cismen. The effects were strongest for apartments located outside of major cities. Non-cisgender applicants were also more often asked to clarify their gender. Additionally, cismale applicants were more likely to be addressed by the wrong name and were less frequently asked if they would bring any cohabitants.
The Effect of an Anonymous Grading Reform for Male and Female University Students
(with Björn Tyrefors) (Economics Letters, 2025)
(WP)
This paper leverages a university-wide anonymous grading reform and presents evidence that female university students benefit from anonymous grading. Female grades improve by around 0.035 standard deviations relative to males. The effect is driven by smaller classes and male-dominated departments.
Grading Bias and the Leaky Pipeline in Economics: Evidence from Stockholm University
(with Björn Tyrefors) (Labour Economics, 2022)
(WP)
We estimate a substantial female grade gain when being graded anonymously compared to male students in 101-macroeconomics courses. Females graded anonymously are more likely to continue with economics studies. This suggests that biased grading is a direct cause of the “leaky pipeline” phenomenon in economics. As male graders are the majority, we complement our analysis and evaluate the importance of same-sex bias using random assignment of graders. Although, we estimate a substantial same-sex bias before anonymous exams were introduced, it cannot explain the overall effect of grading bias. Thus, same-sex bias is not the mechanism explaining the overall effect of grading bias
Media coverage:
Dagens Næringsliv (in Norwegian)
Working papers
Sentence Length and Recidivism: Court Rulings based on BAC (with Mikael Priks, Per Pettersson-Lidbom and Björn Tyrefors) (submitted)
We study the effect of prison sentences on recidivism using a unique feature of sentencing for drunk driving in the Swedish court system. Below the blood alcohol concentration (BAC) of 1.0‰, individuals are never sentenced to prison and above 1.0‰, the average number of Days sentenced to prison is essentially linearly increasing with the BAC level. We find that being sentenced to prison for one month reduces reoffending in the next five years by approximately 80 percent.
Misogyny and Xenophobia Online: A Matter of Anonymity (with Emma von Essen) (submitted)
Social media platforms rapidly disseminate political content, shaping democratic discourse while enabling anonymity that both protects expression and limits accountability. This study combines large-scale text analysis with a difference-in-differences event study design to examine how reduced anonymity influences xenophobia, misogyny, and false information in online political discussions. Using data from a major Swedish discussion forum, we apply fine-tuned BERT models to classify xenophobic and misogynistic content across two user groups---those affected by an anonymity-reducing event and those not. Reduced anonymity is associated with a significant decrease in xenophobia, while levels of misogyny remain unchanged or increase slightly. In line with theoretical expectations, informational quality improves in discussions on immigration---where xenophobia was prevalent---but not in feminist discussions dominated by misogyny. These findings highlight the asymmetric role of anonymity in shaping online hate and suggest that identity exposure may curb certain forms of harmful speech while leaving others unaffected.
Popular science coverage:
Ikaros (in Swedish)
Nyfiken (podcast, in English)
IFN-podden (podcast in Swedish)
Ekonomisk debatt (in Swedish)
Anticipation Effects of a Board Room Gender Quota Law: Evidence from a Credible Threat in Sweden (with Björn Tyrefors) (submitted)
Boardroom quota laws have received an increasing amount of attention. However, firms typically anticipate laws and can respond to them before their effective date. This paper provides novel results on female board participation and firm performance in Sweden due to a credible threat of the enactment of a quota law. The threat caused a substantial and rapid increase in the share of female board members among listed firms. We also observe increased board diversity in other dimensions. Moreover, we also find a lower turnover rate for female board directors and higher turnover for male CEOs consistent with mediocre male board members and CEOs being replaced. Interestingly firm performance improved, which was related to higher sales and lower labor costs. The results highlight that it is possible to increase the share of women on corporate boards without resorting to quotas and that anticipatory effects of a law could be detrimental to the analysis of the law.
Popular science coverage:
Ekonomisk debatt (in Swedish)
Ekonomistas (in Swedish)
Work in progress
Differences Between Immigrants and Natives in Prison Sentencing: An Analysis of DUI Cases in Sweden (with Susan Niknami and Lucie Giorgi)
Do courts treat otherwise similar offenders equally? Using comprehensive Swedish data on police driving under influence (DUI) controls, we examine sentencing differences by immigrant background and gender. Comparing offenders within narrow blood-alcohol concentration (BAC) intervals to hold crime severity constant, we find that immigrants are 10 percentage points and males 5 percentage points more likely to receive prison at a given BAC-level. Controlling for extensive socioeconomic characteristics selected via Lasso and criminal history leaves the estimates unchanged. Exploiting a sentencing discontinuity at BAC = 1.0 in a difference-in-discontinuity design, we find that the additional imprisonment immigrants receive does not reduce subsequent crime further. The results document economically meaningful disparities that are not justified by differential deterrence, implying unequal treatment and inefficient use of costly prison resources. In the near future, we aim to investigate the courts reasoning behind the harsher sentencing practice for immigrants and men by performing text analysis of the court protocols from DUI cases with quantitative text analysis and language models.
Fake news under anonymity (with Emma von Essen)
Minimizing measurement error in outcomes in causal analysis from text data (with Emma von Essen and Yifan Yang)
Long term effects of early sorting in schools (with Björn Tyrefors and Christian Møller Dahl)
Long term effects of child allowance (with Björn Tyrefors, Linnea Karlsson, Louise Lorentzon and Christian Møller Dahl)
Replication reports
A comment on “Publishing while female” by Hengel (with Mark Granberg and Yifan Yang)
Hengel (2022) examines differences in readability between the published articles of female and male authors in the top 4 economic journals. The main findings include that female authored abstracts are “1%–6% better written” than male authored ones using five measures of readability, a descriptive claim. We first normalize the readability scores to facilitate interpretability and comparisons of the estimates. Next, we conduct a robustness replication where we recalculate the five readability scores Hengel used but we utilize a more commonly used R-package. Differences from Hengel are substantial for some measures, with one measure dropping from a 10 % significance level and one other decreasing in magnitude to 1/3 of the effect size reported in Hengel (2022). Lastly, we add 29 other readability scores, readily available in our preferred R-package, and examine (1) their relation to articles’ scientific quality (proxied by asinh(citations)) and (2) the male-female difference in readability of abstracts. We find that there are several readability measures that are more correlated to our proxy of quality than the ones used in Hengel (2022). Furthermore, the main estimates of the gender differences in readability reported in Hengel (2022) are on average 0.11 standard deviations with an average p-value of 0.038. For the measures not used in Hengel we obtain an average gender difference of 0.042 standard deviations with an average p-value of 0.42.
Note: Three answers exist to this report.
Master Thesis
Local Human Capital and Immigrants: Complements, Substitutes and Externalities