Replication

As part of a PhD course, I replicated the paper by Danzer and Lavy titled “Paid Parental Leave and Children’s Schooling Outcomes” and published in the Economic Journal in 2018. This paper studies an extension of the maximum duration of parental leave in Austria and concludes that it had a large positive effect on the educational performance of sons of highly educated mothers, but a very negative impact on sons of low educated mothers.

As I explain in more detail below, my replication uncovers a straightforward methodological error that invalidates the main results in the paper. However, trying to get it published proved more difficult than expected! Initially, I submitted my short comment to the Economic Journal. As I was pointing out a factual mistake in the published paper, it seemed only natural that the comment should be published in the same journal as the original paper. Unfortunately they rejected (twice!) my comment, despite positive independent referee reports. I am very grateful that eventually my comment was accepted for publication in the Journal of Applied Econometrics, which includes a dedicated Replication Section. 

I am sharing publicly my experience in the hope that more and more journals become open to replications. And despite all the challenges that I faced, I still encourage researchers to do replications and share their findings. This is crucial for the progress of science. Each study contributes a small piece to the general knowledge in a certain topic, and if that is the common aim, replication work should be just as welcome as original research. 

Summary

Danzer and Lavy (2018) study how an extension of the maximum duration of parental leave in Austria affected children's educational performance using data from PISA. They find no statistically significant effect on average, but highlight the existence of large and statistically significant heterogeneous effects that vary in sign depending on the education of mothers and children’s gender. The policy increased the scores obtained by sons of highly educated mothers in Reading by 33% of a standard deviation (SD), with a standard error of 15%, and in Science by 40% SD (st. error=11%). On the contrary, sons of low educated mothers experienced a decrease of 27% SD in Reading (st. error=13%) and 23% SD in Science (st. error=13%). 


When replicating their study, I realized that the authors had not followed the recommended procedure to deal with PISA data. Similarly to other international large-scale assessments such as TIMSS, PIAAC and PIRLS, due to imputed values and stratified sampling, PISA provides several different plausible values for each score, each one representing a random draw from the posterior distribution, which need to be taken into account in the estimation. Danzer and Lavy only used one of the five plausible values available, and they did not take into account the Balanced Repeated Replication (BRR) weights.


The main results of the paper change substantially once the analysis is performed correctly. The large and statistically significant heterogeneous effects highlighted in the paper become substantially smaller in absolute magnitude and the associated standard errors increase in size, leading to the estimates becoming statistically insignificant and largely uninformative. According to the corrected results, the educational performance of sons of educated mothers increased by 13% (st. error=23%) in Reading and by 21% SD (st. error 21%) in Science, while for sons of low educated mothers the point estimates are -21% SD in Reading (st. error=17%) and -21% SD (st. error=16%) in Science. (you can read more in the paper)


It would be unfair to single out Danzer and Lavy for making this methodological error. I conducted a survey of papers using data from international large-scale assessments that were published between 2000 and 2019 in top economic journals, and found that out of 56 papers, 35 of them do not follow the recommended procedure for analysis. In large samples this is unlikely to affect results but, when the sample is relatively small, as in this paper, it might have important consequences. This problem is likely to be exacerbated in the presence of a publication (or author) bias favoring large significant estimates.

EJ initially rejected my submission on the grounds that “the policy among editors is not to accept comments”. After I shared this story on Twitter in 2021, EJ invited me to appeal, with the promise that if my findings stood, then the comment would be published. However, the comment was eventually rejected again, apparently based on a report from the original authors that the editor requested me not to share. The authors acknowledge that their original analysis was not correct but they argue that if they slice their data in a different way they can still get some stars for the subsample of mothers with the lowest educational level (10% of the sample). This analysis does not adjust for multiple testing, does not justify why the data should be sliced in this particular way, nor provide any evidence that this subset of women with very low labor market attachment actually qualified for the parental leave extension.

Paradoxically, the editor also remarked that I am not the first one to point out to them that economists should use the correct procedure to analyse PISA data. A paper by Jerrim et al (2017) that I cite pointed out this issue using precisely as an example an article published by Lavy in the EJ in 2015. 

Furthermore, EJ decided not to retract or publish a corrigendum of Danzer and Lavy (2018). I think this is unfortunate and is making a disservice to its readers, who deserve to be informed that (i) we cannot conclude, based on the evidence presented in the paper, that parental leaves affect children’s schooling outcomes and that (ii) estimations with PISA data need to use the appropriate procedure. The fact that the previous comment by Jerrim et al (2017) was ignored both by the EJ and by the authors suggests that, as a discipline, we need to do better to ensure that errors in published papers are acknowledged and corrected.


Here is a timeline of the process: