PhD - Applications of Multilevel Modelling
Exploring the assumption of no correlation of explanatory variables with random effects.
Project aims and activities
The overarching aim of my PhD project is to improve understanding about the implications of using random effects models with social data where the random effects are correlated with explanatory variables.
In more detail
Random effects models are used in the social sciences to handle data where cases are clustered in some way, either because of the way the data have been sampled, or because conceptually we believe that cases are grouped in some way, so that important differences between these groups might shape the processes affecting them .
For example, if we are interested in the relationship between parental income and school attainment, and the data we have contains groups of pupils within schools, we might suppose that there will be differences in the outcomes of different schools for various reasons we can't measure (or just haven't). Perhaps one school has a sporty ethos, and another is very focused on diversity and tolerance, while a third has suffered a lot of disruption because of a building which is poorly maintained.
These differences could mean that the children within any one of these schools have outcomes which are similarly different to the overall average. Some sort of multilevel model is needed to cope with that structure in the data.
Imagine a model which tries to describe how exam scores vary along with a parental income. One common approach would be to use a random intercepts model, which would allow a different 'baseline' exam score for each school, and estimate the effect of wealth on exam scores relative to that.
This improves our analysis in two ways:
Firstly, we get a better estimate of the individual effect of parental income on exam performance;
Secondly, we can explore the balance of difference forces within the model, e.g. by measuring how much difference in exam score is explained by individual differences in wealth, and how much is related to membership of a different school.
However, these models rely on an oft-violated underlying assumption of no correlation of explanatory variables with random effects (hencefore the 'NCRX assumption'). If this assumption is not met, the resulting estimates can be inaccurate.
To continue the example above, we can easily imagine that parental income might tend to be higher in schools where exam scores are above average, perhaps because wealthier parents have the means to move to an area where their child can attend an apparently high-achieving school.
If this confounding factor confuses the model, we could draw the wrong conclusions about the way in which family wealth is involved with school attainment. The graphs above show a simulated (fake!) but plausible data pattern where exam scores do increase with income, but higher incomes are clustered in schools with higher exam results, and within those clusters the relationship between income and attainment is weaker. Looking at the same data in a multilevel way changes the story.
Such a violation does not always change the results, and there are corrections we can apply to address it, but we need to understand more (on an applied level) about when and how these models fail. My PhD project will address that need through three strands of activity:
A methodological review of literature across the social sciences. Researchers respond to potential violation of the NCRX assumption in different ways - from ignoring it completely to ruling out the method automatically. Are these variations in practice down to disciplinary habit, workflow preferences, varying understanding, or something else? I will collect data by reading and scoring research papers which use multilevel models, and use that data to analyse what drives the difference in response.
A simulation study to explore how results are affected by correlation of random effects with explanatory variables under different conditions. I will generate fake data which contains known relationships and hierarchical/nested/clustered structures, and test how estimates from different model specifications change depending on the size and shape of the data and the correlations within it.
Two case studies, exemplifying use cases in occupational social mobility and geographical health inequalities where differences of model specification and response to NCRX violation might be consequential. Using secondary datasets from large social surveys, I will recreate existing two existing analyses and explore how the conclusions drawn might vary or persist depending on how we model the data.
Contact
Student:
Kate O'Hara, University of Stirling, k.a.ohara@stir.ac.uk @Kate_OHara_
Supervisors:
Paul Lambert, University of Stirling, paul.lambert@stir.ac.uk
Kevin Ralston, University of Edinburgh, kev.ralston@stir.ac.uk
Resources
'Three-minute Thesis' slide, as presented at University of Stirling Festival of Research, May 2023. Image credits and references for this slide are listed here.