October 2025
When we need expert judgements for policy decisions or scientific modelling, we typically turn to human experts. But what if we could use AI as a "straw man" expert: not to replace humans, but to provide a benchmark or spark discussion? That's exactly what I set out to test this summer.
Expert knowledge elicitation (EKE) has been around for over half a century. It's the art and science of systematically capturing expert judgements and turning them into quantitative information we can use for decision-making. The Sheffield Elicitation Framework (SHELF) is one such structured approach that brings groups of experts together to make judgements about uncertain quantities.
The process is rigorous and transparent, but it's also resource-intensive. With large language models (LLMs) becoming increasingly capable, I wondered: could they participate meaningfully in this process? There are numerous ways AI could support expert elicitation exercises:
Finding the right experts by analysing publications and profiles
Creating evidence dossiers so all experts start from the same baseline
Transcribing and summarising discussions in real-time
Detecting biases like anchoring or groupthink
Acting as a "straw man expert" whose judgements can be compared with human experts
It's this last application that I decided to investigate.
I tested ten different LLMs using the exact same questions we'd ask human experts in a SHELF exercise. The models included ChatGPT (GPT-4 and GPT-5), Claude Sonnet 4, Copilot, DeepSeek V3, Gemini (Flash and Pro 2.5), Grok 3, Mistral Le Chat, and Qwen3.
Each model was given two different challenges:
The Giant Squid Question: Estimating the length of the largest living giant squid (using SHELF's example evidence dossier)
The Air Quality Question: Estimating the percentage reduction in UK mortality from reducing PM2.5 pollution to WHO guidelines
The key constraint? Each LLM could only use information from a provided evidence dossier—no drawing on their training data. They had to think like a human expert working from the same baseline.
I ran each model five times on each question to account for the randomness in their responses.
The good news: Every single model provided reasonable, justified, and coherent initial judgements. This is a dramatic improvement from where these models were even in late 2024.
The wrinkle: When the SHELF protocol asked models to check and revise their answers, some struggled. About half the models occasionally revised their numbers in ways that contradicted their own reasoning. One model consistently made this type of error.
Looking at the giant squid question, the models' final median estimates ranged from about 12 to 18 meters—a reasonable spread that you might also see among human experts working from the same evidence. For the air quality question, most models adjusted their uncertainty ranges (interquartile ranges) during the checking process, with changes typically between 0 and 3 percentage points. These two small observations about the experimental results show that different models behave differently when it comes to probabilistic reasoning.
There was one standout issue: word count. While most models produced around 4,000 words total across all their responses (manageable in a real-time session) one model regularly exceeded 20,000 words. That's simply too much text to process during a live elicitation exercise.
These results are promising. LLMs can provide coherent, evidence-based judgements when prompted appropriately. But before we actually use them in real elicitation exercises, we need to address some important questions:
Expert attitudes matter most. How will human experts feel about AI participation? Will they see it as a helpful tool or a gimmick? Buy-in from experts is crucial—the last thing we want is for them to feel they're being replaced.
Anchoring is a real risk. If experts see confident AI outputs before making their own judgements, they might anchor to those numbers. We might need "blind" approaches where experts don't see the LLM outputs until after their initial judgements.
Confidentiality concerns. Using online LLMs means sharing potentially sensitive information with third-party providers. Self-hosting could solve this but adds complexity.
Model stability. LLMs are constantly being updated. How do we ensure consistency when models might change mid-project?
Can LLMs act as straw man experts? The answer appears to be "yes, with caveats." They're not ready to replace human experts, nor should they. But they could serve as useful reference points, help calibrate discussions, or provide a baseline for comparison.
The technology has come remarkably far in a short time. What seemed implausible in late 2024 is now technically feasible. The real questions now are about implementation: how we integrate these tools thoughtfully and how human experts respond to them.
This summer's experiment was just a first step. But it's an exciting one that suggests AI might have a role to play in making expert elicitation more accessible and efficient as long as we proceed carefully and keep humans firmly in the loop.