GGR Newsletter
November 2025
GGR Newsletter
November 2025
Why Calibration Matters in Medical AI
Sam Blechman, M.S.
November 2025
Imagine you’re admitted to the ER with a fever and low blood pressure. The hospital’s fancy new AI model flags you as having a 70% chance of sepsis. Sounds bad, right? But what if that 70% chance doesn’t mean what the providers think it means? What if, historically, only half of the people the model called “70% likely” actually had sepsis? That’s what we mean when we say a model is miscalibrated—its predicted probabilities don’t line up with reality.
Calibration is about whether a model’s predictions match reality. If you group all patients who were given a 70% predicted risk, about 70% of them should actually have the outcome. That’s perfect calibration. If it’s higher or lower, the model’s off. And when these models are being used to make treatment decisions—who gets antibiotics, who goes to the ICU—small calibration errors can make a big difference.
So, why does this happen? There are a few usual suspects.
Sometimes, the model just doesn’t fit the data very well—it assumes a too-simple relationship between the predictors and the outcome. For example, logistic regression forces everything into a nice smooth curve, but the real world is messy and nonlinear. More flexible models like random forests or XGBoost can learn those messy patterns, but they have their own issue: they’re often too confident. They’ll happily assign a 99% probability to something they only half understand.
On top of underfitting and overfitting, the data a model sees in training (where it learns what the world looks like) might not match what it sees in practice. Maybe it was built in a tertiary hospital where sepsis is common, then deployed in a community emergency room where it’s rare. Same model, different population structure in terms of disease prevalence and clinical characteristics, totally different calibration.
Here’s the tricky part: a model can be really good at separating sick from not-sick (that’s discrimination, measured by classification accuracy via metrics like sensitivity, specificity, and the area under the receiver operating characteristic curve [see my 09/2025 article to learn more!]) and still be terribly miscalibrated. Classification accuracy doesn’t care if a model says “99% chance” or “51% chance,” as long as it ranks patients correctly. That is, a patient scored as 61% is more likely to have the outcome than one scored as 60%, but without calibration the 1% doesn’t mean much. Calibration means that 1% difference actually means something probabilistically.
Calibration is about trust. If a clinician sees “70%,” they’ll treat that number like a fact. When it’s wrong, decisions can tilt toward unnecessary diagnostic tests and treatments, missed cases, and wasted resources.
This feels like a pretty big deal as we want to be really careful with how we use computational methods to inform clinical decision-making, right? I agree! However, miscalibration is often an afterthought in the minds of the data scientists developing and validating a model.
Researchers have been calling this the “calibration crisis” for years (see Van Calster et al., Medical decision making, 2015; Niculescu-Mizil & Caruana, ICML, 2005). The good news is calibration can be fixed—simple post-hoc adjustments like Platt scaling or isotonic regression can make probabilities more honest. But the first step is recognizing that accuracy isn’t enough. In medicine, what matters most isn’t how good a model is at guessing who’s sick—it’s guiding clinical decisions in a way that actually reflects what will happen to real patients.