Coronavirus and Bayes Theorem

How to think about probabilities with Covid and Covid testing.

Answers to questions about what’s going on in the world are probabilistic at best. We have imperfect measuring devices: our eyes, our brains, our ideologies, our medical tests, and so on. And the truth is hard to uncover out there. Here’s some help on how to think about whether or not you’ve got covid and what a test results, positive or negative means.

First, the complication here is that we’ve got two different probabilities that can be working against each other: there is a probability that a person has the disease and there’s the probability that is the accuracy of the test. When we combine those, we can get some counter-intuitive results. Bayes’ Theorem is what we use to clarify.

Pr(H|O) = Pr(O|H)Pr(H)

Pr(O|H)Pr(H) + Pr(O|~H)Pr(~H)

Here, H stands for “hypothesis,” O stands for “observation,” and ~H stands for “the hypothesis is false.” And the upright line | stands for “given that.”

So suppose we want to know, what’s the probability that you’ve got covid (the hypothesis) given that you’ve got a positive test result (the observation): Pr(C|+). Here’s how you’d substitute the variables into Bayes’ Theorem.

Pr(C|+) = Pr(+|C)Pr(C)

Pr(+|C)Pr(C) + Pr(+|~C)Pr(~C)

I won’t go into an explanation of why Bayes’ Theorem is true. Let’s just understand that this formula, once you plug in the relevant values, gives you a probabilistic answer to the question “Do I have it?” in the case that you get a positive or negative test result.

Let’s try some values. The right side of the numerator is just base rate or the prior probability. That's the probability that a person has covid prior to any observation or new evidence. Pr(C)

Pr(C|+) = Pr(+|C)Pr(C)

Pr(+|C)Pr(C) + Pr(+|~C)Pr(~C)

What’s the probability that some person chosen from a population has covid? At the moment, in Sacramento county, where I live, there are about 10,000 cases. And there is a population of 500,000. So that’s a 2% rate. Those are officially reported cases. The actual rate maybe 10 times higher, but we will deal with that in a moment. But just being a person in the county might not be enough to lead you to test; it's more likely that some exposure or some symptoms have made you worry about it and seek out a test. The rate of covid among people seeking out a test is surely higher than the overall rate in the county. It might be that you spent some time at a gathering with someone who turns out to have it, or you woke up with a sore throat and a fever, and so on. So let's say that your prior probability of having covid, before you seek out a test is 40%. We will see shortly what happens when we change that number for different reasons.

Now, what is Pr(+|C)?

Pr(C|+) = Pr(+|C)Pr(C)

Pr(+|C)Pr(C) + Pr(+|~C)Pr(~C)

That is, what is the probability that you would test positive given that you’ve got Covid? This is the accuracy of the test. It’s also sometimes called the likelihood. When someone has it, what percentage of the time does the test give a true positive result, and what percentage of the time is the test wrong. This number isn't about having covid directly; it's a measure of how good a test is at correctly identifying people who have it. Think of a drug store pregnancy test that says "99% accurate." What that means, roughly, is that if we took hundreds or thousands of pregnant and non-pregnant women and tested them, the test would correctly tell us if they are pregnant or not 99% of the time. One of the problems we’ve been having through the pandemic is that our tests aren’t very accurate. Let’s suppose that we’re using a test that is 85% accurate, or that Pr(+|C) = .85.

The only other values we need to complete the formula now are Pr(+|~C) and Pr(~C).

Pr(C|+) = Pr(+|C)Pr(C)

Pr(+|C)Pr(C) + Pr(+|~C)Pr(~C)

For now, since we are assuming that our test is .85 accurate, what is the probability that a person would have a positive test result but not have it? Pr(+|~C) = .15. (It turns out that tests typically have different false positive and false negative rates, and that can matter. But we will deal with that later.) And if your initial probability for having covid, your prior probability, was .4, then that means that the probability that the hypothesis is false is .6 So now, we can fill in all the values:

Pr(C|+) = (.85)(.4)

(.85)(.4) + (.15)(.6)

Pr(C|+) = .34

.43

Pr(C|+) = .79

That is, with our assumed values, the probability that you've got covid, given a positive test result with 85% accuracy, is about 79%.

But wait, how can that be? A positive test result means you’ve got it, right? Or, if the test is 85% accurate, then we should be 85% sure that you've got it now, right? No, in this case it just means that the probability of someone’s having it went from 40% to 79% with the positive test. Think of it this way. The test here is a pretty poor test. It only get the answer right 85% of the time. (If the test got it right 50% of the time, then it would be no better than flipping a coin to find out if you’ve got it.) And the other problem is that in this example, the base rate, the probability that some random person in the population has it, is somewhat low. It’s only .4%. It helps to see that the general improbability of having it erodes the confidence of the test result. At the outset, you probably didn't have it (.6), and that reduced our confidence in a positive test result from 85% to 79%.

Suppose we took someone from the county at random and we didn't have the prior reasons for thinking they might have it. The rate in Sacramento county is 2%. What happens if this person gets a positive test result, with our 85% accurate test?

Pr(C|+) = (.85)(.02)

(.85)(.02) + (.15)(.98)

Pr(C|+) = .017

.164

Pr(C|+) = .10

Now we can see how much base rates, or the prior probability, affects the answer to the question. Even with a positive test result, this person, with a mere 2% prior probability of having it, is just 10% probable for having covid. That is, it's far more probable that this person doesn't have it, even with a positive test. In this case, the prior probability or the probability that we started with was just the rate distributed over the whole population. We took the total number of cases and divided by the total number of people in the county. In the first case, we saw that if you had some symptoms, or had an exposure, and went to the doctor and asked for a test, then your prior probability would be higher. We had some prior reasons, or some other evidence to think that your base rate was .4, and as a result a positive test was pretty strongly indicative that you've got it. Put another way, the prior probability of having covid is much higher among people who have a sore throat, a fever, and an exposure, than it is for someone just chosen at random from the population.

Now suppose we give this randomly selected person another test after they get the first positive one--that’s what a responsible doctor would do. Now we have a different equation because the base rate, or the prior probability is now 10% from the first test. That is, on the basis of the test we just took, your probability of having covid has gone up from 2% in the general population to 10%. So we insert that new base rate. And let’s suppose we are using the same 85% accurate test. And notice that now since the Pr(C) is .1, then Pr(~C) is .9:

Pr(C|+) = (.85)(.1)

(.85)(.02) + (.15)(.9)

Pr(C|+) = .085

.22

Pr(C|+) = .38

So now, with two positive test results, the probability that this person has it is 38%. That’s still not probable; it’s more likely that you don’t have it now than that you do. But that’s not negligible. The evidence is mounting that this random person we selected from the population has it.

Suppose we apply another test to our first case. With that person, the base rate was .4, and a positive test gave us a posterior probability of 79% for covid. What if this person takes another test and gets a positive result. Note that I use .79 for the prior here, and .21 for Pr(~C):

Pr(C|+) = (.85)(.79)

(.85)(.79) + (.15)(.21)

Pr(C|+) = .67

.70

Pr(C|+) = .95

So now, our patient who came in with some symptoms, and who has taken two tests with positive results, has got covid with Pr(.95). That's intuitive and it's what we'd expect. They went from .4 to .79 to .95 as we folded in the new information, updated our priors, and arrived at a conclusion. An interesting question to consider here is what would happen to the answers if we had tested these people a second time and gotten negative tests, in contrast to the first positives? The short answer is that the negative tests would effectively cancel out the evidence of the first tests and leave us back where we started.

Now consider what happens when we use a more accurate test. Suppose we have a test that is 95% accurate, and only give the wrong answer 5% of the time. And now, let’s dispose of our contrived example of selecting someone randomly from the population. Suppose that our patient has been exposed to someone who has it, and they’ve got some symptoms that are consistent with covid. Now, before we’ve even tested, we’ve got some evidence to think that this person has it. Suppose we assign a 65% probability to their having it, before the test. Now, what would a positive test result mean?

Pr(C|+) = (.95)(.65)

(.95)(.65) + (.05)(.35)

Pr(C|+) = .61

.63

Pr(C|+) = .97

So now, with a better test, and some other evidence to suspect that this patient has it, it’s now 97% probable. This patient very probably has it. And now, notice that our high prior probability increased the answer to more than the accuracy of the test. That is, at the outset, we thought this person had a higher than 50/50 or even chance of having it: .65. That bias in favor of having it, in effect, increased our confidence in the answer to greater than the 95% accuracy of the test.

What else can we say about the base rate? In our first examples, we were assuming that official, reported rate of covid in the population is a good measure of the actual rate. But is that a good assumption? Probably not. There are a lot more people out there who have it than just the ones officially reported. Some people get only mildly sick and don’t go to the doctor, some people get it and don’t get tested, some people get it and never know they’ve got it. The CDC has recently said that the real rate of people who have may be 10 times higher than the official rate. And we’ve got some reasons to think that the CDC is even understating the truth here. If you talk to someone at a party, the difference between a 2% chance of your being exposed and a 20% of your being exposed is substantial.

So what happens to our calculation when you take the base rate of covid in the population to be 20% instead of 2%? And let’s assume a 90% accurate test this time.

Pr(C|+) = (.95)(.2)

(.95)(.2) + (.05)(.8)

Pr(C|+) = .32

So even now, with a positive test and a higher rate in the population, you probably don’t have it, but the evidence is mounting. But if you did another test now and got a positive result, the Pr(C|+) = .89 Now, you’ve probably got it.

So far, we’ve dealt with positive test results. And we’ve seen that the relative rarity of the property in the population erodes a positive test result. But what happens if you get a negative test result? The short answer is that now, with the property rare in the population, a negative result is fortified or increased in probability. Suppose that the base rate is 20% and suppose that this test is 90% accurate. But now we’ve got a negative test result. What’s the probability that this person has it?

Pr(C|-) = Pr(-|C)Pr(C)

Pr(-|C)Pr(C) + Pr(-|~C)(~C)

Pr(C|-) = (.1)(.2)

(.1)(.2) + (.9)(.8)

= .02

.74

=.027

So this person’s having covid was at 20% with the rest of the population before the test. But they tested, got a negative result, and now the probability has gone down to .027. So they probably don’t have it.

Notice a couple of things here. The left of the numerator, “Pr(-|C)” is now “what is the probability that you’d get a negative test, given that you’ve got Covid?” And with a test that’s 90% accurate, we will assume that’s .1. And the right of the denominator, “Pr(-|~C)” is “what is the probability that you’d get a negative test result, given that you don’t have Covid?” That is, what’s the probability that the test would get a correct negative result? With a 90% accurate test, that’s .9. And since the rate of C in the population here is .2, then the rate of people not having it is 80% or .8.

So there are a couple of lessons we can extract here. First, when you test and get a positive or negative test result, the answer to the question “Do I have it?” is not as simple as whether it was a positive or negative test result. The answer is modulated by how accurate the test is and how prevalent the disease is. If the disease is rare and the test is inaccurate, then a positive test result doesn’t tell us much. More tests, with positive test results, would fortify the answer and support the positive conclusion. If the disease is rare and the test is fairly accurate, then a negative test result is good news; a negative result doesn’t give us certainty, but it makes it substantially less probable that you have it. Again, more tests, with more negative results, would fortify that answer. If a person takes multiple tests and gets mixed results, then we are left in a curious epistemic situation. One positive result and one negative result would cancel each other out effectively; you’d have two competing pieces of evidence pushing opposite directions, leaving you just where you started.