Learning Gain

How do we measure student learning at Khan Academy?

Background and Motivation

Whenever we run an A/B test, we have a ton of conversions (e.g. problems attempted, problems correct, return visits, etc) in the form of binary or count variables. It can be difficult to make decisions based on this very high-dimensional set of outcomes. For instance, if a change recommends easier exercises to students, their accuracy may go up, but are they actually learning more?

When students use the learning dashboard, they can do problems from two sources: practice tasks or mastery challenges. Most of the problems for the mastery challenges are chosen based on their previous activity. However, a small subset (currently 5%) of these mastery challenge problems are reserved as analytics cards. These are drawn uniformly at random from exercises in all of math.

Let’s focus on measuring student learning. We will try to extrapolate the state of a user’s knowledge from performance on these analytics cards.

Learning Gain Metric

The current metric for learning gain is rather simple! Let’s only look at users who have done at least two analytics cards since an experiment started. For each user, let’s just take the correctness value (either 0 for incorrect or 1 for correct) on the last card and subtract the first card. In other words, each user will have a value in {-1, 0, 1}.

Gain of -1: means the user got the first card right and the last card wrong (0 - 1 = -1)

Gain of 0: means the user got both cards right (1 - 1 = 0) or both cards wrong (0 - 0 = 0)

Gain of 1: means the user got the first card wrong and the last card right (1 - 0 = 1)

We average these across all the users in a given experiment alternative.

We use “last minus first” instead of just accuracy on the last card because there can be quite a bit of noise in the analytics cards. Subtracting the first card reduces the noise a fair amount. Surprisingly, this turned out to work much better than more complicated variants (and was easier to automate). Don’t get me wrong - there are still lots of improvements that can be made - see the variants section at the bottom!

Reading Results

The current learning gain results are updated weekly and live in BigQuery. To access, we just need to query a [learning_gain.experiment_name_results] table, e.g.

SELECT * FROM [learning_gain.cyclical_problem_ordering_results]

Now, what do all these values mean?

delta: this is just the average “last minus first” on analytics cards, in percentages (i.e. multiplied by 100)

stderr: the standard error of the delta

first: accuracy on the first analytics card, in percentages

last: accuracy on the last analytics card, in percentages

n: the number of participants in this alternative who have done at least two analytics cards

p_value: NaN for the best alternative, otherwise the probability that this alternative being worse is due to chance (using Welch’s t-test)

In general, a higher delta means more learning gain! However, there are a few caveats with exercises changing and engagement effects, which I’ll discuss later. For tests that don’t change how often mastery challenges are done, we can pretty confidently rely on this metric.

One notable exception includes experiments that have a big effect on the first analytics card a user does. For instance, a growth mindset experiment might make the accuracy on their first card jump due to users putting in more effort. For this case, it might make more sense to look at the last analytics card, but this is open to interpretation.

These results also appear on bigbingo (at the bottom of each experiment page):


Q: Why is learning gain negative sometimes? Does this mean students are not learning?!

A: I hope not! There are a few reasons for why a lot of “delta” values can be negative, the main reason being that our pool of exercises (and hence, analytics cards) is changing over time. As we add new content, sometimes the average difficulty will increase and sometimes it will decrease.

For instance, if you look at average accuracy on analytics cards over time, it’s quite bumpy due to new exercises, weekends, and holidays. Depending on when an experiment started, negative learning gain can be expected.

Also, one more effect here is that the less engaged users generally leave Khan Academy on a low note, after answering many questions incorrectly. This translates to a negative learning gain.

The results may be negative, but the good news is that they can still definitely be compared across alternatives in the same experiment.

Q: Why are learning gain results changing week to week?

A: It’s probably due to noise. If the p-values are fairly high, the results are not significant. Looking at it more often won’t make it more significant either. In fact, it will make the results less significant!

Q: Can I just pick the alternative with higher learning gain and be done?

A: I wish it were that easy. If different alternatives in an experiment don’t induce users to do more mastery challenges, then our decision becomes a lot easier. However, most of our experiments affect both engagement and accuracy. What happens if we have higher learning gain but significantly fewer return visits? Learning gain should only be seen as one part of the puzzle.

Moreover, this assumes that the effects of an experiment alternative are fairly smooth and monotonic. Examples of such experiments are the “6 versus 8 problems in a mastery challenge” and “prerequisites in missions” where the effects are gradual. However, for a growth mindset experiment, we can expect a huge spike initially when users put in a lot more effort into their first analytics card. In these cases, we need to take more care in analyzing the results. One idea (discussed in the variants) is to look at analytics cards before the experiment started.

Q: Can I compare learning gain results across experiments?

A: Unfortunately, probably not (unless they were started at the same time). Because of the pool of analytics cards changing over time, the absolute accuracy numbers are going to look very different depending on the dates. Plus, if you were really curious about effects across experiments, we can probably launch a new A/B test!

Q: Is there some absolute measure of learning gain, i.e. not across experiment alternatives?

A: Not yet, but this is in the works! With our pool of exercises constantly changing, new users signing up, etc, I haven’t found a good way to control for all these effects. This has definitely been on my mind and I’ll continue brainstorming ideas.

For more details, read on!


Here’s a growing list of ideas for how to improve the learning gain metric.

1. Conditional probabilities

Instead of using 0/1 for correctness values, we can use exercise information as well. Specifically, for each exercise and correctness value (either 0 or 1), we can use the probability that they get the next analytics card right - these values can be computed empirically. This should reduce noise somewhat. For more details, see here.

To get a sense for what these values look like, see the plot below. The x-axis consists of the different exercises, ordered by accuracy on analytics cards, a rough estimate of difficulty. The y-axis consists of the following.

p_global: this is the global probability that this exercise was answered correctly on analytics cards (so lower means more difficult)

p_0: this is the probability the user answers the next analytics card correctly, conditioned on answering the current exercise incorrectly.

p_1: this is the probability the user answers the next analytics card correctly, conditioned on answering the current exercise correctly.

As we expect, p_0 is generally lower than p_1 and both values are higher the harder the exercises get. These were generated in February 2014, when we had about ~550 exercises in all of math.

2. Look before an experiment started

We have the first and last analytics card right now. It may make sense for some experiments to look at the last card before the experiment started, e.g. for growth mindset, and compute the deltas based off of this card.

3. Look at per-user participation time

In the current model, we use a global start date for each experiment. Then, we filter out the analytics card done after that time. It’s better to do this per-user, filtering out cards they did after they entered the experiment.