7/20/2025: I now think for low difficulty practice (e.g., fluency practice) that modeling response time changes and using those values to guide practice is more efficient.
Below I describe my work from this paper. See my blog post for npj Science of Learning for an informal introduction.
Spacing out practice (e.g., self testing) over time typically benefits learning. But how much space should there be? Should the interval always stay the same? In a similar vein, imposing some amount of difficulty tends to benefit learning, but the exact amount that should be imposed (e.g., via spacing) is rarely specified. The approach I describe below uses a computational model of student learning in combination with difficulty thresholds to optimally schedule practice.
Standard methods of spacing schedule practice (e.g., spacing of 3 intervening items between attempts) implicitly assume that items and students do not vary in difficulty and aptitude. For instance, if a student is highly accurate on all attempts on an item, should the spacing be larger for that item (increasing difficulty)? Should the item be practiced at all? One-size-fits-all methods also tend to ignore how efficiency changes over time - more practice leads to faster responding. So how much difficulty should be imposed on the learner, and how can we enforce a particular difficulty level?
Some amount of difficulty is good for learning, but how much? The simulation I describe below sought to answer this question for foreign vocabulary learning. Figure inspired by Hebb (1955).
Recently I developed a computational model (click more tab for preprints) that can be used in tandem with a difficulty threshold to schedule practice based on a students' prior practice history (e.g., spacing, repetition, prior successes and failures). The general concept that I simulated was to track item difficulty over time (learner model), and on each trial, practice whichever item is closest to the target difficulty threshold (peak of the curve in the schematic below). I ran a simulation to develop that curve (described below)
My simulation investigated how much future recall (at some final test) is improved by being tested at various probabilities (difficulties) of remembering now, in the current study session. Easier items are recalled faster, but maybe provide less learning. Harder items may provide more learning, but take longer (especially if the student fails to remember). What amount of difficulty balances speed, learning gains, and failure risk?
In the simulation, I imposed a rule - "Practice whichever item is closest to the difficulty threshold X". Below shows how imposing this rule played out on one simulated student.
The blue line shows the recall probability of one item over the practice session. Notice that it was practiced (blue squares) when it was close to the difficulty threshold (here .90). Gray lines denote other items practiced in the same fashion.
Below shows how different Optimal Efficiency Thresholds (OETs) benefited later memory along with conventional schedules. Practicing according to model predictions and OETs (left side of figure) provided dramatic benefits over conventional heuristics (right side of figure).
The results of the simulation (3 day delayed final test). Optimal Efficiency Thresholds (OET) represent different levels of difficulty. Conventional schedules represent simpler heuristics such as dropping items from practice after success (drop-1), fixed spacing intervals (vs30) and massed (mass-f) all did much worse than practicing according to difficulty levels. Error bars denote +/-1 SE. Simulated N = 200 per condition.
My simulation had students practice whichever item was closest to the target Optimal Efficiency Threshold (OET). Lower thresholds (e.g., OET = .5) enforced harder practice with more spacing (and forgetting) between repetitions. I found that a fairly easy threshold (OET=.94) resulted in the best final test memory. In other words, a small amount of difficulty struck the best balance between successes (that can be efficient) and failures (that can be time consuming). Because the model accounts for spacing and prior practice history, the schedule of practice is personalized based on student aptitude, item difficulty, and the interactions between them. This method naturally generated an expanding schedule in simulation, but importantly was unique to student and item attributes.
Dr. Pavlik and I's recent work tested the above simulations (forthcoming in npj Science of Learning). Our findings supported our predictions. Below is a plot comparing practicing at several OETs (.40, .70, .86, .94, .98) as well as a control condition (repeat item every 15 trials).
So the plot above indicates several interesting findings that are important for education:
1) Practicing adaptively according to a high OET beats traditional approaches by quite a bit (cohen's d > .5).
2) Not just any OET is effective, practicing at a higher OET results in better memory after a delay than the control or the lower OET (OET40), the amount of difficulty that is desirable is less than typically thought!
3) We can reasonably predict optimal difficulty with only 1 prior dataset to parameterize our model, even though that dataset was a different learning context (not adaptive) and with different students. (OET86 and OET94 were indistinguishable in this case).
4) This model treated all students as having equivalent skill, and was still effective. We could clearly make improvements by including parameters to adjust to model error for each student, further personalizing the approach.
See my paper for more details.