As an instructor, I often wonder about the consistency of my quizzes. Are they measuring my students' knowledge equally across different semesters, or are there subtle differences that I need to account for? Recently, I found myself checking on the scores from two semesters of quizzes and realized it might be time to dig a little deeper into this question.
The challenge lies in a well-known limitation of Classical Test Theory (CTT): it’s population-dependent. This means that the characteristics of a quiz—like its difficulty or ability to differentiate between stronger and weaker students—can change depending on who takes it. So, even if I use the same quiz in two semesters, the scores might not be directly comparable because the groups of students differ.
To address this, I decided to try equating the quiz scores. Equating is a statistical method that adjusts scores on two tests (or the same test given at different times) so they can be compared on the same scale. It’s often used in standardized testing to ensure fairness across test administrations, but I thought, "Why not use it for my classroom quizzes?"
In this blog post, I start by comparing characteristics of the quiz in each semester; Then, I will try using different test equating methods for comparison purposes.
Of course, we need to load the required R packages and our dataset first.
library("equate")
library("CTTvis")
df1 <- read.csv("df_winter24.csv", header = TRUE)
df1 <- df1[,1:30]
df2 <- read.csv("df_fall24.csv", header = TRUE)
> head(df1)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30
1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1
2 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
3 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0
4 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1
5 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1 0 1 1
6 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> head(df2)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30
1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
2 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1
3 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1
4 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
6 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
To compare item property of the two quiz datasets, I used the CTTvis R package. This package visualizes CTT item property to make it easily understandable. I wrote it myself out of my inspiration when working on a standardized test data, with the hope that it would simplify the communication of results.
CTTvis::difficulty_plot(responses = df1, title = "Item Difficulty Plot: Winter 2024", easyFlag = .90, hardFlag = .50)
item difficulty
2 2 0.4803922
19 19 0.5980392
20 20 0.5980392
28 28 0.6568627
7 7 0.6764706
24 24 0.6960784
16 16 0.7156863
8 8 0.7254902
9 9 0.7254902
13 13 0.7254902
25 25 0.7352941
17 17 0.7745098
4 4 0.8039216
15 15 0.8235294
26 26 0.8333333
18 18 0.8431373
22 22 0.8431373
23 23 0.8529412
10 10 0.8627451
1 1 0.8823529
5 5 0.8921569
21 21 0.9019608
30 30 0.9215686
27 27 0.9411765
3 3 0.9509804
6 6 0.9509804
29 29 0.9607843
11 11 0.9705882
12 12 0.9705882
14 14 0.9803922
CTTvis::difficulty_plot(responses = df2, title = "Item Difficulty Plot: Fall 2024", easyFlag = .90, hardFlag = .50)
item difficulty
9 9 0.5441176
2 2 0.6029412
19 19 0.6323529
7 7 0.6764706
17 17 0.6764706
24 24 0.6911765
4 4 0.7058824
20 20 0.7205882
16 16 0.7500000
8 8 0.7794118
28 28 0.7794118
13 13 0.7941176
18 18 0.7941176
15 15 0.8235294
25 25 0.8382353
10 10 0.8529412
5 5 0.8676471
30 30 0.8676471
6 6 0.8823529
21 21 0.8823529
23 23 0.8823529
22 22 0.8970588
1 1 0.9117647
26 26 0.9117647
3 3 0.9411765
11 11 0.9411765
12 12 0.9411765
14 14 0.9411765
29 29 0.9411765
27 27 0.9558824
CTTvis::point_biserial_plot(responses = df1, title = "Item Discrimination Plot: Winter 2024", pBis_threshold = 0.20)
item point_biserial
7 7 -0.06355892
26 26 0.02904733
12 12 0.03273792
6 6 0.07458746
16 16 0.09358458
3 3 0.11401358
1 1 0.13196345
2 2 0.13934851
4 4 0.14284214
24 24 0.18586588
27 27 0.19471860
20 20 0.21113069
22 22 0.23163443
11 11 0.23436840
5 5 0.24464620
17 17 0.24993731
10 10 0.25172295
8 8 0.26785130
15 15 0.28557592
25 25 0.30646985
19 19 0.31051269
13 13 0.31559823
18 18 0.32186358
28 28 0.32543272
9 9 0.32935053
21 21 0.32960215
23 23 0.34233690
30 30 0.35391849
29 29 0.44917215
14 14 0.50771620
CTTvis::point_biserial_plot(responses = df2, title = "Item Discrimination Plot: Fall 2024", pBis_threshold = 0.20)
item point_biserial
4 4 0.1114014
10 10 0.1273124
26 26 0.1512393
9 9 0.1690738
7 7 0.1809274
18 18 0.2036338
1 1 0.2333230
24 24 0.2342623
11 11 0.2365007
16 16 0.2660486
27 27 0.3109581
19 19 0.3206996
20 20 0.3253837
28 28 0.3795062
5 5 0.3953602
6 6 0.3977969
29 29 0.4076268
17 17 0.4105264
13 13 0.4163146
2 2 0.4345626
25 25 0.4436627
21 21 0.4513741
23 23 0.4513741
8 8 0.4729187
3 3 0.4943550
15 15 0.5003408
12 12 0.5380173
14 14 0.5380173
30 30 0.5498302
22 22 0.5512179
item alpha_if_dropped
9 9 0.6685584
28 28 0.6685940
23 23 0.6694850
13 13 0.6698523
19 19 0.6700038
25 25 0.6707590
18 18 0.6707750
29 29 0.6714002
30 30 0.6717746
21 21 0.6721334
15 15 0.6732702
14 14 0.6738117
8 8 0.6743101
17 17 0.6759520
10 10 0.6762716
5 5 0.6771712
22 22 0.6775833
20 20 0.6801310
11 11 0.6806577
27 27 0.6809355
24 24 0.6820674
1 1 0.6844041
4 4 0.6846892
3 3 0.6847147
6 6 0.6864200
12 12 0.6875058
2 2 0.6875671
16 16 0.6902958
26 26 0.6927449
7 7 0.7051945
item alpha_if_dropped
30 30 0.8137246
22 22 0.8145888
15 15 0.8146258
8 8 0.8152722
2 2 0.8166562
25 25 0.8168616
12 12 0.8169768
14 14 0.8169768
21 21 0.8172245
23 23 0.8172245
13 13 0.8175940
17 17 0.8177340
3 3 0.8179658
5 5 0.8187976
6 6 0.8188836
28 28 0.8190156
29 29 0.8199173
20 20 0.8213736
19 19 0.8219095
27 27 0.8224025
16 16 0.8236841
11 11 0.8237171
1 1 0.8237694
24 24 0.8254451
18 18 0.8257339
26 26 0.8259368
10 10 0.8275892
7 7 0.8278594
9 9 0.8291381
4 4 0.8304859
From the comparison above, we can see that item difficulty, discrimination, and test reliability of the quiz slightly differ across two semesters.
Test equating is a method used to ensure that scores from two or more different tests (or versions of a test) can be compared fairly.
Think of it like this: imagine two students take math tests from different schools. One test has harder questions, while the other is easier. If we simply compare their raw scores, it wouldn't be fair. One student might score lower just because their test was tougher.
To fix this, we use test equating to adjust the scores, so they reflect the same level of ability, no matter which test was taken. It’s like converting weights measured in kilograms and pounds into a common unit so they make sense together.
There are different ways to equate tests, but the goal is always the same: make the scores comparable and ensure that test difficulty doesn’t give an unfair advantage or disadvantage.
Test equating can also be applied to the same test given at two different time points, not just different versions of a test.
Even if the test questions are the same, the context in which it is administered may change. For example:
Students may prepare differently.
External factors (like teaching quality, curriculum changes, or test-taker anxiety) can affect performance.
When equating scores for the same test across time points, we’re ensuring that any differences in scores truly reflect differences in abilities, not changes in conditions or test difficulty over time.
This approach is common in longitudinal studies or standardized testing programs (e.g., SATs or PISA), where the goal is to track trends or progress. The equating adjusts for subtle shifts to keep comparisons meaningful and fair.
There are two approaches to test equating, linear and non-linear.
Both methods adjust scores so that tests can be fairly compared, but they handle differences in score distributions differently:
Linear Equating
What it does: Assumes the relationship between scores on the two test forms (or time points) is a straight line.
How it works: It matches the mean and standard deviation of scores from the two tests. This assumes that both test versions measure the same construct similarly but might have small differences in difficulty or variability.
When to use: Works well if the score distributions of the two tests are pretty similar (e.g., similar shape and spread).
Analogy: Imagine you’re resizing two photographs to match the same dimensions. You stretch or compress them equally without changing their overall shape.
Non-Linear Equating
What it does: Allows for a more flexible relationship between scores. It adjusts for differences in the shapes of the score distributions, not just their means and standard deviations.
How it works: It uses techniques like equipercentile equating, which aligns the percentile ranks of scores from the two tests. For instance, if a score of 70 was at the 80th percentile in the first test, it will be matched to the score at the 80th percentile in the second test.
When to use: Works well when the score distributions of the two tests are very different (e.g., one is skewed, while the other is normal).
Analogy: Think of reshaping two lumps of clay to make them the same size and shape. You might need more complex adjustments to match their contours.
It depends on the score distributions of your tests at the two time points:
Linear Equating is simpler and appropriate if the tests have similar distributions (e.g., symmetrical bell-shaped curves).
Non-Linear Equating is better when the tests have noticeable differences in distribution (e.g., one test has more high scores, while the other has a mix).
Steps to Decide
Plot the score distributions for the two time points (histogram or density plot).
If they look similar, try linear equating.
If they look very different, go for non-linear equating like equipercentile equating.
Tip: Non-linear methods are more flexible and generally preferred when in doubt.
To determine whether to use linear or non-linear equating for our two test sessions (df1 for Winter 2024 and df2 for Fall 2024), we can compare their score distributions visually and statistically in R as follows.
df1$score <- rowSums(df1[, grep("^Q", colnames(df1))], na.rm = TRUE)
df2$score <- rowSums(df2[, grep("^Q", colnames(df2))], na.rm = TRUE)
# Combine data for plotting
df1$session <- "Winter2024"
df2$session <- "Fall2024"
combined_df <- rbind(df1, df2)
library(ggplot2)
# Plot the distributions
ggplot(combined_df, aes(x = score, fill = session)) +
geom_density(alpha = 0.5) +
labs(title = "Score Distributions Between Two Semesters", x = "Score", y = "Density") +
scale_fill_manual(values = c("Winter2024" = "blue", "Fall2024" = "orange")) +
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))
> summary(df1$score) # Winter 2024
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.00 22.00 25.00 24.13 26.25 30.00
> summary(df2$score) # Fall 2024
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 23.00 25.00 24.41 28.00 30.00
We can also perform the Exact two-sample Kolmogorov-Smirnov test. This test compares the distributions of two samples to determine if they come from the same distribution.
> ks.test(df1$score, df2$score)
Exact two-sample Kolmogorov-Smirnov test
data: df1$score and df2$score
D = 0.12681, p-value = 0.2838
alternative hypothesis: two-sided
D = 0.12681: This is the KS test statistic. It represents the maximum difference between the cumulative distributions of the two samples.
p-value = 0.2838: This is the probability of observing a test statistic as extreme as, or more extreme than, the observed value under the null hypothesis. A higher p-value indicates that the null hypothesis cannot be rejected.
Since the p-value (0.2838) is greater than the common significance level (e.g., 0.05), we fail to reject the null hypothesis. This suggests that there is no significant difference between the distributions of df1$score and df2$score.
For good measure, I will perform independent sample t-test to examine mean differences between the two datasets.
When analyzing data, it's crucial to determine if your data follows a normal distribution. This helps in choosing the right statistical tests. Here, we’ll walk through the results of normality tests and a non-parametric test for comparing two groups.
Shapiro-Wilk Normality Test
The Shapiro-Wilk test checks if your data is normally distributed. Here are the results for two datasets, df1$score and df2$score:
For df1$score:
Test Statistic (W): 0.87426
p-value: 6.635e-08
For df2$score:
Test Statistic (W): 0.85155
p-value: 8.69e-07
Interpretation:
The p-values for both tests are extremely small (much less than 0.05), indicating that we reject the null hypothesis. This means that neither df1$score nor df2$score follows a normal distribution.
> shapiro.test(df1$score)
Shapiro-Wilk normality test
data: df1$score
W = 0.87426, p-value = 6.635e-08
> shapiro.test(df2$score)
Shapiro-Wilk normality test
data: df2$score
W = 0.85155, p-value = 8.69e-07
Mann-Whitney U Test (Wilcoxon Rank-Sum Test)
Since our data is not normally distributed, we use the Mann-Whitney U test, a non-parametric test, to compare the two groups.
> # Mann-Whitney U Test
> wilcox.test(df1$score, df2$score)
Wilcoxon rank sum test with continuity correction
data: df1$score and df2$score
W = 3313, p-value = 0.3926
alternative hypothesis: true location shift is not equal to 0
> library(ggstatsplot)
> # Create the plot
> ggbetweenstats(data = combined_df, x = session, y = score, type = "nonparametric")
The p-value (0.3926) is greater than the common significance level (e.g., 0.05), so we fail to reject the null hypothesis. This suggests that there is no significant difference between the distributions of df1$score and df2$score.
We first begin by setting up our dataset.
> #Set up the datasets for test equating
> form1 <- df1[,31:32]
> form2 <- df2[,31:32]
>
> # Combine the forms
> form12 <- rbind(form1, form2)
> head(form12)
score session
1 26 Winter2024
2 28 Winter2024
3 20 Winter2024
4 24 Winter2024
5 23 Winter2024
6 28 Winter2024
>
> # Add score frequencies to the data
> data <- as.data.frame(table(form12$score, form12$session))
> names(data) <- c("total", "session", "count")
> head(data)
total session count
1 6 Fall2024 1
2 8 Fall2024 1
3 14 Fall2024 0
4 16 Fall2024 1
5 17 Fall2024 1
6 18 Fall2024 1
>
> # Restructure the data as a frequency table
> data_1 <- as.freqtab(data[data$session == "Winter2024", c("total", "count")])
> data_2 <- as.freqtab(data[data$session == "Fall2024", c("total", "count")])
> head(data_1)
total count
1 6 0
2 8 2
3 14 1
4 16 0
5 17 2
6 18 1
>
> # Descriptive summary of the forms
> rbind(form_1 = summary(data_1), form_2 = summary(data_2))
mean sd skew kurt min max n
form_1 24.13462 3.834077 -1.627154 7.327239 8 30 104
form_2 24.40580 4.541698 -1.651006 7.101548 6 30 69
What is Mean Equating?
Mean equating is one of the simplest methods used to equate scores from two different test forms or sessions. The method adjusts the scores of one distribution, so that its mean matches the mean of the target distribution. This assumes the two distributions are identical in shape and variability, differing only in their central tendencies (means).
In this case:
data_1 represents scores from Winter 2024.
data_2 represents scores from Fall 2024.
The equated scores (yx) are the adjusted scores from data_1 that are transformed to align with the mean of data_2.
> mean_12 <- equate(x = data_1, y = data_2, type = "mean")
> mean_12$concordance
scale yx
1 6 6.271182
2 8 8.271182
3 14 14.271182
4 16 16.271182
5 17 17.271182
6 18 18.271182
7 19 19.271182
8 20 20.271182
9 21 21.271182
10 22 22.271182
11 23 23.271182
12 24 24.271182
13 25 25.271182
14 26 26.271182
15 27 27.271182
16 28 28.271182
17 29 29.271182
18 30 30.271182
> mean_12
Mean Equating: data_1 to data_2
Design: equivalent groups
Summary Statistics:
mean sd skew kurt min max n
x 24.13 3.83 -1.63 7.33 8.00 30.00 104
y 24.41 4.54 -1.65 7.10 6.00 30.00 69
yx 24.41 3.83 -1.63 7.33 8.27 30.27 104
Coefficients:
intercept slope cx cy sx sy
0.2712 1.0000 18.0000 18.0000 24.0000 24.0000
> equated_scores_mean <- mean_12$concordance$yx # Extract equated scores
>
> # Plot the original score scale vs. equated score (yx)
> plot(mean_12$concordance$scale, mean_12$concordance$yx,
+ type = "o", col = "blue", pch = 16,
+ xlab = "Original Scores (Winter 2024)",
+ ylab = "Equated Scores (Fall 2024)",
+ main = "Equated Scores vs Original Scores: Mean Equating")
>
> # Optionally, add a line of equality (where original score = equated score)
> abline(a = 0, b = 1, col = "gray", lty = 2)
> # Extract the relevant columns from the concordance table
> comparison_table <- mean_12$concordance[, c("scale", "yx")]
>
> # Rename the columns for clarity
> colnames(comparison_table) <- c("Original Score (Winter 2024)", "Equated Score (Fall 2024)")
>
> # Print the table
> print(comparison_table)
Original Score (Winter 2024) Equated Score (Fall 2024)
1 6 6.271182
2 8 8.271182
3 14 14.271182
4 16 16.271182
5 17 17.271182
6 18 18.271182
7 19 19.271182
8 20 20.271182
9 21 21.271182
10 22 22.271182
11 23 23.271182
12 24 24.271182
13 25 25.271182
14 26 26.271182
15 27 27.271182
16 28 28.271182
17 29 29.271182
18 30 30.271182
Interpreting the Output
The concordance table provides a mapping of scores from data_1 to their equated values (yx) in the context of data_2. Here's what each column represents:
scale: The original scores from data_1 (Winter 2024).
yx: The equated scores—how each score from data_1 would map onto data_2 (Fall 2024) based on mean equating.
Summary Statistics
The summary table compares the key metrics of the original and equated scores:
Original Scores (x): Scores from Winter 2024.
Target Scores (y): Scores from Fall 2024.
Equated Scores (yx): Transformed Winter 2024 scores.
Coefficients
The coefficients used for the linear transformation:
Intercept: The constant adjustment applied to the scores (0.27).
Slope: The scaling factor for adjusting score variability (1.00).
These values ensure that the means of the two distributions align.
Key Insights
The table shows that each score from data_1 is increased by a constant value (here, approximately 0.27) to align with the higher mean of data_2. For example:
A score of 6 in data_1 is equated to 6.27 in data_2.
A score of 30 in data_1 is equated to 30.27 in data_2.
This constant adjustment reflects the difference in means between the two distributions, without altering the shape or variability of the scores. The transformed scores are uniformly shifted upward.
What is Mean Equating For?
Mean equating is useful when:
The tests are similar in content and difficulty.
Differences between the two distributions are primarily in their central tendencies (e.g., test-takers in one session performed slightly better overall).
However, mean equating does not account for differences in variability or score distribution shapes. If the variability (spread) of scores differs significantly, other methods like linear equating may be more appropriate.
Imagine two classes take a math test at a different time. On average, students in one class scored 0.27 points higher than the other, even though the tests were of the same content. Mean equating adjusts the scores of the lower-scoring class by 0.27 points across the board, so the two classes can be compared fairly. Each student's performance remains relative to their peers, but the average difference between the classes is removed.
What is Linear Equating?
Linear equating is a method that adjusts scores by aligning both the means and the variability (standard deviation) of two score distributions. Unlike mean equating, which only adjusts the central tendency, linear equating also accounts for differences in the spread of scores between the two groups.
In this case:
data_1 represents scores from Winter 2024.
data_2 represents scores from Fall 2024.
The equated scores (yx) represent the scores from data_1 transformed to match the mean and variability of data_2.
> linear_12 <- equate(x = data_1, y = data_2, type = "linear")
> linear_12$concordance
scale yx se
1 6 2.924239 21.5782593
2 8 5.293361 17.5636238
3 14 12.400727 8.0331088
4 16 14.769849 5.6940675
5 17 15.954410 4.6816338
6 18 17.138971 3.7739248
7 19 18.323532 2.9709404
8 20 19.508093 2.2726806
9 21 20.692654 1.6791455
10 22 21.877215 1.1903349
11 23 23.061776 0.8062491
12 24 24.246337 0.5268878
13 25 25.430898 0.3522512
14 26 26.615459 0.2823393
15 27 27.800020 0.3171519
16 28 28.984581 0.4566892
17 29 30.169142 0.7009512
18 30 31.353703 1.0499377
>
> linear_12
Linear Equating: data_1 to data_2
Design: equivalent groups
Summary Statistics:
mean sd skew kurt min max n
x 24.13 3.83 -1.63 7.33 8.00 30.00 104
y 24.41 4.54 -1.65 7.10 6.00 30.00 69
yx 24.41 4.54 -1.63 7.33 5.29 31.35 104
Coefficients:
intercept slope cx cy sx sy
-4.1831 1.1846 18.0000 18.0000 24.0000 24.0000
>
> # Plot the original score scale vs. equated score (yx)
> plot(linear_12$concordance$scale, linear_12$concordance$yx,
+ type = "o", col = "blue", pch = 16,
+ xlab = "Original Scores (Winter 2024)",
+ ylab = "Equated Scores (Fall 2024)",
+ main = "Equated Scores vs Original Scores: Linear Equating")
>
> # Optionally, add a line of equality (where original score = equated score)
> abline(a = 0, b = 1, col = "gray", lty = 2)
> # Extract the relevant columns from the concordance table
> comparison_table <- linear_12$concordance[, c("scale", "yx")]
>
> # Rename the columns for clarity
> colnames(comparison_table) <- c("Original Score (Winter 2024)", "Equated Score (Fall 2024)")
>
> # Print the table
> print(comparison_table)
Original Score (Winter 2024) Equated Score (Fall 2024)
1 6 2.924239
2 8 5.293361
3 14 12.400727
4 16 14.769849
5 17 15.954410
6 18 17.138971
7 19 18.323532
8 20 19.508093
9 21 20.692654
10 22 21.877215
11 23 23.061776
12 24 24.246337
13 25 25.430898
14 26 26.615459
15 27 27.800020
16 28 28.984581
17 29 30.169142
18 30 31.353703
Interpreting the Output
The concordance table (linear_12$concordance) provides three key columns:
scale: The original scores from data_1 (Winter 2024).
yx: The equated scores—how each score from data_1 is adjusted to match the distribution of data_2 (Fall 2024).
se: The standard error of the equated scores, which reflects the uncertainty in the equating process.
Example rows:
A score of 6 in data_1 equates to 2.92 in data_2, with a standard error of 21.58.
A score of 24 in data_1 equates to 24.25 in data_2, with a standard error of 0.53.
The variability in the standard error (high for extreme scores, low for scores near the center) indicates that scores closer to the mean are more reliably equated.
Summary Statistics
The summary table compares the key metrics of the original and equated scores:
Original Scores (x): Scores from Winter 2024.
Target Scores (y): Scores from Fall 2024.
Equated Scores (yx): Transformed Winter 2024 scores.
Coefficients
The coefficients used for the linear transformation:
Intercept: The constant adjustment applied to the scores (-4.18).
Slope: The scaling factor for adjusting score variability (1.18).
These values ensure that both the means and standard deviations of the two distributions align.
Key Insights
Linear equating ensures that:
Mean alignment: The average of data_1 is adjusted to match the average of data_2.
Variability alignment: The spread (standard deviation) of data_1 is scaled to match data_2.
For example:
A low score like 6 is mapped to 2.92, reflecting that low scores in Winter 2024 were "easier to achieve" compared to Fall 2024.
A mid-range score like 24 remains close at 24.25, indicating consistency near the mean.
A high score like 30 maps to 31.35, showing that high scores in Winter 2024 are adjusted upward to align with Fall 2024’s higher variability.
What is Linear Equating For?
Linear equating is ideal when:
The tests are slightly different, and their difficulty levels vary consistently across the score range.
The score distributions have similar shapes but differ in spread.
This method is often used in standardized testing to compare scores across different forms of a test.
Imagine two groups of students take a math test at a different time—one group’s scores are more spread out than the other’s. Linear equating adjusts not just the average score but also ensures that the range of scores (variability) matches between the two groups. This gives a fair comparison, ensuring both high and low scores are treated consistently.
Let's try performing non-linear equating for practice. I will be using equipercentile equating.
What is Equipercentile Equating?
Equipercentile equating is a non-linear method that aligns scores by matching their cumulative percentile ranks across two distributions. Unlike linear equating, it does not assume that the relationship between the two score distributions is straight or uniform. Instead, it accounts for varying relationships between the scores at different points of the scale.
In this case:
data_1 represents scores from Winter 2024.
data_2 represents scores from Fall 2024.
The equated scores (yx) represent data_1 scores transformed so their percentile ranks align with the percentile ranks of data_2.
> equi_12 <- equate(x = data_1, y = data_2, type = "equipercentile")
> equi_12$concordance
scale yx se
1 6 5.000000 0.0000000
2 8 6.326923 0.9326041
3 14 8.975962 1.5709721
4 16 9.971154 1.7905021
5 17 17.798077 2.0046068
6 18 17.561298 2.3497311
7 19 18.144231 2.6412717
8 20 19.649038 1.6280642
9 21 21.657051 0.6589761
10 22 22.375801 0.7456687
11 23 22.884615 0.4249101
12 24 23.457605 0.4730608
13 25 25.040865 1.3155507
14 26 26.963141 0.8188069
15 27 28.455929 0.7182188
16 28 28.975561 0.2520001
17 29 29.417869 0.1634453
18 30 30.334135 0.2609886
>
> equi_12
Equipercentile Equating: data_1 to data_2
Design: equivalent groups
Smoothing Method: none
Summary Statistics:
mean sd skew kurt min max n
x 24.13 3.83 -1.63 7.33 8.00 30.00 104
y 24.41 4.54 -1.65 7.10 6.00 30.00 69
yx 24.45 4.43 -1.68 7.45 6.33 30.33 104
> # Plot the original score scale vs. equated score (yx)
> plot(equi_12$concordance$scale, equi_12$concordance$yx,
+ type = "o", col = "blue", pch = 16,
+ xlab = "Original Scores (Winter 2024)",
+ ylab = "Equated Scores (Fall 2024)",
+ main = "Equated Scores vs Original Scores: Equipercentile Equating")
>
> # Optionally, add a line of equality (where original score = equated score)
> abline(a = 0, b = 1, col = "gray", lty = 2)
> # Extract the relevant columns from the concordance table
> comparison_table <- equi_12$concordance[, c("scale", "yx")]
>
> # Rename the columns for clarity
> colnames(comparison_table) <- c("Original Score (Winter 2024)", "Equated Score (Fall 2024)")
>
> # Print the table
> print(comparison_table)
Original Score (Winter 2024) Equated Score (Fall 2024)
1 6 5.000000
2 8 6.326923
3 14 8.975962
4 16 9.971154
5 17 17.798077
6 18 17.561298
7 19 18.144231
8 20 19.649038
9 21 21.657051
10 22 22.375801
11 23 22.884615
12 24 23.457605
13 25 25.040865
14 26 26.963141
15 27 28.455929
16 28 28.975561
17 29 29.417869
18 30 30.334135
Interpreting the Output
The concordance table (equi_12$concordance) shows:
scale: Original scores from Winter 2024.
yx: Equipercentile-equated scores—what each score in data_1 would correspond to in data_2 based on matching percentiles.
se: Standard error of equating, indicating the reliability of the equated score.
Example rows:
A score of 6 in Winter 2024 equates to 5.00 in Fall 2024, with a standard error of 0 (perfect certainty due to matching at the extremes).
A score of 24 equates to 23.46, with a standard error of 0.47, reflecting low uncertainty.
A score of 29 equates to 29.42, with very low standard error (0.16).
The non-linear nature is evident in how the adjustments vary irregularly across the scale. For example, some scores are adjusted more significantly than others (e.g., 14 → 8.98), while others remain similar (e.g., 28 → 28.98).
Summary Statistics
The summary statistics table compares the original (x), target (y), and equated (yx) score distributions:
Mean and SD Alignment: The mean (24.45) and standard deviation (4.43) of yx closely align with those of y (Fall 2024 scores: mean = 24.41, SD = 4.54), showing successful equating.
Skewness and Kurtosis: These metrics indicate the shape of the distributions. The equated scores preserve some characteristics of x while aligning with y.
Key Insights
Equipercentile equating:
Matches scores based on their percentile ranks rather than their raw values or spread.
Provides more nuanced adjustments, especially useful when score distributions differ in shape.
Is ideal for tests where differences in difficulty or format affect various score ranges unevenly.
For example:
A low score like 6 maps to 5.00, showing that very low Winter 2024 scores are considered slightly easier compared to Fall 2024.
A mid-range score like 24 adjusts minimally to 23.46, indicating similarity around the central range.
A high score like 30 maps to 30.33, reflecting that high scorers face relatively consistent difficulty across the groups.
Why Use Non-Linear Equating?
Non-linear equating is beneficial when:
The shapes of the score distributions differ (e.g., one test has more extreme high or low scores than the other).
The relationship between scores in the two groups is not consistent across the entire range (e.g., low scores might differ more than high scores).
Equipercentile equating ensures that if a person scored in the top 10% in Winter 2024, their equated score would place them in the top 10% of Fall 2024, even if the tests differ in difficulty. Unlike linear equating, which assumes a uniform adjustment, equipercentile equating adjusts scores more flexibly to reflect differences across the entire score range.
For example:
A score of 6 in Winter 2024 is more comparable to 5 in Fall 2024 because low scores were easier to achieve in Winter.
A score of 30 maps closely to 30.33, as high scores were similarly challenging in both tests.
For the sake of comparison, let's compare the results.
> # Compare equated scores
> round(cbind(xscale = 0:25,
+ mean = mean_12$concordance$yx,
+ linear = linear_12$concordance$yx,
+ equipercentile = equi_12$concordance$yx), 2)
xscale mean linear equipercentile
[1,] 0 6.27 2.92 5.00
[2,] 1 8.27 5.29 6.33
[3,] 2 14.27 12.40 8.98
[4,] 3 16.27 14.77 9.97
[5,] 4 17.27 15.95 17.80
[6,] 5 18.27 17.14 17.56
[7,] 6 19.27 18.32 18.14
[8,] 7 20.27 19.51 19.65
[9,] 8 21.27 20.69 21.66
[10,] 9 22.27 21.88 22.38
[11,] 10 23.27 23.06 22.88
[12,] 11 24.27 24.25 23.46
[13,] 12 25.27 25.43 25.04
[14,] 13 26.27 26.62 26.96
[15,] 14 27.27 27.80 28.46
[16,] 15 28.27 28.98 28.98
[17,] 16 29.27 30.17 29.42
[18,] 17 30.27 31.35 30.33
[19,] 18 6.27 2.92 5.00
[20,] 19 8.27 5.29 6.33
[21,] 20 14.27 12.40 8.98
[22,] 21 16.27 14.77 9.97
[23,] 22 17.27 15.95 17.80
[24,] 23 18.27 17.14 17.56
[25,] 24 19.27 18.32 18.14
[26,] 25 20.27 19.51 19.65
>
> # Plot the results
> plot(mean_12, linear_12, equi_12, lty=c(1,2,3),
+ col=c("blue", "red", "forestgreen"), addident = FALSE)
1. What Happens When the Distributions Match?
When the distributions of the two tests are identical (or very similar), non-linear equating essentially behaves similarly to linear equating because percentile ranks are already aligned. The equated scores will closely match the original scores, and any adjustments will be minimal or negligible. This is because there is no substantial shape difference to address.
For example:
A score in the 50th percentile in Test A already aligns with a score in the 50th percentile in Test B.
Non-linear equating won’t distort the alignment but may introduce minor variations due to technical adjustments.
2. Is There Any Negative Effect?
There is no direct harm in applying non-linear equating, but there are a few points to consider:
Added Complexity
Non-linear equating introduces more steps and complexity, making it harder to explain to stakeholders or audiences who may expect simpler linear methods when distributions align.
It might seem redundant and could raise questions about why a more complex approach was used when simpler methods suffice.
Risk of Overfitting
If there are small random fluctuations in the data, non-linear equating might over-adjust in places where no adjustment is necessary. This can make the results less interpretable, especially if the adjustments do not align with practical differences in test difficulty or score meaning.
Increased Standard Error
Non-linear methods generally have slightly higher standard errors compared to linear methods, especially at the extremes of the score range. If the distributions match closely, this might introduce unnecessary uncertainty.
3. What Does It Mean to Use Non-Linear Equating in This Case?
If you use non-linear equating when distributions match:
You are essentially double-checking alignment at all score levels and ensuring that any nuanced differences (even if small) are accounted for.
The results should align closely with linear equating, but small deviations might reflect random variations rather than meaningful differences.
This might be interpreted as over-precision, especially if:
The audience expects simple explanations, and the differences introduced by non-linear equating are negligible or hard to justify.
There is no practical reason to believe that scores at different points in the distribution would require differential adjustment.
4. When Might Non-Linear Equating Still Be Useful?
Non-linear equating could still have a role even if distributions match, for example:
Consistency: If non-linear methods are the standard in your context, using them ensures comparability with past analyses.
Validation: You might want to verify that linear assumptions hold true by comparing linear and non-linear results. If they match, it strengthens the case for using linear equating.
5. Recommendation
If distributions match closely, linear equating is simpler, more interpretable, and equally effective. Non-linear equating in this case might add unnecessary complexity without meaningful benefit.
If distributions differ even slightly, non-linear equating ensures precision and fairness across the score range.
Ultimately, using non-linear equating in this scenario won’t harm your results but might be seen as unnecessarily complicated. It’s a choice that should align with your audience’s expectations and the context of your analysis.
By using both linear and non-linear equating methods, I saw how statistical techniques could align scores from two semesters. This helps me understand whether the quizzes were functioning consistently. Linear equating showed how similar the quizzes were when adjusted for differences in means and variability, while non-linear equating highlighted where slight differences might arise due to more complex factors.
For anyone exploring educational data, equating is a powerful tool. Whether you’re an educator, a researcher, or just a curious learner, it offers a way to dig deeper into your data and make meaningful comparisons. As for me, I’m leaving this process with a clearer picture of how my quizzes perform. Thanks for following along! I hope it inspires you to explore your own data stories!