Equating Quiz Scores Across Semesters

Item Difficulty Comparison

Item Discrimination Comparison

Coefficient Alpha Comparison

Test Equating

Can we do it with the same test but different testing occasions?

Test Equating Approaches: Linear vs. Non-Linear Equating

Which approach should I use then?

Understanding Normality and Non-Parametric Tests

Linear Equating it is then!

Mean Equating

Linear Equating

Can I try Non-Linear Equating a bit for practice?

But the data is supposed to be for linear equating. Will this affect my result?

Concluding Remark

As an instructor, I often wonder about the consistency of my quizzes. Are they measuring my students' knowledge equally across different semesters, or are there subtle differences that I need to account for? Recently, I found myself checking on the scores from two semesters of quizzes and realized it might be time to dig a little deeper into this question.

The challenge lies in a well-known limitation of Classical Test Theory (CTT): it’s population-dependent. This means that the characteristics of a quiz—like its difficulty or ability to differentiate between stronger and weaker students—can change depending on who takes it. So, even if I use the same quiz in two semesters, the scores might not be directly comparable because the groups of students differ.

To address this, I decided to try equating the quiz scores. Equating is a statistical method that adjusts scores on two tests (or the same test given at different times) so they can be compared on the same scale. It’s often used in standardized testing to ensure fairness across test administrations, but I thought, "Why not use it for my classroom quizzes?"

In this blog post, I start by comparing characteristics of the quiz in each semester; Then, I will try using different test equating methods for comparison purposes.

Of course, we need to load the required R packages and our dataset first.

library("equate")

library("CTTvis")

df1 <- read.csv("df_winter24.csv", header = TRUE)

df1 <- df1[,1:30]

df2 <- read.csv("df_fall24.csv", header = TRUE)

> head(df1)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30

1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1

2 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1

3 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0

4 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1

5 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1 0 1 1

6 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

> head(df2)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30

1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1

2 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1

3 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1

4 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1

6 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1

To compare item property of the two quiz datasets, I used the CTTvis R package. This package visualizes CTT item property to make it easily understandable. I wrote it myself out of my inspiration when working on a standardized test data, with the hope that it would simplify the communication of results.

Item Difficulty Comparison

CTTvis::difficulty_plot(responses = df1, title = "Item Difficulty Plot: Winter 2024", easyFlag = .90, hardFlag = .50)

item difficulty

2 2 0.4803922

19 19 0.5980392

20 20 0.5980392

28 28 0.6568627

7 7 0.6764706

24 24 0.6960784

16 16 0.7156863

8 8 0.7254902

9 9 0.7254902

13 13 0.7254902

25 25 0.7352941

17 17 0.7745098

4 4 0.8039216

15 15 0.8235294

26 26 0.8333333

18 18 0.8431373

22 22 0.8431373

23 23 0.8529412

10 10 0.8627451

1 1 0.8823529

5 5 0.8921569

21 21 0.9019608

30 30 0.9215686

27 27 0.9411765

3 3 0.9509804

6 6 0.9509804

29 29 0.9607843

11 11 0.9705882

12 12 0.9705882

14 14 0.9803922

CTTvis::difficulty_plot(responses = df2, title = "Item Difficulty Plot: Fall 2024", easyFlag = .90, hardFlag = .50)

item difficulty

9 9 0.5441176

2 2 0.6029412

19 19 0.6323529

7 7 0.6764706

17 17 0.6764706

24 24 0.6911765

4 4 0.7058824

20 20 0.7205882

16 16 0.7500000

8 8 0.7794118

28 28 0.7794118

13 13 0.7941176

18 18 0.7941176

15 15 0.8235294

25 25 0.8382353

10 10 0.8529412

5 5 0.8676471

30 30 0.8676471

6 6 0.8823529

21 21 0.8823529

23 23 0.8823529

22 22 0.8970588

1 1 0.9117647

26 26 0.9117647

3 3 0.9411765

11 11 0.9411765

12 12 0.9411765

14 14 0.9411765

29 29 0.9411765

27 27 0.9558824

Item Discrimination Comparison

CTTvis::point_biserial_plot(responses = df1, title = "Item Discrimination Plot: Winter 2024", pBis_threshold = 0.20)

item point_biserial

7 7 -0.06355892

26 26 0.02904733

12 12 0.03273792

6 6 0.07458746

16 16 0.09358458

3 3 0.11401358

1 1 0.13196345

2 2 0.13934851

4 4 0.14284214

24 24 0.18586588

27 27 0.19471860

20 20 0.21113069

22 22 0.23163443

11 11 0.23436840

5 5 0.24464620

17 17 0.24993731

10 10 0.25172295

8 8 0.26785130

15 15 0.28557592

25 25 0.30646985

19 19 0.31051269

13 13 0.31559823

18 18 0.32186358

28 28 0.32543272

9 9 0.32935053

21 21 0.32960215

23 23 0.34233690

30 30 0.35391849

29 29 0.44917215

14 14 0.50771620

CTTvis::point_biserial_plot(responses = df2, title = "Item Discrimination Plot: Fall 2024", pBis_threshold = 0.20)

item point_biserial

4 4 0.1114014

10 10 0.1273124

26 26 0.1512393

9 9 0.1690738

7 7 0.1809274

18 18 0.2036338

1 1 0.2333230

24 24 0.2342623

11 11 0.2365007

16 16 0.2660486

27 27 0.3109581

19 19 0.3206996

20 20 0.3253837

28 28 0.3795062

5 5 0.3953602

6 6 0.3977969

29 29 0.4076268

17 17 0.4105264

13 13 0.4163146

2 2 0.4345626

25 25 0.4436627

21 21 0.4513741

23 23 0.4513741

8 8 0.4729187

3 3 0.4943550

15 15 0.5003408

12 12 0.5380173

14 14 0.5380173

30 30 0.5498302

22 22 0.5512179

Coefficient Alpha Comparison

item alpha_if_dropped

9 9 0.6685584

28 28 0.6685940

23 23 0.6694850

13 13 0.6698523

19 19 0.6700038

25 25 0.6707590

18 18 0.6707750

29 29 0.6714002

30 30 0.6717746

21 21 0.6721334

15 15 0.6732702

14 14 0.6738117

8 8 0.6743101

17 17 0.6759520

10 10 0.6762716

5 5 0.6771712

22 22 0.6775833

20 20 0.6801310

11 11 0.6806577

27 27 0.6809355

24 24 0.6820674

1 1 0.6844041

4 4 0.6846892

3 3 0.6847147

6 6 0.6864200

12 12 0.6875058

2 2 0.6875671

16 16 0.6902958

26 26 0.6927449

7 7 0.7051945

item alpha_if_dropped

30 30 0.8137246

22 22 0.8145888

15 15 0.8146258

8 8 0.8152722

2 2 0.8166562

25 25 0.8168616

12 12 0.8169768

14 14 0.8169768

21 21 0.8172245

23 23 0.8172245

13 13 0.8175940

17 17 0.8177340

3 3 0.8179658

5 5 0.8187976

6 6 0.8188836

28 28 0.8190156

29 29 0.8199173

20 20 0.8213736

19 19 0.8219095

27 27 0.8224025

16 16 0.8236841

11 11 0.8237171

1 1 0.8237694

24 24 0.8254451

18 18 0.8257339

26 26 0.8259368

10 10 0.8275892

7 7 0.8278594

9 9 0.8291381

4 4 0.8304859

From the comparison above, we can see that item difficulty, discrimination, and test reliability of the quiz slightly differ across two semesters.

Test Equating

Test equating is a method used to ensure that scores from two or more different tests (or versions of a test) can be compared fairly.

Think of it like this: imagine two students take math tests from different schools. One test has harder questions, while the other is easier. If we simply compare their raw scores, it wouldn't be fair. One student might score lower just because their test was tougher.

To fix this, we use test equating to adjust the scores, so they reflect the same level of ability, no matter which test was taken. It’s like converting weights measured in kilograms and pounds into a common unit so they make sense together.

There are different ways to equate tests, but the goal is always the same: make the scores comparable and ensure that test difficulty doesn’t give an unfair advantage or disadvantage.

Can we do it with the same test but different testing occasions?

Test equating can also be applied to the same test given at two different time points, not just different versions of a test.

Even if the test questions are the same, the context in which it is administered may change. For example:

Students may prepare differently.
External factors (like teaching quality, curriculum changes, or test-taker anxiety) can affect performance.

When equating scores for the same test across time points, we’re ensuring that any differences in scores truly reflect differences in abilities, not changes in conditions or test difficulty over time.

This approach is common in longitudinal studies or standardized testing programs (e.g., SATs or PISA), where the goal is to track trends or progress. The equating adjusts for subtle shifts to keep comparisons meaningful and fair.

Test Equating Approaches: Linear vs. Non-Linear Equating

There are two approaches to test equating, linear and non-linear.

Both methods adjust scores so that tests can be fairly compared, but they handle differences in score distributions differently:

Linear Equating
- What it does: Assumes the relationship between scores on the two test forms (or time points) is a straight line.
- How it works: It matches the mean and standard deviation of scores from the two tests. This assumes that both test versions measure the same construct similarly but might have small differences in difficulty or variability.
- When to use: Works well if the score distributions of the two tests are pretty similar (e.g., similar shape and spread).
- Analogy: Imagine you’re resizing two photographs to match the same dimensions. You stretch or compress them equally without changing their overall shape.
Non-Linear Equating
- What it does: Allows for a more flexible relationship between scores. It adjusts for differences in the shapes of the score distributions, not just their means and standard deviations.
- How it works: It uses techniques like equipercentile equating, which aligns the percentile ranks of scores from the two tests. For instance, if a score of 70 was at the 80th percentile in the first test, it will be matched to the score at the 80th percentile in the second test.
- When to use: Works well when the score distributions of the two tests are very different (e.g., one is skewed, while the other is normal).
- Analogy: Think of reshaping two lumps of clay to make them the same size and shape. You might need more complex adjustments to match their contours.

Which approach should I use then?

It depends on the score distributions of your tests at the two time points:

Linear Equating is simpler and appropriate if the tests have similar distributions (e.g., symmetrical bell-shaped curves).
Non-Linear Equating is better when the tests have noticeable differences in distribution (e.g., one test has more high scores, while the other has a mix).

Steps to Decide

Plot the score distributions for the two time points (histogram or density plot).
If they look similar, try linear equating.
If they look very different, go for non-linear equating like equipercentile equating.

Tip: Non-linear methods are more flexible and generally preferred when in doubt.

To determine whether to use linear or non-linear equating for our two test sessions (df1 for Winter 2024 and df2 for Fall 2024), we can compare their score distributions visually and statistically in R as follows.

df1$score <- rowSums(df1[, grep("^Q", colnames(df1))], na.rm = TRUE)

df2$score <- rowSums(df2[, grep("^Q", colnames(df2))], na.rm = TRUE)

# Combine data for plotting

df1$session <- "Winter2024"

df2$session <- "Fall2024"

combined_df <- rbind(df1, df2)

library(ggplot2)

# Plot the distributions

ggplot(combined_df, aes(x = score, fill = session)) +

geom_density(alpha = 0.5) +

labs(title = "Score Distributions Between Two Semesters", x = "Score", y = "Density") +

scale_fill_manual(values = c("Winter2024" = "blue", "Fall2024" = "orange")) +

theme_minimal()+

theme(plot.title = element_text(hjust = 0.5))

> summary(df1$score) # Winter 2024

Min. 1st Qu. Median Mean 3rd Qu. Max.

8.00 22.00 25.00 24.13 26.25 30.00

> summary(df2$score) # Fall 2024

Min. 1st Qu. Median Mean 3rd Qu. Max.

6.00 23.00 25.00 24.41 28.00 30.00

We can also perform the Exact two-sample Kolmogorov-Smirnov test. This test compares the distributions of two samples to determine if they come from the same distribution.

> ks.test(df1$score, df2$score)

Exact two-sample Kolmogorov-Smirnov test

data: df1$score and df2$score

D = 0.12681, p-value = 0.2838

alternative hypothesis: two-sided

D = 0.12681: This is the KS test statistic. It represents the maximum difference between the cumulative distributions of the two samples.

p-value = 0.2838: This is the probability of observing a test statistic as extreme as, or more extreme than, the observed value under the null hypothesis. A higher p-value indicates that the null hypothesis cannot be rejected.

Since the p-value (0.2838) is greater than the common significance level (e.g., 0.05), we fail to reject the null hypothesis. This suggests that there is no significant difference between the distributions of df1$score and df2$score.

For good measure, I will perform independent sample t-test to examine mean differences between the two datasets.

Understanding Normality and Non-Parametric Tests

When analyzing data, it's crucial to determine if your data follows a normal distribution. This helps in choosing the right statistical tests. Here, we’ll walk through the results of normality tests and a non-parametric test for comparing two groups.

Shapiro-Wilk Normality Test

The Shapiro-Wilk test checks if your data is normally distributed. Here are the results for two datasets, df1$score and df2$score:

For df1$score:
- Test Statistic (W): 0.87426
- p-value: 6.635e-08
For df2$score:
- Test Statistic (W): 0.85155
- p-value: 8.69e-07

Interpretation:

The p-values for both tests are extremely small (much less than 0.05), indicating that we reject the null hypothesis. This means that neither df1$score nor df2$score follows a normal distribution.

> shapiro.test(df1$score)

Shapiro-Wilk normality test

data: df1$score

W = 0.87426, p-value = 6.635e-08

> shapiro.test(df2$score)

Shapiro-Wilk normality test

data: df2$score

W = 0.85155, p-value = 8.69e-07

Mann-Whitney U Test (Wilcoxon Rank-Sum Test)

Since our data is not normally distributed, we use the Mann-Whitney U test, a non-parametric test, to compare the two groups.

> # Mann-Whitney U Test

> wilcox.test(df1$score, df2$score)

Wilcoxon rank sum test with continuity correction

data: df1$score and df2$score

W = 3313, p-value = 0.3926

alternative hypothesis: true location shift is not equal to 0

> library(ggstatsplot)

> # Create the plot

> ggbetweenstats(data = combined_df, x = session, y = score, type = "nonparametric")

The p-value (0.3926) is greater than the common significance level (e.g., 0.05), so we fail to reject the null hypothesis. This suggests that there is no significant difference between the distributions of df1$score and df2$score.

Linear Equating it is then!

We first begin by setting up our dataset.

> #Set up the datasets for test equating

> form1 <- df1[,31:32]

> form2 <- df2[,31:32]

>

> # Combine the forms

> form12 <- rbind(form1, form2)

> head(form12)

score session

1 26 Winter2024

2 28 Winter2024

3 20 Winter2024

4 24 Winter2024

5 23 Winter2024

6 28 Winter2024

>

> # Add score frequencies to the data

> data <- as.data.frame(table(form12$score, form12$session))

> names(data) <- c("total", "session", "count")

> head(data)

total session count

1 6 Fall2024 1

2 8 Fall2024 1

3 14 Fall2024 0

4 16 Fall2024 1

5 17 Fall2024 1

6 18 Fall2024 1

>

> # Restructure the data as a frequency table

> data_1 <- as.freqtab(data[data$session == "Winter2024", c("total", "count")])

> data_2 <- as.freqtab(data[data$session == "Fall2024", c("total", "count")])

> head(data_1)

total count

1 6 0

2 8 2

3 14 1

4 16 0

5 17 2

6 18 1

>

> # Descriptive summary of the forms

> rbind(form_1 = summary(data_1), form_2 = summary(data_2))

mean sd skew kurt min max n

form_1 24.13462 3.834077 -1.627154 7.327239 8 30 104

form_2 24.40580 4.541698 -1.651006 7.101548 6 30 69

Mean Equating

What is Mean Equating?

Mean equating is one of the simplest methods used to equate scores from two different test forms or sessions. The method adjusts the scores of one distribution, so that its mean matches the mean of the target distribution. This assumes the two distributions are identical in shape and variability, differing only in their central tendencies (means).

In this case:

data_1 represents scores from Winter 2024.
data_2 represents scores from Fall 2024.

The equated scores (yx) are the adjusted scores from data_1 that are transformed to align with the mean of data_2.

> mean_12 <- equate(x = data_1, y = data_2, type = "mean")

> mean_12$concordance

scale yx

1 6 6.271182

2 8 8.271182

3 14 14.271182

4 16 16.271182

5 17 17.271182

6 18 18.271182

7 19 19.271182

8 20 20.271182

9 21 21.271182

10 22 22.271182

11 23 23.271182

12 24 24.271182

13 25 25.271182

14 26 26.271182

15 27 27.271182

16 28 28.271182

17 29 29.271182

18 30 30.271182

> mean_12

Mean Equating: data_1 to data_2

Design: equivalent groups

Summary Statistics:

mean sd skew kurt min max n

x 24.13 3.83 -1.63 7.33 8.00 30.00 104

y 24.41 4.54 -1.65 7.10 6.00 30.00 69

yx 24.41 3.83 -1.63 7.33 8.27 30.27 104

Coefficients:

intercept slope cx cy sx sy

0.2712 1.0000 18.0000 18.0000 24.0000 24.0000

> equated_scores_mean <- mean_12$concordance$yx # Extract equated scores

>

> # Plot the original score scale vs. equated score (yx)

> plot(mean_12$concordance$scale, mean_12$concordance$yx,

+ type = "o", col = "blue", pch = 16,

+ xlab = "Original Scores (Winter 2024)",

+ ylab = "Equated Scores (Fall 2024)",

+ main = "Equated Scores vs Original Scores: Mean Equating")

>

> # Optionally, add a line of equality (where original score = equated score)

> abline(a = 0, b = 1, col = "gray", lty = 2)

> # Extract the relevant columns from the concordance table

> comparison_table <- mean_12$concordance[, c("scale", "yx")]

>

> # Rename the columns for clarity

> colnames(comparison_table) <- c("Original Score (Winter 2024)", "Equated Score (Fall 2024)")

>

> # Print the table

> print(comparison_table)

Original Score (Winter 2024) Equated Score (Fall 2024)

1 6 6.271182

2 8 8.271182

3 14 14.271182

4 16 16.271182

5 17 17.271182

6 18 18.271182

7 19 19.271182

8 20 20.271182

9 21 21.271182

10 22 22.271182

11 23 23.271182

12 24 24.271182

13 25 25.271182

14 26 26.271182

15 27 27.271182

16 28 28.271182

17 29 29.271182

18 30 30.271182

Interpreting the Output

The concordance table provides a mapping of scores from data_1 to their equated values (yx) in the context of data_2. Here's what each column represents:

scale: The original scores from data_1 (Winter 2024).
yx: The equated scores—how each score from data_1 would map onto data_2 (Fall 2024) based on mean equating.

Summary Statistics

The summary table compares the key metrics of the original and equated scores:

Original Scores (x): Scores from Winter 2024.
Target Scores (y): Scores from Fall 2024.
Equated Scores (yx): Transformed Winter 2024 scores.

Coefficients

The coefficients used for the linear transformation:

Intercept: The constant adjustment applied to the scores (0.27).
Slope: The scaling factor for adjusting score variability (1.00).
These values ensure that the means of the two distributions align.

Key Insights

The table shows that each score from data_1 is increased by a constant value (here, approximately 0.27) to align with the higher mean of data_2. For example:

A score of 6 in data_1 is equated to 6.27 in data_2.
A score of 30 in data_1 is equated to 30.27 in data_2.

This constant adjustment reflects the difference in means between the two distributions, without altering the shape or variability of the scores. The transformed scores are uniformly shifted upward.

What is Mean Equating For?

Mean equating is useful when:

The tests are similar in content and difficulty.
Differences between the two distributions are primarily in their central tendencies (e.g., test-takers in one session performed slightly better overall).

However, mean equating does not account for differences in variability or score distribution shapes. If the variability (spread) of scores differs significantly, other methods like linear equating may be more appropriate.

Imagine two classes take a math test at a different time. On average, students in one class scored 0.27 points higher than the other, even though the tests were of the same content. Mean equating adjusts the scores of the lower-scoring class by 0.27 points across the board, so the two classes can be compared fairly. Each student's performance remains relative to their peers, but the average difference between the classes is removed.

Linear Equating

What is Linear Equating?

Linear equating is a method that adjusts scores by aligning both the means and the variability (standard deviation) of two score distributions. Unlike mean equating, which only adjusts the central tendency, linear equating also accounts for differences in the spread of scores between the two groups.

In this case:

data_1 represents scores from Winter 2024.
data_2 represents scores from Fall 2024.

The equated scores (yx) represent the scores from data_1 transformed to match the mean and variability of data_2.

> linear_12 <- equate(x = data_1, y = data_2, type = "linear")

> linear_12$concordance

scale yx se

1 6 2.924239 21.5782593

2 8 5.293361 17.5636238

3 14 12.400727 8.0331088

4 16 14.769849 5.6940675

5 17 15.954410 4.6816338

6 18 17.138971 3.7739248

7 19 18.323532 2.9709404

8 20 19.508093 2.2726806

9 21 20.692654 1.6791455

10 22 21.877215 1.1903349

11 23 23.061776 0.8062491

12 24 24.246337 0.5268878

13 25 25.430898 0.3522512

14 26 26.615459 0.2823393

15 27 27.800020 0.3171519

16 28 28.984581 0.4566892

17 29 30.169142 0.7009512

18 30 31.353703 1.0499377

>

> linear_12

Linear Equating: data_1 to data_2

Design: equivalent groups

Summary Statistics:

mean sd skew kurt min max n

x 24.13 3.83 -1.63 7.33 8.00 30.00 104

y 24.41 4.54 -1.65 7.10 6.00 30.00 69

yx 24.41 4.54 -1.63 7.33 5.29 31.35 104

Coefficients:

intercept slope cx cy sx sy

-4.1831 1.1846 18.0000 18.0000 24.0000 24.0000

>

> # Plot the original score scale vs. equated score (yx)

> plot(linear_12$concordance$scale, linear_12$concordance$yx,

+ type = "o", col = "blue", pch = 16,

+ xlab = "Original Scores (Winter 2024)",

+ ylab = "Equated Scores (Fall 2024)",

+ main = "Equated Scores vs Original Scores: Linear Equating")

>

> # Optionally, add a line of equality (where original score = equated score)

> abline(a = 0, b = 1, col = "gray", lty = 2)

> # Extract the relevant columns from the concordance table

> comparison_table <- linear_12$concordance[, c("scale", "yx")]

>

> # Rename the columns for clarity

> colnames(comparison_table) <- c("Original Score (Winter 2024)", "Equated Score (Fall 2024)")

>

> # Print the table

> print(comparison_table)

Original Score (Winter 2024) Equated Score (Fall 2024)

1 6 2.924239

2 8 5.293361

3 14 12.400727

4 16 14.769849

5 17 15.954410

6 18 17.138971

7 19 18.323532

8 20 19.508093

9 21 20.692654

10 22 21.877215

11 23 23.061776

12 24 24.246337

13 25 25.430898

14 26 26.615459

15 27 27.800020

16 28 28.984581

17 29 30.169142

18 30 31.353703

Interpreting the Output

The concordance table (linear_12$concordance) provides three key columns:

scale: The original scores from data_1 (Winter 2024).
yx: The equated scores—how each score from data_1 is adjusted to match the distribution of data_2 (Fall 2024).
se: The standard error of the equated scores, which reflects the uncertainty in the equating process.

Example rows:

A score of 6 in data_1 equates to 2.92 in data_2, with a standard error of 21.58.
A score of 24 in data_1 equates to 24.25 in data_2, with a standard error of 0.53.

The variability in the standard error (high for extreme scores, low for scores near the center) indicates that scores closer to the mean are more reliably equated.

Summary Statistics

The summary table compares the key metrics of the original and equated scores:

Original Scores (x): Scores from Winter 2024.
Target Scores (y): Scores from Fall 2024.
Equated Scores (yx): Transformed Winter 2024 scores.

Coefficients

The coefficients used for the linear transformation:

Intercept: The constant adjustment applied to the scores (-4.18).
Slope: The scaling factor for adjusting score variability (1.18).
These values ensure that both the means and standard deviations of the two distributions align.

Key Insights

Linear equating ensures that:

Mean alignment: The average of data_1 is adjusted to match the average of data_2.
Variability alignment: The spread (standard deviation) of data_1 is scaled to match data_2.

For example:

A low score like 6 is mapped to 2.92, reflecting that low scores in Winter 2024 were "easier to achieve" compared to Fall 2024.
A mid-range score like 24 remains close at 24.25, indicating consistency near the mean.
A high score like 30 maps to 31.35, showing that high scores in Winter 2024 are adjusted upward to align with Fall 2024’s higher variability.

What is Linear Equating For?

Linear equating is ideal when:

The tests are slightly different, and their difficulty levels vary consistently across the score range.
The score distributions have similar shapes but differ in spread.

This method is often used in standardized testing to compare scores across different forms of a test.

Imagine two groups of students take a math test at a different time—one group’s scores are more spread out than the other’s. Linear equating adjusts not just the average score but also ensures that the range of scores (variability) matches between the two groups. This gives a fair comparison, ensuring both high and low scores are treated consistently.

Can I try Non-Linear Equating a bit for practice?

Let's try performing non-linear equating for practice. I will be using equipercentile equating.

What is Equipercentile Equating?

Equipercentile equating is a non-linear method that aligns scores by matching their cumulative percentile ranks across two distributions. Unlike linear equating, it does not assume that the relationship between the two score distributions is straight or uniform. Instead, it accounts for varying relationships between the scores at different points of the scale.

In this case:

data_1 represents scores from Winter 2024.
data_2 represents scores from Fall 2024.

The equated scores (yx) represent data_1 scores transformed so their percentile ranks align with the percentile ranks of data_2.

> equi_12 <- equate(x = data_1, y = data_2, type = "equipercentile")

> equi_12$concordance

scale yx se

1 6 5.000000 0.0000000

2 8 6.326923 0.9326041

3 14 8.975962 1.5709721

4 16 9.971154 1.7905021

5 17 17.798077 2.0046068

6 18 17.561298 2.3497311

7 19 18.144231 2.6412717

8 20 19.649038 1.6280642

9 21 21.657051 0.6589761

10 22 22.375801 0.7456687

11 23 22.884615 0.4249101

12 24 23.457605 0.4730608

13 25 25.040865 1.3155507

14 26 26.963141 0.8188069

15 27 28.455929 0.7182188

16 28 28.975561 0.2520001

17 29 29.417869 0.1634453

18 30 30.334135 0.2609886

>

> equi_12

Equipercentile Equating: data_1 to data_2

Design: equivalent groups

Smoothing Method: none

Summary Statistics:

mean sd skew kurt min max n

x 24.13 3.83 -1.63 7.33 8.00 30.00 104

y 24.41 4.54 -1.65 7.10 6.00 30.00 69

yx 24.45 4.43 -1.68 7.45 6.33 30.33 104

> # Plot the original score scale vs. equated score (yx)

> plot(equi_12$concordance$scale, equi_12$concordance$yx,

+ type = "o", col = "blue", pch = 16,

+ xlab = "Original Scores (Winter 2024)",

+ ylab = "Equated Scores (Fall 2024)",

+ main = "Equated Scores vs Original Scores: Equipercentile Equating")

>

> # Optionally, add a line of equality (where original score = equated score)

> abline(a = 0, b = 1, col = "gray", lty = 2)

> # Extract the relevant columns from the concordance table

> comparison_table <- equi_12$concordance[, c("scale", "yx")]

>

> # Rename the columns for clarity

> colnames(comparison_table) <- c("Original Score (Winter 2024)", "Equated Score (Fall 2024)")

>

> # Print the table

> print(comparison_table)

Original Score (Winter 2024) Equated Score (Fall 2024)

1 6 5.000000

2 8 6.326923

3 14 8.975962

4 16 9.971154

5 17 17.798077

6 18 17.561298

7 19 18.144231

8 20 19.649038

9 21 21.657051

10 22 22.375801

11 23 22.884615

12 24 23.457605

13 25 25.040865

14 26 26.963141

15 27 28.455929

16 28 28.975561

17 29 29.417869

18 30 30.334135

Interpreting the Output

The concordance table (equi_12$concordance) shows:

scale: Original scores from Winter 2024.
yx: Equipercentile-equated scores—what each score in data_1 would correspond to in data_2 based on matching percentiles.
se: Standard error of equating, indicating the reliability of the equated score.

Example rows:

A score of 6 in Winter 2024 equates to 5.00 in Fall 2024, with a standard error of 0 (perfect certainty due to matching at the extremes).
A score of 24 equates to 23.46, with a standard error of 0.47, reflecting low uncertainty.
A score of 29 equates to 29.42, with very low standard error (0.16).

The non-linear nature is evident in how the adjustments vary irregularly across the scale. For example, some scores are adjusted more significantly than others (e.g., 14 → 8.98), while others remain similar (e.g., 28 → 28.98).

Summary Statistics

The summary statistics table compares the original (x), target (y), and equated (yx) score distributions:

Mean and SD Alignment: The mean (24.45) and standard deviation (4.43) of yx closely align with those of y (Fall 2024 scores: mean = 24.41, SD = 4.54), showing successful equating.
Skewness and Kurtosis: These metrics indicate the shape of the distributions. The equated scores preserve some characteristics of x while aligning with y.

Key Insights

Equipercentile equating:

Matches scores based on their percentile ranks rather than their raw values or spread.
Provides more nuanced adjustments, especially useful when score distributions differ in shape.
Is ideal for tests where differences in difficulty or format affect various score ranges unevenly.

For example:

A low score like 6 maps to 5.00, showing that very low Winter 2024 scores are considered slightly easier compared to Fall 2024.
A mid-range score like 24 adjusts minimally to 23.46, indicating similarity around the central range.
A high score like 30 maps to 30.33, reflecting that high scorers face relatively consistent difficulty across the groups.

Why Use Non-Linear Equating?

Non-linear equating is beneficial when:

The shapes of the score distributions differ (e.g., one test has more extreme high or low scores than the other).
The relationship between scores in the two groups is not consistent across the entire range (e.g., low scores might differ more than high scores).

Equipercentile equating ensures that if a person scored in the top 10% in Winter 2024, their equated score would place them in the top 10% of Fall 2024, even if the tests differ in difficulty. Unlike linear equating, which assumes a uniform adjustment, equipercentile equating adjusts scores more flexibly to reflect differences across the entire score range.

For example:

A score of 6 in Winter 2024 is more comparable to 5 in Fall 2024 because low scores were easier to achieve in Winter.
A score of 30 maps closely to 30.33, as high scores were similarly challenging in both tests.

For the sake of comparison, let's compare the results.

> # Compare equated scores

> round(cbind(xscale = 0:25,

+ mean = mean_12$concordance$yx,

+ linear = linear_12$concordance$yx,

+ equipercentile = equi_12$concordance$yx), 2)

xscale mean linear equipercentile

[1,] 0 6.27 2.92 5.00

[2,] 1 8.27 5.29 6.33

[3,] 2 14.27 12.40 8.98

[4,] 3 16.27 14.77 9.97

[5,] 4 17.27 15.95 17.80

[6,] 5 18.27 17.14 17.56

[7,] 6 19.27 18.32 18.14

[8,] 7 20.27 19.51 19.65

[9,] 8 21.27 20.69 21.66

[10,] 9 22.27 21.88 22.38

[11,] 10 23.27 23.06 22.88

[12,] 11 24.27 24.25 23.46

[13,] 12 25.27 25.43 25.04

[14,] 13 26.27 26.62 26.96

[15,] 14 27.27 27.80 28.46

[16,] 15 28.27 28.98 28.98

[17,] 16 29.27 30.17 29.42

[18,] 17 30.27 31.35 30.33

[19,] 18 6.27 2.92 5.00

[20,] 19 8.27 5.29 6.33

[21,] 20 14.27 12.40 8.98

[22,] 21 16.27 14.77 9.97

[23,] 22 17.27 15.95 17.80

[24,] 23 18.27 17.14 17.56

[25,] 24 19.27 18.32 18.14

[26,] 25 20.27 19.51 19.65

>

> # Plot the results

> plot(mean_12, linear_12, equi_12, lty=c(1,2,3),

+ col=c("blue", "red", "forestgreen"), addident = FALSE)

But the data is supposed to be for linear equating. Will this affect my result?

1. What Happens When the Distributions Match?

When the distributions of the two tests are identical (or very similar), non-linear equating essentially behaves similarly to linear equating because percentile ranks are already aligned. The equated scores will closely match the original scores, and any adjustments will be minimal or negligible. This is because there is no substantial shape difference to address.

For example:

A score in the 50th percentile in Test A already aligns with a score in the 50th percentile in Test B.
Non-linear equating won’t distort the alignment but may introduce minor variations due to technical adjustments.

2. Is There Any Negative Effect?

There is no direct harm in applying non-linear equating, but there are a few points to consider:

Added Complexity

Non-linear equating introduces more steps and complexity, making it harder to explain to stakeholders or audiences who may expect simpler linear methods when distributions align.
It might seem redundant and could raise questions about why a more complex approach was used when simpler methods suffice.

Risk of Overfitting

If there are small random fluctuations in the data, non-linear equating might over-adjust in places where no adjustment is necessary. This can make the results less interpretable, especially if the adjustments do not align with practical differences in test difficulty or score meaning.

Increased Standard Error

Non-linear methods generally have slightly higher standard errors compared to linear methods, especially at the extremes of the score range. If the distributions match closely, this might introduce unnecessary uncertainty.

3. What Does It Mean to Use Non-Linear Equating in This Case?

If you use non-linear equating when distributions match:

You are essentially double-checking alignment at all score levels and ensuring that any nuanced differences (even if small) are accounted for.
The results should align closely with linear equating, but small deviations might reflect random variations rather than meaningful differences.

This might be interpreted as over-precision, especially if:

The audience expects simple explanations, and the differences introduced by non-linear equating are negligible or hard to justify.
There is no practical reason to believe that scores at different points in the distribution would require differential adjustment.

4. When Might Non-Linear Equating Still Be Useful?

Non-linear equating could still have a role even if distributions match, for example:

Consistency: If non-linear methods are the standard in your context, using them ensures comparability with past analyses.
Validation: You might want to verify that linear assumptions hold true by comparing linear and non-linear results. If they match, it strengthens the case for using linear equating.

5. Recommendation

If distributions match closely, linear equating is simpler, more interpretable, and equally effective. Non-linear equating in this case might add unnecessary complexity without meaningful benefit.
If distributions differ even slightly, non-linear equating ensures precision and fairness across the score range.

Ultimately, using non-linear equating in this scenario won’t harm your results but might be seen as unnecessarily complicated. It’s a choice that should align with your audience’s expectations and the context of your analysis.

Concluding Remark

By using both linear and non-linear equating methods, I saw how statistical techniques could align scores from two semesters. This helps me understand whether the quizzes were functioning consistently. Linear equating showed how similar the quizzes were when adjusted for differences in means and variability, while non-linear equating highlighted where slight differences might arise due to more complex factors.

For anyone exploring educational data, equating is a powerful tool. Whether you’re an educator, a researcher, or just a curious learner, it offers a way to dig deeper into your data and make meaningful comparisons. As for me, I’m leaving this process with a clearer picture of how my quizzes perform. Thanks for following along! I hope it inspires you to explore your own data stories!