“We’ve entered a new era of digital innovation — Explore how ABHS is transforming assessments with AI and advanced technologies.”
“Establish the Balance with Justice and Do Not Undermine the Scale”
Between Aʿrāf and Responsibility: Medical Education Assessment Between Mercy and Justice
In the Holy Qur’an, al-Aʿrāf is described as a symbolic place between Paradise and Hell where people whose good and bad deeds are equal await a divine verdict—one that is not influenced by whim or incomplete knowledge. It is a suspended space between two certainties, where destinies pause at the edge of a precise and sensitive scale. This powerful image evokes the concept of the “gray zone” in evaluation, especially in medical education, where a student stands at the pass-fail threshold, between acceptable practice and dangerous incompetence.
In this academic context, we do not possess God’s perfect knowledge, His complete justice, or His boundless mercy. Instead, we have human-made tools of measurement, educators making professional judgments, and institutions demanding that we balance patient safety with a fair opportunity for each student. Educational committees often face a student whose performance is neither a clear failure nor a confident success. This gray zone is not merely a statistical phenomenon but a reflection of a deeper human reality: performance cannot always be judged by a number alone—it sometimes calls for reflection on behavior, clinical skill, ethics, and the practical standards required in real-world settings.
Mercy in this context does not mean pardoning ignorance or excusing inadequacy. It means establishing a rigorous system for assessing competence—one that distinguishes between a learner who needs support and another who is not yet fit for clinical responsibility. Professional mercy is expressed through structured remediation and constructive feedback, not by lowering standards due to discomfort or favoritism. At this moment, we are not merely judging a student's academic standing; we are shaping the future of a physician who will hold the trust—and the lives—of others in their hands. There is no room for hesitation or kindness that compromises professional safety.
Divine mercy is unconditional, rooted in absolute knowledge. Educational mercy, however, is accountable to society, conditioned by safety and competence, and positioned at the crossroads between the educator’s compassion and their fear of allowing someone into the profession who cannot bear its weight.
This is why the “scale” of exams should not be a mechanical sorting tool but a moral framework governed by precise justice and grounded in clear scientific and professional criteria. We must preserve the integrity of the cut-off point without shutting the door to growth, and offer students fair opportunities for development—without sacrificing the safety of others.
In the end, medical education is not merely a path to academic success. It is a covenant of responsibility. Every evaluation decision reflects our understanding of mercy, our commitment to justice, and our belief that the balance of the profession must never be compromised—just as God has commanded: “Do not undermine the scale.”
Post-Hoc Cut Score:
A Dual-Criteria Approach for Fairness and Psychometric Integrity
Abstract
In high-stakes assessment, the establishment of a valid and fair passing standard is critical to the credibility of the certification process. While the Angoff method remains the gold standard for standard-setting, it is not always feasible to implement due to logistical constraints. This paper proposes a dual-criteria alternative that combines a fixed classical score threshold with a minimum ability score (theta) from Item Response Theory (IRT). Using a practical example of 60% classical score and θ ≥ 0.9, the article demonstrates how this method provides defensible outcomes, mitigates variability in test difficulty, and supports longitudinal consistency. The dual-criteria model also allows for adaptability to future improvements in cohort performance. This approach is recommended for institutions seeking both fairness and psychometric rigor when Angoff is not feasible.
Keywords: standard setting, IRT, fixed cut score, assessment, medical education
Post-Hoc Cut Score: A Dual-Criteria Approach for Fairness and Psychometric Integrity
S.Mashhadani
Introduction
In the world of high-stakes examinations, establishing the appropriate pass mark—known as the cut score—is a foundational step in ensuring fairness, validity, and public trust. The Angoff method is a well-established approach that uses expert judgment to define the expected performance of a minimally competent candidate. However, in many settings, especially where there are logistical challenges, tight timelines, or lack of standard-setting expertise, this method is not applied before the examination is administered.
In such cases, a fixed numerical cut score, such as 60 out of 100 (i.e., 60%), is often used instead. While administratively simple, this fixed benchmark introduces a critical vulnerability: it does not account for changes in test difficulty from one administration to another. This article explores a rigorous, psychometrically sound alternative: combining a fixed score threshold with a minimum IRT-based ability score to protect standardization and ensure fair decisions.
The Risk of Using a Fixed Cut Score Alone
Relying only on a fixed cut score—such as 60/100—assumes that all test forms are equally difficult and that candidate populations are homogenous. In reality, this is rarely the case. Some exam versions may contain easier questions, either by design or by oversight. In such cases, candidates may score 60 or more without demonstrating true competence, simply by accumulating enough easy marks.
This creates a serious risk: unqualified candidates might pass when the exam is easy, while competent candidates might fail when it is hard. As a result, public trust in the exam and the integrity of certification are undermined. Moreover, comparison of outcomes across years becomes unreliable, making it difficult to track improvement or decline in educational quality.
The Role of Ability Scores (IRT Theta)
To address this issue, we turn to Item Response Theory (IRT), which provides a deeper and more standardized way of assessing candidates. IRT estimates a candidate’s 'theta' (ability) based on the difficulty and discrimination of the items they answered correctly. This is not simply a tally of right answers—it is a statistically calibrated score that reflects real competence.
For example, if a candidate answers mostly easy questions correctly and few difficult ones, their classical score might be high (say, 65), but their IRT theta score could still be low (e.g., 0.3), indicating weak underlying ability. In contrast, another candidate scoring 61 who got many hard questions right might have a theta of 0.9, indicating strong ability. This reveals the limitations of relying on raw scores alone.
A Dual-Criteria Standard: Combining Fixed Scores with Ability Thresholds
To ensure defensibility, fairness, and cross-year comparability, we propose a dual-criteria model for passing:
• Criterion 1: Classical Score of at least 60 out of 100 (fixed)
• Criterion 2: IRT Ability Score (θ) of at least 0.9
This means that a candidate must not only reach a minimum number of correct answers, but also show sufficient ability adjusted for item difficulty. This approach ensures that those passing are not simply those who score points, but those who meet the true standard of competence.
For instance, a candidate who scores 61/100 but has a theta score of only 0.25 would not pass. Meanwhile, a candidate with 64/100 and a theta of 0.95 would pass. The model therefore balances administrative simplicity with psychometric rigor.
Why This Model Works
The dual-criteria approach serves multiple functions:
- It acts as a safety net against the influence of test form difficulty.
- It ensures that the standard of competence remains stable across different exam versions.
- It supports comparability across cohorts and years.
- It allows institutions to track true progress over time, instead of relying on fluctuating classical scores.
In effect, the theta score serves as a psychometric 'safety valve'. While the classical score captures observable performance, the theta score adjusts for exam variability, offering a more reliable measure of ability
Flexibility for Future Improvements
An added advantage of using IRT ability scores is that the system becomes responsive to genuine educational progress. If a future cohort performs significantly better—e.g., more candidates scoring θ > 1.0—it may be appropriate to raise the theta threshold. This makes the standard dynamic, evidence-based, and sensitive to institutional improvement.
In this way, the dual-criteria model avoids the trap of a static benchmark. It evolves with the educational context, while maintaining defensible consistency year after year.
Conclusion
In the absence of a pre-exam Angoff procedure, relying solely on a fixed cut score is insufficient and risky. A better solution is to combine a fixed performance threshold (e.g., 60%) with a minimum IRT ability score (e.g., θ ≥ 0.9). This dual-criteria approach upholds fairness, supports standardization, and strengthens the defensibility of certification decisions. It protects the exam process from being distorted by variation in difficulty and allows for year-to-year comparison and long-term benchmarking.
References
1. Cizek GJ, Bunch MB. Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests. SAGE Publications; 2007.
2. Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of Item Response Theory. SAGE Publications; 1991.
3. Downing SM, Yudkowsky R. Assessment in Health Professions Education. 2nd ed. Routledge; 2019.