Workshop on Automated Evaluation of Learning and Assessment Content

2nd Workshop on Automated Evaluation of Learning and Assessment Content

AIED 2025 workshop | Palermo (Italy), Hybrid | July 26 (Full day)

Call for papers

Evaluation of learning and assessment content has always been very important in the educational domain, regardless of the specific setting: K-12, MOOCs, language certification exams, and learning apps (to name some settings frequently studied in the literature) all require that the learning and assessment content is evaluated to ensure high quality and good learning outcomes. In online settings and, in general, settings that involve a lot of students or that require large libraries (e.g., of questions), it is unfeasible to exclusively rely on subject matter experts to perform such evaluations. Thus, there is the need for accurate and reliable techniques for automatically evaluating learning and assessment contents, either with human-in-the-loop approaches or in fully automated ones, depending on the needs and constraints of each use case.

Assessment content, such as questions and exams, is commonly evaluated both with traditional approaches based on students’ responses – e.g., Item Response Theory [12] – and more modern approaches based on machine learning and Natural Language Processing [1, 5, 3]. More recently, some research also attempted to perform this evaluation by simulating the responses of students with AI models and performing virtual pretesting [22, 4, 15, 16]. Another challenge in automated evaluation is related to distractors (i.e., the plausible incorrect options in multiple choice questions): they are commonly evaluated with manual labelling [7, 9, 10], since automatic evaluation approaches proposed so far have some limitations [18, 21], but manual evaluation cannot scale to production settings in, for example, online learning platforms with large libraries of questions.

On the other hand, the evaluation of learning content – such as single lectures, and whole courses and curricula – still relies heavily on experts from the educational domain, but could be potentially automated, at least partially and with a human-in-the-loop approach (e.g., flagging pre-requisites which are not satisfied by the previous lectures in a course). Although not strictly “content”, Virtual Teaching Assistants (VTAs) also have to be evaluated, and this is particularly relevant with modern LLM-based VTAs [20, 2], since biases and behaviours from other tasks and domains could be unwillingly transferred to the educational domain. Indeed, older models were often rule-based, thus more limited and less powerful but more controllable (and not subject to hallucinations) [11], while more modern assistants are more powerful and flexible but need to be carefully evaluated. LLMs as a whole need to be evaluated from a pedagogical perspective – this is usually referred to as pedagogical alignment of LLMs [19, 17] – since they can be and are being used for a variety of educational tasks, in addition to serving as virtual tutors, as shown in previous research (from feedback generation and automated assessment to question and content generation [14, 8, 13]). As an example, generated content must align with the learning objectives of the specific course (or exam) where it is being used, and use language of a complexity level suitable for the target students. Importantly, this evaluation must consider not only the educational requirements of the generated content, which is the primary focus of pedagogical alignment, but also the biases that might emerge from the generation models, to ensure EDIB (Equity, Diversity, Inclusion, & Belonging) appropriateness, and the factual accuracy of the generated text.

Building on the success of the First Workshop on the Automatic Evaluation of Learning and Assessment Content [6], which was held at AIED 2024 and attracted more than 80 participants, this workshop will focus on approaches for automatically evaluating learning and assessment content, offering an opportunity to discuss common challenges, share best practices, and promising new research directions.

Topics of Interest

Topics of interest include but are not limited to:

Question evaluation (e.g., in terms of the pedagogical criteria listed above: alignment to the learning objectives, factual accuracy, language level, cognitive validity, etc.).
Estimation of question statistics (e.g., difficulty, discrimination, response time, etc.).
Evaluation of distractors in Multiple Choice Questions.
Evaluation of reading passages in reading comprehension questions.
Evaluation of lectures and course material.
Evaluation of learning paths (e.g., in terms of prerequisites and topics taught before a specific exam).
Evaluation of educational recommendation systems (e.g., personalised curricula).
Evaluation of hints and scaffolding questions, as well as their adaptation to different students.
Evaluation of automatically generated feedback provided to students.
Evaluation of techniques for automated scoring.
Evaluation of pedagogical alignment of LLMs.
Evaluation of the ethical implications of using open-weight and commercial LLMs in education.
Evaluation of bias in educational content and LLM outputs.

Human-in-the-loop approaches are welcome, provided that there is also an automated component in the evaluation and there is a focus on the scalability of the proposed approach. Papers on generation are also very welcome, as long as there is an extensive focus on the evaluation step.

Submission Guidelines

We invite papers in two different categories:

Research papers: We are accepting short papers (5 pages, excluding references) and long papers (10 pages, excluding references), formatted according to the workshop style (using either the LaTeX template or the DOCX template).
Ongoing work: We also accept extended abstracts (max 2 pages), to showcase work in progress and preliminary results. Papers should be formatted according to the workshop style (using either the LaTeX template or the DOCX template).

Submissions should contain mostly novel work, but there can be some overlap between the submission and work submitted elsewhere (e.g., summaries, focus on the evaluation phase of a broader work). Each of the submissions will be reviewed by the members of the Program Committee.

Submissions in the research papers category may be archival or non-archival, based on the wish of the authors. All archival papers will be in the proceedings volume, which will be submitted for publication to CEUR Workshop Proceedings.

Submissions in the ongoing work (i.e., extended abstracts) will be non-archival. All non-archival submissions may be submitted to any venue in the future, except for another EvalLac.

Submission URL: https://easychair.org/my/conference?conf=evallac2025 (you will need an EasyChair account to submit)

Important Dates

* All deadlines are calculated at 11:59 pm UTC-12 hours ("anywhere on Earth")

Submission deadline: May 25, 2025
Notification of acceptance: June 18~23, 2025
Camera ready: July 7, 2025
Workshop: July 26, 2025 👈

Presentation at the Conference

All accepted papers must be presented at the workshop to appear in the proceedings. The workshop will include both in-person and virtual presentation options. The presenting author should register to the AIED conference (at least, a "Full Day Workshop" registration).

All papers will be presented as posters. Please follow the recommendations below:

Paper size: A0 (84.1 cm x 118.9 cm or 33.1 inches x 46.8 inches). Please ensure there is a margin from the edge of the paper on all four sides to prevent any misalignment that may occur during the cutting process.
Layout: Portrait
Color mode: CMYK
File Format: PDF
Minimum Resolution: 300 ppi (recommended)

Publication Guidelines

In the camera-ready paper, you are allowed to include one additional page of content (up to 11 pages for long papers, up to 6 pages for short papers) to address reviewers’ comments and include an "Acknowledgements" section.
Camera-ready papers must not be anonymous. Kindly ensure that all author names and affiliations are included for publication. No changes to the order or composition of authorship may be made.
Make sure that the footer on the first page has the following text: EvalLAC'25: 2nd Workshop on Automatic Evaluation of Learning and Assessment Content, July 26, 2025, Palermo, Italy
- If you used the Latex template, use the command: \conference{EvalLAC'25: 2nd Workshop on Automatic Evaluation of Learning and Assessment Content, July 26, 2025, Palermo, Italy}
Once your paper has been accepted, you're welcome to upload it to arXiv or other preprint servers.

References

AlKhuzaey, S., Grasso, F., Payne, T.R., Tamma, V.: Text-based question difficulty prediction: A systematic review of automatic approaches. International Journal of Artificial Intelligence in Education pp. 1–53 (2023)
Alsafari, B., Atwell, E., Walker, A., Callaghan, M.: Towards effective teaching assistants: From intent-based chatbots to llm-powered teaching assistants. Natural Language Processing Journal 8, 100101 (2024)
Benedetto, L.: A quantitative study of nlp approaches to question difficulty estimation pp. 428–434 (2023)
Benedetto, L., Aradelli, G., Donvito, A., Lucchetti, A., Cappelli, A., Buttery, P.: Using llms to simulate students’ responses to exam questions. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 11351–11368 (2024)
Benedetto, L., Cremonesi, P., Caines, A., Buttery, P., Cappelli, A., Giussani, A., Turrin, R.: A survey on recent approaches to question difficulty estimation from text. ACM Computing Surveys (CSUR) (2022)
Benedetto, L., Taslimipoor, S., Caines, A., Galvan-Sosa, D., Due˜nas, G., Loukina, A., Zesch, T.: Workshop on automatic evaluation of learning and assessment content. In: International Conference on Artificial Intelligence in Education. pp. 473–477. Springer (2024)
Bitew, S.K., Deleu, J., Develder, C., Demeester, T.: Distractor generation for multiple-choice questions with predictive prompting and large language models. arXiv preprint arXiv:2307.16338 (2023)
Caines, A., Benedetto, L., Taslimipoor, S., Davis, C., Gao, Y., Andersen, Ø., Yuan, Z., Elliott, M., Moore, R., Bryant, C., et al.: On the application of large language models for language teaching and assessment technology (2023)
Chamberlain, D.J., Jeter, R.: Creating diagnostic assessments: Automated distractor generation with integrity. Journal of Assessment in Higher Education 1(1), 30–49 (2020)
Ghanem, B., Fyshe, A.: Disto: Evaluating textual distractors for multi-choice questions using negative sampling based approach. arXiv preprint arXiv:2304.04881 (2023)
Goel, A.K., Polepeddi, L.: Jill watson: A virtual teaching assistant for online education. In: Learning engineering for online education, pp. 120–143. Routledge (2018)
Hambleton, R.K., Swaminathan, H.: Item response theory: Principles and applications. Springer Science & Business Media (2013)
Jeon, J., Lee, S.: Large language models in education: A focus on the complementary relationship between human teachers and chatgpt. Education and Information Technologies pp. 1–20 (2023)
Kasneci, E., Seßler, K., K¨uchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., G¨unnemann, S., H¨ullermeier, E., et al.: Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences 103, 102274 (2023)
Maeda, H.: Field-testing multiple-choice questions with ai examinees: English grammar items. Educational and Psychological Measurement (2024)
Park, J.W., Park, S.J., Won, H.S., Kim, K.M.: Large language models are students at various levels: Zero-shot question difficulty estimation. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 8157–8177 (2024)
Razafinirina, M.A., Dimbisoa, W.G., Mahatody, T.: Pedagogical alignment of large language models (llm) for personalized learning: A survey, trends and challenges. Journal of Intelligent Learning Systems and Applications 16(4), 448–480 (2024)
Rodriguez-Torrealba, R., Garcia-Lopez, E., Garcia-Cabot, A.: End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Systems with Applications 208, 118258 (2022)
Sonkar, S., Ni, K., Chaudhary, S., Baraniuk, R.G.: Pedagogical alignment of large language models. arXiv preprint arXiv:2402.05000 (2024)
Taneja, K., Maiti, P., Kakar, S., Guruprasad, P., Rao, S., Goel, A.K.: Jill watson: A virtual teaching assistant powered by chatgpt. In: International Conference on Artificial Intelligence in Education. pp. 324–337. Springer (2024)
Taslimipoor, S., Benedetto, L., Felice, M., Buttery, P.: Distractor generation using generative and discriminative capabilities of transformer-based models. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 5052–5063 (2024)
Uto, M., Tomikawa, Y., Suzuki, A.: Question difficulty prediction based on virtual test-takers and item response theory. Workshop on Automatic Evaluation of Learning and Assessment Content (2024)

Page updated

Report abuse