Workshop on Automated Evaluation of Learning and Assessment Content

AIED 2024 workshop | Recife (Brazil), Hybrid

Call for Papers

The evaluation of learning and assessment content has always been very important in the educational domain. Assessment content, such as questions and exams, is commonly evaluated both with traditional approaches such as Item Response Theory [8] and more modern approaches based on machine learning [1, 3, 2]. However, the evaluation of learning content – such as single lectures, and whole courses and curricula – still relies heavily on experts from the educational domain. The same is true for several other components of the educational pipeline: for instance, distractors (i.e., the plausible incorrect options in multiple choice questions) are commonly evaluated with manual labelling [4, 6, 7], since automatic evaluation approaches proposed so far have some limitations [11]. The need to develop accurate metrics for evaluating learning and assessment content became even more pressing with the rapid growth and adoption of LLMs in the educational domain, both open (e.g., Gemma, Llama 2 and Vicuna) and closed (e.g., GPT-4). Indeed, previous research showed that LLMs can be used for a variety of educational tasks – from feedback generation and automated assessment to question and content generation [10, 5, 9] – and being able to accurately evaluate the output of LLMs in an automated manner becomes of crucial importance to ensure the effectiveness of their application, since traditional approaches based on human feedback are not easily scalable to large amounts of data.

Importantly, the evaluation needs to consider both the educational requirements of the generated content and the biases that might emerge from the generation models. For instance, the generated content must align to the learning objectives of the specific course (or exam) where it is being used, as well as to the language level suitable for the target students. Moreover, similarly to what happens when applying language models to other domains, the evaluation must assess the factual accuracy, and the EDIB (Equity, Diversity, Inclusion, & Belonging) appropriateness of the generated text. This workshop focuses on approaches for automatically evaluating learning and assessment content.

We expect this workshop to attract professionals from both industry and academia, and to create a space for discussion about the common challenges in evaluating learning and assessment content in education. Through papers and debate, we aim at collecting guidelines and best-practices for the evaluation of educational content. We believe these will be a very relevant contribution for the AIED community, and a reference for future research on the evaluation (and generation) of learning and assessment content.

Topics of interest

Topics of interest include but are not limited to:

Question evaluation (e.g., in terms of the pedagogical criteria listed above: alignment to the learning objectives, factual accuracy, language level, cognitive validity, etc.).
Estimation of question statistics (e.g., difficulty, discrimination, response time, etc.).
Evaluation of distractors in Multiple Choice Questions.
Evaluation of reading passages in reading comprehension questions.
Evaluation of lectures and course material.
Evaluation of learning paths (e.g., in terms of prerequisites and topics taught before a specific exam).
Evaluation of educational recommendation systems (e.g., personalised curricula).
Evaluation of hints and scaffolding questions, as well as their adaptation to different students.
Evaluation of automatically generated feedback provided to students.
Evaluation of techniques for automated scoring.
Evaluation of bias in educational content and LLM outputs.

Human-in-the-loop approaches are welcome, provided that there is also an automated component in the evaluation and there is a focus on the scalability of the proposed approach. Papers on generation are also very welcome, as long as there is an extensive focus on the evaluation step.

Submission guidelines

There are two tracks, with different submission deadlines.

Full and short papers: We are accepting short papers (5 pages, excluding references) and long papers (10 pages, excluding references), formatted according to the workshop style (using either the LaTeX template or the DOCX template).

Extended abstracts: We also accept extended abstracts (max 3 pages), to showcase work in progress and preliminary results. Papers should be formatted according to the workshop style (using either the LaTeX template or the DOCX template).

Submissions should contain mostly novel work, but there can be some overlap between the submission and work submitted elsewhere (e.g., summaries, focus on the evaluation phase of a broader work). Each of the submissions will be reviewed by the members of the Program Committee, and the proceedings volume will be submitted for publication to CEUR Workshop Proceedings. Due to CEUR-WS.org policies, only full and short papers will be submitted for publication, not the extended abstracts.

Submission URL: https://easychair.org/conferences/?conf=evallac2024

Important dates

Submission deadline (full and short papers): May 17, 2024 May 22, 2024
Submission deadline (extended abstracts): May 27, 2024
Notification of acceptance: June 4, 2024
Camera ready: June 11, 2024
Workshop: July 8, 2024

References

AlKhuzaey, S., Grasso, F., Payne, T.R., Tamma, V.: Text-based question difficulty prediction: A systematic review of automatic approaches. International Journal of Artificial Intelligence in Education pp. 1–53 (2023)
Benedetto, L.: A quantitative study of nlp approaches to question difficulty estimation pp. 428–434 (2023)
Benedetto, L., Cremonesi, P., Caines, A., Buttery, P., Cappelli, A., Giussani, A., Turrin, R.: A survey on recent approaches to question difficulty estimation from text. ACM Computing Surveys (CSUR) (2022)
Bitew, S.K., Deleu, J., Develder, C., Demeester, T.: Distractor generation for multiple-choice questions with predictive prompting and large language models. arXiv preprint arXiv:2307.16338 (2023)
Caines, A., Benedetto, L., Taslimipoor, S., Davis, C., Gao, Y., Andersen, Ø., Yuan, Z., Elliott, M., Moore, R., Bryant, C., et al.: On the application of large language models for language teaching and assessment technology (2023)
Chamberlain, D.J., Jeter, R.: Creating diagnostic assessments: Automated distractor generation with integrity. Journal of Assessment in Higher Education 1(1), 30–49 (2020)
Ghanem, B., Fyshe, A.: Disto: Evaluating textual distractors for multi-choice questions using negative sampling based approach. arXiv preprint arXiv:2304.04881 (2023)
Hambleton, R.K., Swaminathan, H.: Item response theory: Principles and applications. Springer Science & Business Media (2013)
Jeon, J., Lee, S.: Large language models in education: A focus on the complementary relationship between human teachers and chatgpt. Education and Information Technologies pp. 1–20 (2023)
Kasneci, E., Seßler, K., K ̈uchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., G ̈unnemann, S., H ̈ullermeier, E., et al.: Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences 103, 102274 (2023)
Rodriguez-Torrealba, R., Garcia-Lopez, E., Garcia-Cabot, A.: End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Systems with Applications 208, 118258 (2022)