Talk 1: Mariano Felice, Lucy Skidmore and Karen Dunn
English vocabulary knowledge prediction across diverse L1 backgrounds
Mariano Felice, Lucy Skidmore and Karen Dunn (British Council)
As traditional test calibration is too slow and costly for the digital age, we investigate how this can be automated using psychometrically validated vocabulary lists. We will describe how we use the British Council's KVL dataset to automate item difficulty estimation for English learners from Spanish, German and Chinese backgrounds and uncover properties of the items that influence these predictions. These insights not only challenge some of our assumptions about L2 vocabulary production but also inform more effective item design.
Talk 2: Denise Löfflad
Criterial Features in German: Operationalization and Validation in CEFR-aligned Graded Readers
Denise Löfflad and Detmar Meurers (Leibniz-Institut für Wissensmedien (IWM))
Criterial features have been proposed as pedagogically grounded measures for proficiency and readability assessment, offering an alternative to abstract measures of linguistic complexity. The German Grammar Profile, based on Profile Deutsch and aligned with CEFR levels A1 to B2, has been developed to describe such criterial measuring units for German. In order to faciliate computer based assessment, the automatic feature detection system PALME was implemented which can detect 160 criterial features for German.
In this talk we present two evaluations following this research: 1.) An evaluation of the feature detection system and 2.) an empirical validation of the features' pedagogical alignment on a corpus of graded readers.
Talk 3: Gabrielle Gaudeau
Brewing Better Feedback: Criteria-based Assessment of Feedback quality (CAFe)
Gabrielle Gaudeau, Diana Galvan-Sosa, Suchir Salhan, Lily Goulder, Aoife O'Driscoll and Andrew Caines (ALTA Institute)
A central challenge in evaluating written feedback lies in the variety of definitions and purposes feedback can take.
This fragmentation makes it difficult to establish a common standard of feedback quality, as there is no consensus on what feedback should fundamentally achieve. In Natural Language Generation (NLG), research on Automated Feedback Generation (AFG) seeks to formalise feedback quality, yet feedback is often insufficiently specified and treated as a static textual artefact, resulting in evaluation practices that capture only shallow properties of feedback.
In contrast, Education research conceptualises feedback as an integral part of the learning cycle, where quality is determined by how learners use feedback to reduce the gap between current and expected performance. Despite the prominence of such process-oriented theories, existing evaluation methods in NLP struggle to translate these high-level principles into systematic and practical assessments of written feedback.
Current approaches are often either too broad or too task-specific to support consistent comparison across systems. Consequently, there is an evident need for an assessment tool that operationalises these principles, allowing for a consistent way to measure whether the feedback successfully triggers the learner towards reaching their learning goals. To address this gap, we propose CAFe, a rubric for assessing feedback quality grounded in the feedback model of Hattie and Timperley.
We apply CAFe to evaluate feedback generated in interactions between student and teacher large language models (LLMs), demonstrating how the rubric enables a rigorous, pedagogically grounded evaluation of feedback.
Talk 4: Walid El Hefny
Beyond First Attempts: Deep Learning Knowledge Tracing for Post-Feedback Performance in Language Learning
Walid El Hefny and Detmar Meurers (IWM)
Traditional Knowledge Tracing often overlooks knowledge acquisition during scaffolded feedback in language learning. We present a two-layer deep learning framework modeling success as a conditional process: the first layer predicts first-attempt success, while the second layer predicts success after feedback. By comparing six Deep Learning Knowledge Tracing models across seven Knowledge Component models with 7th-grade English learners in Germany, we found a shift in predictive features. Granular item identifiers best predict first attempts, while semantic (content-based) features better predict success after feedback. This approach identifies a learner's Zone of Proximal Development by modeling true potential beyond binary performance.
Talk 5: Kordula de Kuthy
Adaptivity in an Intelligent Tutoring System should reflect the diversity of students, not just their different levels of knowledge
Kordula De Kuthy and Detmar Meurers (IWM)
While learners differ substantially in terms of their prior knowledge, language competence, cognitive characteristics, and personal interests and motivation, traditional Intelligent Tutoring Systems (ITS) primarily focus on adapting only to differences in the learning domain knowledge. As a step towards taking the multi-dimensional nature of student heterogeneity serious, we propose an approach that adaptively supports individual learning paths through a rich space of activities that are systematically varied in terms of their language and cognitive demands, in addition to the subject-domain complexity.
We realized this approach in an ITS for economics education in German secondary schools together with our ALEE project partners at the IOB Oldenburg and Lüneburg university and conducted two studies in authentic school contexts: one validating the linguistic and cognitive activity parameters, and a randomized controlled field study evaluating the adaptive learning paths in comparison to a traditional standard sequence of activities. The talk spells out the motivation and approach and the results of the two studies.