Event Report

The Workshop on Computational Linguistic Methods for Language Technology: Writing, Reading, Interaction, Content Creation and Evaluation (WRICE) Workshop, organised by the University of Cambridge ALTA Institute, took place from the 12th to 13th of March, 2026. The workshop was offered in a hybrid format with around 50 participants attending in person and 20 attending online.

The workshop included 17 half-an-hour talks from academics from the Universities of Cambridge, Tübingen, Ghent, Limerick and Gothenburg, and the British Council. Below is a brief summary of the content of each of the presentations.

The first day of the workshop was opened with a talk by Mariano Felice, Lucy Skidmore and Karen Dunn from the British Council. They explored English vocabulary knowledge prediction across diverse L1 backgrounds, utilising a large Knowledge-based vocabulary list (KVL). This list includes the English target word alongside the L1 source word, L1 context and English target clue. It was revealed that a multilingual transformer model fine-tuned on all L1s offer the best prediction of English vocabulary difficulty. Moreover, item prompt choice and subword tokenisation also impacts model prediction.

The second talk, from Denise Löfflad of the University of Tübingen, covered the operationalisation and validation of criterial features in German. Criterial features are defined as linguistic features that emerge at specific stages of L2 development and operationalise CEFR descriptors. Löfflad utilised the PALME system to automatically extract criterial features from learner texts, based on the German Grammar Profile. Implementation on a test suite shows high precision across features, but consistently lower recall. Furthermore, the German criterial features show comparable prediction of CEFR level to English criterial features, though the results are lower than CEFR classification using linguistic complexity.

The next talk was from Diana Galvan-Sosa, Gabrielle Gaudeau and Suchir Salhan from the University of Cambridge. The speakers presented CAFe, a task-agnostic rubric to evaluate the quality of both LM- and human-written feedback. Grounded in the pedagogical theory of Hattie and Timperley, the rubric considers the feedback in context and evaluates its textual and pedagogical quality. The rubric was operationalised to evaluate the quality of LLM-generated feedback provided to ESL learners, allowing a fine-grained analysis of how different models vary in their ability to satisfy pedagogical functions.

Talk four was given by Walid El Hefny from the University of Tübingen. El Hefny explored the potential of deep learning knowledge tracing (DLKT) for post-feedback performance in language learning. DLKT can extract meaningful features and uncover complex patterns in students' learning, improving predictive performance. DLKT was applied to high school ESL learners' data from FeedBook, an interactive tutoring system. El Hefny reported that initial success for students depends on item-specific mechanics and phrasing rather than topic mastery, but error recovery (after feedback) is driven by conceptual knowledge that transfers across different exercises.

The final talk of the morning was presented by Kordula de Kuthy, also from the University of Tübingen. This talk focused on the importance of Intelligent Tutoring Systems (ITS) adapting to heterogenous students on individual learning paths. An adaptive algorithm (using multi-dimensional parameterisation) was compared to a randomised control study for an economics tutoring system for Germany secondary school students. The adaptive system provided numerous benefits; it allowed students to reach closer to their learning goal, results were no longer dependent on students' grades nor on their socioeconomic background (as in the control system). De Kuthy's talk emphasised the importance of considering learner heterogeneity as a multi-dimensional phenomenon in ITS.

Thursday's afternoon session began with a talk from Stefano Bannò from ALTA Engineering and the University of Cambridge. This talk covered the use of the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs. LLMs attempted 3 tasks; semantic understanding, word level proficiency prediction and essay-level proficiency prediction. LLM performance was compared to a POS-based system and random baseline. Semantic understanding was performed best by GPT-4o. However, it was found that non-ambiguous words do not require an LLM for word-level proficiency prediction. This task, for ambiguous words, was performed best by a LLM (Qwen 2.5 32B). Similarly, features extracted by Qwen were better predictors of proficiency than POS based models. Bannò concludes that combined POS- and LLM- based systems can maximise performance on L2 vocabulary assessment.

The next talk explored AI in the L2 Dutch learning context and was presented by Joni Kruijsbergen from the University of Ghent. This talk offered important insights into how L2 Dutch teachers use and view AI. It was found that most teachers indeed incorporate AI into their teaching practices, but that there are at least five areas in which it could be improved for pedagogical use. Firstly, AI should guide towards self-correction rather than providing the explicit correct answer. Teachers also want AI to give explanations, but state that these should avoid cognitively overloading the student. Ideally, AI would also adapt to the learners' language background. Finally, AI should provide positive feedback that is concrete in nature, rather than being sycophantic and hollow.

The eighth talk was given by Matthew Pattemore from the University of Tübingen. Pattemore's talk evaluated the impact of corrective feedback types in children's digital language games. Feedback can be divided into outcome feedback (giving learners information about the result of their action) and process feedback (giving learners information about the process of achieving their desired result). Based on a literacy game teaching 6th grade Spanish students to read in English, outcome feedback proved better than process feedback. However, some learners seemed to ignore process feedback and when removed from the results, process feedback proved significantly better than outcome. This finding raises important questions about how to encourage listening and attention in language games to make feedback as useful as possible.

The next talk was from Luisa Ribeiro Flucht of the University of Tübingen. This presentation explored whether LLMs can provide pedagogical frameworks to target communicative practice. In order to give GenAI a 'pedagogical brain', it needs to consider the domain, learner and how to adapt. The L2 grammar domain can be modelled as a hierarchical structure available as a machine-readable domain model for GenAI. In this model, forms are combined with functions and mapped to related forms; these relations can be of types 'requires', 'precedes' or 'has_variant'. This talk introduces a novel graph-based way of representing grammatical knowledge and encoding this knowledge within LLMs.

The penultimate talk on the first day was presented by Øistein Anderson (University of Cambridge) and Geraldine Mark (University of Limerick). The speakers discussed their work on the English Grammar Profile, which is a tool for grammatical annotation and CEFR-level mapping based on the Cambridge Learner Corpus (CLC). Translating the 'can-do' descriptors to rules for automatisation is not simple; there are different rules at different levels, initial interpretations of rules may not always be correct and two or more rules can match the same construction. Using the Write & Improve Corpus 2024, the authors annotated learners' texts for: form function mapping/grammatical polysemy and the effect of exam vs spontaneous conditions. It was found that grammatical development may look like new structures in exam conditions, whereas it is the same structure in a new use (new function). Anderson and Mark's talk prompts a discussion on the impact of writing conditions, as exam contexts such as the CLC include more advanced looking forms whilst practice conditions (W&I) show attempts at various structures across various proficiency levels.

Thursday's final talk was from Detmar Meurers of the University of Tübingen and covered how we can observe linguistic complexity and its development longitudinally. Meurers, using the corpora of Chinese learners of German, analysed 450 German complexity features. It was found that not all developmentally-informative features are relevant for grading and vice versa, suggesting that grading also considers aspects beyond linguistic development such as accuracy and appropriateness. Moreover, the interpretation of linguistic complexity must be task-dependent as features are susceptible to task effects.

The second day of the workshop began with a talk from Elena Volodina from the University of Gothenburg. This talk introduced the idea of pseudonymization to protect learners' privacy in datasets. Pseudonymization does not replace all personal information with the same filler (as regular anonymisation) and instead uses key mapping to replace and refer to the original information. Volodina highlighted how automatic rule-based pseudonymization often violates semantic constraints and provides inaccurate predictions for detailed personal information. LLM-based pseudonymization also presents challenges; less than 25% of results were accurate, there were biased predictions and overlap with prediction suggesting privacy leakage. This talk emphasises the importance of careful handling of personal information in learner corpora, especially in relation to downstream tasks.

Friday's second talk explored approaches to short answer grading in domain-specific educational settings and was presented by Kate Belcher from the University of Tübingen. Meaning assessment is at the core of short-answer assessment. Consequently, metrics for textual similarity evaluation aggregated into a random forest classifier were shown to have high accuracy for grading short-answer questions. In-domain data is vital for accuracy here and thus LLMs may be advantageous in limited data contexts. However, LLMs struggle with semantically similar answers with nuanced differences. Performance across LLMs was not consistent as some models improved when provided with a reference and others worsened. Belcher's talk highlights the importance of meaning in auto-grading short-answer assessments.

The next talk was presented by Suchir Salhan from the University of Cambridge. Salhan explored the potential of using bilingual language models as computational models of human language learners. One effort towards this is BabyLM and BabyBabelLM which aim to model human-scale input for language acquisition using developmentally plausible data. However, BabyLMs lack a number of important capabilities that emerge with more data (e.g., instruction following), which limits their use as student models. Bilingual GPT models have been trained with simultaneous and sequential routines to mirror human bilinguals. Unlike humans, these models can show catastrophic forgetting of the L1 as the L2 is introduced. Salhan also presented Beetle, a new project aiming to mitigate catastrophic forgetting and including dense checkpoints for developmental analyses. Finally, Salhan emphasised the importance of evaluating bilingual language models on appropriate benchmarks covering both formal and functional competence.

Talk 15 was from Daniela Verratti Souto of the University of Tübingen and asked whether language forms used in practice in intelligent tutoring systems (ITS) transfer to functional language production tasks. An analysis of English essays written by 7th grade German students revealed that grammatical features practised and tested more were indeed used more in free writing, with examples including comparatives, relative clauses and conditionals. Form specific feedback in ITS also increased the use of comparatives and relative clauses. Verratti Souto argued that grammatical features such as the conditional can be divided into components; this allows a hierarchical approach where the learner can use only some of the components and still be considered to be aiming for a construction.

The penultimate talk of the workshop was from Soroosh Akef from the University of Tübingen. This talk looked at the potential for LLMs to conduct rule-grounded error correction. Accuracy, complexity and fluency can be considered to constitute proficiency. Accuracy, identifying fine-grained errors at scale is the hardest dimension to automate with LLMs. Correcting errors requires theory of mind reasoning to attribute motivation to learners' mistakes, which LLMs may not be able to achieve. SABER was used as a tool to extract grammatical properties from well-formed texts. Then the Needleman-Wunsch algorithm is used to determine if the property is correct (identical forms) or incorrect (different form and different property; or different form, same property and Levensthein distance less than 0.15). Training models based on LLM and human corrections, it is shown that LLMs indeed can be used as zero-shot error correctors given the prompt is detailed enough.

The final talk was presented by Shiva Taslimipoor from the University of Cambridge. This talk summarises ALTA's work to develop a dataset of responses and annotations for reading comprehension multiple choice questions (MCQs). Rather than only capturing the final answer, this dataset includes sections of the reading text highlighted by learners to show the 'why' behind their choice. Moreover, each question is labelled with what is being tested (e.g., main idea, inference, lexical meaning etc). Participants are being recruited through Prolific covering B-C CEFR levels. Future work aims to release the dataset for public use, and benchmark difficulty and discrimination statistics against those reported in the original source dataset (Cambridge MCQ reading dataset).

Page updated

Report abuse