Venue: Lecture Theatre 7, Diamond Building
Note: all research talks will be delivered in-person. There will be no facility to join online.
Authors: Hend ElGhazaly (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Fairness evaluations of automatic speech recognition (ASR) systems are typically conducted along a single demographic axis and largely in English. In dialect-rich languages such as Arabic, disparities may arise across interacting sociolinguistic dimensions. We introduce a new Arabic speech dataset with rich metadata, in which speakers read identical content in Modern Standard Arabic and Egyptian Arabic. By keeping text content constant across speakers, the dataset enables separation of subgroup disparities from content-driven model behaviour and assessment of Arabic ASR systems' fairness. Evaluating two state-of-the-art multilingual ASR models, we show that performance gaps are multidimensional and model-dependent. Beyond WER, we observe register normalisation bias, whereby dialectal speech is occasionally rendered in the dominant standard register. Fairness evaluation should thus consider not only accuracy but also the system's ability to preserve linguistic variation in its outputs.
Authors: Atsuki Yamaguchi (University of Sheffield), Terufumi Morishita (Hitachi, Ltd.), Aline Villavicencio (University of Sheffield), Nikolaos Aletras (University of Sheffield)
Abstract: Expanding the linguistic diversity of instruct large language models (LLMs) is crucial for global accessibility but is often hindered by the reliance on costly specialized target language labeled data and catastrophic forgetting during adaptation. We tackle this challenge under a realistic, low-resource constraint: adapting instruct LLMs using only unlabeled target language data. We introduce Source-Shielded Updates (SSU), a selective parameter update strategy that proactively preserves source knowledge. Using a small set of source data and a parameter importance scoring method, SSU identifies parameters critical to maintaining source abilities. It then applies a column-wise freezing strategy to protect these parameters before adaptation. Experiments across five typologically diverse languages and 7B and 13B models demonstrate that SSU successfully mitigates catastrophic forgetting. It reduces performance degradation on monolingual source tasks to just 3.4% (7B) and 2.8% (13B) on average, a stark contrast to the 20.3% and 22.3% from full fine-tuning. SSU also achieves target-language performance highly competitive with full fine-tuning, outperforming it on all benchmarks for 7B models and the majority for 13B models.
Authors: Wenjie Peng (University of Sheffield), Chen Chen (University of Sheffield), Thomas Hain (University of Sheffield)
Abstract: Learning speech representations that are useful for a variety of downstream tasks has received considerable attention, due to the outstanding properties of Self-Supervised Learning (SSL) trained models. Despite advancements in modelling methods, understanding the difference in task performance on representations is limited. Mainly motivated by the no-free-lunch theorem and speech production, this work investigates changes in task performance in sparse speech representations, providing interpretability analysis under the Information Bottleneck (IB) framework. Autoencoders with varying sparsity levels were trained using three SSL features, and evaluated on six tasks of SUPERB: Speech Enhancement (SE), Speaker Identification (SID), Speech Emotion Recognition (SER), Phone Recognition (PR), Automatic Speech Recognition (ASR) and Slot Filling (SF). Experiments show that: 1) different tasks manifest different degrees of sensitivity to the sparsity levels; 2) the optimal sparsity level for task performance varies; 3) the choice of SSL features has a limited impact on most tasks but with an exception of PR; 4) overall PR and ASR require more preservation of relevant information about the labels, while SID and SER demand more compression of irrelevant information, where the input quality can shift this trade-off to some degree. These findings can contribute to the design of a universal sparse speech representation learner.
Authors: Misbah Farooq (Loughborough University, London), Varuna De Silva (Loughborough University, London), Xiyu Shi (Loughborough University, London)
Abstract: Speech emotion recognition (SER) across languages remains a challenging problem due to variability in acoustic patterns, linguistic structure, and cultural expression of emotions. This study investigates the concept of emotional universality in speech by proposing a Cross-Attention Transformer framework for cross-linguistic emotion recognition. The model is designed to learn shared emotional representations across multiple languages, including English, Urdu, French, and German, while mitigating language-specific biases. To achieve this, we integrate transformer-based encoders with cross-attention mechanisms that enable effective interaction between language-invariant emotional cues and language-dependent acoustic features. In addition, handcrafted acoustic descriptors such as pitch, energy, MFCCs, and spectral features are fused with deep learned representations to enhance robustness. A transfer learning strategy is employed by pretraining the model on a high-resource English dataset and adapting it to low-resource languages through fine-tuning, enabling knowledge transfer of emotional patterns across linguistic boundaries. Experimental evaluation demonstrates improved generalization performance across unseen languages compared to baseline models. The proposed approach highlights that emotional cues in speech exhibit significant cross-linguistic consistency, supporting the hypothesis of emotional universality. This work contributes toward building scalable, language-agnostic emotion recognition systems suitable for real-world multilingual applications.
Authors: Aaron HA Fletcher (University of Sheffield), Mark Stevenson (University of Sheffield)
Abstract: Deciding when to stop searching for information before making a decision is a common problem with multiple applications. Existing stopping rules developed within Technology-Assisted Review (TAR) aim to achieve a pre-specified recall target, regardless of whether sufficient evidence has been gathered to support the downstream decision. This paper formulates TAR stopping for consequential screening tasks as a utility-aware decision problem and derives three practical stopping policies based on the Expected Value of Perfect Information. The approach is applied to two professional search tasks: patent examining and systematic reviewing. Experiments on CLEF-IP and medical systematic review datasets show that the proposed utility-aware policies generally achieve higher net utility than recall-centric baselines under the evaluated cost and payoff settings.
Authors: Owen Cook (University of Sheffield), Jake Vasilakes (University of Sheffield), Ian Roberts (University of Sheffield), Xingyi Song (University of Sheffield)
Abstract: As machine learning is becoming increasingly data-centric, we should continuously ask: “how much can we trust the data we are training and evaluating our model with?â€. The models we train are a direct product of training data quality, which is itself a direct product of its annotators. The EffiARA (Efficient Annotator Reliability Assessment) framework primarily aims to understand the reliability of each individual annotator to help filter out unreliable workers or down-weight their less trustworthy labels during training. With the annotation process itself being extremely expensive and time-consuming, EffiARA also factors in cost and assists in managing annotation projects from start to finish. So far, the EffiARA framework has supported the creation of three datasets at the University of Sheffield: RUC-MCD, Chinese News Framing dataset, and SCRum-9. The EffiARA Python package is available on PyPi and open-sourced on GitHub (https://github.com/MiniEggz/EffiARA); our publicly accessible webtool is also available at https://effiara.gate.ac.uk. By accounting for annotator reliability in our dataset creation, we have observed an increase of ~5% in F1-macro for misinformation detection, and increased overall dataset reliability in the news framing task, raising the average Krippendorff's alpha from 0.396 to 0.465.