Special Interest Group on Speech and Language Technology in Education (SLaTE) Webinar Series

This webinar series focuses on Speech and Language Technologies in Education (SLaTE). SLaTE is a Special Interest Group (SIG) of the International Speech Communication Association (ISCA) and provides a platform to exchange ideas, present research and discuss applications. The webinar will occur on the first non-holiday Monday of every month at 16:00 CET. The talks will be live-streamed and recorded, but please let us know if you do not feel comfortable being recorded. Links to the talks are shown below and on our Youtube channel: www.youtube.com/@ISCASIGSLaTE. You are welcome to distribute our series of webinars!

Webinar Registration:
If you are interested in our topic and want to receive updates on our webinar series, please register via Eventbrite. Registration can be easily found if you follow our account SLaTE. You will receive the zoom link 2 days before the event and 1 reminder 3 hours before the event.

Are you interested in our next webinar? Register now via https://www.eventbrite.com/cc/slate-talk-3581309 for the upcoming talks!

Visit SIG SLaTE Website

Join SIG SLaTE Mailing Group
https://groups.google.com/g/slate-isca

Webinar Schedule

Abeer Alwan

Distinguished Professor at the University of California, Los Angeles, America

03/03/2025 16:00 CET
Title: To Use or not to Use Spoken Language Systems in Early Literacy Education?

Abstract: Child speech is characterized by larger inter- and intra- speaker variability than adults’ speech, partly due to vocal tract changes as children grow. In addition, there is a lack of large, publicly available datasets that can adequately train machine learning algorithms for various recognition tasks. As a result, the performance of automatic speech recognition (ASR) systems of child speech is worse than that of adults. In this talk, I will summarize our efforts in data collection, developing data augmentation techniques, benchmarking children’s speech recognition with supervised and self-supervised speech foundation models, and developing a framework for assessing children’s narrative language abilities. Our studies point to the need for accounting for several factors when designing child speech processing systems: age (an ASR system that works well for a 9-year-old child would not necessarily work well for a 6-year-old), style (reading versus spontaneous speech), dialect (differences not only in pronunciation but also in word usage and grammar), and reading and/or language impairment. Moreover, for language assessments, transliteration is sometimes more valuable to the teacher than a corrected transcription. As a result, data diversity, and not just quantity, is critical when designing child ASR systems. While significant progress has been made in child speech processing, several challenges remain and need to be addressed before spoken language systems are used in early literacy settings.

Tsuneo Kato

Professor at Faculty of Science and Engineering, Department of Intelligent Information Engineering and Sciences, Doshisha University, Kyoto, Japan

03/02/2025 16:00 CET
Title: Learning Second Language Expression with Form-Focused Trialogue-Based CALL system

Abstract: Dialogue-based CALL systems have various designs dependent on learners' age, proficiency levels, social contexts, and goals of learning. We designed a Joining-in-type robot-assisted language learning (JIT-RALL) system for students who seldom have opportunities for L2 communication. The JIT-RALL system shows a model conversation between two humanoid robots and invites a learner to join in so that the learner can use specific forms of English expressions. We have explored effective training methods such as question-answering (QA) vs. repeating (RP). Through the restrictive years of COVID-19, we developed a new JIT-CALL system that enabled a remote learner to converse with two characters on a server. We conducted a large-scale experiment that verified the dependency of the training effectiveness on learners' proficiency levels. The experimental results showed that learners with low CEFR levels showed a significantly greater effect from QA training than RP training. Though the original design of the JIT-RALL was expecting implicit learning through the model conversation, which was not limited by the accuracy of ASR for the learners' accented speech, recent advancements in ASR and NLP technologies make it possible to give feedback to learner responses. We implemented a feedback function on the system with Whisper ASR and ChatGPT and conducted an experiment. The experimental results showed a significantly greater effect of the feedback than without. I will talk about the design, the performance, and the pedagogical impact.

[There is No Video Available for this Talk]

Elaf J Islam

Lecturer Taif University, Saudi Arabia,
recent PhD graduate in Speech and Hearing Research Group (SpandH) at the University of Sheffield, England

13/01/2025 16:00 CET
Title: Exploring Speech Representations for Proficiency Assessment in Language Learning

Abstract: Automatic proficiency assessment can be a useful tool in language learning, for self-evaluation of language skills and to enable educators to tailor instruction effectively. Often assessment methods use categorisation approaches. In this paper an exemplar based approach is chosen, and comparisons between utterances are made using different speech encodings. Such an approach has advantage to avoid formal categorisation of errors by experts. Aside from a standard spectral representation pretrained model embeddings are investigated for the usefulness for this task. Experiments are conducted using speechocean762 database, which provides 3 levels of proficiency. Data was clustered and performance of different representations is assessed in terms of cluster purity as well as categorisation correctness. Cosine distance with whisper representations yielded better clustering performance.

Ailbhe Ní Chasaide

Professor of Phonetics at Trinity College Dublin, Ireland

02/12/2024 16:00 CET
An endangered language perspective on speech technology in education: reflections on the ABAIR Irish experience

Abstract: For the endangered language, speech and language technologies offer untold opportunities – not least in the area of education, critical to language transmission, maintenance and revival. However, to achieve their potential impact, the development of the core technologies such as ASR and TTS, and the building of educational applications based on these, needs to be guided by important considerations. While these are not typical priorities for the major world languages, they are critical to ensure that the technologies are adequate, appropriate and useful to the endangered-language community. These considerations are discussed and illustrated in the light of the ABAIR project’s experience with Irish (Gaelic). Of importance are: (i) sociolinguistic awareness, such as addressing the fact that the endangered language is unlikely to have a spoken standard, but rather, a number of widely different dialects; (ii) linguistic knowledge, given that the language structure may dictate how an educational application is built and that mirroring an application available for English may be highly inappropriate; (iii) clear pedagogical targeting that explores the acquisition process for the learner of the specific language, and above all, (iv) close collaboration with the communities and end-users at every stage of technology development and application building. Ultimately, a holistic, interdisciplinary approach is proposed. The local limitations confronting specific endangered languages can be very extreme, and time is running out. It is suggested that pooling guidelines, expertise, experiences and resources would benefit all. A practical proposal here is to establish a SLaTE – SEAGUL joint initiative to embrace groups actively working with endangered languages, such as the Endangered Languages Documentation Programme (ELDP), the Network to Promote Linguistic Diversity (NPLD) and the Language Technology for All movement (LT4All) to promote collaborations that will harness the potential of the new technologies and educational applications for the endangered language.

Nobuaki Minematsu

Professor of Engineering and Language Education, the University of Tokyo, Bunkyō, Japan

04/11/2024 15:00 CET
Measurement of listening behaviors of learners and raters and its application for L2 aural/oral training

Abstract: With recent advancement of speech technology, pronunciation training courseware is available and running even on smartphones. Learners' speaking behaviors are measured and assessed automatically, and in this talk, the lecturer focuses on how to measure and assess their listening behaviors. Researchers of second language acquisition claim that input (perception) training is much more important than and should be given prior to output (production) training. Since listening is a mental phenomenon, however, it seems possible to measure listening behaviors only with expensive brain sensing techniques. In this talk, based on human brains' characteristics, a pedagogically valid and inexpensive technique for "acoustic" measurement of listening behaviors to detect listening breakdown is proposed. After that, the technique is applied for L2 "aural" training by measuring learners' behaviors and for L2 "oral" training by measuring raters' behaviors. Finally, the lecturer shows an interesting example of applying the technique to calculate the global communicability of individual learners talking with and listening to speakers of global Englishes.

Relevant information:

A project of listening disfluency measurement :

https://sites.google.com/g.ecc.u-tokyo.ac.jp/listening-disfluency

https://drive.google.com/file/d/1tQ4vlOurBmaax6HEomRIYx6T-RRGJ__R/view?usp=share_link

Noé Tits

Senior ML Engineer at CluePoints, Belgium, and Scientific Collaborator at ISIA Lab, Numediart Institute, the University of Mons, Mons, Belgium

07/10/2024 16:00 CET
Approaches to advancing speech technology for educational applications

Abstract: In this talk, I will present two complementary approaches to advancing speech technology for educational applications, particularly in pronunciation training systems. The first approach, detailed in the paper "TIPAA-SSL" that introduces a novel methodology for text-independent phone-to-audio alignment, leveraging self-supervised learning and phoneme recognition.

We build on top of a wav2vec2 model pre-trained on many languages and already fine-tuned on phoneme sequence prediction. A pipeline of shallow ML models and algorithms are used to predict phones and phone boundaries from the latent representation of the latter, and can be adapted with little data and on a chosen phone set. This approach significantly improves alignment accuracy across different native English accents, a critical feature for unbiased pronunciation feedback in language learning applications.

The second approach, described in the "MUST&P-SRL" paper, focuses on the extraction of linguistic features, emphasizing automatic syllabification across multiple languages. This methodology ensures compatibility with existing forced-alignment tools like the Montreal Forced Aligner (MFA) and enables consistent segmentation of both text and phonetic data. The resulting unified syllabification and stress annotation techniques are essential for creating accurate and reliable speech content for educational tools.

Guanyu Huang

PhD Candidate in Human-Robot Interaction at University of Sheffield / FHEA

13/05/2024 16:00 CET
Using social robots for language learning: a scoping review

Abstract: Along with growing interest in applying social robots in the education sector, a new technology-based field of language education has emerged, which is called ‘robot-assisted language learning (RALL)’. RALL has developed rapidly in second language learning, especially driven by the need to compensate for the shortage of first-language tutors. There are many implementation cases and studies of social robots, from early government-led attempts in Japan and South Korea to increasing research interests in Europe and worldwide. Compared with RALL used for English as a foreign language (EFL), however, there are fewer studies on applying RALL for teaching Chinese as a foreign language (CFL). One potential reason is that RALL is not well-known in the CFL field. This talk attempts to fill this gap by addressing the balance between classroom implementation and research frontiers of social robots. The review first introduces the technical tool used in RALL, namely the social robot, at a high level. It then presents a historical overview of the real-life implementation of social robots in language classrooms in East Asia and Europe. It then provides a summary of the evaluation of RALL from the perspectives of L2 learners, teachers and technology developers. The overall goal of this talk is to gain insights into RALL’s potential and challenges and identify a rich set of open research questions for applying RALL to CFL. It is hoped that the review may inform interdisciplinary analysis and practice for scientific research and front-line teaching in future.

Dr. Stefano Bannò

Research Associate, Cambridge University Institute for Automated Language Teaching and Assessment (ALTA).

08/04/2024 16:00 CET
Towards End-to-End Spoken Grammatical Error Correction

Abstract: Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. Typically, to do this, we use a series of steps: first, we turn spoken words into text using automatic speech recognition (ASR), then we remove any disfluencies (such as repetitions, hesitations and false starts) in the speech, and finally, we correct the grammatical errors. However, there might be a problem with this method: errors might get passed from one step to another. In this presentation, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework (ASR, disfluency removal, and GEC) or only part of it, e.g., ASR only or disfluency removal only. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the presentation discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.

Dr. Shammur Absar Chowdhury

Scientist, Qatar Computing Research Institute (QCRI)

04/03/2024 16:00 CET
Towards L1-aware Multilingual Mispronunciation Detection Modeling

Abstract: With the advent of globalization, there is an increasing demand for foreign language learning. Computer-Aided Pronunciation Training (CAPT) technologies play a pivotal role in promoting self-directed language learning, offering constant and tailored feedback to second language learners. This talk will first explore an array of modeling techniques used for Mispronunciation Detection and Diagnosis (MDD) systems, a crucial component of CAPT. Following, the talk will highlight the effectiveness of making the MDD model more aware of learners' L1 background. Finally, it will explore how an L1-aware multilingual model improves detection performance, especially for low-resource target languages.

Page updated

Google Sites

Report abuse

03/03/2025 16:00 CETTitle: To Use or not to Use Spoken Language Systems in Early Literacy Education?

03/02/2025 16:00 CETTitle: Learning Second Language Expression with Form-Focused Trialogue-Based CALL system

13/01/2025 16:00 CETTitle: Exploring Speech Representations for Proficiency Assessment in Language Learning

02/12/2024 16:00 CETAn endangered language perspective on speech technology in education: reflections on the ABAIR Irish experience

04/11/2024 15:00 CETMeasurement of listening behaviors of learners and raters and its application for L2 aural/oral training

07/10/2024 16:00 CETApproaches to advancing speech technology for educational applications

13/05/2024 16:00 CETUsing social robots for language learning: a scoping review

08/04/2024 16:00 CETTowards End-to-End Spoken Grammatical Error Correction

04/03/2024 16:00 CETTowards L1-aware Multilingual Mispronunciation Detection Modeling