Special Interest Group on Speech and Language Technology in Education (SLaTE) Webinar Series
Special Interest Group on Speech and Language Technology in Education (SLaTE) Webinar Series
This webinar series focuses on Speech and Language Technologies in Education (SLaTE). SLaTE is a Special Interest Group (SIG) of the International Speech Communication Association (ISCA) and provides a platform to exchange ideas, present research and discuss applications. The webinar will occur on the first non-holiday Monday of every month at 16:00 CET. The talks will be live-streamed and recorded, but please let us know if you do not feel comfortable being recorded. Links to the talks are shown below and on our Youtube channel: www.youtube.com/@ISCASIGSLaTE. You are welcome to distribute our series of webinars!
Webinar Registration:
If you are interested in our topic and want to receive updates on our webinar series, please register via Eventbrite. Registration can be easily found if you follow our account SLaTE. You will receive the zoom link 2 days before the event and 1 reminder 3 hours before the event.
Are you interested in our next webinar? Register now via https://www.eventbrite.com/cc/slate-talk-3581309 for the upcoming talks!
Visit SIG SLaTE Website
Join SIG SLaTE Mailing Group
https://groups.google.com/g/slate-isca
Webinar Schedule
Abstract: Child speech is characterized by larger inter- and intra- speaker variability than adults’ speech, partly due to vocal tract changes as children grow. In addition, there is a lack of large, publicly available datasets that can adequately train machine learning algorithms for various recognition tasks. As a result, the performance of automatic speech recognition (ASR) systems of child speech is worse than that of adults. In this talk, I will summarize our efforts in data collection, developing data augmentation techniques, benchmarking children’s speech recognition with supervised and self-supervised speech foundation models, and developing a framework for assessing children’s narrative language abilities. Our studies point to the need for accounting for several factors when designing child speech processing systems: age (an ASR system that works well for a 9-year-old child would not necessarily work well for a 6-year-old), style (reading versus spontaneous speech), dialect (differences not only in pronunciation but also in word usage and grammar), and reading and/or language impairment. Moreover, for language assessments, transliteration is sometimes more valuable to the teacher than a corrected transcription. As a result, data diversity, and not just quantity, is critical when designing child ASR systems. While significant progress has been made in child speech processing, several challenges remain and need to be addressed before spoken language systems are used in early literacy settings.
Professor at Faculty of Science and Engineering, Department of Intelligent Information Engineering and Sciences, Doshisha University, Kyoto, Japan
Abstract: Dialogue-based CALL systems have various designs dependent on learners' age, proficiency levels, social contexts, and goals of learning. We designed a Joining-in-type robot-assisted language learning (JIT-RALL) system for students who seldom have opportunities for L2 communication. The JIT-RALL system shows a model conversation between two humanoid robots and invites a learner to join in so that the learner can use specific forms of English expressions. We have explored effective training methods such as question-answering (QA) vs. repeating (RP). Through the restrictive years of COVID-19, we developed a new JIT-CALL system that enabled a remote learner to converse with two characters on a server. We conducted a large-scale experiment that verified the dependency of the training effectiveness on learners' proficiency levels. The experimental results showed that learners with low CEFR levels showed a significantly greater effect from QA training than RP training. Though the original design of the JIT-RALL was expecting implicit learning through the model conversation, which was not limited by the accuracy of ASR for the learners' accented speech, recent advancements in ASR and NLP technologies make it possible to give feedback to learner responses. We implemented a feedback function on the system with Whisper ASR and ChatGPT and conducted an experiment. The experimental results showed a significantly greater effect of the feedback than without. I will talk about the design, the performance, and the pedagogical impact.
[There is No Video Available for this Talk]
Lecturer Taif University, Saudi Arabia,
recent PhD graduate in Speech and Hearing Research Group (SpandH) at the University of Sheffield, England
Abstract: Automatic proficiency assessment can be a useful tool in language learning, for self-evaluation of language skills and to enable educators to tailor instruction effectively. Often assessment methods use categorisation approaches. In this paper an exemplar based approach is chosen, and comparisons between utterances are made using different speech encodings. Such an approach has advantage to avoid formal categorisation of errors by experts. Aside from a standard spectral representation pretrained model embeddings are investigated for the usefulness for this task. Experiments are conducted using speechocean762 database, which provides 3 levels of proficiency. Data was clustered and performance of different representations is assessed in terms of cluster purity as well as categorisation correctness. Cosine distance with whisper representations yielded better clustering performance.
Abstract: For the endangered language, speech and language technologies offer untold opportunities – not least in the area of education, critical to language transmission, maintenance and revival. However, to achieve their potential impact, the development of the core technologies such as ASR and TTS, and the building of educational applications based on these, needs to be guided by important considerations. While these are not typical priorities for the major world languages, they are critical to ensure that the technologies are adequate, appropriate and useful to the endangered-language community. These considerations are discussed and illustrated in the light of the ABAIR project’s experience with Irish (Gaelic). Of importance are: (i) sociolinguistic awareness, such as addressing the fact that the endangered language is unlikely to have a spoken standard, but rather, a number of widely different dialects; (ii) linguistic knowledge, given that the language structure may dictate how an educational application is built and that mirroring an application available for English may be highly inappropriate; (iii) clear pedagogical targeting that explores the acquisition process for the learner of the specific language, and above all, (iv) close collaboration with the communities and end-users at every stage of technology development and application building. Ultimately, a holistic, interdisciplinary approach is proposed. The local limitations confronting specific endangered languages can be very extreme, and time is running out. It is suggested that pooling guidelines, expertise, experiences and resources would benefit all. A practical proposal here is to establish a SLaTE – SEAGUL joint initiative to embrace groups actively working with endangered languages, such as the Endangered Languages Documentation Programme (ELDP), the Network to Promote Linguistic Diversity (NPLD) and the Language Technology for All movement (LT4All) to promote collaborations that will harness the potential of the new technologies and educational applications for the endangered language.
Professor of Engineering and Language Education, the University of Tokyo, Bunkyō, Japan
Abstract: With recent advancement of speech technology, pronunciation training courseware is available and running even on smartphones. Learners' speaking behaviors are measured and assessed automatically, and in this talk, the lecturer focuses on how to measure and assess their listening behaviors. Researchers of second language acquisition claim that input (perception) training is much more important than and should be given prior to output (production) training. Since listening is a mental phenomenon, however, it seems possible to measure listening behaviors only with expensive brain sensing techniques. In this talk, based on human brains' characteristics, a pedagogically valid and inexpensive technique for "acoustic" measurement of listening behaviors to detect listening breakdown is proposed. After that, the technique is applied for L2 "aural" training by measuring learners' behaviors and for L2 "oral" training by measuring raters' behaviors. Finally, the lecturer shows an interesting example of applying the technique to calculate the global communicability of individual learners talking with and listening to speakers of global Englishes.
Relevant information:
A project of listening disfluency measurement :
https://sites.google.com/g.ecc.u-tokyo.ac.jp/listening-disfluency
https://drive.google.com/file/d/1tQ4vlOurBmaax6HEomRIYx6T-RRGJ__R/view?usp=share_link
Senior ML Engineer at CluePoints, Belgium, and Scientific Collaborator at ISIA Lab, Numediart Institute, the University of Mons, Mons, Belgium
Abstract: In this talk, I will present two complementary approaches to advancing speech technology for educational applications, particularly in pronunciation training systems. The first approach, detailed in the paper "TIPAA-SSL" that introduces a novel methodology for text-independent phone-to-audio alignment, leveraging self-supervised learning and phoneme recognition.
We build on top of a wav2vec2 model pre-trained on many languages and already fine-tuned on phoneme sequence prediction. A pipeline of shallow ML models and algorithms are used to predict phones and phone boundaries from the latent representation of the latter, and can be adapted with little data and on a chosen phone set. This approach significantly improves alignment accuracy across different native English accents, a critical feature for unbiased pronunciation feedback in language learning applications.
The second approach, described in the "MUST&P-SRL" paper, focuses on the extraction of linguistic features, emphasizing automatic syllabification across multiple languages. This methodology ensures compatibility with existing forced-alignment tools like the Montreal Forced Aligner (MFA) and enables consistent segmentation of both text and phonetic data. The resulting unified syllabification and stress annotation techniques are essential for creating accurate and reliable speech content for educational tools.
Abstract: Along with growing interest in applying social robots in the education sector, a new technology-based field of language education has emerged, which is called ‘robot-assisted language learning (RALL)’. RALL has developed rapidly in second language learning, especially driven by the need to compensate for the shortage of first-language tutors. There are many implementation cases and studies of social robots, from early government-led attempts in Japan and South Korea to increasing research interests in Europe and worldwide. Compared with RALL used for English as a foreign language (EFL), however, there are fewer studies on applying RALL for teaching Chinese as a foreign language (CFL). One potential reason is that RALL is not well-known in the CFL field. This talk attempts to fill this gap by addressing the balance between classroom implementation and research frontiers of social robots. The review first introduces the technical tool used in RALL, namely the social robot, at a high level. It then presents a historical overview of the real-life implementation of social robots in language classrooms in East Asia and Europe. It then provides a summary of the evaluation of RALL from the perspectives of L2 learners, teachers and technology developers. The overall goal of this talk is to gain insights into RALL’s potential and challenges and identify a rich set of open research questions for applying RALL to CFL. It is hoped that the review may inform interdisciplinary analysis and practice for scientific research and front-line teaching in future.
Research Associate, Cambridge University Institute for Automated Language Teaching and Assessment (ALTA).
Abstract: Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. Typically, to do this, we use a series of steps: first, we turn spoken words into text using automatic speech recognition (ASR), then we remove any disfluencies (such as repetitions, hesitations and false starts) in the speech, and finally, we correct the grammatical errors. However, there might be a problem with this method: errors might get passed from one step to another. In this presentation, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework (ASR, disfluency removal, and GEC) or only part of it, e.g., ASR only or disfluency removal only. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the presentation discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.
Abstract: With the advent of globalization, there is an increasing demand for foreign language learning. Computer-Aided Pronunciation Training (CAPT) technologies play a pivotal role in promoting self-directed language learning, offering constant and tailored feedback to second language learners. This talk will first explore an array of modeling techniques used for Mispronunciation Detection and Diagnosis (MDD) systems, a crucial component of CAPT. Following, the talk will highlight the effectiveness of making the MDD model more aware of learners' L1 background. Finally, it will explore how an L1-aware multilingual model improves detection performance, especially for low-resource target languages.