The Second Language University Speech Intelligibility corpus was created by Northern Arizona University, The Pennsylvania State University, and University of Texas at Dallas. It consists of 10.5 hours of speech by 66 international faculty and university students (34 female, 32 male) from 15 different language backgrounds at 10 universities in North America.
Data were collected in 2021 and 2022 during which 127 speech events were recorded by speakers at home and in classrooms. The recordings are all monologic and contain speakers’ presentations, descriptions, reflections, and microteaching tasks. Some include a practice recording and /or a final, in-class recording. Speakers were recruited from courses at Intensive English Programs (IEPs) and oral skills courses for international graduate students seeking to become International Teaching Assistants (ITAs).
The corpus includes wav audio files, orthographic transcriptions for all recordings, and intelligibility scores for 73 per cent of the files. Aligned Praat Textgrids with word-level segmentation and pauses greater than 0.4 seconds are also included. Each recording is identified for four background variables: gender, L1, Type (IEP, ITA, Highly Intelligible), and two intelligibility scores.
The transcription-based score is based on five human listeners transcribing one sentence of 10-15 words and three phrases of 3-5 words. The listeners had only one chance to listen. The transcription scores were computed by comparing the listener transcription to a gold standard transcription using Bosker’s (2021) fuzzy string match approach.
The second intelligibility score is an average of 15 listener’s rating on a 5-point scale (see documentation for exact scale).
Recordings are one and two channel, 48kHz, 16-bit wav files. Transcripts are UTF-8 encoded plain text.