Corpus Repository

ALSL Spoken Corpora

The NAU Speech Lab has several corpora available for NAU students and external researchers granted on a per-user, per-purpose basis. Each corpus has text transcripts and audio files. Those that contain speech analysis (i.e., pauses, runs, intonation) are noted individually.  

Find the Data-Driven Learning for Pronunciation concordancing tool here.

Are you a non-NAU researcher interested in using one or more corpus? Complete this application form for your access to be considered by the NAU Speech Lab director.

Contact the director, Okim Kang (Okim.Kang@nau.edu) for more information.

Available Corpora

NAU Second Language University Speech Intelligibility (SLUSI) Corpus

Description:

The Second Language University Speech Intelligibility corpus was created by Northern Arizona University, The Pennsylvania State University, and University of Texas at Dallas. It consists of 10.5 hours of speech by 66 international faculty and university students (34 female, 32 male) from 15 different language backgrounds at 10 universities in North America.

 

Data were collected in 2021 and 2022 during which 127 speech events were recorded by speakers at home and in classrooms. The recordings are all monologic and contain speakers’ presentations, descriptions, reflections, and microteaching tasks. Some include a practice recording and /or a final, in-class recording. Speakers were recruited from courses at Intensive English Programs (IEPs) and oral skills courses for international graduate students seeking to become International Teaching Assistants (ITAs).

 

The corpus includes wav audio files, orthographic transcriptions for all recordings, and intelligibility scores for 73 per cent of the files. Aligned Praat Textgrids with word-level segmentation and pauses greater than 0.4 seconds are also included. Each recording is identified for four background variables: gender, L1, Type (IEP, ITA, Highly Intelligible), and two intelligibility scores. 


The transcription-based score is based on five human listeners transcribing one sentence of 10-15 words and three phrases of 3-5 words. The listeners had only one chance to listen. The transcription scores were computed by comparing the listener transcription to a gold standard transcription using Bosker’s (2021) fuzzy string match approach. 


The second intelligibility score is an average of 15 listener’s rating on a 5-point scale (see documentation for exact scale). 

 

Recordings are one and two channel, 48kHz, 16-bit wav files. Transcripts are UTF-8 encoded plain text.

NAU Cambridge ESOL Monologic Corpus

Description:

Cambridge English Language Assessment provided 120 examinee responses in a monologic task of an “average” test taker for each proficiency level (i.e., a score of 75 out of 100 or higher) in four proficiency levels. Responses include only the first minute of the monologic task. Tasks are largely picture description for all proficiency levels.


Participants:

76 were female and 44 were male and L1s were diverse. Most frequent L1s are Spanish (16), Korean (11), Italian (8), Dutch (7), French (6), and Chinese (5). 


Proficiency levels:


Transcriptions available:


Publication(s):

Johnson, D., Kang, O., & Ghanem, R. (2016). Improved automatic English proficiency rating of unconstrained speech with multiple corpora. International Journal of Speech Technology, 19(4), 755-768. DOI: 10.1007/s10772-016-9366-0

Johnson, D., Kang, O., & Ghanem, R. (2016). Language proficiency rating: Human versus Machine. Proceedings of the Pronunciation in Second Language Learning and Teaching, 7, 119-129.  

Kang, O. (2013). Linguistic analysis of speaking features distinguishing general English exams at CEFR levels B1 to C2 and examinee L1 backgrounds. Research Notes, 52, 40-48. http://www.cambridgeenglish.org/images/139525-research-notes-52-document.pdf

Kang, O. (2013). Relative impact of pronunciation features on non-native speakers’ oral proficiency. In J. Levis & K. LeVelle (Eds.), Proceedings of the Pronunciation in Second Language Learning and Teaching. Iowa State University.

Kang, O., & Johnson, D. (2018). Automated English proficiency scoring of unconstrained speech using prosodic features. Speech Prosody. https://www.isca-speech.org/archive/SpeechProsody_2018/pdfs/105.pdf

Kang, O., & Johnson, D. (2018). Contribution of suprasegmental to English speaking proficiency: Human rater and automated scoring system. Language Assessment Quarterly, 15(2), 150-168.

Kang, O., & Moran, M. (2014). Pronunciation features in non-native speakers’ oral performances. TESOL Quarterly, 48, 173-184.

Kang, O., & Yan, X. (2018). Linguistic features distinguishing examinees’ speaking performances at different proficiency levels. Journal of Language Testing and Assessment, 1, 24-39, DOI: 10.23977/langta.2018.11003.

NAU Cambridge ESOL Dialogic Corpus

Description:

Cambridge English Language Assessment provided 39 examinee responses to a dialogic of about 2 minutes. Tasks include describing a picture (PET, FCE, and CAE) or expressing opinions on abstract topics (CPE).


Participants:

No participant data other than proficiency 


Proficiency levels:


Transcriptions available:


Publication(s):

Kang, O., & Wang, L. (2014). Impact of Different Task Types on Candidates’ Speaking Performances. Research Notes, 57, 40-49.

Kang, O., Larson, G., & Goo, S. (2019).  Interaction features predicting examinees’ speaking performances at different proficiency levels. Journal of Language Testing and Assessment, 2, 1-12. https://www.clausiuspress.com/article/294.html


NAU ETS TOEFL Speaking Task Response Corpus

Description:

Educational Testing Services provided 106 responses by 28 male L2 English examinees to four TOEFL “integrated” speaking tasks. Responses are one minute in length. Task prompts are not available. 


Participants:

Available upon request


Transcriptions available:


Publication(s):

Kang, O, & Rubin, D. (2012). Intra-rater reliability of oral proficiency ratings. Journal of Educational and Psychological Assessment, 12, 43-61. https://sites.google.com/site/tijepa2012/vol-11-2/vol-12-1

Kang, O. & Pickering, L. (2013). Using acoustic and temporal analysis for assessing speaking. In A. Kunnan (Ed.), Companion to Language Assessment (pp.1047-1062). Wiley-Blackwell.

Kang, O., & Pickering, L. (2011). The role of objective measures of suprasegmental features in judgments of comprehensibility and oral proficiency in L2 spoken discourse. Speak Out, 44, 4-8.

Kang, O., Rubin, D., & Kermad, A. (2019). Effect of training and rater individual differences on oral proficiency assessment. Language Testing, 36 (4),  481-504.

Kang, O., Rubin, D., & Pickering, L. (2010). Judgments of ELL proficiency in oral English and acoustic measures of accentedness. Modern Language Journal, 94, 554-566.

Rubin, O., Kang, O., & Pickering, L. (2008). Relative impact of rater intercultural and language background, rater attitudes, rater training, and measurable elements of pronunciation on TOEFL iBT speaking proficiency scoring. ETS Research Reports. Princeton, NJ: Educational Testing Service.

NAU IELTS longitudinal Corpus

Description:

52 learners complete the IELTS exam before and after a 3-month IELTS preparation course. Their speech was captured and analyzed by general proficiency (beginner intermediate, advanced) and CEFR level (B1, B2, C1). The resulting 104 files are 14-20 minutes long and are of varying quality.


Participants:

No participant data other than proficiency 


Proficiency levels:



Transcriptions available:


Publication(s):

(under review)

NAU Young Learners from Mexico Corpus

Description:

60 Young learners were recorded listening and retelling a story in English. The retelling portion of the audio files ranges from 15 to 60 seconds and the overall audio files, including the examiner’s reading of the prompt, are 3 to 5 minutes in length. 


Participants:

No participant data other than proficiency 


Proficiency levels:

No participant data other than proficiency 


Transcriptions available:

Transcriptions are not available


Publication(s):

Kang, O., Ahn, H., Yaw, K., & Chung. S. (forthcoming). Impact of Test-Taker’s Background on Score Gain on IELTS, Language Testing. 

Kang, O., Ahn, H., Yaw, K., & Chung, S. (2021). Investigation of relationship among learner background, linguistic progression, and score gain on IELTS.  The IELTS Research Report Series. https://www.ielts.org/-/media/research-reports/ielts-rr_2021-1_kang-et-al.ashx

NAU ITA Lecture Corpus

Description:

11 International Teaching Assistants (ITAs) and 4 US L1 English teaching assistants were recorded while giving undergraduate lectures at a US University. Recordings are 3-4 minutes in length and cover a wide variety of disciplines. The recordings contain only the speech of the teaching assistant.


Participants:

No participant data available


Transcriptions available:


Transcriptions available:

Transcriptions are not available


Publication(s):

Kang, O., Rubin, D, & Lindemann, S.  (2015). Using contact theory to improve US undergraduates’ attitudes toward international teaching assistants. TESOL Quarterly, 49, 681-706.

Kang, O. (2008). The effect of rater background characteristics on the rating of International Teaching Assistants Speaking Proficiency. Spaan Fellow Working Papers, 6, 181-205.

Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38, 301-315.

Kang, O. (2012). Impact of rater characteristics on ratings of international teaching assistants’ oral performance. Language Assessment Quarterly, 9, 249-269.

Corpora Licensed by the ALSL

TIMIT Corpus

CSLU: Foreign Accented English