A lot of research questions can be answered with existing data. A linguistic corpus is a large, structured collection of texts or spoken language samples that are used for linguistic research and analysis. These corpora serve as valuable resources for studying language patterns, usage, and evolution. By analyzing a corpus, linguists can uncover insights into grammar, vocabulary, and language variation across different contexts and populations. Whether it's for developing language models, creating dictionaries, or understanding language change over time, a linguistic corpus provides a rich foundation for exploring the complexities of human language.
The Speech Lab is the administrative home for many corpora of recorded speech. If you are interested in access to the corpora, please follow provided links, or contact the appropriate Principle Investigator (PI).
Date Collected: 2009
Language: English
Location: Lafourche Parish, Louisiana
Speaker Information: 17 Speakers, all Cajun (white) males, aged 32-83. Monolingual English speakers, Semi Speakers of Cajun French (cf Dorian 1981), Bilingual French-English speakers, and L2 speakers of English (L1 Cajun French). Additional female speaker whose jokes are included in the corpus, but whose interview is not transcribed.
Recording Information: Participants wore Shure SM 10A headset microphones and were recorded by a Z00M H4 portable digital recorder, a separate Crown Audio Sound Grabber II microphone was set up to record other speech and interaction.
Tasks: Full interviews consisting of casual conversation lasted 30 to 90 minutes; 15-minute segments selected and transcribed in PRAAT and FAVE-aligned. Participants were also asked to tell Boudreaux and Thibodeaux jokes, which are told with exaggerated stereotypical Cajun English accents -- included in this corpus are recordings of 7 speakers telling 33 jokes. Only some are transcribed, but each joke is categorized with a general title and whether it is a Boudreaux and Thibodeaux joke (some more general jokes were also told).
Date Collected: 2006-2008
Language: French
Location: Terrebonne and Lafourche Parish, Louisiana
Speaker Information: 28 Speakers, all Pointe-Au-Chien Indians, aged 28-73, split evenly according to gender across three speaker groups:
12 Older Fluent
8 Younger Fluent
8 Semi Speakers, or non-fluent speakers of French--common in situations of language death (cf Dorian 1981)
Recording Information: Participants and interviewer wore Lavalier microphones. Interviews were conducted with PI, a local "insider" interviewer, and participants.
Tasks: Full interviews consisting of casual conversation lasted one to two hours; 15-45 minute segments selected and transcribed in MS Word. Participants were also asked to translate 50 short sentences in English to French.
Date Collected: 2012; 2016-2023
Language: English
Location: New Orleans, Louisiana and Chalmette, Louisiana
Speaker Information: Total of 192 speakers
New Orleans subcorpus: 135 Black, white, and Creole speakers. 64 men and 71 women. Birthyears: 1917-2002. All grew up in New Orleans, Louisiana. Collected 2016-2023 by Katie Carmichael, Nathalie Dajko (Tulane), Dana Serditova (Uni Freiburg), Lucia Paternostro (Tulane), Shawanda Marie (Independent).
Chalmatian subcorpus: 57 white, working-class speakers. 32 women and 25 men. Birthyears: 1927-1994. All grew up in and around the town of Chalmette in St. Bernard Parish, though half of the corpus had relocated after Katrina (most to St. Tammany Parish on the Northshore of Lake Pontchartrain, some elsewhere in Greater New Orleans). Collected 2012 by Katie Carmichael.
Recording Information: Participants wore Shure SM 10A headset microphones and were recorded by a ZOOMH4 portable digital recorder, a seperate Crown Audio Sound Grabber II microphone was set up to record other speech and interaction.
Tasks: Full interviews consisting of casual conversation lasted one to four hours; 15-45 minute segments selected and transcribed as TXT file and FAVE-aligned. At the end of the interview, metalinguistic commentary was elicited; some transcribed in MS Word. Reading passage and word list data available for most participants, both of which were also FAVE-aligned. Thus for most participants, WAV audio file, TXT transcription, and PRAAT textgrid with time-aligned phonetic transcription available for 3 speech conditions.
Date Collected: 2012-2013
Language: English
Location: Columbus, Ohio and London, England
Speaker Information: 97 speakers
19 English Expatriates in Cbus (12 male & 9 female, aged 20-71)
21 American Expatriates in LDN (4 male & 17 female, aged 23-74)
13 English Fans of NFL teams (13 male, aged 23-41)
16 American Fans of EPL teams (15 male & 1 female, aged 21-51)
11 English Controls (7 male & 4 female, aged 18-48)
14 American Controls (3 male & 11 female, aged 18-59)
Data breakdown:
Wordlist: 280 words/phrases (blocked by theme).
Listening in Noise: 128 sentences were presented in SBE and SAE; participants transcribed what they could hear.
Interviews: averaging 45 minutes (12 mins-92 mins).
Equipment used: Shure 54 head-worn microphones; ZOOM H4N portable digital recorder (44100 Hz, 16 bit).
Recording quality: Varied. Mostly good, but some recordings occurred in noisy environments; occasional recording problems.
This corpus is the first public corpus of AAL data, featuring recorded speech from regional varieties of AAL, and includes the audio recordings along with time-aligned orthographic transcriptions from over 200 sociolinguistic interviews from speakers born between 1888 and 2005. Dr. Charlie Farrington is among the researchers who compiled the corpus.
If you're a looking for other corpora to work with (maybe for your Language Sciences Minor capstone project!), here are links to some external corpora below!
Note: If access to the corpus requires contacting a researcher personally, we strongly recommend that you have already discussed this project with a faculty member.
American English Dialectal Recordings
118 hours of recorded English in American speakers, a good general dialectal library when comparing US regions.
1 million words of time-aligned speech from the Appalachian English dialect. Additionally, the University of South Carolina has in-depth research details about Appalachian English.
100-million-word corpus of spoken and read British English from the late 20th century.
Free corpus of speakers from Columbus, Ohio, provided by Ohio State University.
Child Language Description Exchange
This corpus is from children around the world, focusing on child phonology. They are separated by language family and bilinguals, and is a great resource for language acquisition or syntax exploration.
Great for variation in English in North America. The archive is the same as the COCA archive.
Corpus of Contemporary American English (COCA)
The most-used American English corpus. This can be used to identify spoken language, fiction, magazines, newspapers, and academic sources from 1990-2017. This can be accessed by using the "Sections" button and narrowing the search from there and using a specific word or phrase for your search. You may need an account to save certain things, but it is free.
Corpus of Historical American English (COHA)
This corpus contains over 400 million words of text from 1810-2000. It is organized by genre and decade, which can be useful for research in language change in America.
English Medical Corpus From the Web
A corpus with search engine tools to explore the language of the English medical field. Texts are collected from the internet's medical resources and has around 300 million words.
International Corpus of English
Corpus of global Englishes, mainly from places that have recognized national varieties.
International Corpus of Learner English
Corpus of accented English from various languages around the world. Some languages are complete and have links, others you may need to contact those in charge of that language, which are all listed on the website. Greate resource for those who want to do experiments on accented English.
A research repository hosted by the University of Pennsylvania for linguistic data. Ability to search by language, other corpora name, year, and more.
A corpus based on Time Magazine articles from 1923-2006, around 100 million words. Great for language change in print.
This corpus takes 1 billion words from Wikipedia pages, which can be used to count word frequency or possible discourse about certain topics.
American Indian Studies Research Institute
A corpus for endangered Native American languages' texts. Includes Arikara, Skiri Pawnee, and South Band Pawnee texts -- all Northern Caddoan languages.
Bavarian Archive for Speech Signals
Primarily German corpus that also includes German Sign Language, telephone speech, drunk speech, and more. Good for multi-modal studies and research.
Child Language Description Exchange
This corpus is from children around the world, focusing on early child phonology. They are separated by language family and bilinguals, and is a great resource for language acquisition or syntax exploration.
Corpus of data in English and Spanish. Includes historical changes and passage reading. Helpful for those who would like to look at Spanish linguistics.
A research repository hosted by the University of Pennsylvania for linguistic data. Ability to search by language, other corpora name, year, and more.
The archive contains various types of materials, including audio and video language corpus data from languages around the world; photographs, notes, experimental data, and other relevant information required to document and describe languages and how people use them; records of speech in everyday conversations from endangered and under-studied languages, and linguistic phenomena; experimental stimuli and data.