Corpora

Corpus Analysis

Antconc - widely used corpus access software ('concordancer')
Voyant - online tool for basic text and corpus analysis
Corpus Linguistics with R (LADAL, U Queensland)
Interactive introduction to corpus basics KorPlus (U Bamberg)
Textmining with R (online textbook)

Corpus Databases

Corpus Resource Database (CoRD)

Learner Corpora

LEONIDE (Longitudinal lEarner cOrpus iN Italiano, Deutsch, English; Reference article): Third Language Acquisition (mainly L1 German, L2 Italian, L3 English; L1 Italian, L2 German, L3 English; 2500 texts from 163 learners at school level
International Corpus of Learner English (ICLE): Essays written by university students of English with various L1s
Louvain International Database of Spoken English Interlanguage (LINDSEI): Interviews with university students of English with various L1s
The International Corpus Network of Asian Learners of English (ICNALE): (4,000+ dialogues, monologues and essays from university students - ten countries/ regions in Asia (China, Hong Kong, Indonesia, Japan, Korea, Pakistan, the Philippines, Singapore/ Malaysia, Taiwan, and Thailand) as well as English native speakers)
Open Cambridge Learner Corpus (2.9 million words of over 10,000 student responses taken from the Cambridge English Language Assessment suite of exams , 7 different L1s)
EFCAMDAT (33 million words from 85,000 learners; no L1 information, but nationality; 37% Brazil, 19% China, 9% Russia, as well as Mexico, Germany, France, Italy, Saudi Arabia, Taiwan, Japan, as well as ~160 other nationalities below 2%; various levels of proficiency as well as some longitudinal data)
DESI (Deutsch Englisch Schülerleistungen International, 2003/4). Data from > 10,000 German 9th graders and classroom observation data (downloadable, but not necessarily in an easibly usable corpus format)
Mehrsprachigkeitsentwicklung im Zeitverlauf (MEZ). Data from 2,103 school students in Germany; text production in German, English and heritage languages, as well as extensive background data
Hamburger Schulleistungsstudie zu Kompetenzen und Einstellungen von Schülerinnen und Schülern - Jahrgangsstufe 4-7 (KESS 4-7), Jahrgangsstufe 8 (KESS 8), Jahrgangsstufe 10-13 (KESS 10-13), Aspekte der Lernausgangslage und der Lernentwicklung (LAU) - data on performance in several subjects, incl. English, and a large number of background variables, for several thousand school students [not a learner corpus, but relevant for the analysis of learner language/SLA]

Corpora for Sociolinguistics

British National Corpus 2014 (metadata for speakers and texts can be downloaded)
[Hint: In order to download speaker codes, proceed as follows: Go to restricted query, in the top left corner click "choose action" and select "tabulate". In the drop down menu under "attribute" choose "u_who (Speaker)" and download.]
Freiburg Corpus of English Dialects (FRED) (2.5 million words of 372 speakers; Sampler is publicly available)
Friends Corpus (all dialogues of the US TV series 'Friends')
Cornell Movie Dialogues Corpus (10,292 pairs of movie characters in 617 movies, with gender but without age)
CANDOR corpus (1650 conversations that strangers had over video chat with rich metadata information, including age, sex, political orientation, employment, education)
CaSiNo ('CampSite Negotiations') Corpus (1030 negotiation dialogues involving 846 speakers, with information on age, gender, ethnicity, education, big five personality traits and social values orientation)
US Supreme Court Oral Arguments Corpus (1.7 million utterances from 7,700 cases, 1955 - 2019, with speaker IDs linked to database that probably enables retrieval of age and gender information)
Persuasion for Good Corpus (1285 speakers with information on age, sex, gender, education, big five personality traits, moral foundations and others)

Corpora for World Englishes

Corpus of British Isles Spoken English (CoBISE; geolocated automatic speech recognition (ASR) YouTube transcripts from the United Kingdom and Ireland; 38,680 ASR transcripts from 497 YouTube channels; 111,563,614 tokens)
Corpus of North American Spoken English (CoNASE; 1.29-billion-word corpus of geolocated automatic speech recognition (ASR) YouTube transcripts from the United States and Canada)
Corpus of Australian and New Zealand Spoken English (CoANZSE; 196-million-word corpus of geolocated automatic speech recognition (ASR) YouTube transcripts)
YouTube Corpus of Singapore English Podcasts (YCSEP; multi-million word corpus of Singapore podcasts)
Corpus of Singapore English Messages (CoSEM) (WhatsApp messages, includes information on age, gender and ethnicity)

Historical Corpora

Corpus of English Dialogues 1560–1760 (1.2 million words)

Datasets for Sentiment and Emotion Analysis

Yelp polarity reviews (560,000 highly polar yelp reviews)
Sentiment 140 (1.6 million tweets annotated for sentiment)
IMDB reviews (50k movie reviews)
Amazon US reviews (130 million reviews)
GoEmotions (58k tweets annotated for 27 emotions)

Corpora for the Analysis of Toxicity

Wikipedia talk pages toxicity (more than 200,000 items)
CivilComments (2 million comments annotated for toxicity)
Bot Adversarial Dialogue Dataset (70k dialogues labelled for offensiveness)

Datasets for Human-Bot Communication

Schema-Guided Dialogue (over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant)
Bot Adversarial Dialogue Dataset (70k dialogues labelled for offensiveness)

Very Large Corpora

Para Crawl (web-scale parallel corpora for official European languages, 2018)
Reddit Corpus (all subreddits from inception to Oct. 2018)
Wikipedia (entire Wikipedia, split by languages)
Wikipedia 40b (cleaned-up text for 40+ Wikipedia languages editions)
CommonCrawl cleaned (cleaned version of the massive Common Crawl corpus comprising large amounts of material from the internet for multiple languages)

Very Large Speech Corpora

Common Voice (multi-language dataset of voices, around 100k voices)

Corpora of Conspiracy Discourse and the Alt-Right

(largely based on a list by Sviatlana Höhn)

LOCO - the 88-million word language of conspiracy corpus (corpus; paper)
PushShift Telegram (dataset by Baumgartner et al. (2020), compiled from 27.800 mostly English channels and 2.2 M unique users)
Capitol riot (corpus of all messages from a Telegram channel supporting Donald Trump from end of 2016 till January 2021)
4chan - Raiders of the Lost Kek (corpus of 3.5 years of 4chan posts from the Politically Incorrect Board)
Shouting into the Void - A Database of the Alternative Social Media Platform Gab (37,012,061 posts and 819,957 user profiles collected from Gab between 08/2016 and 12/2018)
An Early Look at the Parler Online Social Network (183M posts made by 4M users between August 2018 and January 2021 on Parler)

Corpora of Political Speeches

Corpus of Political Speeches (HKBU; ~ 6 million words from the US, Hong Kong, Taiwan, PRC)
German Political Speeches Corpus (1984 - 2017; 13 million words)
Small Corpus of Political Speeches (1545–2010; 2 million words, mainly US and UK)
US presidential speeches (all presidents; not a real corpus, speeches available for download)
Corpora of Parliamentary Proceedings
Corpus of Bundestag debates (German federal parliament)

Other Corpora and Datasets

Distant Reading/European Literary Text Collection (corpus of European novels from 1820 - 1940; 100+ novels for more than 10 languages)
SAMSum (16k chat dialogues with summaries)
Opinion abstracts (Rotten tomatoes film review summaries and data from IDebate; around 5000 items)
14 million stolen passwords
Wordnet (large lexical database of English; nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept)
Universal Dependencies (framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages, over 100 languages)
BLiMP (challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English; BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics)
Corpus of German Speech (CoGS; 51-million-word corpus of geolocated automatic speech recognition (ASR) YouTube transcripts)

Google Sites

Report abuse