Corpora
Corpus Analysis
Antconc - widely used corpus access software ('concordancer')
Voyant - online tool for basic text and corpus analysis
Corpus Linguistics with R (LADAL, U Queensland)
Interactive introduction to corpus basics KorPlus (U Bamberg)
Textmining with R (online textbook)
Corpus Databases
Corpus Resource Database (CoRD)
Learner Corpora
LEONIDE (Longitudinal lEarner cOrpus iN Italiano, Deutsch, English; Reference article): Third Language Acquisition (mainly L1 German, L2 Italian, L3 English; L1 Italian, L2 German, L3 English; 2500 texts from 163 learners at school level
International Corpus of Learner English (ICLE): Essays written by university students of English with various L1s
Louvain International Database of Spoken English Interlanguage (LINDSEI): Interviews with university students of English with various L1s
Open Cambridge Learner Corpus (2.9 million words of over 10,000 student responses taken from the Cambridge English Language Assessment suite of exams , 7 different L1s)
Corpora for Sociolinguistics
British National Corpus 2014 (metadata for speakers and texts can be downloaded)
Friends Corpus (all dialogues of the US TV series 'Friends')
Cornell Movie Dialogues Corpus (10,292 pairs of movie characters in 617 movies, with gender but without age)
CANDOR corpus (1650 conversations that strangers had over video chat with rich metadata information, including age, sex, political orientation, employment, education)
CaSiNo ('CampSite Negotiations') Corpus (1030 negotiation dialogues involving 846 speakers, with information on age, gender, ethnicity, education, big five personality traits and social values orientation)
US Supreme Court Oral Arguments Corpus (1.7 million utterances from 7,700 cases, 1955 - 2019, with speaker IDs linked to database that probably enables retrieval of age and gender information)
Persuasion for Good Corpus (1285 speakers with information on age, sex, gender, education, big five personality traits, moral foundations and others)
Corpora for World Englishes
Corpus of British Isles Spoken English (CoBISE; geolocated automatic speech recognition (ASR) YouTube transcripts from the United Kingdom and Ireland; 38,680 ASR transcripts from 497 YouTube channels; 111,563,614 tokens)
Corpus of North American Spoken English (CoNASE; 1.29-billion-word corpus of geolocated automatic speech recognition (ASR) YouTube transcripts from the United States and Canada)
Corpus of Australian and New Zealand Spoken English (CoANZSE; 196-million-word corpus of geolocated automatic speech recognition (ASR) YouTube transcripts)
Corpus of Singapore English Messages (CoSEM) (WhatsApp messages, includes information on age, gender and ethnicity)
Datasets for Sentiment and Emotion Analysis
Yelp polarity reviews (560,000 highly polar yelp reviews)
Sentiment 140 (1.6 million tweets annotated for sentiment)
IMDB reviews (50k movie reviews)
Amazon US reviews (130 million reviews)
GoEmotions (58k tweets annotated for 27 emotions)
Corpora for the Analysis of Toxicity
Wikipedia talk pages toxicity (more than 200,000 items)
CivilComments (2 million comments annotated for toxicity)
Bot Adversarial Dialogue Dataset (70k dialogues labelled for offensiveness)
Datasets for Human-Bot Communication
Schema-Guided Dialogue (over 20k annotated multi-domain, task-oriented conversations between a human and a virtual assistant)
Bot Adversarial Dialogue Dataset (70k dialogues labelled for offensiveness)
Very Large Corpora
Para Crawl (web-scale parallel corpora for official European languages, 2018)
Reddit Corpus (all subreddits from inception to Oct. 2018)
Wikipedia (entire Wikipedia, split by languages)
Wikipedia 40b (cleaned-up text for 40+ Wikipedia languages editions)
CommonCrawl cleaned (cleaned version of the massive Common Crawl corpus comprising large amounts of material from the internet for multiple languages)
Corpora of Conspiracy Discourse and the Alt-Right
(largely based on a list by Sviatlana Höhn)
LOCO - the 88-million word language of conspiracy corpus (corpus; paper)
PushShift Telegram (dataset by Baumgartner et al. (2020), compiled from 27.800 mostly English channels and 2.2 M unique users)
Capitol riot (corpus of all messages from a Telegram channel supporting Donald Trump from end of 2016 till January 2021)
4chan - Raiders of the Lost Kek (corpus of 3.5 years of 4chan posts from the Politically Incorrect Board)
Shouting into the Void - A Database of the Alternative Social Media Platform Gab (37,012,061 posts and 819,957 user profiles collected from Gab between 08/2016 and 12/2018)
An Early Look at the Parler Online Social Network (183M posts made by 4M users between August 2018 and January 2021 on Parler)
Corpora of Political Speeches
Corpus of Political Speeches (HKBU; ~ 6 million words from the US, Hong Kong, Taiwan, PRC)
German Political Speeches Corpus (1984 - 2017; 13 million words)
Small Corpus of Political Speeches (1545–2010; 2 million words, mainly US and UK)
US presidential speeches (all presidents; not a real corpus, speeches available for download)
Corpora of Parliamentary Proceedings
Other Corpora and Datasets
Distant Reading/European Literary Text Collection (corpus of European novels from 1820 - 1940; 100+ novels for more than 10 languages)
SAMSum (16k chat dialogues with summaries)
Opinion abstracts (Rotten tomatoes film review summaries and data from IDebate; around 5000 items)
Wordnet (large lexical database of English; nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept)
Universal Dependencies (framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages, over 100 languages)
BLiMP (challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English; BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics)
Corpus of German Speech (CoGS; 51-million-word corpus of geolocated automatic speech recognition (ASR) YouTube transcripts)