Corpora

Organisations

ICAME (International Computer Archive of Modern and Medieval English)

CoRD (Corpus Resource Database) ['provides first-hand information about English language corpora']

Lists and Collections

CQPweb (Lancaster University) [Andrew Hardie's video guide is available here.]

CQPweb (CLARIN-D Service Centre, Saarlund University) [There are several corpora which are not available in the Lancaster CQPweb.]

Mark Davies, 'English-Corpora.org'

Corpora of the Czech National Corpus project [includes EEBO and Old Bailey Corpus; eight lessons for each corpus are available here]

Martin Weisser, 'Historical Corpora or Collections (English)'

Martin Weisser, 'Concordancers'

Stuart Lee, 'Old and Middle English Corpora'

Corpus Tools

VARD 2 ['an interactive piece of software produced in Java designed to assist users of historical corpora in dealing with spelling variation, particularly in EModE texts']

AntConc [developed by Laurence Anthony: 'A freeware corpus analysis toolkit for concordancing and text analysis.']

KWIC Concordance for Windows ['a corpus analytical tool for making word frequency lists, concordances and collocation tables by using electronic files']

Diachronic

Helsinki Corpus of English Texts: Diachronic Part [The manual is available from the link. The corpus can be ordered from either ICAME or OTA. See also XML Helsinki Corpus Browser.]

Corpus of Early English Medical Writing (CEEM) [covers the period 1350-1800; available as separate CD-ROMs in Middle English Medical Texts (MEMT), Early Modern English Medical Texts (EMEMT) and Late Modern English Medical Texts (LMEMT)]

Corpus of English Religious Prose (COREP) [covers the period from 1150 to the end of the 18th century (in preparation)]

Corpus of Early English Recipes (CoER) [in preparation at the University of Las Palmas de Gran Canaria]

Seville Corpus of Northern English (SCONE) [in preparation at the University of Seville; covers the period 600-1500]

LEON: Leuven English Old to New, version 0.3 [compiled by Peter Petré (University of Leuven); covers from early Old English to 1640]

Corpus of Historical English Law Reports (CHELAR) [a diachronic (1535-1999) corpus compiled at the University of Santiago de Compostela]

The Salamanca Corpus: Digital Archive of English Dialect Texts [a diachronic (1500-1950) dialectal corpus currently being compiled at the University of Salamanca]

Transhistorical Corpus of Written English [a diachronic (15C-21C) corpus developed at Edge Hill University]

Old English

Dictionary of Old English Corpus Web Corpus [a comprehensive corpus of existing Old English texts; now freely available to the general public (though with restrictions in the number of log-ins)]

York-Helsinki Parsed Corpus of Old English Poetry (York Poetry Corpus) [a syntactically-annotated selection of poetry texts from the Old English section of the Helsinki Corpus; available from either Susan Pintzuk (University of York) or OTA]

York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) [a syntactically-annotated selection of prose texts from the Old English section of the Helsinki Corpus; available from OTA]

TOXIIC (Trinity Old English from the XIIth Century) ['a corpus of twelfth-century copies of Old English texts from manuscripts not well represented in existing resources like the DOE']

ENHIGLA ['a parallel corpus of Old English and Old High German translations and their source texts']

ParCorOEv2 ['an open access annotated parallel corpus Old English-English']

Middle English

Corpus of Narrative Etymologies (CoNE) [an AHRC-funded project at the University of Edinburgh, now published online with its associated Corpus of Changes (CC)]

Corpus of Middle English Prose and Verse [a free Web-based corpus consisting of 146 complete texts]

Penn-Helsinki Parsed Corpus of Middle English, Second edition (PPCME2) [a syntactically-annotated selection of prose texts from the Middle English section of the Helsinki Corpus]

The Parsed Corpus of Middle English Poetry (PCMEP) [a fully parsed and annotated corpus of 38 Middle English poems]

ICAMET (Innsbruck Computer Archive of Machine-Readable English Texts) [the Prose Corpus consists of 129 texts and the Letter Corpus contains 254 complete letters]

MEG-C (The Middle English Grammar Corpus) [currently under compilation at the University of Stavanger]

A Corpus of Middle English Local Documents (MELD) [another corpus from the University of Stavanger; downloadable as Zip file]

Corpus of Early English Correspondence (CEEC) [covers the period 1418-1681]

Corpus of Early English Correspondence Sampler (CEECS) [covers the period 1418-1680; can be ordered from either ICAME or OTA]

Corpus of Early English Correspondence Supplement (CEECSU) [covers the period 1402-1663 and aims to fill its socio-regional gaps]

The Parsed Corpus of Early English Correspondence (PCEEC) [covers the period 1410-1681; can be ordered from OTA; a revised version is available here]

The Parliament Rolls of Medieval England [an electronic edition of 'the official records of the meetings of the English parliament from the reign of Edward I (1272 - 1307) until the reign of Henry VII (1485 - 1509)']

The Málaga Corpus of Late Middle English Scientific Prose [an annotated corpus of Middle English scientific manuscripts housed in the Hunterian Glasgow University Library]

Early Modern English

Early English Books Online (EEBO) [also available on Lancaster's CQPweb]

Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) [a syntactically-annotated selection of texts from the Early Modern English section of the Helsinki Corpus]

The Penn-York Computer-annotated Corpus of a Large Amount of English (PYCCLE) ['a part-of-speech tagged version of the Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) corpora, as digitised by the Text Creation Partnership (TCP)']

Shakespeare Corpus [a corpus of Shakespeare's 37 plays, plus all the speeches of all the characters]

A Corpus of English Dialogues, 1560-1760 (CED) [can be ordered from ICAME]

Quaker Historical Corpus [173 texts written by Quakers between 1650 and 1690]

EMMA Corpus [EMMA stands for Early Modern Multiloquent Authors; a sample of 50 of the most prolific English writers born in the 17th century, who mostly belonged to the London-based elite]

Visualizing English Print ['plain text corpora of Early Modern English texts and visualization tools to explore them']

The Lampeter Corpus of Early Modern English Tracts [covers the period 1640-1710; can be ordered from either ICAME or OTA]

Newdigate Newsletters [manuscript newsletters dating from 13 January 1673/4 to 29 September 1715]

Emerging Voices Corpus ['a small corpus (47,481 words; 53,567 tokens including punctuation) of Early Modern English (1500-1800)']

Late Modern English

A Representative Corpus of Historical English Registers (ARCHER) [a British and American English corpus from 1650 to 1990; the latest version (ARCHER 3.2) is searchable online]

Zurich English Newspaper Corpus (ZEN) [covers the period 1661-1791; available on CD-ROM]

The Old Bailey Corpus [contains the proceedings of the Old Bailey, London's Central Criminal Court, from 1674 to 1913; see also Old Bailey Online]

Rostock Newspaper Corpus [RNC-1 comprises British news reports from 1700 to 2000; RNC-2, a systematically condensed version of the basic corpus, is in progress]

Penn Parsed Corpus of Modern British English (PPCMBE) [covers the period 1700-1914]

A Corpus of Late Modern British and American English Prose (COLMOBAENG) [compiled by Teresa Fanego (University of Santiago de Compostela); a 1,170,000 word database covering the period 1700-1879]

The English language of the north-west in the late Modern English period: A Corpus of late 18c Prose ['about 300,000 words of local English letters on practical subjects, dated 1761-90'; available from OTA]

Corpus of Early English Correspondence Extension (CEECE) [the 18th-century extension of the original CEEC, covering the period 1681-1800]

Corpus of Late Modern English Texts, version 3.0 (CLMET3.0) ['a genre-balanced 34-million-word corpus of Late Modern British English']

Bluestocking Corpus: Letters by Elizabeth Montagu, 1730s-1780s [the letters are both downloadable and browsable on the web]

University of Lausanne, The Language of the Labouring Poor in Late Modern England ['the pauper petition corpus ... will be made available to the academic community, including not only linguists, but also (cultural) historians and other related disciplines']

Coruña Corpus of English Scientific Writing [currently under compilation at the Research Group for Multidimensional Corpus-based Studies in English (MuStE)]

Hansard Corpus [the 1.6-billion-word corpus 'contains nearly every speech given in the British Parliament from 1803-2005']

Corpus of English Novels (CEN) ['a 25-million-word corpus of late nineteenth and early twentieth-century novels by British and North American novelists']

A Corpus of late Modern English Prose [covers the period 1861-1919; available from OTA]

University of Birmingham, CLiC Dickens [a web app CLiC is being developed, 'designed specifically for the analysis of literary texts']

The Corpus of Nineteenth-Century Newspaper English (CNNE) [an ongoing project at Uppsala University (principal investigator: Erik Smitterberg)]

HUM19K Corpus [19th-century British fiction corpus, with 100 novels, 100 authors, 100 years and 13 million words]

The Diachronic Corpus of Present-Day Spoken English (DCPSE) [a new parsed corpus of spoken English available on CD-ROM]

APU Writing and Reading Corpus 1979-1988 [a diachronic corpus of British English schoolchildren's data at Year 6-level]

Varieties of English

Corpus of Global Web-Based English (GloWbE) [c.1.9 billion words of texts from 20 different countries]

International Corpus of English (ICE) ['Twenty-six research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989.']

Vienna-Oxford International Corpus of English (VOICE) ['transcripts of naturally occurring, non-scripted face-to-face interactions in English as a lingua franca (ELF)'; see also VOICE CLARIAH]

Freiburg Corpus of English Dialects (FRED) [The full version is available only by visiting Freiburg.]

Helsinki Corpus of Older Scots [a corpus of Middle Scots (1450-1700); can be ordered from either ICAME or OTA]

Corpus of Scottish Correspondence (CSC) [a corpus of early Scottish epistolary prose texts (1500-1715)]

Corpus of Modern Scottish Writing (CMSW) [a corpus of written Scottish English (1700-1945)]

Scottish Corpus of Texts & Speech (SCOTS) [a corpus of written and spoken Scottish English (1945-2007)]

A Corpus of Irish English [gathers together the main documents from the early 14th century up to the present-day; published with Raymond Hickey's Corpus Presenter (John Benjamins, 2003); see also his Irish English Resource Centre]

The Parsed Old and Middle Irish Corpus (POMIC) [a corpus released from the Dublin Institute for Advanced Studies]

The Corpus of Early Ontario English, pre-Confederation Section (CONTE-pC) [covers the period 'from the earliest Ontarian English texts to the end of the 19th century (ca. 225,000 words)']

Corpus of Oz Early English (COOEE) [covers the period 1788-1900 and comprises c. 2 million words]

The Diachronic Electronic Corpus of Tyneside English (DECTE) [an ongoing project at the Newcastle University]

Corpus of Historical American English (COHA) [covers the period 1810-2009 and comprises c. 400 million words]

Cleaned COHA (CCOHA) ['We cleaned the corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties.']

TIME Magazine Corpus of American English [covers the period 1923-2006]

The Movie Corpus ['contains 200 million words of data in more than 25,000 movies from the 1930s to the current time']

Corpus of US Supreme Court Opinions [c.1.3 million words in 32,000 Supreme Court decisions from the 1790s to the present]

Santa Barbara Corpus of Spoken American English [all transcriptions can be downloaded for free]

Strathy Corpus of Canadian English [covers the period 1970s-2000s]

Wellington Corpus of Spoken New Zealand English [available as CD]

The Varieties of English for Specific Purposes dAtabase (VESPA) learner corpus ['available to students and researchers at the University of Oslo and to researchers developing other subcorpora of VESPA']

Page updated

Google Sites

Report abuse