Corpus Resources for Language Teachers

Corpus Resources for Language Teachers

General reference sites

Corpora Around the World http://martinweisser.org/corpora_site/CBLLinks.html

This site provides a considerably wide collection of links to different corpora organized by corpus type. A majority of corpora are English language corpora, however the site includes a number of non-English and multilingual corpora as well. Corpus analysis tools and other helpful materials (journals, corpus linguistics courses, conferences, etc.) are also provided, and makes for an optimal resource for researchers and teachers.

Learner Corpora Around the World https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html

A list of approximately 145 learner corpora of a variety of languages (primarily English) arranged by a number of pertinent corpus considerations, some of which include task/text type, proficiency level, and medium (spoken vs. written). Links to corpora along with availability may be provided although some are not entries do not include this information. This list is available for update.

CORPORA List-serv http://www.hit.uib.no/corpora/welcome.txt

A useful listserv if you are trying to find a corpus from a lesser known language or from a less represented register. You can join the listserv and find out about upcoming conferences and corpus releases.


English Corpora

Corpus.BYU.edu ~ http://corpus.byu.edu

This site links to the many corpora (e.g., COCA and TIME) that are searchable through an interface developed by Mark Davies. The format for searches is the same regardless of the corpus. The user-friendly interface allows for part-of-speech and wildcard searches. This site has one of the best interfaces with the BNC for word and phrase searches that include graphs and tables of search results by register.


Corpus of Contemporary American English (COCA) ~ http://corpus.byu.edu/coca/

An online, searchable 400+ million word corpus of American English arranged by register,

including news, spoken, and academic texts. The texts in this corpus are from 1990 to the

present. This site allows the user to also search by part of speech (POS).


Time Corpus ~ http://corpus.byu.edu/time/

This online corpus of Time Magazine from 1923 through 2006 is searchable through Mark

Davies’ user friendly interface. The Time corpus allows interesting explorations of how language changes over a relatively short period of time. It is also a useful resource of looking at written academic language that is accessible for language learners. This site allows the user to also search by part of speech (POS).


Word and Phrase ~ www.wordandphrase.info

This site allows users to look at the Academic Vocabulary (also by discipline). Users can also enter words to be searched. You can also submit texts and see the words highlighted.


Wikipedia ~ http://corpus.byu.edu/wiki/

41.9 billion words from 4.4 million pages that are now searchable and can be used to create a specialized corpus.


MICASE –Michigan Corpus of Academic Spoken English https://quod.lib.umich.edu/cgi/c/corpus/corpus?c=micase;page=simple

This free, online, searchable corpus of academic spoken language is a valuable resource.

The online concordancer is user-friendly and has a number of search options. In addition to

the transcripts, some of the sound files are also available. The corpus is available for purchase for a

modest fee (use from the website is free). There are links to lesson material that has been

prepared based on MICASE.


MICUSP - Michigan Corpus of Upper Level Student Papers http://www.elicorpora.info/

This free, online, searchable corpus of student papers from a variety of disciplines provides

teachers and students with many useful resources. The searches can be designed to target

specific disciplines, types of writing, and/or parts of papers (e.g., conclusions, citations). The

bar graph that displays results provides an easy to interpret visual.


English Learner Corpora


International Corpus of Learner English (ICLE) https://uclouvain.be/en/research-institutes/ilc/cecl/icle.html

ICLE is a corpus of learner English focused on interlanguage which had its second version published in 2009. It is publicly available via CD-ROM for around $250. It consists of 6,085 essays of approximately 700 words, totalling 3.7 million words. The learners are university age (about 20 years old) studying English in an EFL context, categorized as intermediate to advanced learners of English. Sixteen different L1 backgrounds are represented: Bulgarian, Chinese, Czech, Dutch, Finnish, French, German, Italian, Japanese, Norwegian, Polish, Russian, Spanish, Swedish, Tswana, Turkish. The corpus comes with a built-in concordancer.


Louvain International Database of Spoken English Interlanguage https://uclouvain.be/en/research-institutes/ilc/cecl/lindsei.html

LINDSEI was launched in 1995 and is available by CD-ROM for around $250. It contains oral data produced by intermediate to advanced learners of English. Eleven nationalities are represented, including Bulgarian, Chinese, Dutch, French, German, Greek, Italian, Japanese, Polish, Spanish, and Swedish, with approximately 50 learners from each nationality. LINDSEI currently contains over 1 million words and 130 hours of interviews between learners (L2 English speakers) and interviewers (in most cases a native speaker of English). The interviews are an average of 14 minutes and consist of nearly 2000 words. Learner language represents approximately 792,000 words total and nearly 1500 words per interview. The interviews consist of three tasks: set topic, free discussion and picture description. The CD-ROM includes a built-in concordancer.

Non-English Corpora

French

French Learner Language Oral Corpora: http://www.flloc.soton.ac.uk/

FLLOC is a collection of 9 corpora totaling approximately 3 million words of oral learner French at

varying levels. It contains both text and sound files.


Spanish

Corpus del Español (BYU) http://www.corpusdelespanol.org/

The Corpus del Español allows one to quickly and easily search more than 100 million words in more than 20,000 Spanish texts from the 1200s to the 1900s. One can search for exact words or phrases, wildcards, lemmas, parts of speech, or any combination of these, as well as collocates within a 10-word window. One can also compare the frequency and distribution of two related words, phrases and grammatical constructions across texts by register and historical period. Semantically-based queries can also be conducted with this corpus.

CEDEL2 https://www.uam.es/proyectosinv/woslac/collaborating.htm

CEDEL2 is an L1 English - L2 Spanish learner corpus that is being collected and created by Critóbal Lozano. CEDEL2 is part of the general WOSLACproject (Word Order in Second Language Acquisition Corpora) directed by Amaya Mendikoetxea at the Universidad Autónoma de Madrid.

CEDEL2 is in line with other European projects where large learner corpora are being created. Of particular interest is SPLLOC (Spanish Learner Language Oral Corpus) that is being created at the University of Southampton (UK).

Spanish Learner Language Oral Corpora (SPLLOC) http://www.splloc.soton.ac.uk/

SPLLOC 1 (April 2006-March 2008) and SPLLOC 2 (August 2008-January 2010) contains data from classroom learners of Spanish (with English as their first language), from beginners to advanced level, using specially designed elicitation tasks. For comparison purposes, native speakers were also recorded undertaking the same tasks. The resulting database of L2 Spanish contains digital soundfiles of learner speech, in varying genres, accompanied by transcripts in CHILDES format.


Portuguese

Corpus do Português (BYU): http://www.corpusdoportugues.org/


Russian

Russian National Corpus: http://www.ruscorpora.ru/en/

Concordancer for RNC: http://corpus.leeds.ac.uk/ruscorpora.html

Russian Learner Corpus: http://web-corpora.net/RussianLearnerCorpus/search/


Arabic

Arabic Corpus (BYU): http://arabicorpus.byu.edu/

Arabic Learner Corpus (Leeds): http://www.alcsearch.com/alcsearch/


Chinese

Lancaster Corpus of Mandarin Chinese http://www.lancaster.ac.uk/fass/projects/corpus/LCMC/

The Lancaster Corpus of Mandarin Chinese consists of one million words of written texts. It follows the sampling frame of the Freiburg-London-Oslo-Bergen (FLOB) corpus and includes five hundred 2000 word samples of written texts taken from 15 text categories in 1991-1992. It is accessible online and includes POS tagging. It contains a Chinese simplified character version & Romanized Pinyin version. The corpus can be accessed through CQPweb at Lancaster University, UK. https://cqpweb.lancs.ac.uk/


Japanese

Japanese Learner’s Conversation Database https://nknet.ninjal.ac.jp/nknet/ndata/opi/

Database of 339 publicly available OPI (Oral Proficiency Interview) audio recordings of Japanese learners which can be searched online using 9 different search criteria. Learners are from multiple L1 backgrounds, although more than half are Korean L1 speakers. Data was collected in the years 2007 and 2008. This is currently the largest L2 Japanese learner corpus available. *Site is in Japanese*

Learner’s Language Corpus of Japanese http://cblle.tufs.ac.jp/tag/ja/index.php?menulang=en

Online learner corpus of written Japanese collected from universities in Taiwan, the UK and Ukraine as well as native speakers. Includes a part-of-speech tagger that can search tags among data from Taiwanese participants (200,000 words total) and native speakers (60,000 words total). Ideal for researchers looking to compare language use between native and non-native speakers.


German

KanDeL (The Kansas Developmental Learner corpus) https://www.linguistik.hu-berlin.de/en/institut-en/professuren-en/korpuslinguistik/research/kandel

KanDeL comprises developmental data collected from US students who enrolled in a basic German language program over four consecutive 16-week-long semesters at the University of Kansas (KU) and agreed to participate in this research. This instructional program completes the foreign language requirement for certain majors at KU, a large public US university. The writing samples are rough drafts of essays written by the students in response to curricular tasks every three to five weeks during each semester. The genres are personal narratives and personal accounts with argumentative tasks added at later time points. All learner texts have been tokenized, lemmatized, and automatically annotated for parts-of-speech. Next, they were manually annotated for target hypotheses by multiple annotators (see Falko-Handbuch). Finally, the target hypothesis layer of the corpus was also automatically lemmatized and annotated for parts-of-speech.