ELT Corpus

Short link to this page:

http://bit.ly/elt_corpus

The reason for building this corpus is to have a source of words, phrases and example sentences that contains enough examples of all the words in the topics that are commonly studied in foreign language learning, which in fact, large general corpora do.

And that the corpus be structured in sections (subcorpora) representing each topic, which can only be achieved by building it with this in mind. The metadata in general corpora mostly indicates text types, dates, regional varieties etc of the source data.

The procedure involved:

    1. choosing a set of topics

    2. preparing extensive lists of nouns for each topic

    3. dividing the lists into elementary, intermediate and advanced based mainly on the source of the lists, but with some general frequency rankings and intuition

    4. breaking the lists into sets of 20, as this is the maximum number of words that WebBootCat can crawl. WBC is a web-based tool, part of the Sketch Engine suite of corpus tools, that crawls the web for sets of words and builds a corpus of the texts found.

    5. all of the data sets for each topic were combined into subcorpora

The ELT Corpus now has 36 topics and c. 75 mill. words.

Animals

Arts

Body

Clothes

Cognition

Communication

Crime

Describing

Education

Environment

Food_Drink

Health

Home

Jobs

Language

Language_terms

Measurement

Media

Money

Natural_world

People_appearance

People_things

Places

Politics

Relationships

Shopping

Sport

Technology

Time

Town_City

Transport

Travel

University

Weather

Work

Work_jobs