ELT Corpus
Short link to this page:
http://bit.ly/elt_corpus
The reason for building this corpus is to have a source of words, phrases and example sentences that contains enough examples of all the words in the topics that are commonly studied in foreign language learning, which in fact, large general corpora do.
And that the corpus be structured in sections (subcorpora) representing each topic, which can only be achieved by building it with this in mind. The metadata in general corpora mostly indicates text types, dates, regional varieties etc of the source data.
The procedure involved:
choosing a set of topics
preparing extensive lists of nouns for each topic
dividing the lists into elementary, intermediate and advanced based mainly on the source of the lists, but with some general frequency rankings and intuition
breaking the lists into sets of 20, as this is the maximum number of words that WebBootCat can crawl. WBC is a web-based tool, part of the Sketch Engine suite of corpus tools, that crawls the web for sets of words and builds a corpus of the texts found.
all of the data sets for each topic were combined into subcorpora
The ELT Corpus now has 36 topics and c. 75 mill. words.
Animals
Arts
Body
Clothes
Cognition
Communication
Crime
Describing
Education
Environment
Food_Drink
Health
Home
Jobs
Language
Language_terms
Measurement
Media
Money
Natural_world
People_appearance
People_things
Places
Politics
Relationships
Shopping
Sport
Technology
Time
Town_City
Transport
Travel
University
Weather
Work
Work_jobs