Contributors: Melvin Johnson, Romain Egelé, add-your-name
Preliminary notes (Isabelle); overwrite/ignore
"Classical tasks": language understanding, summarization, translation, trivia question answering.
Problem of benchmarks that become quickly obsolete.
New tasks:
go/languagedata; tensorflow datasets
GLUE (GLUE-leaderboad; super-GLUE-leaderboard)
NLTK Data (classic set of corpora and models for NLP)
spaCy Language data (find the list of compatible languages)
Multilingual benchmarks:
XTREME covers 9 tasks and 40 languages. Similar to GLUE and SuperGLUE but for multiple languages)
XGLUE similar aggregation of many multilingual benchmarks
XTREME-R an improved version of XTREME that covers 10 tasks and 50 languages. Removes some easy tasks and replaces them with harder ones.
Oxford English Corpus: The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly 2.1 billion words.
Historical evolution (Dipanjan Das):
1. Penn treebank in the early 1990s.
2. Various machine translation benchmarks and the establishment of the WMT shared tasks.
3. Information retrieval competitions such as TREC.
4. RTE (recognizing textual entailment) challenges in the mid-2000s.
5. SNLI: the Stanford natural language inference benchmark and its leaderboard (2015).
6. The SQUAD question answering dataset and its leaderboard (2016).
7. The Natural Questions dataset from Google and its leaderboard (2019).
8. After that benchmarking exploded: GLUE, SuperGLUE, GEM, BigBench, etc.
Very large corpora: