Home

We compiled corpora of language learners texts from a language exchange SNS site Lang-8.

These corpora are available only for research and educational purposes.

You can download the Lang-8 corpora from the download page.

If you would like to use them in commercial products, please directly contact Lang-8 support desk to obtain further information.

Lang-8 Learner Corpora list of URLs

This list contains the URLs of learners blog entries as of December 2010. It has 334,379 multilingual entries written by 59,455 active users. We used Japanese portion of the corpus for our IJCNLP 2011 paper. 

Lang-8 Learner Corpora (raw format)

The corpora contain all the 80 languages supported by Lang-8. The top 20 languages are (counted by entry):

580549 total 237843 English 185991 Japanese  45289 Unknown  28154 Mandarin  21779 Korean  12606 Spanish  12392 French  11111 German   4069 Russian   4052 Traditional Chinese   3339 Italian   1135 Portuguese(Brazil)    944 Swedish    906 Turkish    892 Indonesian    803 Thai    737 Arabic    712 Finnish    655 Vietnamese    588 Dutch    574 Afrikaans

Lang-8 Corpus of Learner English

This corpus contains English learners texts extracted from Lang-8. It has 100,051 English entries written by 29,012 active users. We also include automatic tense/aspect annotation used in our ACL 2012 paper.

Contact

Contributors

References