Home

We compiled corpora of language learners texts from a language exchange SNS site Lang-8.
These corpora are available only for research and educational purposes.
You can download the Lang-8 corpora from the download page.
If you would like to use them in commercial products, please directly contact Lang-8 support desk to obtain further information.

Lang-8 Learner Corpora list of URLs

This list contains the URLs of learners blog entries as of December 2010. It has 334,379 multilingual entries written by 59,455 active users. We used Japanese portion of the corpus for our IJCNLP 2011 paper. 

Lang-8 Learner Corpora (raw format)

The corpora contain all the 80 languages supported by Lang-8. The top 20 languages are (counted by entry):

580549 total
237843 English
185991 Japanese
 45289 Unknown
 28154 Mandarin
 21779 Korean
 12606 Spanish
 12392 French
 11111 German
  4069 Russian
  4052 Traditional Chinese
  3339 Italian
  1135 Portuguese(Brazil)
   944 Swedish
   906 Turkish
   892 Indonesian
   803 Thai
   737 Arabic
   712 Finnish
   655 Vietnamese
   588 Dutch
   574 Afrikaans
  • 2013-04-30 Lang-8 Learner Corpora v2.0 (download page) (README)
  • 2012-12-07 Lang-8 Learner Corpora v1.0 

Lang-8 Corpus of Learner English

This corpus contains English learners texts extracted from Lang-8. It has 100,051 English entries written by 29,012 active users. We also include automatic tense/aspect annotation used in our ACL 2012 paper.


Contact

  • Tomoya Mizumoto (mizumoto.tomoya.mh7--at--is.naist.jp)

Contributors

  • Tomoya Mizumoto (Corpus of Learner Japanese)
  • Toshikazu Tajiri (Corpus of Learner English)
  • Takuya Fujino (Corpora of Learner Japanese)
  • Seiji Kasahara (Corpus of Romanized Learner Japanese)
  • Mamoru Komachi
  • Masaaki Nagata
  • Yuji Matsumoto

References

  • Tomoya Mizumoto, Yuji Matsumoto. Discriminative Reranking for Grammatical Error Correction with Statistical Machine Translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT2016), pp. 1133-1138, San Diego, California, America. June 2016.
  • Ippei Yoshimoto, Tomoya Kose, Kensuke Mitsuzawa, Keisuke Sakaguchi, Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Yuji Matsumoto. NAIST at 2013 CoNLL Grammatical Error Correction Shared Task. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 26-33, Sofia, Bulgaria, August 2013.
  • Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings. In Proceedings of the 24th International Conference on Computational Linguistics (COLING-2012): Short Papers, pp.863-872. Mumbai, India, December 2012.
  • Toshikazu Tajiri, Mamoru Komachi and Yuji Matsumoto. Tense and Aspect Error Correction for ESL Learners Using Global Context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (oral), pp.198-202. Jeju Island, Korea, July 2012.
  • Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pp.147-155. Chiang Mai, Thailand, November 2011.