Home
We compiled corpora of language learners texts from a language exchange SNS site Lang-8.
These corpora are available only for research and educational purposes.
You can download the Lang-8 corpora from the download page.
If you would like to use them in commercial products, please directly contact Lang-8 support desk to obtain further information.
Lang-8 Learner Corpora list of URLs
This list contains the URLs of learners blog entries as of December 2010. It has 334,379 multilingual entries written by 59,455 active users. We used Japanese portion of the corpus for our IJCNLP 2011 paper.
2011-01-25 Lang-8 Corpora list of URLs 201012 (gzipped)
Lang-8 Learner Corpora (raw format)
The corpora contain all the 80 languages supported by Lang-8. The top 20 languages are (counted by entry):
580549 total 237843 English 185991 Japanese 45289 Unknown 28154 Mandarin 21779 Korean 12606 Spanish 12392 French 11111 German 4069 Russian 4052 Traditional Chinese 3339 Italian 1135 Portuguese(Brazil) 944 Swedish 906 Turkish 892 Indonesian 803 Thai 737 Arabic 712 Finnish 655 Vietnamese 588 Dutch 574 Afrikaans
2013-04-30 Lang-8 Learner Corpora v2.0 (download page) (README)
2012-12-07 Lang-8 Learner Corpora v1.0
Lang-8 Corpus of Learner English
This corpus contains English learners texts extracted from Lang-8. It has 100,051 English entries written by 29,012 active users. We also include automatic tense/aspect annotation used in our ACL 2012 paper.
2012-08-20 Lang-8 Corpus of Learner English v1.0 (download page) (README)
Contact
Tomoya Mizumoto (mizumoto.tomoya.mh7--at--is.naist.jp)
Contributors
Tomoya Mizumoto (Corpus of Learner Japanese)
Toshikazu Tajiri (Corpus of Learner English)
Takuya Fujino (Corpora of Learner Japanese)
Seiji Kasahara (Corpus of Romanized Learner Japanese)
Mamoru Komachi
Masaaki Nagata
Yuji Matsumoto
References
Tomoya Mizumoto, Yuji Matsumoto. Discriminative Reranking for Grammatical Error Correction with Statistical Machine Translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT2016), pp. 1133-1138, San Diego, California, America. June 2016.
Ippei Yoshimoto, Tomoya Kose, Kensuke Mitsuzawa, Keisuke Sakaguchi, Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Yuji Matsumoto. NAIST at 2013 CoNLL Grammatical Error Correction Shared Task. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 26-33, Sofia, Bulgaria, August 2013.
Tomoya Mizumoto, Yuta Hayashibe, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings. In Proceedings of the 24th International Conference on Computational Linguistics (COLING-2012): Short Papers, pp.863-872. Mumbai, India, December 2012.
Toshikazu Tajiri, Mamoru Komachi and Yuji Matsumoto. Tense and Aspect Error Correction for ESL Learners Using Global Context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (oral), pp.198-202. Jeju Island, Korea, July 2012.
Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pp.147-155. Chiang Mai, Thailand, November 2011.