README-ja

11 January 2012 This README file describes the Lang-8 Corpus of Learner Japanese. INTRODUCTION ============ We scraped correction data of learner Japanese from Lang-8 (http://lang-8.com), a SNS for language learning. The data was crawled in September 2011. The scraping procedure is described in our IJCNLP 2011 paper (Mizumoto et al., 2011). DATA FORMAT =========== Correction are stored in two files, line by line. The N-th line of the learners' file corresponds to the N-th line of the corrected file. The japanese_L1 directory contains correction files split by the mother tongue of the Japanese learners. 1. japanese.incor This file contains Japanese learner's sentences. Example: 祖母は猫の子にうれしかった。猫と一緒に住むのは楽しくなるだと行って楽に猫をもらった。 2. japanese.corr This file contains corrected sentences. Anything in [sline]...[/sline] tags were dropped. Color tags ([f-blue], [f-red], ...) were removed. Example: 祖母は猫の子を見てよろこんだ。猫と一緒に住むのは楽しくなるだろうとよろこんで猫をもらった。 AUTHORS ======= Takuya Fujino, Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto Any questions regarding the Lang-8 Corpus of learner Japanese should be directed to tomoya-m@is.naist.jp and komachi@is.naist.jp. REFERENCES ========== If you are interested in the Lang-8 Corpus of Learner Japanese, please cite this paper: Mizumoto Tomoya, Mamoru Komachi, Masaaki Nagata. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In Proceedings of the 5th International Joint Conference on Natural Language Processing, pp.147-155. Chiang Mai, Thailand, November 2011. LICENSE ======= The corpus is distributed for research or educational purposes only, and is provided without any warranty. ACKNOWLEDGMENTS =============== We gratefully thank Yangyang Xi and Lang-8 contributors for sharing their data.

Page updated

Google Sites

Report abuse