README-romanized1.0

24 August 2012 (v1.0)  This README file describes the Lang-8 Corpus of Romanized Learner Japanese.  INTRODUCTION ============  We scaped correction data of learner Japanese written in Roman script from Lang-8 (http://lang-8.com), a SNS for language learning. The data was crawled in December 2010. The scraping procedure is described in our WTIM paper (Kasahara et al., 2011).  The dataset contains 5,000 pairs of learner's text and corrected one, extracted from 350,000 Japanese sentences from the dataset used in our IJCNLP paper (Mizumoto et al., 2011). We included an entry if either learner's text or corrected text contains roman scripts.   DATA FORMAT ===========  1.roman-ja-5k-learner.txt  This file contains romanized Japanese written by Japanese learners. The first item refers to the sentence ID, followed by a sentence.  e.g. 351 Watashi no shuutmatsu wa ii deshita.  2.roman-ja-5k-correct.txt  This file contains romanized Japanese corrected by Japanese native speakers.  The first item refers to the sentence ID, followed by a sentence. The ID is the same as the learner's counterpart.  e.g. 351 Watashi no shuutmatsu wa yokatta desu.    AUTHORS =======  Seiji Kasahara, Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto  Any questions regarding the Lang-8 Corpus of Romanized Learner Japanese should be directed to seijik42@gmail.com and komachi@is.naist.jp.  REFERENCES ==========  If you are interested in the Lang-8 Corpus of Romanized Learner Japanese, please cite these papers:  Seiji Kasahara, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. Error Correcting Romaji-Kana Conversion for Japanese Language Education. In Proceedings of the Workshop on Text Input Methods (WTIM 2011): Short papers (oral), pp.38-42. Chiang Mai, Thailand, November 2011.  Mizumoto Tomoya, Mamoru Komachi, Masaaki Nagata. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners.  In Proceedings of the 5th International Joint Conference on Natural Language Processing, pp.147-155. Chiang Mai, Thailand, November 2011.  LICENSE =======  The corpus is distributed for research or educational purpose only, and is provided without any warranty.  ACKOWLEDGMENTS ==============  We gratefully thank Yangyang Xi and Lang-8 contributors for sharing their data.