README-romanized1.1

The dataset contains 5,000 pairs of learner's text and corrected one, extracted from 350,000 Japanese sentences from the dataset used in our IJCNLP paper (Mizumoto et al., 2011). We manually re-annotated 500 of them to make a gold standard. We included an entry if either learner's text or corrected text contains roman scripts.   DATA FORMAT ===========  1.roman-ja-5k-learner.txt  This file contains romanized Japanese written by Japanese learners. The first item refers to the sentence ID, followed by a sentence.  e.g. 351 Watashi no shuutmatsu wa ii deshita.  2.roman-ja-5k-correct.txt  This file contains romanized Japanese corrected by Japanese native speakers.  The first item refers to the sentence ID, followed by a sentence. The ID is the same as the learner's counterpart.  e.g. 351 Watashi no shuutmatsu wa yokatta desu.   3.roman-ja-500-eval.txt  This file contains romanized learner Japanese accompanied by manual annotation by Kana. We only corrected spelling errors during the annotation. Two fields are delimited by "\t|||\t".  e.g. hajimimashtei   |||     はじめまして   AUTHORS =======  Seiji Kasahara, Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto  Any questions regarding the Lang-8 Corpus of Romanized Learner Japanese should be directed to seijik42@gmail.com and komachi@is.naist.jp.  REFERENCES ==========  If you are interested in the Lang-8 Corpus of Romanized Learner Japanese, please cite these papers:  Seiji Kasahara, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. Error Correcting Romaji-Kana Conversion for Japanese Language Education. In Proceedings of the Workshop on Text Input Methods (WTIM 2011): Short papers (oral), pp.38-42. Chiang Mai, Thailand, November 2011.  Mizumoto Tomoya, Mamoru Komachi, Masaaki Nagata. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners.  In Proceedings of the 5th International Joint Conference on Natural Language Processing, pp.147-155. Chiang Mai, Thailand, November 2011.  LICENSE =======  The corpus is distributed for research or educational purpose only, and is provided without any warranty.  ACKOWLEDGMENTS ==============  We gratefully thank Yangyang Xi and Lang-8 contributors for sharing their data.  HISTORY =======  24 August 2012 (v1.1) Included roman-ja-500-eval.txt.  23 August 2012 (v1.0) Initial commit.