README-raw

v2.0 (30 Apr 2013) v1.0 (07 Dec 2012)  This README file describes the Lang-8 Learner Corpora.   INTRODUCTION ============  We scraped correction data of learners' writings from Lang-8 (http://lang-8.com), a SNS for language learning. The data was crawled in September 2011. After downloading the HTML from Lang-8, we used the lang8decode.py to extract correction pairs. Please note that we include only the raw, unmodified data that may contain noise. Please refer to our IJCNLP 2011 paper (Mizumoto et al., 2011) for cleaning and filtering.   DATA FORMAT ===========  The data is in json format. The structure is   ["journal_id",   "sentence_id",   "learning_language",   "native_language",   ["learner_sentence1","learner_sentence2",...],   [["correction1_to_sentence1","correction2_to_sentence1",...],    ["correction1_to_sentence2","correction2_to_sentence2",...],    ...], ]  Example: ["772869","227504","English","Spanish",["My prefer color","Hello people,","Today I didn't know to tell us.","My prefer color is red.","Because is funny and diferent.","The red can to pretend dangeruis and it's sexy.","The color red is chosen for people with self-confidence."],[[],[],["Today I didn't know how to say it this:"],["My favourite color is red."],["Because it is funny and different."],["Red can pretend to be dangerous and it's sexy."],["The color red is chosen by people with self confidence."]]]   Please note that corrections may contain tags like [f-red]...[/f-red], [f-blue]...[/f-blue], and [sline]...[/sline] (meaning deletion).   AUTHORS =======  Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto  Any questions regarding the Lang-8 Learner Corpora should be directed to  komachi@is.naist.jp.   REFERENCES ==========  If you are interested in the Lang-8 Learner Corpora, please cite this paper:  Mizumoto Tomoya, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners.  In Proceedings of the 5th International Joint Conference on Natural Language Processing, pp.147-155. Chiang Mai, Thailand, November 2011.   LICENSE =======  The corpora are distributed for research or educational purposes only, and is provided without any warranty.  If you would like to use it for commercial purpose, please talk to support@lang-8.com. They changed the terms of use to sell license of the learners data created after January 2012.   ACKNOWLEDGMENTS ===============  We gratefully thank Yangyang Xi and Lang-8 contributors for sharing their data.