Analyzing writing styles of non-native speakers is a challenging task. In this paper, we analyze the comments written in the discussion pages of the English Wikipedia. Using learning algorithms, we are able to detect native speakers’ writing style with an accuracy of 74%. Given the diversity of the English Wikipedia users and the large number of languages they speak, we measure the similarities among their native languages by comparing the influence they have on their English writing style. Our results show that languages known to have the same origin and development path have similar footprint on their speakers’ English writing style. To enable further studies, the dataset we extracted from Wikipedia will be made available publicly.
Comments database (data.json.txt.details):
- autoid: Index that is unique for each row
- text: The actual comment
- lang: native user language
- user_name: wikipedia username
- page_id: The ID of the wikipedia page that the comment appeared in.
- page_title: Title of the wikipedia page that the comment appeared in.
- time_stamp: The time stamp of the comment, as mentioned in the comment signature. Could be null
- level: English proficiency of the user.
Users database (users_props.json.txt):
- langs: set of the languages that the user claims knowledge of.
- comm_size: number of (characters/wrods, not sure!) that is used in all the user comments.
- comm_num: number of comments that the user contributed.