For each programming language, we remove the projects:
With educational purposes (e.g., tutorial) ;
Which contain non-English commit messages.
For each project, we select commits with the following criteria:
A commit shall include at least three hunks;
A commit shall include hunks with the number of changed lines of code less than 15 (considering the length limit of our model);
The commit message shall be an English message with a token length over 5;
The commit shall not contain the automatically generated source files (e.g., the Java files with @auto keywords) or non-source files (e.g., .bak, .log, and .pyc files)