Baselines

Random Baseline is simply predicting a random author for each piece of code from the list of 1,000 authors (from 0 to 999). Its expected accuracy is 0.1%.

Characters Count Logistic Baseline converts each source code into a vector that represents the count of the 100 printable characters. Then, it builds a logistic regression model on the vectorized representations. It achieves an accuracy of 29.252% on the development set.

TF-IDF KNN Baseline vectorizes the source codes using TF-IDF method with 10K features. These features are fed into a KNN classifier with k=25. Its accuracy on the development set is 62.128%, which is significantly better than the previous baselines. Keep in mind that this baseline is very slow and it will take about 4 hours to predict all examples in the development set using 6 threads.

Page updated

Google Sites

Report abuse