Due by: Friday night of the finals week
Late Policy: 5% for the first day and 10% for each day afterwards
Single person project and no team work for the standard one. If you have a project idea of your own that matches with the basic one, please discuss it with us beforehand.
Report format:
Write a report with >1,000 words (excluding references) including main sections: a) abstract, b) introduction, c) method, d) experiment, e) conclusion, and f) references. You can follow the paper format as e.g leading machine learning journals such as Journal of Machine Learning Research (http://www.jmlr.org/) or IEEE Trans. on Pattern Analysis and Machine Intelligence (http://www.computer.org/web/tpami), or leading conferences like NeurIPS (https://papers.nips.cc/) and ICML (https://icml.cc/). There is no page limit for your report.
Templates (using Google Docs or Word is fine too):
NeurIPS: https://neurips.cc/Conferences/2023/PaperInformation/StyleFiles
ICML: https://media.icml.cc/Conferences/ICML2023/Styles/icml2023.zip
ICLR: https://github.com/ICLR/Master-Template/raw/master/iclr2024.zip
Bonus points:
If you feel that your work deserves bonus points due to reasons such as: a) novel ideas and applications, b) large efforts in your own data collection/preparation, c) state-of-the-art classification results, or d) new algorithms, please create a "Bonus Points" section to specifically describe why you deserve bonus points.
In this project you will choose any three classifiers out of those tested in Caruana and NiculescuMizil on three datasets from the UCI repository http://archive.ics.uci.edu/ml/. Note that the same classifier type with different kernels (e.g. SVM using linear or RBF; boosting using decision stump or decision tree), are NOT considered as different classifiers. Please read the paper by Caruana and Niculescu-Mizil carefully. In your experiments, for each classifier, you will train and test it on at least three datasets. Therefore, there are minimum a total of 3*3=9 individual training and testing. Each time, you will need to do cross-validation to find your proper hyper-parameters corresponding to the type of classifier being used.
We have been discussing the classification problem in the form of two-class classifiers throughout the class. Some classifiers like decision tree, KNN, random forests stay agnostic w.r.t the number of classes but others like SVM and Boosting where explicit objective functions are involved don't.
The basic requirement for the final project is based on the two-class classification problem. If you have additional bandwidth, you can experiment on the multi-class classification setting. When preparing the dataset to train your classifier (two-class), please try to merge the labels into two groups, positives and negatives, if your dataset happens to consist multi-class labels.
Train your classifiers using the setting (not all metrics are needed) described in the empirical study by Caruana and Niculescu-Mizil. You are supposed to reproduce consistent results as in the paper. However, do expect some small variations. When evaluating the algorithms, you don’t need to use all the metrics that were reported in the paper. Using one metric, e.g. the classification accuracy, is sufficient. Please report the cross-validated classification results with the corresponding learned hyper-parameters.
Note that since you are choosing your own libraries for the classifiers, there are implementation details that will affect the classification results. Even the same SVM but with different implementations, you won't be able to see identical results when trained on the same dataset. Therefore, don't expect the identical results as those in the paper, as you are probably using a subset and not all the features. If you see a bit difference in ranking, it should ok but the overall trend should be consistent, e.g. random forest should do well, more training data leads to better results, knn is not necessarily very bad etc.
If you compute accuracy and follow the basic requirement picking 3 classifiers and 3 datasets. You are looking at 3 trials X 3 classifiers X 3 datasets X 3 partitions (20/80, 50/50, 80/20). Each time you always report the best accuracy under the chosen hyper-parameter. Since for the accuracy is averaged among three 3 trials to rank order the classifiers, you will report 3 classifiers X 3 datasets X 3 partitions (20/80, 50/50, 80/20) X 3. accuracies (train, validation, test). When trying to debug, always try to see the training accuracy to see if you are able to at least push the training accuracy high (to overfit the data) as a sanity check making sure your implementation is correct. The heatmaps for your hyper-parameters are the details that do not need to be too carefully compared with. The searching for the hyper-parameters is internal and the final conclusion about the classifiers is based on the best hyper-parameter you have obtained for each time.
Since the exact data setting might have changed, the specific parameters and hyper-parameters reported in Caruana and Niculescu-Mizil paper serve as a guideline but you don't need to try all of them. You can try a few standard ones, as long as your classification results are reasonable. If you pick the multi-layer perceptron as one of your classifiers, note that you may need to increase the number of layers to e.g. 3 and create more neurons in each layer to attain good results, for some datasets.
You can alternatively or additionally adopt the datasets and classifiers reported in a follow-up paper, Caruana et al. ICML 2008.
You are encouraged to use Python, but using other programming languages and platforms is ok. The candidate classifiers include:
1. Boosting family classifiers
http://www.mathworks.com/matlabcentral/fileexchange/21317-adaboost
or
https://github.com/dmlc/xgboost
2. Support vector machines
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
3. Random Forests
http://www.stat.berkeley.edu/~breiman/RandomForests/
4. Decision Tree
http://www.rulequest.com/Personal/ (please see also see a sample matlab code in the attachment)
5. K-nearest neighbors
http://www.mathworks.com/matlabcentral/fileexchange/19345-efficient-k-nearest-neighbor-searchusing-jit
6. Neural Nets
http://www.cs.colostate.edu/~anderson/code/
http://www.mathworks.com/products/neural-network/code-examples.html
7. Logistic regression classifier
8. Bagging family
The links above are for your reference. You can implement your own classifier or download other
versions you like online (But you need to make sure the code online is reliable). You are supposed to
write a formal report describing about the experiments you run and the corresponding results (plus
code).
Grading
Note that if you do well by satisfying the minimum requirement e.g. 3 classifiers on 3 datasets with cross-validation, you will receive a decent score but not the full 100 points. We are looking for something a bit more and please see the guideline below.
When reporting the experimental results, there are two main sets of comparisons we are looking for:
a. For each dataset on each partition, show the comparison for different algorithms, and hopefully be consistent with the findings in the paper with Random Forests being the best etc.
b. For each classifier on each partition, show the comparison on different partitions and you are supposed to show the increase of test accuracy (decrease of test error) with more training data and less test data.
Note that the performance and function calls vary due to the particular ML libraries you are using. For example, the same SVM classifier provided in different toolboxes might result in different errors even trained on the same dataset. But the overall differences should be reasonable and interpretable. You may obtain a ranking that is somewhat different from that in the paper, due to differences in detailed implementation of the classifiers, different training sizes, features ect. But the overall trend should be explainable. For example, random forest usually has a pretty good performance; knn might not be as bad as you had thought, kernel-based SVM is sometimes sensitive to the hyper-parameters; using more data in training will lead to improvement, especially on difficult cases.
The merit and grading of your project can be judged from aspects described below that are common
when reviewing a paper:
1. How challenging and large are the datasets you are studying? (10 points)
2. Any aspects that are new in terms of algorithm development, uniqueness of the data, or new
applications? (10 points)
3. Is your experimental design comprehensive? Have you done thoroughly experiments in tuning
hyper-parameters and performing cross validation (you should also try different data partitions, e.g 20% training and 80% testing, 50% training and 50% testing, and 80% training and 20% testing for multiple rounds, e.g. 3 times each for the above three partitions and compute average scores to remove potentials of having accidental results); try to report both the training and testing errors after cross-validation; it is encouraged to also report the training and validation errors during cross-validation using classification error/accuracy curves w.r.t. the hyper-parameters. (50 points)
4. Is your report written in a professional way with sections including abstract, introduction, data
and problem description, method description, experiments, conclusion, and references? (30
points)
5. Bonus points will be assigned to projects in which new ideas have been developed and implemented, or thorough experiments where extensive empirical studies have been carried out (e.g. evaluated on >=5 classifiers and >=4 datasets).