Stats 202

Class Project

Objective
The goal of this class project is to give you experience in real life data mining.  By the end of the project, you will have learned how to identify and interpret types of different attributes in a dataset, visualize each attribute individually, , visualize relationships between attributes of different types, understand how those relationships could affect your model, and finally build a binary classification model.  Throughout the project, I will be available via e-mail to help out and provide advice.

The class project is optional.  It can replace your score in the Final Exam if you choose not to take it.  If you choose to also take the Final Exam, your score for the Final Exam portion of your grade will be the maximum of the Final Exam and the Class Project.

Data
You will be provided a training data set which includes 19 attributes and 94,682 observations from web transaction anomaly data.  In a separate file, the class labels (positive for being an anomaly, or negative for not being an anomaly), are provided for each observation.  Additionally, a test data set is provided which contains 36,019 observations.  Your goal is to order the observations in the test data set by how likely you think they are to be classified as positive for being an anomaly.   Please download the data from the links provided at the bottom of the page.   

Your job is to create a text file containing one line per example in the test set.  On each line, give your predicted probability the label is 1 (positive).  The probability should be a decimal number between 0 and 1 (inclusive) with up to 6 decimal places of precision. So if you use all 6 decimal places, the format should be x.xxxxxx, where each x is an integer between 0 and 9.

Be careful when submitting because accidently deleting one line may have large repercussions.

Final Report
In addition to the text file you will provide a final report of up to 5 pages (not including figures or tables) explaining the steps you took throughout the data mining process (see Figure 1).

Figure 1


  1. Selection - Explain which attributes you used to build the model, and why you chose those attributes
  2. Preprocessing - Explain whether you pre-processed any of the attributes by modifying them in any way
  3. Transformation - Explain whether you created new features from the existing attributes, or from pairs of the existing attributes.  Did you transform any of the attributes into another representation of data?  Remember, you do not need to use all of the attributes in your model.  Try to evaluate which attributes you think will be useful, and use those attributes.
  4. Data Mining - Explain how you built your classification model.  There are many kinds of models that may work for this problem.   From my experience, I would probably start by trying a naive Bayes model, but I would also explore k-nearest neighbor models, support vector machines, and maybe even decision trees.  You are welcome to use whatever classification approach you would like, but remember, you need to end up with a number which represents your confidence that the test observation belongs to the positive (anomaly) class.
  5. Interpretation/Evaluation - Understand what your model is doing and how it is performing.  This may require you to separate your training data into different groups so that you can test your models performance on a "hold out" group.

Explain the decisions you made and provide visualizations supporting those decisions.  Furthermore, provide visualizations in the form of tables and/or figures for each attribute in your model, and provide visualizations for pairs of attributes that you think may be related.  Remember, understanding your data is an important part of the data mining process, and visualization that data can help understand it. 

Due Date
The class project is due on Wednesday, August 12, 2009.  E-mail the report to stats202 [at] gmail [dot] com and include your predictions on the test set as an attachment.

Frequently Asked Questions
As the class project goes on, I will use this space to post questions asked to me which I think may be relevant for the rest of the class.

Question 1.  Can I work with other students in the class?
Yes, you can work with up to one other student, however you need to submit your own report.  You can share the same classification model, code,  and the same predictions with another student, but your reports should be written independently.

Question 2.  Can I seek advice from friends or colleagues?
Yes, you can seek advice from anyone, as long as you submit your own report and either you or your optional partner write the code to visualize the data and build the classification model.

Question 3.  How will I be evaluated for the project?
If you explain and understand each step of the data mining process, understand and visualize the data effectively, and are able to build a classification model from the data, you can expect to get full credit.  Partial credit will be given as well.  Your grade this project can replace the final if you score better on this than the final or if you choose not to take the final.

Question 4.  Can we create a team and submit our solutions to the UCSD website before we submit our final model?
Yes!  I will sponsor anyone who wants to submit a solution.  You are welcome to submit solutions on the UCSD site before the project is due.  I believe that UCSD will keep the submission process open even after the competition closes for teams to continue testing and developing their models.

Question 5.  Can we submit more than one classification model?
Yes!  You can submit as many as you like, but you need to explain how and why you built each one.

Question 6.  By which metric will the model be evaluated?  How good does it have to be to get full credit?
The model will be evaluated in the following manner.  We will take the 20% of test set observations that you have given the highest probability of belonging to the positive class.  The fraction of those observations that actually belong to the positive class is the score of the model.  I haven't set a quantitative bar on how good the model has to be to get full credit.  A good report, with good visualizations and a good thought process will get full credit even for a model that may not work very well.

Question 7.  Will you be teaching most of the tools required to do the class project, or will most of it be self-learning? For example, will we be learning how to build our own classifier model and train it with data sets, or is that something for us to figure out on our own in R?
I will be teaching several types of classifiers between July 10th and July 31st, however you will have to become familiar enough with R to be able to write the program for one of the types of classifiers.  R comes with packages that can be used to construct support vector machines, decisions trees, and much more, so hopefully you don't have to write much code from scratch.  For example, 4 different libraries are available to construct support vector machines:

R packages that may help with the project

You can install packages into R by using the following command:

install.packages()

selecting a download mirror near you and then selecting the package to download.

There are several R packages to help you build classification models.  Here are a few:

  1. The first implementation of SVM in R (R Development Core Team 2005) was introduced in the e1071 (Dimitriadou, Hornik, Leisch, Meyer, and Weingessel 2005) package. The svm() function in e1071 provides a rigid interface to libsvm along with visualization and parameter tuning methods.
  2. Package kernlab features a variety of kernel-based methods and includes a SVM method based on the optimizers used in libsvm and bsvm (Hsu and Lin 2002c). It aims to provide a flexible and extensible SVM implementation.
  3. Package klaR (Roever, Raabe, Luebke, and Ligges 2005) includes an interface to SVMlight, a popular SVM implementation that additionally offers classification tools such as Regularized Discriminant Analysis.
  4. Finally, package svmpath (Hastie 2004) provides an algorithm that fits the entire path of the SVM solution (i.e., for any value of the cost parameter).
  5. A naive Bayes classifier is available in package e1071

Attachments (3)

  • test_attributes.csv - on Jun 27, 2009 8:35 AM by Stats 202 (version 1)
    2236k Download
  • training_attributes.csv - on Jun 27, 2009 8:34 AM by Stats 202 (version 1)
    5873k Download
  • training_labels.csv - on Jun 27, 2009 8:35 AM by Stats 202 (version 1)
    185k Download