Machine Learning for Mortals

When reading the news, machine learning, data mining and data science seem like magic, leading us into either dystopia or utopia. But there is nothing really magical about machine learning, any mortal can learn to understand the core theory behind machine learning and get to build models themselves real quickly - without much computing knowledge.


The lecture slides can be found here.

Lab Session

In these exercises we will use the WEKA open source data mining tool. You can download WEKA here.

Please also complete this anonymous survey. We will use it in exercise 3.

Assignment 1: Animal Trees

In this assignment we use a data set of animals and their attributes. Using a decision tree classifier the computer learns to classify animals into different categories (mammals, fish, reptiles etc).

1.1 The data set can be found here. Without using the data mining tool, draw a decision tree of three to five levels deep that classifies animals into a mammal, bird, reptile, fish, amphibian, insect or invertebrate.

1.2 Now we are going to let the computer discover a decision tree itself. First download this zip file with data sets to your desktop and unzip it. Open the zoo.arff data set in WEKA (choose start menu – weka – weka-3-4 – Weka Explorer – Open file).

1.2.1 How many attributes are known of each animal?

1.2.2 How many animals are there in the data set?

1.3 Let us build some classifiers. Go to the classifier tab. We will use 66% of the animals to build the models, and the remaining 34% to evaluate the quality of the model., so select percentage split – 66%. First we will build a ‘naïve’ model that just predicts the most occurring class in the data set for each animal. This corresponds to a decision tree of depth 0. Click start to build a model.

1.3.1 What % of animals is correctly classified?

1.3.2 Into what category are all these animals classified and why?

1.4 Now build a decision tree of depth 1 (a.k.a. a decision stump - select choose – trees – decision stump). Draw the discovered decision tree.

1.4.1 What % of animals is correctly classified?

1.4.2 Give an example of an animal that would not be classified correctly by this model.

1.5 Now build a decision tree of any depth (a.k.a. a J48 tree). Draw the discovered decision tree.

1.5.1 What % of animals is correctly classified?

1.5.2 Give an example of an animal that would not be classified correctly by this model.

Assignment 2: Animal Rules

In this exercise you will use the association rule algorithm to discover interesting regularities in the zoo data set.

2.1 The association rule algorithm to be used can only cope with non-numerical (‘nominal’) attributes, so you first have to transform the numerical attribute ‘legs’ to discrete bins (so 0, 2, 2, 4, >4 legs etc). This type of data preprocessing can be performed in the preprocess tab by applying the right filter (select Discretize of PKIDistcretize and then Apply). Check the results before and after application of the filter. Now run the association rule algorithm. You can change the numrules option to get more rules Id needed.

2.1.1 List at least three interesting rules

2.1.2 Give at least one example of a rule that is always true according to the algorithm (hint: see the confidence)?

2.1.3 Give an example of counterexample for a specific rule (an example for which the rule is not correct)

Assignment 3: Mine Yourself

At the beginning of this lab session you have answered some questions about yourselves. In this exercise we will mine this survey of all participants to discover interesting, surprising and counterintuitive patterns in the data.

3. 1 Build a decision tree to predict whether someone likes beer or wine. What is the predictive power of the model? What are important distinguishing characteristics?

3.2 Build classifiers for a selection of the other attributes. For each attribute note the classification accuracy and some distinguishing characteristics. Which attribute is easiest to predict and which one is hardest to predict?

3.3 Use the association rules algorithm to derives interesting rules of this data set. Pick three rules that find most interesting (most funny, trivial, counterintuitive)

We will discuss some of the patterns found with the group.

Assignment 4: Data Mining Case Projects

The zip file from assignment 1 contains a number of data sets from a variety of areas. Most data sets contain a small description in the header – to read this open the file in a text editor like notepad. This exercise could be done in pairs.

Pick a data set that looks interesting and write it on the blackboard so that we don’t get two team working on the same data set.

For your data set / data mining case note:

  1. The practical problem that is being solved here
  2. The goal of the classifier: what needs to be predicted
  3. A high level description of the data: kind of attributes available, number of attributes / instances etc.
  4. Examples of interesting patterns found by just analyzing individual attributes
  5. The classification accuracy for each classifier type – a decision stump, a decision tree and optionally another type of classifier
  6. The patterns discovered by at least one of the classifiers
  7. One or more interesting association rules
  8. A suggestion of how such a prediction can be used in practice

Repeat this process for a number of data sets until you find one that interests you most. Additional data sets can be found at the Weka website, OpenML (already in Weka format), UCI, Kaggle, KDnuggets etc. (if you have a lot of extra time, for other tools for example here)

For at least one of the projects, create a small powerpoint presentation discussing your most interesting results. If you worked in a pair, one of you should act like the domain expert and present the beginning and the end; the other one should act like the data mining expert and present the data mining approach and results. The rest of the group will ask questions after the presentation.