In this project you will build a decision tree classifier for a dataset of edible and poisonous mushrooms. You will have to parse observations, build a decision tree, and be able to both query the decision tree as well as print a text guide suitable for carrying through the woods. Please note this is a toy problem, and should not be used for actually determining if any real mushrooms you find are edible or not.
This project has fewer function stubs than previous projects: you are expected to decompose functions into logical sub-problems and define helper functions.
Due Date: Friday October 29th
Late Due Date: Monday November 1st
Create your Github Classroom Repository.
In-Class Activity and the Resulting Tree
Mushroom.hs - This is the file you will edit. Some types are defined for you, and some types will have to be defined. Functions are listed in roughly the suggested implementation order.
mushroom.csv - Your input file of observations about mushrooms.
Main.hs - This file contains the I/O actions that turn your code into an executable. You do not need to understand anything in this file.
README.md - This file contains a brief project description, as well as spots for your name and Trinity ID (i.e. sfogarty). Be certain to add your name and ID.
Makefile - The makefile supports make, make clean, and make test. The executable is output to fungi. Be certain make runs before you submit!
sampleInputs.hs - A few sample inputs for readObservation, readObservationFile, numCorrect, and buildTree.
The executable fungi can be built using make. If the executable does not build, use make setup to install any missing Haskell packages. The executable and compiled source files can be removed using make clean. By default fungi will print a guide constructed from the decision tree built from the observations in mushroom.csv.
Valid options are:
-h or --help Prints a help message and exits.
-n n Uses only first n mushrooms from mushroom.csv.
-p or --print Read in the mushrooms and print their representation to the screen, without building a decision tree.
Useful for debugging.
-t or --tree Print the decision tree, without any formatting. Useful for debugging.
-d m Decide whether or not the mushroom m is edible, printing a percentage estimate.
The mushroom should be given in the same format as the input file, without the first column.
-l l or --limit=l Use l as the threshold for edibility. Defaults to 0.9.
Note that there are no unit tests for this project. Test cases are provided in Project 4 Test Cases, and sample inputs are provided in sampleInputs.hs.
For the milestone you must define data-types for mushroom attributes and decision trees, as well as functions to read in observations, examine lists of observations, and evaluate a mushroom on a decision tree. Each mushroom is described by a list of attributes. An observation is the tuple of a mushroom and the edibility of that mushroom.
You are given the following data types and type aliases.
data Edible = Nom | NoNom deriving (Show, Eq, Ord)
type Mushroom = [Attribute]
type Observation = (Mushroom, Edible)
The input file has one line per observation, broken into 8 columns by commas as follows.
edibility,stalk-color,cap-color,spore-color,cap-shape,stalk-shape,gill-size,odor
The first column is either "edible" or "poison", describing the edibility of the mushroom. The remaining 7 columns describe the attributes of the mushroom. The possible values for each column are:
stalk-color: brown, green, purple, white, yellow
cap-color: brown, green, purple, white, yellow
spore-color: brown, green, purple, white, yellow
cap-shape: bell, conical, knobbed
stalk-shape: bulbous, club, missing
gill-size: broad, narrow
odor: almond, anise, foul, none, musty
You must define the algebraic data type Attribute. Each instance of Attribute will represent one of the 28 different possible attributes. You may not use strings or integers. You may decompose this problem by creating helper types for the attributes, similar to the use of the Operator type to decompose the different Tokens.
Note that each mushroom in the input file will have exactly seven attributes in precisely the listed order. The Mushroom type alias is more general, and could represent mushrooms with missing, or extra, attributes. You may use the fact that the attributes will be in the same order, for instance to simplify equality checking.
After defining the Attribute data type, implement the following functions.
readObservation translates a single line of the input file into an Observation.
readObservationFile translates the entire contents of the input file into a list of Observations.
Define the function numCorrect to compute how much edibility information can be gained by checking a specific attribute against a list of Observations. This implements the functionality of the in-class template.
The next step is to build a decision trees. A decision tree is used to evaluate one instance against a known corpus of instances. In this case, we want to evaluate the edibility of a new mushroom against our list of observations. Decision trees are a binary tree with two kinds of nodes:
Decision nodes are labeled with an attribute, and have two children. The left is used to to evaluate mushrooms that do not have that attribute, and the right is used to evaluate mushrooms that do have the attribute.
End nodes are leaves, and have no children. They must store enough information to estimate the edibility of a mushroom that ends up at that node. Do note store any observations, simply compute an appropriate summary.
Define an algebraic data type for decision trees and implement the buildTree function to turn a list of observations into a decision tree
buildTree constructs a decision tree is build from a list of observations. If the list of observations is coherent enough, it is turned into an end node. Otherwise, search for the attribute that gives us the most information about edibility, and create a decision node labeled with that attribute. Split the observations into those with and without that attribute, and recursively create the left and right children.
For the full project, we will write two functions to evaluate a decision tree.
rateMushroom evaluates a mushroom against a decision tree. It takes in a mushroom, a decision tree, and a rational describing the safety limit for edibility. It returns a string with instructions. For full credit, include a percentage estimate.
buildGuide turns a decision tree into a human-readable, step-by-step, guide to edibility. It takes a decision tree and a safety limit, and returns a list of strings representing the numbered steps. Each step represents a node in the tree, and takes one of two forms:
"n: Eat the mushroom." / "n: Do not eat the mushroom."
"n: If the mushroom has (attribute) go to step x, otherwise go to step y."
The guide produced for the core project may be quite ugly. For instance, the guide produced by ./fungi -n 4200 is:
1: If the mushroom has a musty odor go to step 2, otherwise go to step 3.
2: Do not eat the mushroom.
3: If the mushroom has a foul odor go to step 4, otherwise go to step 5.
4: Do not eat the mushroom.
5: If the mushroom has green spores go to step 6, otherwise go to step 7.
6: Do not eat the mushroom.
7: Eat the mushroom.
Notice the repeated lines. Further, your guide will likely display your internal representation of Attributes instead of "a musty odor." As long as your guide has the same steps in a logically consistent order, you have completed the core project. The complete guide should have 33 steps. See Project 4 Test Cases for the full tree and other test cases.
The first step to improving the guide is to make a custom instance of the Show typeclass for Attribute. You should print grammatically correct lines in your guide. While it would usually be acceptable to create a custom function to turn attributes into strings, for the purposes of this assignment you must use show and implement your function as an instance of the Show typeclass. If you have any sub-types for Attribute, they should also have custom instances of Show.
As an important note, using a custom instance of Show like this is considered bad practice. Why is left as an exercise to the reader.
The second step to improving the guide is to remove redundant strings from the guide. To do this, use common sub-expression elimination. Instead of directly translating each node into a string, keep an index that maps strings to integer locations. At first, we will use association lists for indexes, as defined by the following type alias:
type Index a = [(a, Int)]
The makeEntry function takes a new element to add to an existing index. It returns a tuple of the integer location for the string, and an index that is guaranteed to contain that string in that location. Most critically, if the string is already in the index, it will not add it again.
Define buildGuideCSE to build an index containing all the steps. Avoid adding the step number to the string: the index already maps each string to a step number, and the two can be combined after the index is built. To add an "If the mushroom has (attribute) go to step x, otherwise go to step y." step, you will have to have already added the strings for x and y to the index. this means x and y will have lower numbers, and your entire index will be "reversed." The output is described in Project 4 Test Cases
Finally, it is worth noting that association lists are slow. For the size of the decision trees we are creating (18-40 nodes) this is irrelevant. However, for a large enough tree the overhead of searching through a list can be considerable. Replace indexes with association binary search trees, similar to those implemented in class. You do not have to ensure the tree is balanced.
Subject to change
Milestone 45
Core Project 40
Full Credit 15