Can be used to predict a categorical or a continuous target (called regression trees in the latter case)
Unlike logistic regression and neural networks, no equations are estimated in decision trees
A tree structure of rules over the input variables are used to classify or predict the cases according to the target variable
The rules are of an IF-THEN form. For example:
If Risk = Low, then predict on-time payment of a loan
Advantages:
Easy to interpret (Tree structured presentation)
Allow mixed input data types: Categorical, ordinal, & interval
Allow discrete (binary and categorical) or continuous target
Robust to outliers in inputs
No problem with missing values
Automatically:
Accommodates nonlinearity
Selects input variables
Disadvantages:
Error prone if training data too small
Can be time consuming to train
Builds rectangular regions which may not correspond well with actual data distribution
Decision Tree Terminology
Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets
Splitting: It is a process of dividing a node into two or more sub-nodes
Decision Node: When a sub-node splits into further sub-nodes
Leaf / Terminal Node: Nodes that do not split
Pruning: When removing sub-nodes of a decision node
Branch / Sub-Tree: A sub section of entire tree
Parent and Child Node: A node which is divided into sub-nodes is called parent node and the sub-nodes are the child node
The top or root node is the most significant
The following nodes are less significant
The child nodes of the "less significant" are least significant
Nodes that are not involved are not significant
Variation of Decision Trees
Classification tree
The target is discrete (binary, nominal)
The leaves give the predicted class as well as the probability of class membership
Regression tree
The target is continuous
The leaves give the predicted value of the target
Tree with binary splits
Tree with multiway splits
Modeling Essentials – Decision Trees
1. Predict new cases (Prediction rules)
The prediction (leaf node) is always the goal
If 0 – Green – Stay with the company & 1 – Yellow – Leave the company:
Green node – 40% leaving the company
2. Select useful inputs (Split search)
Calculate the logworth of every partition on input x1
Select the partition with the maximum logworth
Repeat for input x2
Compare partition logworth ratings
Create a partition rule from the best partition across all inputs
Repeat the process in each subset
Create a second partition rule
Repeat to form a maximal tree
3. Optimize complexity (Pruning)
Predictive Model Sequence
Create a sequence of models with increasing complexity
Maximal Tree
A maximal tree is the most complex model in the sequence
Pruning One Split
The next model in the sequence is formed by pruning one split from the maximal tree
Each subtree’s predictive performance is rated on validation data
The subtree with the highest validation assessment is selected
Pruning Two Splits
Similarly, this is done for subsequent models
Prune two splits from the maximal tree
Rate each subtree using validation assessment and select the subtree with the best assessment rating
Subsequent Pruning
Continue pruning until all subtrees are considered
Selecting the Best Tree
Compare validation assessment between tree complexities
Validation Assessment
Choose the simplest model with the highest validation assessment
Assessment Statistics (Ratings depend on):
Target measurement (binary, continuous etc.)
Prediction type (decisions, rankings, estimates)
Binary Targets
Primary outcome
Secondary outcome
Binary Target Predictions
Decision Optimization
Accuracy
True positive
True negative
Maximize accuracy: Agreement between outcome and prediction
Misclassification
False negative
False positive
Minimize misclassification: disagreement between outcome and prediction
Ranking Optimization
Concordance
Target = 0 -> Low score
Target = 1 -> High score
Maximize concordance: proper ordering of primary and secondary outcomes
Discordance
Target = 0 -> High score
Target = 1 -> Low score
Minimize discordance: improper ordering of primary and secondary outcomes
Estimate Optimization
Squared error
Target -> Estimate^2
Minimize squared error: squared difference between target and prediction
Pruning Criteria (Limiting the size of a decision tree)
Entropy Pruning Criterion:
Entropy is used to calculate the homogeneity of a sample
If the sample is completely homogeneous, the entropy is zero and if the sample is an equally divided, it has entropy of one
The pruning is based on the entropy set target
Gini Pruning Criterion:
As with entropy, the change in Gini statistic is calculated based on the change in the global Gini statistic
Misclassification Rate Pruning Criterion:
The misclassification rate is simply the number of mispredictions divided by the number of predictions
Thus, a predicted target level can be set, and the tree is pruned back if it cannot meet the misclassification rate target
Average Square Error Pruning Criterion:
The average square error (ASE) is based on the sum of squares error (SSE)
Similarly, a predicted a target level can be set and the tree is pruned back if it cannot meet the ASE target
SAS Applications
3. SAS - Decision Tree
3.1. Creating Training and Validation Data
Click the Sample tool tab to use the "Data Partition" tool
Select the "Data Partition" node and examine its Properties panel
Use the Properties panel to select the fraction of data devoted to the Training, Validation, and Test partitions
By default, less than half the available data is used for generating the predictive models
There is a trade-off in various partitioning strategies:
More data devoted to training results in more stable predictive models, but less stable model assessments, and vice versa
The Test partition is used only for calculating fit statistics after the modeling and model selection is complete
Enter 50 as the Training value in the Data Partition node
Enter 50 as the Validation value in the Data Partition node
Enter 0 as the Test value in the Data Partition node
Select Results in the Run Status window to view the results
The Results provides a basic metadata summary of the raw data that feeds the node and a frequency table that shows the distribution of the target variable in the raw, training, and validation data sets
3.2.1. Preparing for Interactive Tree Construction
The Decision Tree tool can build predictive models autonomously or interactively.
To build a model autonomously, simply run the Decision Tree node.
Building models interactively, however, is more informative
Select "Interactive" from the Decision Tree node's Properties panel
The SAS Enterprise Miner Interactive Decision Tree application appears
Right-click the blue box and select Split Node from the menu
The Split Node 1 dialog box appears
3.2.2. Creating a Splitting Rule
The Split Node dialog box shows the relative value, -Log(p) or logworth of partitioning the training data using the indicated input
As the logworth increases, the partition better isolates cases with identical target values
"Gift Count 36 Months" has the highest logworth, followed by "Gift Amount Average Card 36 Months" and "Gift Amount Last"
This dialog box shows how the training data is partitioned using the input Gift Count 36 Months
Two branches are created.
The first branch contains cases with a 36-month gift count less than 2.5
The second branch contains cases with a 36-month gift count greater than or equal to 2.5
In other words, with cases that have a 36-month gift count of zero, one or two branch left and three or more branch right
In addition, any cases with a missing or unknown 36-month gift count are placed in the second branch
Click "Apply", and then click "OK" twice
This model assigns to all cases in the left branch a predicted TARGET_B value equal to 0.43 and to all cases in the right branch a predicted TARGET_B value equal to 0.56
The training data is partitioned into two subsets.
The first subset, corresponding to cases with a 36 month gift count less than 2.5, has a higher than average concentration of TARGET_B=0 cases.
The second subset, corresponding to cases with a 36 month gift count greater than 2.5, has a higher than average concentration of TARGET_B=1 cases
The second branch has slightly more cases than the first on training and validation data sets, which is indicated by the Count field
In general, decision tree predictive models assign all cases in a leaf the same predicted target value. For binary targets, this equals the percentage of cases in the target variable’s primary outcome (target=1 outcome)
3.2.3. Adding More Splits
It is important to remember that the logworth of each split, reported above, and the predicted TARGET_B values are generated using the training data only
The main focus is on selecting useful input variables for the first predictive model. A diminishing marginal usefulness of input variables is expected, given the split-search
3.2.4. Changing a Splitting Rule
3.2.5. Creating the Maximal Tree
Select the root node of the tree
Right-click and select "Train Node" from the menu
Right-click in the gray area behind the tree
Select View -> Fit to page
To view the information within each node, right-click in the gray area behind the tree and select View -> Chart tips
Save the tree by selecting File -> Save
Then select File -> Exit to close the Interactive Tree application
Select View -> Subtree Assessment Plot
Looking at the plot for training data, the majority of the improvement in fit occurs over the first few splits, it appears that the maximal, fifteen-leaf tree generates a lower misclassification rate
The plot on training data seems to indicate that the maximal tree is preferred for assigning predictions to cases
However, when looking only at the results from the training data, this plot is misleading
Using the same sample of data both to evaluate input variable usefulness and to assess model performance commonly leads to overfit models
Looking at an assessment plot based on validation data provides the solution
4.1. Assessing a Decision Tree
Select the Decision Tree node
In the Decision Tree node's Train properties, change the Use Frozen Tree property’s value from No to Yes
The Frozen Tree property prevents the maximal tree from being changed by other property settings when the flow is run
Right-click the Decision Tree node and run it
Select Results
The Results window appears
The Results window contains a variety of diagnostic plots and tables, including a cumulative lift chart, a treemap, and a table of fit statistics
The diagnostic tools shown in the results vary with the measurement level of the target variable
Select View -> Model -> Subtree Assessment Plot
The plot shows the Average Square Error corresponding to each subtree as the data is sequentially split
This plot is similar to the one generated with the Interactive Decision Tree tool, and it confirms suspicions about the optimality of the 15-leaf tree
The performance on the training sample becomes monotonically better as the tree becomes more complex
However, the performance on the validation sample only improves up to a tree of, approximately, four or five leaves, and then diminishes as model complexity increases
The validation performance shows evidence of model overfitting
Over the range of one to approximately four leaves, the precision of the model improves with the increase in complexity.
A marginal increase in complexity over this range results in better accommodation of the systematic variation or signal in data
Precision diminishes as complexity increases past this range; the additional complexity accommodates idiosyncrasies in the training sample, and the model extrapolates less well.
The validation performance under Misclassification Rate is similar to the performance under Average Square Error
The optimal tree appears to have, approximately, four or five leaves
4.2. Pruning a Decision Tree
The default method used to prune the maximal tree is Assessment. This means that algorithms in SAS Enterprise Miner choose the best tree in the sequence based on some optimality measure
Alternative method options are Largest and N.
The Largest option provides an autonomous way to generate the maximal tree.
The N option generates a tree with N leaves. The maximal tree is upper-bound on N.
In the Result window, select View -> Model -> Subtree Assessment Plot
The five-leaf tree has the lowest associated misclassification rate on the validation sample
The maximal tree is then sequentially pruned so that the sequence consists of the best 15-leaf tree, the best 14-leaf tree, and continue
The tree in the sequence with the lowest overall validation misclassification rate is selected.
4.3. Alternative Assessment Measures
Same as procedure as the steps above
Create a new Decision Tree node "Probability Tree"
Change the Assessment Measure property to Average Square Error
A five-leaf tree is also optimal under the Validation Average Square Error criterion
However, viewing the Tree plot reveals that, although the optimal Decision (Misclassification) Tree and the optimal Probability Tree have the same number of leaves, they are not identical trees
The tree shown below is a different five-leaf tree than the tree optimized under Validation Misclassification
In the first step, cultivating (or growing or splitting) the tree, the important measure is logworth
Splits of the data are made based on logworth to isolate subregions of the input space with high proportions of donors and non-donors
Splitting continues until a boundary associated with a stopping rule is reached. This process generates the maximal tree.
4.4. Additional Diagnostic Tools
Treemap
The Tree Map window is intended to be used in conjunction with the Tree window to gauge the relative size of each leaf
The size of the small rectangle in the treemap indicates the fraction of the training data present in the corresponding leaf. It appears that only a small fraction of the training data finds its way to this leaf
Leaf Statistic Bar Chart
This window compares the blue predicted outcome percentages in the left bars (from the training data) to the red observed outcome percentages in the right bars (from the validation data)
This plot reveals how well training data response rates are reflected in the validation data
Ideally, the bars should be of the same height
Differences in bar heights are usually the result of small case counts in the corresponding leaf
Variable Importance
Select View -> Model -> Variable Importance
The Variable Importance window provides insight into the importance of inputs in the decision tree
The magnitude of the importance statistic relates to the amount of variability in the target explained by the corresponding input relative to the input at the top of the table
The Score Rankings Overlay Plot
Is commonly called a cumulative lift chart
Cases in the training and validation data are ranked, based on decreasing predicted target values
A fraction of the ranked data is selected (given by the decile value)
The proportion of cases with the primary target value in this fraction is compared to the proportion of cases with the primary target value overall (given by the cumulative lift value)
A useful model shows high lift in low deciles in both the training and validation data
For example, the Score Ranking plot shows that in the top 20% of cases (ranked by predicted probability), the training and validation lifts are approximately 1.26
This means that the proportion of primary outcome cases in this top 20% is about 26% more likely to have the primary outcome than a randomly selected 20% of cases.
Fit Statistics
The Fit Statistics window is used to compare various models built within SAS Enterprise Miner. The misclassification and average square error statistics are of most interest for this analysis
The Output Window
The Output window provides information generated by the SAS procedures that are used to generate the analysis results. For the decision tree, this information includes variable importance, tree leaf report, model fit statistics, classification information, and score rankings