4. Decision Trees

Decision Trees

Can be used to predict a categorical or a continuous target (called regression trees in the latter case)
Unlike logistic regression and neural networks, no equations are estimated in decision trees
A tree structure of rules over the input variables are used to classify or predict the cases according to the target variable
The rules are of an IF-THEN form. For example:
- If Risk = Low, then predict on-time payment of a loan
Advantages:
1. Easy to interpret (Tree structured presentation)
2. Allow mixed input data types: Categorical, ordinal, & interval
3. Allow discrete (binary and categorical) or continuous target
4. Robust to outliers in inputs
5. No problem with missing values
6. Automatically:
  - Accommodates nonlinearity
  - Selects input variables
Disadvantages:
1. Error prone if training data too small
2. Can be time consuming to train
3. Builds rectangular regions which may not correspond well with actual data distribution

Decision Tree Terminology

Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets
Splitting: It is a process of dividing a node into two or more sub-nodes
Decision Node: When a sub-node splits into further sub-nodes
Leaf / Terminal Node: Nodes that do not split
Pruning: When removing sub-nodes of a decision node
Branch / Sub-Tree: A sub section of entire tree
Parent and Child Node: A node which is divided into sub-nodes is called parent node and the sub-nodes are the child node
The top or root node is the most significant
- The following nodes are less significant
- The child nodes of the "less significant" are least significant
- Nodes that are not involved are not significant

Variation of Decision Trees

Classification tree
- The target is discrete (binary, nominal)
- The leaves give the predicted class as well as the probability of class membership
Regression tree
- The target is continuous
- The leaves give the predicted value of the target

Tree with binary splits

Tree with multiway splits

Modeling Essentials – Decision Trees

1. Predict new cases (Prediction rules)

The prediction (leaf node) is always the goal
If 0 – Green – Stay with the company & 1 – Yellow – Leave the company:
- Green node – 40% leaving the company

2. Select useful inputs (Split search)

Calculate the logworth of every partition on input x1
Select the partition with the maximum logworth

Repeat for input x2

Compare partition logworth ratings

Create a partition rule from the best partition across all inputs
Repeat the process in each subset
Create a second partition rule

Repeat to form a maximal tree

3. Optimize complexity (Pruning)

Predictive Model Sequence
- Create a sequence of models with increasing complexity
Maximal Tree
- A maximal tree is the most complex model in the sequence
Pruning One Split
- The next model in the sequence is formed by pruning one split from the maximal tree
- Each subtree’s predictive performance is rated on validation data
- The subtree with the highest validation assessment is selected
Pruning Two Splits
- Similarly, this is done for subsequent models
- Prune two splits from the maximal tree
- Rate each subtree using validation assessment and select the subtree with the best assessment rating
Subsequent Pruning
- Continue pruning until all subtrees are considered
Selecting the Best Tree
- Compare validation assessment between tree complexities
Validation Assessment
- Choose the simplest model with the highest validation assessment

Assessment Statistics (Ratings depend on):
- Target measurement (binary, continuous etc.)
- Prediction type (decisions, rankings, estimates)
Binary Targets
- Primary outcome
- Secondary outcome

Binary Target Predictions
1. Decision Optimization
  1. Accuracy
    - True positive
    - True negative
    - Maximize accuracy: Agreement between outcome and prediction
  2. Misclassification
    - False negative
    - False positive
    - Minimize misclassification: disagreement between outcome and prediction
2. Ranking Optimization
  1. Concordance
    - Target = 0 -> Low score
    - Target = 1 -> High score
    - Maximize concordance: proper ordering of primary and secondary outcomes
  2. Discordance
    - Target = 0 -> High score
    - Target = 1 -> Low score
    - Minimize discordance: improper ordering of primary and secondary outcomes
3. Estimate Optimization
  1. Squared error
    - Target -> Estimate^2
    - Minimize squared error: squared difference between target and prediction

Pruning Criteria (Limiting the size of a decision tree)

Entropy Pruning Criterion:
- Entropy is used to calculate the homogeneity of a sample
- If the sample is completely homogeneous, the entropy is zero and if the sample is an equally divided, it has entropy of one
- The pruning is based on the entropy set target
Gini Pruning Criterion:
- As with entropy, the change in Gini statistic is calculated based on the change in the global Gini statistic
Misclassification Rate Pruning Criterion:
- The misclassification rate is simply the number of mispredictions divided by the number of predictions
- Thus, a predicted target level can be set, and the tree is pruned back if it cannot meet the misclassification rate target
Average Square Error Pruning Criterion:
- The average square error (ASE) is based on the sum of squares error (SSE)
- Similarly, a predicted a target level can be set and the tree is pruned back if it cannot meet the ASE target

SAS Applications

3. SAS - Decision Tree

3.1. Creating Training and Validation Data
- Click the Sample tool tab to use the "Data Partition" tool
- Select the "Data Partition" node and examine its Properties panel
  - Use the Properties panel to select the fraction of data devoted to the Training, Validation, and Test partitions
  - By default, less than half the available data is used for generating the predictive models
  - There is a trade-off in various partitioning strategies:
    - More data devoted to training results in more stable predictive models, but less stable model assessments, and vice versa
    - The Test partition is used only for calculating fit statistics after the modeling and model selection is complete
      - Enter 50 as the Training value in the Data Partition node
      - Enter 50 as the Validation value in the Data Partition node
      - Enter 0 as the Test value in the Data Partition node

Select Results in the Run Status window to view the results
The Results provides a basic metadata summary of the raw data that feeds the node and a frequency table that shows the distribution of the target variable in the raw, training, and validation data sets

3.2.1. Preparing for Interactive Tree Construction

The Decision Tree tool can build predictive models autonomously or interactively.
- To build a model autonomously, simply run the Decision Tree node.
- Building models interactively, however, is more informative

Select "Interactive" from the Decision Tree node's Properties panel

The SAS Enterprise Miner Interactive Decision Tree application appears

Right-click the blue box and select Split Node from the menu
The Split Node 1 dialog box appears

3.2.2. Creating a Splitting Rule

The Split Node dialog box shows the relative value, -Log(p) or logworth of partitioning the training data using the indicated input
As the logworth increases, the partition better isolates cases with identical target values
"Gift Count 36 Months" has the highest logworth, followed by "Gift Amount Average Card 36 Months" and "Gift Amount Last"

Select "Edit Rule". The GiftCnt36 - Interval Splitting Rule dialog box appears

This dialog box shows how the training data is partitioned using the input Gift Count 36 Months
Two branches are created.
- The first branch contains cases with a 36-month gift count less than 2.5
- The second branch contains cases with a 36-month gift count greater than or equal to 2.5
In other words, with cases that have a 36-month gift count of zero, one or two branch left and three or more branch right
In addition, any cases with a missing or unknown 36-month gift count are placed in the second branch
Click "Apply", and then click "OK" twice

This model assigns to all cases in the left branch a predicted TARGET_B value equal to 0.43 and to all cases in the right branch a predicted TARGET_B value equal to 0.56

The training data is partitioned into two subsets.
The first subset, corresponding to cases with a 36 month gift count less than 2.5, has a higher than average concentration of TARGET_B=0 cases.
The second subset, corresponding to cases with a 36 month gift count greater than 2.5, has a higher than average concentration of TARGET_B=1 cases
The second branch has slightly more cases than the first on training and validation data sets, which is indicated by the Count field
In general, decision tree predictive models assign all cases in a leaf the same predicted target value. For binary targets, this equals the percentage of cases in the target variable’s primary outcome (target=1 outcome)

3.2.3. Adding More Splits

It is important to remember that the logworth of each split, reported above, and the predicted TARGET_B values are generated using the training data only
The main focus is on selecting useful input variables for the first predictive model. A diminishing marginal usefulness of input variables is expected, given the split-search

3.2.4. Changing a Splitting Rule

3.2.5. Creating the Maximal Tree

Select the root node of the tree
Right-click and select "Train Node" from the menu
Right-click in the gray area behind the tree
Select View -> Fit to page

To view the information within each node, right-click in the gray area behind the tree and select View -> Chart tips

Save the tree by selecting File -> Save
Then select File -> Exit to close the Interactive Tree application

Select View -> Subtree Assessment Plot
Looking at the plot for training data, the majority of the improvement in fit occurs over the first few splits, it appears that the maximal, fifteen-leaf tree generates a lower misclassification rate
The plot on training data seems to indicate that the maximal tree is preferred for assigning predictions to cases
However, when looking only at the results from the training data, this plot is misleading
Using the same sample of data both to evaluate input variable usefulness and to assess model performance commonly leads to overfit models
Looking at an assessment plot based on validation data provides the solution

4.1. Assessing a Decision Tree

Select the Decision Tree node
In the Decision Tree node's Train properties, change the Use Frozen Tree property’s value from No to Yes
The Frozen Tree property prevents the maximal tree from being changed by other property settings when the flow is run

Right-click the Decision Tree node and run it
Select Results
The Results window appears
The Results window contains a variety of diagnostic plots and tables, including a cumulative lift chart, a treemap, and a table of fit statistics
The diagnostic tools shown in the results vary with the measurement level of the target variable

Select View -> Model -> Subtree Assessment Plot

The plot shows the Average Square Error corresponding to each subtree as the data is sequentially split
This plot is similar to the one generated with the Interactive Decision Tree tool, and it confirms suspicions about the optimality of the 15-leaf tree
The performance on the training sample becomes monotonically better as the tree becomes more complex
However, the performance on the validation sample only improves up to a tree of, approximately, four or five leaves, and then diminishes as model complexity increases

The validation performance shows evidence of model overfitting
Over the range of one to approximately four leaves, the precision of the model improves with the increase in complexity.
A marginal increase in complexity over this range results in better accommodation of the systematic variation or signal in data
Precision diminishes as complexity increases past this range; the additional complexity accommodates idiosyncrasies in the training sample, and the model extrapolates less well.

The validation performance under Misclassification Rate is similar to the performance under Average Square Error
The optimal tree appears to have, approximately, four or five leaves

4.2. Pruning a Decision Tree

The default method used to prune the maximal tree is Assessment. This means that algorithms in SAS Enterprise Miner choose the best tree in the sequence based on some optimality measure
Alternative method options are Largest and N.
- The Largest option provides an autonomous way to generate the maximal tree.
- The N option generates a tree with N leaves. The maximal tree is upper-bound on N.

In the Result window, select View -> Model -> Subtree Assessment Plot
The five-leaf tree has the lowest associated misclassification rate on the validation sample

The maximal tree is then sequentially pruned so that the sequence consists of the best 15-leaf tree, the best 14-leaf tree, and continue
The tree in the sequence with the lowest overall validation misclassification rate is selected.

4.3. Alternative Assessment Measures
- Same as procedure as the steps above
- Create a new Decision Tree node "Probability Tree"
- Change the Assessment Measure property to Average Square Error

A five-leaf tree is also optimal under the Validation Average Square Error criterion
However, viewing the Tree plot reveals that, although the optimal Decision (Misclassification) Tree and the optimal Probability Tree have the same number of leaves, they are not identical trees

The tree shown below is a different five-leaf tree than the tree optimized under Validation Misclassification

In the first step, cultivating (or growing or splitting) the tree, the important measure is logworth
Splits of the data are made based on logworth to isolate subregions of the input space with high proportions of donors and non-donors
Splitting continues until a boundary associated with a stopping rule is reached. This process generates the maximal tree.

4.4. Additional Diagnostic Tools
1. Treemap
  - The Tree Map window is intended to be used in conjunction with the Tree window to gauge the relative size of each leaf
  - The size of the small rectangle in the treemap indicates the fraction of the training data present in the corresponding leaf. It appears that only a small fraction of the training data finds its way to this leaf
2. Leaf Statistic Bar Chart
  - This window compares the blue predicted outcome percentages in the left bars (from the training data) to the red observed outcome percentages in the right bars (from the validation data)
  - This plot reveals how well training data response rates are reflected in the validation data
  - Ideally, the bars should be of the same height
  - Differences in bar heights are usually the result of small case counts in the corresponding leaf
3. Variable Importance
  - Select View -> Model -> Variable Importance
  - The Variable Importance window provides insight into the importance of inputs in the decision tree
  - The magnitude of the importance statistic relates to the amount of variability in the target explained by the corresponding input relative to the input at the top of the table
4. The Score Rankings Overlay Plot
  - Is commonly called a cumulative lift chart
  - Cases in the training and validation data are ranked, based on decreasing predicted target values
  - A fraction of the ranked data is selected (given by the decile value)
  - The proportion of cases with the primary target value in this fraction is compared to the proportion of cases with the primary target value overall (given by the cumulative lift value)
  - A useful model shows high lift in low deciles in both the training and validation data
  - For example, the Score Ranking plot shows that in the top 20% of cases (ranked by predicted probability), the training and validation lifts are approximately 1.26
  - This means that the proportion of primary outcome cases in this top 20% is about 26% more likely to have the primary outcome than a randomly selected 20% of cases.
5. Fit Statistics
  - The Fit Statistics window is used to compare various models built within SAS Enterprise Miner. The misclassification and average square error statistics are of most interest for this analysis
6. The Output Window
  - The Output window provides information generated by the SAS procedures that are used to generate the analysis results. For the decision tree, this information includes variable importance, tree leaf report, model fit statistics, classification information, and score rankings

Google Sites

Report abuse