inTrees

Contributions are welcome. Github CRAN

inTrees (interpretable trees) is a framework for extracting, measuring, pruning, selecting and summarizing rules from a tree ensemble (so far including random forest, RRF and gbm). All algorithms for classification, and some for regression have been implemented in the "inTrees" R package. For Latex user: these rules can be easily formatted as latex code.

The inTrees package supports:

- random forests

- xgboost

- gbm

Deck

Demo code

Code/data for comparing STEL with RPART

More work on inTrees

Citation: Interpreting Tree Ensembles with inTrees, Houtao Deng, arXiv:1408.5456, 2014

Notes:

- You can format the rule metrics into LaTex code using something like:

print(xtable(ruleMetric), include.rownames=FALSE)

- For regression problems, rules with numeric outcomes may be hard to validate/interpret, particular when presented to people without deep knowledge.

> regression rule example: X[,1] >= 3 => pred = 4.555

Instead, one could have a set of "approximate" classification rules.

> classification rule example: X[,1] >= 3 => pred = L1 (lowest level, i.e., smallest)

Here is an example how to transform the regression rules to classification rules using inTrees.

X <- iris[,-1]; target <- iris[,"Sepal.Length"]

rf <- randomForest(X,target,ntree=100) # random forest

ruleExec0 <- extractRules(RF2List(rf),X)

ruleExec <- unique(ruleExec0)

ruleMetric <- getRuleMetric(ruleExec,X,target) # regression rules

ruleMetric

len freq err condition pred

[1,] "2" "0.273" "0.121998810232005" "X[,2]<=3.95 & X[,3]<=0.35 " "4.95365853658537"

[2,] "4" "0.02" "0.00666666666666662" "X[,2]<=3.95 & X[,2]<=3.4 & X[,3]>0.35 & X[,4] %in% c('versicolor','virginica')" "5"

...

# transform regression rules to classification rules

target <- dicretizeVector(target) # discretize it into three levels in default with equal frequency (the function also allows one to customize the number of levels to be discretized)

# methods for classification rules can then be used for the conditions extracted from the regression trees

ruleMetric <- getRuleMetric(ruleExec,X,target)

ruleMetric <- pruneRule(ruleMetric,X,target) # prune each rule

# ruleMetric <- selectRuleRRF(ruleMetric,X,target) # rule selection

learner <- buildLearner(ruleMetric,X,target) #build the simplified tree ensemble learner

readableLearner <- presentRules(learner,colnames(X)) # present the rules with a more readable format

readableLearner

len freq err condition pred

[1,] "3" "0.206666666666667" "0" "Sepal.Width<=3.65 & Petal.Length<=3.3 & Petal.Length>1.35" "L1"

[2,] "2" "0.106666666666667" "0" "Petal.Length>5.65 & Petal.Width<=2.35" "L3"

...

where L1 indicates lowest level and L3 indicates largest (for 3-level discretization)

Here is my reply to a post on Stackoverflow.com on how to extract the rules from a random forest.

Extract raw rules from a random forest:

> library(inTrees); library(randomForest) > data(iris); > X <- iris[,1:(ncol(iris)-1)]; target <- iris[,"Species"] # X: predictors; target: class > rf <- randomForest(X,as.factor(target)) > treeList <- RF2List(rf) # transform rf object to an inTrees' format > exec <- extractRules(treeList,X) # R-executable conditions > exec[1:2,] condition [1,] "X[,1]<=5.45 & X[,4]<=0.8" [2,] "X[,1]<=5.45 & X[,4]>0.8"

Measure rules. "len" is the number of variable-value pairs in a condition, "freq" is the percentage of data satisfying a condition, "pred" is the outcome of a rule, i.e., "condition" => "pred", "err" is the error rate of a rule.

> ruleMetric <- getRuleMetric(exec,X,target) # get rule metrics > ruleMetric[1:2,] len freq err condition pred [1,] "2" "0.3" "0" "X[,1]<=5.45 & X[,4]<=0.8" "setosa" [2,] "2" "0.047" "0.143" "X[,1]<=5.45 & X[,4]>0.8" "versicolor"

Prune each rule:

> ruleMetric <- pruneRule(ruleMetric,X,target) > ruleMetric[1:2,] len freq err condition pred [1,] "1" "0.333" "0" "X[,4]<=0.8" "setosa" [2,] "2" "0.047" "0.143" "X[,1]<=5.45 & X[,4]>0.8" "versicolor"

Select a compact rule set:

> ruleMetric <- selectRuleRRF(ruleMetric,X,target) > ruleMetric len freq err condition pred impRRF [1,] "1" "0.333" "0" "X[,4]<=0.8" "setosa" "1" [2,] "3" "0.313" "0" "X[,3]<=4.95 & X[,3]>2.6 & X[,4]<=1.65" "versicolor" "0.806787615686919" [3,] "4" "0.333" "0.04" "X[,1]>4.95 & X[,3]<=5.35 & X[,4]>0.8 & X[,4]<=1.75" "versicolor" "0.0746284932951366" [4,] "2" "0.287" "0.023" "X[,1]<=5.9 & X[,2]>3.05" "setosa" "0.0355855756152103" [5,] "1" "0.307" "0.022" "X[,4]>1.75" "virginica" "0.0329176860493297" [6,] "4" "0.027" "0" "X[,1]>5.45 & X[,3]<=5.45 & X[,4]<=1.75 & X[,4]>1.55" "versicolor" "0.0234818254947883" [7,] "3" "0.007" "0" "X[,1]<=6.05 & X[,3]>5.05 & X[,4]<=1.7" "versicolor" "0.0132907201116241"

Build an ordered rule list as a classifier:

> learner <- buildLearner(ruleMetric,X,target) > learner len freq err condition pred [1,] "1" "0.333333333333333" "0" "X[,4]<=0.8" "setosa" [2,] "3" "0.313333333333333" "0" "X[,3]<=4.95 & X[,3]>2.6 & X[,4]<=1.65" "versicolor" [3,] "4" "0.0133333333333333" "0" "X[,1]>5.45 & X[,3]<=5.45 & X[,4]<=1.75 & X[,4]>1.55" "versicolor" [4,] "1" "0.34" "0.0196078431372549" "X[,1]==X[,1]" "virginica"

Make rules more readable:

> readableRules <- presentRules(ruleMetric,colnames(X)) > readableRules[1:2,] len freq err condition pred [1,] "1" "0.333" "0" "Petal.Width<=0.8" "setosa" [2,] "3" "0.313" "0" "Petal.Length<=4.95 & Petal.Length>2.6 & Petal.Width<=1.65" "versicolor"

Extract frequent variable interactions (note the rules are not pruned or selected):

> rf <- randomForest(X,as.factor(target)) > treeList <- RF2List(rf) # transform rf object to an inTrees' format > exec <- extractRules(treeList,X) # R-executable conditions > ruleMetric <- getRuleMetric(exec,X,target) # get rule metrics > freqPattern <- getFreqPattern(ruleMetric) > freqPattern[which(as.numeric(freqPattern[,"len"])>=2),][1:4,] # interactions of at least two predictor variables len sup conf condition pred [1,] "2" "0.045" "0.587" "X[,3]>2.45 & X[,4]<=1.75" "versicolor" [2,] "2" "0.041" "0.63" "X[,3]>4.75 & X[,4]>0.8" "virginica" [3,] "2" "0.039" "0.604" "X[,4]<=1.75 & X[,4]>0.8" "versicolor" [4,] "2" "0.033" "0.675" "X[,4]<=1.65 & X[,4]>0.8" "versicolor"

One can also present these frequent patterns in a readable form using function presentRules.

In addition, rules or frequent patterns can be formatted in LaTex.

> library(xtable) > print(xtable(freqPatternSelect), include.rownames=FALSE) \begin{table}[ht] \centering \begin{tabular}{lllll} \hline len & sup & conf & condition & pred \\ \hline 2 & 0.045 & 0.587 & X[,3]$>$2.45 \& X[,4]$<$=1.75 & versicolor \\ 2 & 0.041 & 0.63 & X[,3]$>$4.75 \& X[,4]$>$0.8 & virginica \\ 2 & 0.039 & 0.604 & X[,4]$<$=1.75 \& X[,4]$>$0.8 & versicolor \\ 2 & 0.033 & 0.675 & X[,4]$<$=1.65 \& X[,4]$>$0.8 & versicolor \\ \hline \end{tabular} \end{table}