[DecisionTree] Decision trees in R using C5.0

In this tutorial you will learn how to use decisions trees using the C5.0 algorithm of the package C50. This tutorial supposes that you know what is a decision tree and basic knowledge of R.

NB : In you are interested in the inner working of decision trees and what they are, I suggest you look at my tutorial in Python that describes how to build a decision tree from scratch.

We will :

  1. Explore the dataset
  2. Split it into training and test, which involves reshuffling the order of the observations
  3. Train the model
  4. Test the model to by making predictions on the test set and evaluating the results
  5. Improve the model with (adaptative) boosting, i.e. combine multiple trees
  6. Assign different penalties for each type of error with a cost matrix, so that our model will avoid certain type of missclassification more than others

This tutorial is based on chapter 5 of Machine learning with R. This is really an excellent book I can only recommend. The original code are the datasets are available here.

Our question is : how can we assess the risk of credit default for people based on their individual characteristics as well as the characteristics of their loan> ? To answer this we have the past record for 1000 individuals, with 17 features, including if their had credit default or not.

1. Exploring the data 

First, let's look at the data :

In [2]:
credit <- read.csv("credit.csv")
str(credit)
'data.frame':	1000 obs. of  17 variables:
 $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
 $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
 $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
 $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
 $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
 $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
 $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
 $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
 $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
 $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
 $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
 $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

The "DM" refer to Deutsche Mark as the data is German (and was collected befor the Euro).

In [12]:
table(credit$checking_balance)
Out[12]:
    < 0 DM   > 200 DM 1 - 200 DM    unknown 
       274         63        269        394 

Let's have a very quick look on the number of default :

In [3]:
table(credit$default)
Out[3]:
 no yes 
700 300 

2. Splitting the data into training and test sets 

We split the dataset into and training (90% of the sample) and test (10%) data.
We need to get the data in a random order (it's currently ordered).

In [4]:
set.seed(12345)
credit_rand <- credit[order(runif(1000)), ] #create a new dataframe 
# where rows are copies of the original dataframe but selected on a random generation of 1000 numbers

summary(credit$amount) 
summary(credit_rand$amount) # we check we get the same data in both dataframes...

head(credit$amount)
head(credit_rand$amount) # we check the order ob both dataframes are different !

#splitting the dataset
credit_train <- credit_rand[1:900, ]
credit_test  <- credit_rand[901:1000, ]
Out[4]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18420 
Out[4]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    250    1366    2320    3271    3972   18420 
Out[4]:
  1. 1169
  2.  
  3. 5951
  4.  
  5. 2096
  6.  
  7. 7882
  8.  
  9. 4870
  10.  
  11. 9055
Out[4]:
  1. 1199
  2.  
  3. 2576
  4.  
  5. 1103
  6.  
  7. 4020
  8.  
  9. 1501
  10.  
  11. 1568

We quickly check that both training and test data are roughly similar. It's the case as both have roughly 30% of default.

In [5]:
prop.table(table(credit_train$default))
prop.table(table(credit_test$default))
Out[5]:
       no       yes 
0.7022222 0.2977778 
Out[5]:
  no  yes 
0.68 0.32 

3. Training the model 

We use the C5.0 algorithm that requiers the library C50.
The syntax is : C5.0(dataframe of predictors, vector of predicted classes). So here we selected as input dataframe all columns except the one containing the credit default (the 17th column), and as class we use the credit default :

In [6]:
library(C50)
credit_model <- C5.0(credit_train[-17], credit_train$default)

Let's have a look.

In [7]:
credit_model
Out[7]:
Call:
C5.0.default(x = credit_train[-17], y = credit_train$default)

Classification Tree
Number of samples: 900 
Number of predictors: 16 

Tree size: 67 

Non-standard options: attempt to group attributes

We can see that there were 67 decisions / branching. What are they ?

In [8]:
summary(credit_model)
Out[8]:
Call:
C5.0.default(x = credit_train[-17], y = credit_train$default)


C5.0 [Release 2.07 GPL Edition]  	Sun Oct 25 12:48:28 2015
-------------------------------

Class specified by attribute `outcome'

Read 900 cases (17 attributes) from undefined.data

Decision tree:

checking_balance = unknown: no (358/44)
checking_balance in {< 0 DM,> 200 DM,1 - 200 DM}:
:...credit_history in {perfect,very good}:
    :...dependents > 1: yes (10/1)
    :   dependents <= 1:
    :   :...savings_balance = < 100 DM: yes (39/11)
    :       savings_balance in {> 1000 DM,500 - 1000 DM,unknown}: no (8/1)
    :       savings_balance = 100 - 500 DM:
    :       :...checking_balance = < 0 DM: no (1)
    :           checking_balance in {> 200 DM,1 - 200 DM}: yes (5/1)
    credit_history in {critical,good,poor}:
    :...months_loan_duration <= 11: no (87/14)
        months_loan_duration > 11:
        :...savings_balance = > 1000 DM: no (13)
            savings_balance in {< 100 DM,100 - 500 DM,500 - 1000 DM,unknown}:
            :...checking_balance = > 200 DM:
                :...dependents > 1: yes (3)
                :   dependents <= 1:
                :   :...credit_history in {good,poor}: no (23/3)
                :       credit_history = critical:
                :       :...amount <= 2337: yes (3)
                :           amount > 2337: no (6)
                checking_balance = 1 - 200 DM:
                :...savings_balance = unknown: no (34/6)
                :   savings_balance in {< 100 DM,100 - 500 DM,500 - 1000 DM}:
                :   :...months_loan_duration > 45: yes (11/1)
                :       months_loan_duration <= 45:
                :       :...other_credit = store:
                :           :...age <= 35: yes (4)
                :           :   age > 35: no (2)
                :           other_credit = bank:
                :           :...years_at_residence <= 1: no (3)
                :           :   years_at_residence > 1:
                :           :   :...existing_loans_count <= 1: yes (5)
                :           :       existing_loans_count > 1:
                :           :       :...percent_of_income <= 2: no (4/1)
                :           :           percent_of_income > 2: yes (3)
                :           other_credit = none:
                :           :...job = unemployed: no (1)
                :               job = management:
                :               :...amount <= 7511: no (10/3)
                :               :   amount > 7511: yes (7)
                :               job = unskilled: [S1]
                :               job = skilled:
                :               :...dependents <= 1: no (55/15)
                :                   dependents > 1:
                :                   :...age <= 34: no (3)
                :                       age > 34: yes (4)
                checking_balance = < 0 DM:
                :...job = management: no (26/6)
                    job = unemployed: yes (4/1)
                    job = unskilled:
                    :...employment_duration in {4 - 7 years,
                    :   :                       unemployed}: no (4)
                    :   employment_duration = < 1 year:
                    :   :...other_credit = bank: no (1)
                    :   :   other_credit in {none,store}: yes (11/2)
                    :   employment_duration = > 7 years:
                    :   :...other_credit in {bank,none}: no (5/1)
                    :   :   other_credit = store: yes (2)
                    :   employment_duration = 1 - 4 years:
                    :   :...age <= 39: no (14/3)
                    :       age > 39:
                    :       :...credit_history in {critical,good}: yes (3)
                    :           credit_history = poor: no (1)
                    job = skilled:
                    :...credit_history = poor:
                        :...savings_balance in {< 100 DM,100 - 500 DM,
                        :   :                   500 - 1000 DM}: yes (8)
                        :   savings_balance = unknown: no (1)
                        credit_history = critical:
                        :...other_credit = store: no (0)
                        :   other_credit = bank: yes (4)
                        :   other_credit = none:
                        :   :...savings_balance in {100 - 500 DM,
                        :       :                   unknown}: no (1)
                        :       savings_balance = 500 - 1000 DM: yes (1)
                        :       savings_balance = < 100 DM:
                        :       :...months_loan_duration <= 13:
                        :           :...percent_of_income <= 3: yes (3)
                        :           :   percent_of_income > 3: no (3/1)
                        :           months_loan_duration > 13:
                        :           :...amount <= 5293: no (10/1)
                        :               amount > 5293: yes (2)
                        credit_history = good:
                        :...existing_loans_count > 1: yes (5)
                            existing_loans_count <= 1:
                            :...other_credit = store: no (2)
                                other_credit = bank:
                                :...percent_of_income <= 2: yes (2)
                                :   percent_of_income > 2: no (6/1)
                                other_credit = none: [S2]

SubTree [S1]

employment_duration in {< 1 year,1 - 4 years}: yes (11/3)
employment_duration in {> 7 years,4 - 7 years,unemployed}: no (8)

SubTree [S2]

savings_balance = 100 - 500 DM: yes (3)
savings_balance = 500 - 1000 DM: no (1)
savings_balance = unknown:
:...phone = no: yes (9/1)
:   phone = yes: no (3/1)
savings_balance = < 100 DM:
:...percent_of_income <= 1: no (4)
    percent_of_income > 1:
    :...phone = yes: yes (10/1)
        phone = no:
        :...purpose in {business,car0,education,renovations}: yes (3)
            purpose = car:
            :...percent_of_income <= 3: no (2)
            :   percent_of_income > 3: yes (6/1)
            purpose = furniture/appliances:
            :...years_at_residence <= 1: no (4)
                years_at_residence > 1:
                :...housing = other: no (1)
                    housing = rent: yes (2)
                    housing = own:
                    :...amount <= 1778: no (3)
                        amount > 1778:
                        :...years_at_residence <= 3: yes (6)
                            years_at_residence > 3: no (3/1)


Evaluation on training data (900 cases):

	    Decision Tree   
	  ----------------  
	  Size      Errors  

	    66  125(13.9%)   <<


	   (a)   (b)    <-classified as
	  ----  ----
	   609    23    (a): class no
	   102   166    (b): class yes


	Attribute usage:

	100.00%	checking_balance
	 60.22%	credit_history
	 53.22%	months_loan_duration
	 49.44%	savings_balance
	 30.89%	job
	 25.89%	other_credit
	 17.78%	dependents
	  9.67%	existing_loans_count
	  7.22%	percent_of_income
	  6.67%	employment_duration
	  5.78%	phone
	  5.56%	amount
	  3.78%	years_at_residence
	  3.44%	age
	  3.33%	purpose
	  1.67%	housing


Time: 0.0 secs

There is a lot going on here. There are two important parts here. First, let's have a look at the decision tree :

checking_balance = unknown: no (358/44)
checking_balance in {< 0 DM,> 200 DM,1 - 200 DM}:
:...credit_history in {perfect,very good}:
    :...dependents > 1: yes (10/1)
    :   dependents <= 1:
    :   :...savings_balance = < 100 DM: yes (39/11)
    :       savings_balance in {> 1000 DM,500 - 1000 DM,unknown}: no (8/1)
    :       savings_balance = 100 - 500 DM:
    :       :...checking_balance = < 0 DM: no (1)
    :           checking_balance in {> 200 DM,1 - 200 DM}: yes (5/1)
    credit_history in {critical,good,poor}:
    :...months_loan_duration <= 11: no (87/14)
        months_loan_duration > 11:

Lines that have the same identation show different branches of a decision node. For instance,

  • the 1st decision node is for the variable checking_balance
    • checking_balance = unknown vs checking_balance in {< 0 DM,> 200 DM,1 - 200 DM}
    • numbers between brackets means (# of observations / # of observations in the wrong class). Ex : checking_balance = unknown: no (358/44) means that if checking_balance = unknown, it was assigned the class no credit default. Thsi was the case for 358 observations. From these 358, 44 were incorrectly assigned no.
  • the 2nd decision node is for the variable credit_history
    • credit_history in {perfect,very good} vs credit_history in {critical,good,poor}

You can plot the tree using plot (for some reason I encounter a bug unfortunaley).

Nowlet's have a look at the confusion matrix :

Evaluation on training data (900 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

        66  125(13.9%)   <<


       (a)   (b)    <-classified as
      ----  ----
       609    23    (a): class no
       102   166    (b): class yes
  • Errors = 125/900 = 13.9% of cases were incorrectly classified.
  • These 125 cases are read on the diagonal from bottom left to up right (125=102+23)

4. Testing the model 

In [15]:
credit_pred <- predict(credit_model, credit_test)
In [16]:
library(gmodels)
CrossTable(credit_test$default, credit_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |        no |       yes | Row Total | 
---------------|-----------|-----------|-----------|
            no |        57 |        11 |        68 | 
               |     0.570 |     0.110 |           | 
---------------|-----------|-----------|-----------|
           yes |        16 |        16 |        32 | 
               |     0.160 |     0.160 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        73 |        27 |       100 | 
---------------|-----------|-----------|-----------|

 

This reads as following :

  • On the diagonal from up right to bottom left we have our correctly classified observations (no/no : 57, tes/yes: 16) (NB : this is not the same diagonal as previously !)
  • Hence we have an accuracy rate of 57+16=73 %

5. Improving the model with (adaptative) boosting 

Adaptive boosting (or "adaboost") refers to the idea of combining several weak learners to get a "synthesis" of their responses. It can be applied to decision trees, but also to other machine learning techniques. The main idea of adaboost is to adapt to the errors of a classifier to train the next one :

  • Imagine we have have built a 1st decision tree.
  • We then built a 2nd one, focusing on the missclassified observations of the 1st one by giving them more weights so that the 2nd tree "correct" the 1st one. Of course, this 2nd tree also makes mistake.
  • We build a 3rd one and give more weight to the mistakes of the 2nd tree.
  • etc.

An excellent explanation can be found on Youtube here.

To conduct adaboost on the decision tree with C50, we simply add trials = 10, meaning than 10 decisions trees will be built. If we compare the results we can see that they have improved !

In [20]:
credit_boost10 <- C5.0(credit_train[-17], credit_train$default,
                       trials = 10)
credit_boost10

credit_boost_pred10 <- predict(credit_boost10, credit_test)
CrossTable(credit_test$default, credit_boost_pred10,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
Out[20]:
Call:
C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)

Classification Tree
Number of samples: 900 
Number of predictors: 16 

Number of boosting iterations: 10 
Average tree size: 56 

Non-standard options: attempt to group attributes
 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |        no |       yes | Row Total | 
---------------|-----------|-----------|-----------|
            no |        60 |         8 |        68 | 
               |     0.600 |     0.080 |           | 
---------------|-----------|-----------|-----------|
           yes |        15 |        17 |        32 | 
               |     0.150 |     0.170 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        75 |        25 |       100 | 
---------------|-----------|-----------|-----------|

 

NB: We could display all 10 trees using the command summary(credit_boost10) (output not included here as it's too spacious).

6. Penalty and cost matrix : making some misclassifications more costly than others 

Until now we have just tried to achieve the highest accuracy. But is the cost of refusing a credit to someone who will not default the same as accepting to give a credit to someone who will default ? Obviously not as the latter case is more much expensive for the bank (credit default) than missing a customer.
Here comes a neat feature of C5.0 that enables us to define a cost matrix to reflect this fact. We define a cost matrix. In the present case, we suppose that the error of giving a credit to someone who will default ist 4 times most costly thant missing a potential opportunity to give a loan. Hence we have :

In [26]:
error_cost <- matrix(c(0, 1, 4, 0), nrow = 2)
error_cost
Out[26]:
0 4
1 0

NB : the order of the cost matrix, as explained here is following : The cost matrix should by CxC, where C is the number of classes. Diagonal elements are ignored. Columns should correspond to the true classes and rows are the predicted classes.

In our case : the # for the columns and row are based on the factors of the variable default where "no=1" and "yes=2". For instance, the "4" on the table means True class is "yes" (column 2) and Predicted class is "no" (row 1).

In terms of syntax we simply have to had costs = and the name of our matrix cost :

In [27]:
credit_cost <- C5.0(credit_train[-17], credit_train$default,
                          costs = error_cost)
credit_cost_pred <- predict(credit_cost, credit_test)
Warning message:
In C5.0.default(credit_train[-17], credit_train$default, costs = error_cost): 
no dimnames were given for the cost matrix; the factor levels will be used
In [28]:
CrossTable(credit_test$default, credit_cost_pred,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
           dnn = c('actual default', 'predicted default'))
 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
               | predicted default 
actual default |        no |       yes | Row Total | 
---------------|-----------|-----------|-----------|
            no |        42 |        26 |        68 | 
               |     0.420 |     0.260 |           | 
---------------|-----------|-----------|-----------|
           yes |         6 |        26 |        32 | 
               |     0.060 |     0.260 |           | 
---------------|-----------|-----------|-----------|
  Column Total |        48 |        52 |       100 | 
---------------|-----------|-----------|-----------|

 

Looking at the matrix, we only achieve 68% (=42+26) of accuracy. However, if we look at the type of errors, we observe that we drastically reduced the number of cases where we predicted no default whereas there were ones.

In a nutshell....

In [31]:
# Importing data
credit <- read.csv("credit.csv")

# Resuffling the observation order and splitting into training & test sets
set.seed(12345)
credit_rand <- credit[order(runif(1000)), ]

credit_train <- credit_rand[1:900, ]
credit_test  <- credit_rand[901:1000, ]

# Building the model
library(C50)
credit_model <- C5.0(credit_train[-17], credit_train$default)

# Predicting on the test set
credit_pred <- predict(credit_model, credit_test)

# Boosting the accuracy of decision trees
credit_boost10 <- C5.0(credit_train[-17], credit_train$default,
                       trials = 10)
credit_boost_pred10 <- predict(credit_boost10, credit_test)

# Making some mistakes more costly than others

error_cost <- matrix(c(0, 1, 4, 0), nrow = 2) # create a cost matrix

credit_cost <- C5.0(credit_train[-17], credit_train$default,
                          costs = error_cost) # Apply the cost matrix to the tree

credit_cost_pred <- predict(credit_cost, credit_test)
Warning message:
In C5.0.default(credit_train[-17], credit_train$default, costs = error_cost): 
no dimnames were given for the cost matrix; the factor levels will be used

I hope this tutorial was helpful! In another tutorial we will explore another package for decisions trees, the rpart package. That's all folks !

Patrick

Comments