Genetic Algorithms and Decision Trees

This research explores the use of genetic algorithms to directly evolve decision trees.

Genetic algorithms can search complex hypotheses spaces based on overall quality metrics, such as decision tree size and accuracy.

A software product has come out of this research, GATree, available from www.gatree.com.

It has been also completely re-written in Java.

Recently, the basic algorithm has also been re-written in R (and has been parallelized, too).


The initial theoretical and experimental aspects of this research are set out in the following papers (both available via www.gatree.com):


We've tried GATree extensively to experiment on deriving success/failures models for exams.


We have also experimented with a scheme to allow fitness to be piecewise computed in a lossless fashion and thus speed-up evolution substantially:


Now, having started dealing with lossless fitness inheritance, new issues are coming up:

- To obtain a 2- to 3-fold increase in speed (admittedly, with a raw implementation, so we might improve there a bit as well) we are sacrificing some space, O(n), where n is the training set size, to store training set data in the decision tree leaves. For relatively small problems, this is not an issue. For large ones, it will be a problem, but one can point out that the initial solution (which does not employ lossless fitness inheritance) will also become expensive when it evaluates each decision tree (since it will need to access the full training set). Can we quantify "large" and see whether some approach is better for distributed fitness evaluation? That looks more like an empirical problem, but could also have a theoretical flavour.

- Would fitness caching be useful for decision trees? Caching calls for overhead in terms of space and depends on which scheme you need to implement for cache manegement. If small decision tree parts come up again and again it may be helpful. Will it be better than fitness inheritance? Quite likely, this will also start as an empirical problem with a strong data structures component.

- Classical fitness inheritance was based on approximating the fitness of an individual based on parents' fitness. Would that work with decision trees? That looks like a purely empirical problem.

- Our approach to lossless fitness inheritance exploits the formulation of GATree's fitness function, which is of a divide-and-conquer nature. There is no reason to believe that this should not apply to other problems with fitness functions being calculated in a divide-and-conquer nature. What are these problems and their fitness functions; can they be engineered to allow lossless fitness inheritance?