Q. What is GPTIPS?
A. It is a free, open source MATLAB toolbox for genetic programming (GP) with many features specifically aimed at symbolic regression. GPTIPS is an acronym for Genetic Programming Toolbox for the Identification of Process Systems.
GPTIPS is a generic genetic programming toolbox for MATLAB - but the main intention of GPTIPS is to perform symbolic regression. That is, to allow you to automatically build empirical symbolic non-linear models - from data - of the form
y = f (X1, ..., XN)
where y = an output/response variable (the thing you are trying to predict) and X1, ..., XN are input/predictor variables (the things you know and want to use to predict y) and f is a symbolic non-linear function (or a collection of non-linear functions). The form of the 'function' f is determined automatically by GPTIPS.
Q. What license is GPTIPS distributed under?
A. GPTIPS is 'free' subject to the GPL (GNU General Public ) v3 license which can be viewed here http://www.gnu.org/licenses/gpl-3.0.html.
Q. How do I cite GPTIPS?
A. If you use GPTIPS in any published work then please use the following citation.
GPTIPS: an open source genetic programming toolbox for multigene symbolic regression
Searson, D.P., Leahy, D.E. & Willis, M.J.
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 (IMECS 2010), Hong Kong, 17-19 March, 2010.
I work in an academic environment and citations are important for my career. :)
Q. What is symbolic regression?
A. In classical linear regression and non-linear regression, a pre-determined model structure is assumed (by you) and then the problem is to find the 'optimal' parameters of the model to minimise some prediction error metric of y (typically the sum of squared errors; SSE) over a set of known y and X data values (this known data set is usually called the 'training' data). Once these 'optimal' parameters have been found, you now have a model which can (you hope) predict unknown y values for new X values.
E.g. Consider a typical linear regression scenario:
There is an output (response) variable Y and N input (predictor) variables X1, ..XN. There are a number of observations of Y and corresponding observations of the N inputs. These comprise the training data set. It is then assumed that Y is a unknown linear function of the variables X1, ..., XN . Now the problem is to find the optimal values of the unknown parameters a0, ..., aN in the expression that minimises a metric of the error E over the training data set.
y = a0 + a1 X1 + a2 X2 + ... + aN XN + E
For instance, the SSE error metric can be minimised by using the least squares normal equation to give optimal a0, ..., aN. The problem is that linear regression models will not generally not capture non-linear relationships.
However, in symbolic regression a model structure is not explicitly assumed, and the problem is find both the structure and parameters of a non-linear model that minimises the chosen error metric. In GPTIPS, a genetic programming algorithm is used to to this. A typical model generated by GPTIPS could be:
y = 0.23 X1 + 0.33(X1 - X5) + 1.23 X32 - 3.34 cos(X1) + 0.22
This model contains both linear and non-linear terms.
in GPTIPS, you don't specify the functional form of the evolved models, you only specify what "building blocks" that model can be constructed from, e.g. +, -, *, cos, tanh etc. The advantage is that symbolic regression can build non-linear models without having to know beforehand what the model structure should be. Another advantage is that collinearity of the inputs (which can cause severe problems in many linear regression methods) is not generally problematic in symbolic regression.
Q. Can I export GPTIPS models as standalone M files for use outside GPTIPS?
A. Yes, the GPTIPS utility function
Q. How are models represented in GPTIPS?
A. Like most GP implementations, each model is represented as one or more trees. E.g.
In this simple case, the tree can be evaluated, left to right, to give the equivalent model:
y = √X5 + X1
Internally, each GP tree is represented in GPTIPS by a compact coded string. Tree structures are used in GP because they facilitate the process of simulated evolution to create populations of better trees from existing ones using tree mutation and tree crossover operations.
GPTIPS allows the use of multigene (multiple tree) models. Here each overall model is a linear combination of 1 or more trees. Each tree can be thought of as a partial model fragment which has a weighted contribution to the overall model. The tree encoding is referred to as the 'genotype' and the decoded (and simplified) model structure is referred to as the 'phenotype'.
Q. How many genes do I need to model my data?
A. This really depends on the data and your expectations of the resulting model. More genes usually results in a more "accurate" model, but the complexity of the model may be high. That said - as a very broad guideline - start with 3 or 4 genes (with a maximum depth of 4 or 5 nodes) and work upwards from there. After a few runs you should begin find the "sweet spot" that gives a decent trade off between model accuracy and performance. As a side note: I would avoid having large ( > 7) maximum tree depths as this tends to encourage overfitting and bloated models. Finally, too many genes can also lead to bloated models (this is "horizontal bloat" in contrast to the usual "vertical bloat" found in single tree GP models). GPTIPS 2.0 will contain tools to explicitly identify horizontal bloat in multigene models.
Q. Where can I find out more about genetic programming?
A. I recommend the excellent, free to download Field Guide to Genetic Programming.
Q. Does GPTIPS do variable selection (feature selection)?
A. Yes, GPTIPS implicitly performs variable selection. The simulated evolutionary processes will "try" to pick the input variables that give the best overall prediction. However, how well it does this is dependent on several factors, like how many inputs there are, what population size is used, how many generations GPTIPS is run for etc.
Q. What GP selection methods does GPTIPS support?
A. Currently, only tournament selection is implemented (with the option of lexicographic parsimony pressure enabled). GPTIPS 2.0 will also support pareto tournaments.
Q. Are GPTIPS models as good as neural net models?
A. It depends on (a) what you mean by good and (b) the problem at hand and (c) what you want the model for.
GPTIPS sometimes (usually when only a few input variables are involved) lags behind a neural net model in terms of raw predictive performance but the equivalent GP models are often simpler, shorter and may be open to physical interpretation. It's not always an easy question to answer. To put it another way: is the model y = 3X12 + 2X1.X2 (R2 = 0.93) "better" or "worse" than a black box neural net model with R2 = 0.95?
Q. Is GPTIPS better than PLS (partial least squares) regression?
A. Again, It depends on context but: yes, GPTIPS is better. In my opinion, PLS is an overused and overvalued analytical tool. PLS regression is also hard to understand (well at least I find it hard to understand) and the models it produces are complex and fragile.
Q. What are the system requirements for GPTIPS?
A. MATLAB Version v 7.6 or higher. I have tried to make it as portable as possible across platforms and versions - however, I do not work for The Mathworks and I do not have access to the resources to ensure it works on everything.
Q. What Mathworks toolboxes are required?
A. None for the core functions of GPTIPS but the Mathworks Symbolic Math Toolbox is required for model simplification and conversion to standalone M file format. The Mathworks Statistics Toolbox is also required for a small number of GPTIPS features (such as computing the statistical significance of model terms).
Q. Will there be any further releases of GPTIPS?
Q. Does GPTIPS run on Octave?
A. No. Currently there are too many limitations in Octave to make it worthwhile.
Q. Who develops GPTIPS?
A. GPTIPS was written by Dominic Searson (me). However, it owes much to current and former colleagues at the School of Chemical Engineering and Advanced Materials, Newcastle University, UK. Notably: Dr Mark Willis, Dr Hugo Hiden, Dr Mark Hinchliffe and Dr Ben McKay. GPTIPS is an ongoing, open source project and any contributions, suggestions etc are welcome. Please email them to email@example.com.