What is GPTIPS?

It is a free, open source MATLAB toolbox for symbolic data mining (SDM). It uses a biologically inspired machine learning method called multigene genetic programming (MGGP) as the 'engine' that drives the automatic model discovery process. 

GPTIPS is an acronym for Genetic Programming Toolbox for the Identification of Physical Systems.

GPTIPS is a generic genetic programming toolbox for MATLAB - but the main intention of GPTIPS is to perform symbolic data mining. That is, to allow you to automatically discover empirical symbolic non-linear models from data.

These are typically of the form

y = f (X, ... , XN)

where y is an output/response variable (the thing you are trying to predict) and X, ... , XN are input/predictor variables (the things you know and want to use to predict y) and f is a symbolic non-linear function (or a collection of non-linear functions). The form of f is determined automatically by GPTIPS.

What is symbolic data mining?

Symbolic data mining is the process of extracting hidden, meaningful relationships from data in the form of symbolic equations. In contrast to other data-mining methods: the structure of these equations often give a new insight into the physical systems or processes that generated the data. 

This is in stark contrast to existing data mining methods. For instance, In classical linear regression and non-linear regression, a pre-determined model structure is assumed (by you) and then the problem is to find the 'optimal' parameters of the model to minimise some prediction error metric of y (typically the sum of squared errors; SSE) over a set of known y and X data values (this known data set is usually called the 'training' data). Once these 'optimal' parameters have been found, you now have a model which can (you hope) predict unknown y values for new X values. 

Consider a typical linear regression scenario:

There is an output (response) variable y and N input (predictor) variables X, ... , XN. There are a number of observations of y and corresponding observations of the N inputs. These comprise the training data set. It is then assumed that y is a unknown linear function of the variables X, ..., XN . Now the problem is to find the optimal values of the unknown parameters a0 , ..., aN in the expression that minimises a metric of the error E over the training data set.

y = a0 + a1 X1 + a2 X2 + ... + aN XN + E

For instance, the SSE error metric can be minimised by using the least squares normal equation to give optimal a0 , ... , aN. The problem is that linear regression models will not generally not capture non-linear relationships.

However, in symbolic data mining a model structure is not explicitly assumed, and an algorithm is used to find the structure and parameters of a non-linear model that minimises the chosen error metric. In GPTIPS, a genetic programming algorithm is used to to this. A typical model generated by GPTIPS is

y = 0.23 X1 + 0.33(X1 - X5) + 1.23 X32 - 3.34 cos(X1) + 0.22

This model contains both linear and non-linear terms.

In GPTIPS, you don't specify the functional form of the evolved models, you only specify what 'building blocks' that model can be constructed from, e.g. plus, minus, times, cos, tanh etc. The advantage is that in symbolic data mining you build non-linear models without having to know beforehand what the model structure should be.

Another advantage is that collinearity and correlation of the inputs (which can cause severe problems in many linear regression methods) is not generally problematic in symbolic data mining.

Are GPTIPS models as good as neural net models?

It depends on (a) what you mean by good and (b) the problem at hand and (c) what you want the model for.

GPTIPS sometimes (usually when only a few input variables are involved) lags behind a neural net model (e.g. a feedfoward neural network with a single hidden layer) in terms of raw predictive performance but the equivalent GP models are often simpler, shorter and may be open to physical interpretation. It's not always an easy question to answer. To put it another way: is the model y = 3X12 + 2X1 X2 (R2 = 0.93) "better" or "worse" than a black box neural net model with R2 = 0.95?

GPTIPS  generates models that are intended to be understood by humans, but neural networks - whilst very powerful when trained correctly - are not.

At any rate, on typical practical applications GPTIPS normally significantly outperforms feedforward neural net models in terms of model performance, interpretability, deployability and robustness.

Is GPTIPS better than PLS (partial least squares) regression?

Again, It depends on context but: yes, GPTIPS is better. In my opinion, PLS regression is an overused and overvalued analytical tool. PLS regression is also hard to understand (well at least I find it hard to understand) and the models it produces are complex, fragile, hard to deploy and difficult to interpret.

What license is GPTIPS distributed under?

GPTIPS is 'free' subject to the GPL (GNU General Public ) v3 license which can be viewed here http://www.gnu.org/licenses/gpl-3.0.html.

How do I cite GPTIPS?

If you use GPTIPS in any published work then please use the following citation.

GPTIPS: an open source genetic programming toolbox for multigene symbolic regression

Searson, D.P., Leahy, D.E. & Willis, M.J.

Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 (IMECS 2010), Hong Kong, 17-19 March, 2010.

I work in an academic environment and citations are important for my career. :)

How do I run GPTIPS?

GPTIPS is  run from the MATLAB command line using a config file to specify the run settings and to load or generate user data. The config file is a standard MATLAB m file. For instance, in GPTIPS 2 a config file could be as simple as

function gp = myconfig(gp)

gp.userdata.xtrain = rand(100,10);

gp.userdata.ytrain = rand(100,1);

gp.nodes.functions.name = {'plus','minus','times'};

This example randomly generates input (X) and output (y) data and specifies that the trees should be built using plus, minus and times nodes. 

All other settings use the GPTIPS defaults, including the fitness function (regressmulti_fitfun.m) which performs 'multigene' symbolic model discovery. You can overwrite any default setting by adding an appropriate line to your config file. For example, the default number of genes is 4 and to override this to use 8 genes add the following line 

gp.genes.max_genes = 8;

To run GPTIPS use the rungp function from the command line with a function handle to your config file as a parameter. E.g.

>>gp = rungp(@myconfig);

When the run is complete the resulting gp data structure can then be visualised and processed using a variety of GPTIPS functions, e.g. to graphically browse the population in terms of goodness of fit and model complexity, the popbrowser function can be used as follows


When using popbrowser - it's often a good idea to also generate an HTML report of the model equations that form the pareto front (i.e. the green dots on the popbrowser figure window) using the paretoreport function. In this report the equations can be sorted by model complexity and model performance by clicking on the appropriate header. To generate the report use:


I'm getting weird Java errors when running GPTIPS 2 in parallel mode. Why is this happening?

There is a known issue with the JVM in versions of MATLAB (all platforms) prior to version R2013b (6.3). This causes a failure of the Parallel Computing Toolbox in most cases. 

There is a fix/workaround for this here:


Please apply this fix if you are using a version prior to R2013b.

Can I export GPTIPS models as standalone M files for use outside GPTIPS?

Yes, the GPTIPS utility function gpmodel2mfile was specifically written for this purpose (it requires the Mathworks Symbolic Math Toolbox to create the standalone model file - but this toolbox is not required to run the standalone model file).

For example, to convert the 'best' (as evaluated on the training data) symbolic model in a population to a standalone M file use


This writes the model to the file mymodel.m

You can then run the model on a new data input matrix x using mymodel.m as follows:

>> yprediction = mymodel(x);

Additionally, if you want the vector of model predictions for (say) your training data you can use:

>>yprediction_train = mymodel(gp.userdata.xtrain);

How are models represented in GPTIPS?

Like most GP implementations, each symbolic model is represented as one or more trees. E.g.

The tree can be evaluated, left to right, to give the equivalent symbolic equation

y = √X5 + X1

Internally, each GP tree is represented in GPTIPS by a compact coded string. Tree structures are used in GP because they facilitate the process of simulated evolution to create populations of better trees from existing ones using tree mutation and tree crossover operations.

GPTIPS allows the use of multigene (multiple tree) models. Here each overall model is a linear combination of one or more trees. Each tree can be thought of as a partial model fragment which has a weighted contribution to the overall model. The tree encoding is referred to as the 'genotype' and the decoded (and simplified) model structure is referred to as the 'phenotype'.

How many genes do I need to model my data?

This really depends on the data and your expectations of the resulting model. More genes usually results in a more "accurate" model, but the complexity of the model may be high. That said - as a very broad guideline - start with 3 or 4 genes (with a maximum depth of 4 or 5 nodes) and work upwards from there.

After a few runs you should begin find the "sweet spot" that gives a decent trade off between model accuracy and performance. As a side note: I would avoid having large ( > 7) maximum tree depths as this tends to encourage overfitting and bloated models. It also makes GPTIPS run more slowly.

Finally, too many genes can also lead to bloated models (this is "horizontal bloat" in contrast to the usual "vertical bloat" found in single tree GP models) which may not generalise well. GPTIPS 2 contains tools to explicitly identify horizontal bloat in multigene models.

Where can I find out more about genetic programming?

I recommend the excellent, free to download Field Guide to Genetic Programming.

Does GPTIPS do variable selection (feature selection)?

GPTIPS implicitly performs variable selection. The simulated evolutionary processes will "try" to pick the input variables that give the best overall performance. However, how well it does this is dependent on several factors, like how many inputs there are, what population size is used, how many generations GPTIPS is run for etc.

What GP selection methods does GPTIPS support?

In GPTIPS 1 only tournament selection is implemented (with the option of lexicographic parsimony pressure enabled). 

GPTIPS 2 supports pareto tournaments based on fitness and tree complexity. For instance, to use only pareto tournaments add the following line to your config file

gp.selection.tournament.p_pareto = 1;

To set a quarter of all selection events to pareto tournaments use the following (the remaining 3/4 will be regular tournaments based only on fitness).

gp.selection.tournament.p_pareto = 0.25;

What are the system requirements for GPTIPS?

For GPTIPS 1 you will need MATLAB Version v 7.6 or higher. I have tried to make it as portable as possible across platforms and versions - however, I do not work for The Mathworks and I do not have access to the resources to ensure it works on everything.

GPTIPS 2 has been tested on 64 bit Windows (R2010b) and 64 bit Mac OSX R2014a (v 8.3) and R2014b (8.4) versions of MATLAB.

For certain GPTIPS 2 functions, namely drawtrees, gpmodelreport and paretoreport you will need a web browser (preferably not Internet Explorer) and an internet connection. This is to allow the automatic download of certain web based JavaScript visualisation APIs - your data is not sent to any servers however, all processing is done locally in your browser.

What Mathworks toolboxes are required?

None for the core functions of GPTIPS but the Mathworks Symbolic Math Toolbox is very highly recommended. Amongst other things, it is required for model simplification and conversion to standalone M file format. The Mathworks Statistics Toolbox is also required for a small number of GPTIPS features (such as computing the statistical significance of model terms).

In GPTIPS 2 the Mathworks Parallel Computing Toolbox speeds up the execution time of runs significantly on multicore machines and is highly recommended for power users.

Will there be any further releases of GPTIPS?

Yes - GPTIPS 2 will be released in summer 2014. Sorry - running late again :-/ Should be soon.

Does GPTIPS run on Octave? 

No. Currently there are too many limitations in Octave to make it worthwhile.

Who develops GPTIPS?

GPTIPS was written by Dominic Searson (me). I am a senior researcher in the School of Computing Science, Newcastle University, UK and I have been working on GP and its applications intermittently for about 15 years. My general interests are machine learning, scientific computing, data-driven modelling, data analysis/mining and data visualisation.  

If you would like to connect professionally, my Linked In profile is at https://www.linkedin.com/in/domsearson

GPTIPS is an ongoing, open source project and any contributions, suggestions etc. are welcome. Please email them to searson@gmail.com.