FAQ

What is GPTIPS?

It is a free, open source MATLAB toolbox for symbolic data mining (SDM). 

It uses a biologically inspired machine learning method called multigene genetic programming (MGGP) as the 'engine' that drives the automatic model discovery process. 

GPTIPS is an acronym for Genetic Programming Toolbox for the Identification of Physical Systems.

GPTIPS is a generic genetic programming toolbox for MATLAB - but the main intention of GPTIPS is to perform symbolic data mining. That is, to allow you to automatically discover empirical symbolic non-linear models from data.

These are typically of the form

y = f (X, ... , XN)

where y is an output/response variable (the thing you are trying to predict) and X, ... , XN are input/predictor variables (the things you know and want to use to predict y) and f is a symbolic non-linear function (or a collection of non-linear functions). This is called symbolic regression and the form of f is determined automatically by GPTIPS.


What is symbolic data mining?

Symbolic data mining is the process of extracting hidden, meaningful relationships from data in the form of symbolic equations. In contrast to other data-mining methods: the structure of these equations often give a new insight into the physical systems or processes that generated the data. 

This is in stark contrast to existing data mining methods. For instance, In classical linear regression and non-linear regression, a pre-determined model structure is assumed (by you) and then the problem is to find the 'optimal' parameters of the model to minimise some prediction error metric of y (typically the sum of squared errors; SSE) over a set of known y and X data values (this known data set is usually called the 'training' data). Once these 'optimal' parameters have been found, you now have a model which can (you hope) predict unknown y values for new X values. 

Consider a typical linear regression scenario:

There is an output (response) variable y and N input (predictor) variables X, ... , XN. There are a number of observations of y and corresponding observations of the N inputs. These comprise the training data set. It is then assumed that y is a unknown linear function of the variables X, ..., XN . Now the problem is to find the optimal values of the unknown parameters a0 , ..., aN in the expression that minimises a metric of the error E over the training data set.

y = a0 + a1 X1 + a2 X2 + ... + aN XN + E

For instance, the SSE error metric can be minimised by using the least squares normal equation to give optimal a0 , ... , aN. The problem is that linear regression models will not generally not capture non-linear relationships.

However, in symbolic data mining a model structure is not explicitly assumed, and an algorithm is used to find the structure and parameters of a non-linear model that minimises the chosen error metric. In GPTIPS, a genetic programming algorithm is used to do this. A typical (regression) model generated by GPTIPS is

y = 0.23 X1 + 0.33(X1 - X5) + 1.23 X32 - 3.34 cos(X1) + 0.22

This model contains both linear and non-linear terms.

In GPTIPS, you don't specify the functional form of the evolved models, you only specify what 'building blocks' that model can be constructed from, e.g. plus, minus, times, cos, tanh etc. The advantage is that in symbolic data mining you build non-linear models without having to know beforehand what the model structure should be.

Models evolved by GPTIPS are usually much more accurate and comprehensible that those created using standard GP.

Another advantage is that collinearity and correlation of the inputs (which can cause severe problems in many linear regression methods) is not generally problematic in GPTIPS.


How do I run GPTIPS 2?

GPTIPS is run from the MATLAB command line using a config file to specify the run settings and to load or generate user data. The config file is a standard MATLAB m file. 

The easiest way to see how this is done is run the symbolic regression demos (gpdemo1gpdemo2gpdemo3gpdemo4) from the command line, e.g.


>>gpdemo3


To run GPTIPS 2  with your own data use the rungp function from the command line with a function handle to your config file as a parameter. E.g.


>>gp = rungp(@myconfig);


In GPTIPS 2 a config file could be as simple as


function gp = myconfig(gp)

gp.userdata.xtrain = rand(100,10);

gp.userdata.ytrain = rand(100,1);

gp.nodes.functions.name = {'plus','minus','times','rdivide','cube','sqrt','abs'};


This example randomly generates input (X) and output (y) data and specifies that the trees should be built using plus, minus, times, divide (unprotected), cube, square root and abs nodes. Other nodes typically used in symbolic regression are square, sin, cos, exp, power, add3, mult3, log, negexp and neg.

All other settings use the GPTIPS defaults, including the fitness function (regressmulti_fitfun.m) which performs 'multigene' symbolic model discovery. You can overwrite any default setting by adding an appropriate line to your config file. For example, the default number of genes is 4 and to override this to use 8 genes add the following line 


gp.genes.max_genes = 8;


When the run is complete the resulting gp data structure can then be visualised and processed using a variety of GPTIPS functions, e.g. to run the best individual in the population (as evaluated on the training data) use:


>>runtree(gp, 'best');


To run the best on the testing data use:


>>runtree(gp, 'testbest');


To graphically browse the population in terms of goodness of fit and model complexity, the popbrowser function can be used as follows


>>popbrowser(gp);


When using popbrowser - it's often a good idea to also generate an HTML report of the model equations that form the pareto front (i.e. the green dots on the popbrowser figure window) using the paretoreport function. In this report the equations can be sorted by model complexity and model performance by clicking on the appropriate header. To generate the report use:


>>paretoreport(gp)


Note

In addition to the demos provided with GPTIPS 2 (gpdemo1, gpdemo2, gpdemo3, gpdemo4) the following example config files are also included for reference purposes. 


cubic_config - Multigene regression on data from a cubic polynomial.


3.4 x3  + 2.9 x2  + 6.2 x + 0.75


e.g. to run this use


>>gp = rungp(@cubic_config)



uball_config - Multigene regression on data on the 5 dimensional Unwrapped Ball function.


e.g. to run this use


>>gp = rungp(@uball_config)



ripple_config -  Multigene regression on data from a mathematical function f(x1,x2) of two input variables

 

 f(x1,x2) = (x1 - 3)(x2 - 3) + 2 sin((x1 - 4) (x2 - 4))


e.g. to run this use


>>gp = rungp(@ripple_config)



salustowicz1d_config - Multigene regression on data from a mathematical function f(x) of a single input variable 

 

f(x) = exp(-x) x3 cos(x) sin(x) (sin(x)2 cos(x) - 1)


e.g. to run this use


>>gp = rungp(@ salustowicz1d_config)


In addition, see the following tutorial for a step by step example on how to use GPTIPS to accurately model another 'difficult' synthetic nonlinear regression problem.


Is there a manual for GPTIPS 2?

Not yet. But there will be 'soon'. There is however - a tutorial, a paper (PDF) containing usage examples and fairly extensive documentation in the help for each of the files in GPTIPS 2.


How do I run GPTIPS 2 in parallel mode?

Add the following line to your config file:


gp.runcontrol.parallel.auto = true;


You must have the Parallel Computing Toolbox installed and licensed for this to work. 


Note

  • The first time you run it in a session there is a short delay whilst the parallel mode initialises.
  • GPTIPS will autodetect the number of cores you have on your machine.


I'm getting weird Java errors when running GPTIPS 2 in parallel mode. Why is this happening?

There is a known issue with the JVM in versions of MATLAB (all platforms) prior to version R2013b (6.3). This causes a failure of the Parallel Computing Toolbox in most cases. 

There is a fix/workaround for this here:

 http://www.mathworks.com/support/bugreports/919688

Please apply this fix if you are using a version prior to R2013b.


How do I run GPTIPS 2 for a fixed amount  of time?

To perform a GPTIPS run that terminates after a set amount of time (in seconds) add the following line to your config file:


gp.runcontrol.timeout = 60;


Where, in this case the run terminates after 60 seconds regardless of how many generations it was set to run for.


This can be used effectively in combination with multiple runs. Because GP is a non-deterministic process results vary from run to run and so it is most often a good idea to perform multiple runs. In GPTIPS 2 you can perform multiple runs of fixed time duration that are merged at the end to form a single population. For instance, to perform 5 runs of 30 seconds each use the following settings in your config file.


gp.runcontrol.runs = 5;

gp.runcontrol.timeout = 60;


How do I export GPTIPS 2 multigene regression models as standalone M files for use outside GPTIPS?

The GPTIPS function gpmodel2mfile does this (it requires the Mathworks Symbolic Math Toolbox to create the standalone model file - but this toolbox is not required to run the standalone model file).


For example, to convert the 'best' (as evaluated on the training data) symbolic model in a population to a standalone M file use


>>gpmodel2mfile(gp,'best','mymodel');


This writes the model to the file mymodel.m


You can then run the model on a new data input matrix x using mymodel.m as follows:


>> yprediction = mymodel(x);


Additionally, if you want the vector of model predictions for (say) your training data you can use:


>>yprediction_train = mymodel(gp.userdata.xtrain);


How do I export a GPTIPS 2 multigene regression model as a Symbolic Math object?

This is done with gpmodel2sym (it requires the Mathworks Symbolic Math Toolbox). For instance, to convert the best model (as evaluated on the testing data) in the population use:


>>gpmodel2sym(gp,'testbest');


The symbolic model can then be manipulated like any other symbolic math object.


Example

Run GPTIPS on the supplied cubic polynomial function.


>>gp = rungp(@cubic_config);


Extract the best model (training data) to a symbolic math object.


>>modelsym = gpmodel2sym(gp,'best')


Set the display precision to 2 using MATLAB's vpa function (variable precision arithmetic).


>>modelsym = vpa(modelsym,2)


Plot the symbolic math object using MATLAB's ezplot function.


>>ezplot(m)



What GP selection methods does GPTIPS support?

In GPTIPS 1 only tournament selection is implemented (with the option of lexicographic parsimony pressure enabled). 

GPTIPS 2 supports pareto tournaments based on fitness and tree complexity. For instance, to use only pareto tournaments add the following line to your config file


gp.selection.tournament.p_pareto = 1;


To set a quarter of all selection events to pareto tournaments use the following (the remaining 3/4 will be regular tournaments based only on fitness).


gp.selection.tournament.p_pareto = 0.25;


Note

  • In GPTIPS 2 lexicographic selection is enabled by default for regular tournament selection.
  • 'Selection' refers to the process of selecting individuals in the current population (based on their fitness and complexity) to create new individuals in the next generation of individuals.


How do I visualise the trees in a GPTIPS 2 model?

In GPTIPS 2, the trees in any multigene regression model can be drawn to an HTML file using the drawtees function, e.g.




>> drawtrees(gp,'best')


draws the best model in the population (as evaluated on the training data).


>> drawtrees(gp,'valbest')


draws the best model in the population (as evaluated on the validation data - if it exists).


>> drawtrees(gp,'testbest')


draws the best model in the population (as evaluated on the test data - if it exists).


>> drawtrees(gp,5)


draws the model in the population with numerical ID 5 where the ID is an integer that can range from 1 to the population size.


You can control the formatting of the drawn trees (colour, line width, font etc.) using additional CSS arguments to the drawtrees function.

For instance, to change the font to 'Comic Sans MS' use:


>> drawtrees(gp,'best',[],'Comic Sans MS')


For further advanced formatting see the help for the drawtrees function:


>>help drawtrees


Note

  • You need an internet connection for this function as it uses the Google Charts Javascript API.
  • Internet Explorer does not render the trees well - so use another browser for best results.


Are GPTIPS multigene regression models as good as neural net models?

It depends on (a) what you mean by good and (b) the problem at hand and (c) what you want the model for.

GPTIPS sometimes (usually when only a few input variables are involved) lags behind a neural net model (e.g. a feedfoward neural network with a single hidden layer) in terms of raw predictive performance but the equivalent GP models are often simpler, shorter and may be open to physical interpretation. It's not always an easy question to answer. To put it another way: is the model y = 3X12 + 2X1 X2 (R2 = 0.93) "better" or "worse" than a black box neural net model with R2 = 0.95?

GPTIPS  generates models that are intended to be understood by humans, but neural networks - whilst very powerful when trained correctly - are not.

At any rate, on typical practical applications GPTIPS normally significantly outperforms feedforward neural net models in terms of model performance, interpretability, deployability and robustness.


Is GPTIPS better than PLS (partial least squares) regression?

Again, It depends on context but: yes, GPTIPS is better. In my opinion, PLS regression is an overused and overvalued analytical tool. PLS regression is also hard to understand (well at least I find it hard to understand) and the models it produces are complex, fragile, hard to deploy and difficult to interpret.


What license is GPTIPS distributed under?

GPTIPS is 'free' subject to the GPL (GNU General Public ) v3 license which can be viewed here http://www.gnu.org/licenses/gpl-3.0.html.


How do I cite GPTIPS?

If you use GPTIPS in any published work then please use the following citations.


GPTIPS 2: an open-source software platform for symbolic data mining

Searson, D.P.

Chapter 22 in Handbook of Genetic Programming Applications, A.H. Gandomi et al., (Eds.), Springer, New York, NY, 2015 (in press).


GPTIPS: an open source genetic programming toolbox for multigene symbolic regression

Searson, D.P., Leahy, D.E. & Willis, M.J.

Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 (IMECS 2010), Hong Kong, 17-19 March, 2010.


I work in an academic environment and citations are important for my career. :)


How are multigene regression models represented in GPTIPS?

Like most GP implementations, each symbolic model is represented as one or more trees. E.g.



The tree can be evaluated, left to right, to give the equivalent symbolic equation

y = √X5 + X1

Internally, each GP tree is represented in GPTIPS by a compact coded string. Tree structures are used in GP because they facilitate the process of simulated evolution to create populations of better trees from existing ones using tree mutation and tree crossover operations.

GPTIPS uses multigene genetic programming (MGGP). This allows the use of multigene (multiple tree) models which usually perform much better than standard GP models.

In MGGP regression, each overall model is a linear combination of one or more trees. Each tree can be thought of as a partial model fragment which has a weighted contribution to the overall model. The tree encoding is referred to as the 'genotype' and the decoded (and simplified) model structure is referred to as the 'phenotype'.





How many genes do I need to model my data in multigene regression?

This really depends on the data and your expectations of the resulting model. More genes usually results in a more "accurate" regression model, but the complexity of the model may be high. That said - as a very broad guideline - start with 3 or 4 genes (with a maximum depth of 4 or 5 nodes) and work upwards from there.

After a few runs you should begin find the "sweet spot" that gives a decent trade off between model accuracy and performance. As a side note: I would avoid having large ( > 7) maximum tree depths as this tends to encourage overfitting and bloated models. It also makes GPTIPS run more slowly.

Finally, too many genes can also lead to bloated models (this is "horizontal bloat" in contrast to the usual "vertical bloat" found in single tree GP models) which may not generalise well. GPTIPS 2 contains tools to explicitly identify horizontal bloat in multigene models.


Where can I find out more about genetic programming?

I recommend the excellent, free to download Field Guide to Genetic Programming.


Does GPTIPS do variable selection (feature selection)?

GPTIPS implicitly performs variable selection. The simulated evolutionary processes will "try" to pick the input variables that give the best overall performance. However, how well it does this is dependent on several factors, like how many inputs there are, what population size is used, how many generations GPTIPS is run for etc.


What are the system requirements for GPTIPS?

For GPTIPS 1 you will need MATLAB Version v 7.6 or higher. I have tried to make it as portable as possible across platforms and versions - however, I do not work for The Mathworks and I do not have access to the resources to ensure it works on everything.

GPTIPS 2 has been tested on 64 bit Windows (R2011b) and 64 bit Mac OSX (R2014a , R2014b and R2015a).

For certain GPTIPS 2 functions, namely drawtrees, gpmodelreport and paretoreport you will need a web browser (preferably not Internet Explorer) and an internet connection. This is to allow the automatic download of certain web based JavaScript visualisation APIs - your data is not sent to any servers however, all processing is done locally in your browser.




What Mathworks toolboxes are required?

None for the core functions of GPTIPS but the Mathworks Symbolic Math Toolbox is very highly recommended. Amongst other things, it is required for model simplification and conversion to standalone M file format. The Mathworks Statistics Toolbox is also required for a small number of GPTIPS features (such as computing the statistical significance of model terms).

In GPTIPS 2 the Mathworks Parallel Computing Toolbox speeds up the execution time of runs significantly on multicore machines and is highly recommended for power users.


Does GPTIPS run on Octave? 

No. Currently there are too many limitations in Octave to make it worthwhile.


Who develops GPTIPS?

GPTIPS was written by Dominic Searson (me). I am a senior researcher in the School of Computing Science, Newcastle University, UK and I have been working on GP and its applications intermittently for about 15 years. My general interests are machine learning, scientific computing, data-driven modelling, data analysis/mining and data visualisation.  

If you would like to connect professionally, my Linked In profile is at https://www.linkedin.com/in/domsearson

GPTIPS is an ongoing, open source project and any contributions, suggestions etc. are welcome. Please email them to searson@gmail.com.