GPTIPS 


Genetic Programming and
Symbolic Data Mining Platform for MATLAB 

by

Dominic Searson

Chief Data Scientist

Synoptic Technologies Ltd., UK.

 


GPTIPS is a free symbolic data mining platform and interactive modelling environment for MATLAB. It enables you to


  • Discover hidden, non-linear relationships in your data.

                        

  • Use machine learning (genetic programming) to automatically create compact, accurate equations to predict the behaviour of physical systems.

                               

  • Identify key predictive variables even when your data is noisy and highly correlated and there are a large number of superfluous input variables.

                       

  • Build data-driven models when you don't know the 'true' underlying model structure.

         

  • Automatically generate a model portfolio containing models of different levels of complexity and predictive quality.
                

GPTIPS is a widely used technology platform both commercially and academically across a diverse range of research and application areas. It has been shown to outperform existing 'soft-computing / machine learning' methods such as neural networks, support vector machines etc. on many problem domains.

GPTIPS is particularly effective at automatically generating symbolic non-linear models of predictor response (input/output) data.

GPTIPS is driven by multigene genetic programming (MGGP) which combines the flexibility - and ability to capture non-linear behaviour - of genetic programming with the power of classical linear least squares parameter estimation. MGGP combines multiple GP trees to model data more effectively than standard GP.

GPTIPS promotes the creation of low complexity models that capture non-linear behaviour but contain linearly separable terms. For instance:


y = 4.94 x2 - 1.08 x1 + 2.24 tanh(x3 - x1) - 57.8 tanh(x2 x3) + 0.041 x3 x4 - 0.053 x32 + 73.0


Unlike similar commercial products - such as Eureqa and DataModeler - GPTIPS is completely open source, is written in standard MATLAB & has a pluggable architecture - it is easy to write new functions to solve your own problems. It's also free.

GPTIPS is an enabling technology platform aimed at scientists, engineers and students - it was developed to make it easy to perform and understand symbolic data mining transparently and to deploy the models outside of GPTIPS and MATLAB. You do not have to be an expert in evolutionary computation, data mining or machine learning to use it effectively. 


Why use GPTIPS?

No a priori assumptions of model structure

GPTIPS automatically evolves both the structure and the parameters of mathematical models using the supplied input variables and simple mathematical "building blocks" such as plus, minus, sqrt, log, exp -x, sin, cos, square, cube etc. to form one or more 'trees' representing the models. For instance, the simple model y = tanh(x3 - x1) can be represented as



There are more than 30 building blocks in GPTIPS and it is very easy to add your own. Trees can be visualised (as above) in GPTIPS 2 using the drawtrees function (see FAQ for details)


Extremely deployable compact portable models - what you see is what you get

GPTIPS models are simple mathematical equations. Unlike many soft-computing models - such as neural networks - no specialised modelling software environment is required to deploy the trained models. They can be easily and rapidly implemented in any modern computing language by a non-modelling expert.  

GPTIPS considers both the model performance (i.e. goodness of fit) and model complexity in an attempt to evolve models that perform well but are not overly complex. The trade-off surface of models ('Pareto front') represents models that are not beaten by any other model in both performance and complexity. These models are usually of the most interest. An example of a typical pareto front is shown as green circles below. The blue circles represent models not on the Pareto front.

Pareto front in a population of models

In GPTIPS 2, it is easy to select the model you want from the Pareto front using an HTML report generated by the paretoreport function. 

Pareto front models can be sorted by complexity or goodness of fit by clicking on the appropriate column header.



It is easy to export selected GPTIPS models, there are a variety of functions to do this including gpmodel2mfile, gpmodel2sym, gpmodel2func etc.


Automatic variable selection

GPTIPS automatically selects the input variables ('features') that best predict the output variable of interest. GPTIPS can easily be used - as a variable selection method in its own right - to select variables or non-linear combinations of variables as inputs for any other modelling method. 

GPTIPS has been shown to be effective at variable selection even when there are > 1000 irrelevant input variables. 


Detailed HTML model reports

In GPTIPS 2, for any multigene symbolic regression model, you can quickly generate a detailed, interactive HTML model report using the gpmodelreport function (see FAQ for details). 

The report includes details of the run configuration, the tree structures of the model, the performance of the model on the data and more.

A sample of a typical report is shown below.



Updated: 14th August 2017


© Dominic Searson 2009 - 2017


Subpages (1): Visual Analytic Tools