## What is GPTIPS?It is a free, open source MATLAB toolbox for GPTIPS is an acronym for GPTIPS is a generic genetic programming toolbox for MATLAB - but the main intention of GPTIPS is to perform symbolic data mining. That is, to allow you to automatically discover empirical symbolic non-linear models from data. These are typically of the form y = f (X where y is an output/response variable (the thing you are trying to predict) and X ## What is symbolic data mining?Symbolic data mining is the process of extracting hidden, meaningful relationships from data in the form of symbolic equations. In contrast to other data-mining methods: the structure of these equations often give a new insight into the physical systems or processes that generated the data. This is in stark contrast to existing data mining methods. For instance, In classical linear regression and non-linear regression, a pre-determined model structure is assumed (by you) and then the problem is to find the 'optimal' parameters of the model to minimise some prediction error metric of y (typically the sum of squared errors; SSE) over a set of known y and X data values (this known data set is usually called the 'training' data). Once these 'optimal' parameters have been found, you now have a model which can (you hope) predict unknown y values for new X values. Consider a typical linear regression scenario: There is an output (response) variable y and N input (predictor) variables X However, in symbolic data mining a model structure is not explicitly assumed, and an algorithm is used to find the structure and parameters of a non-linear model that minimises the chosen error metric. In GPTIPS, a genetic programming algorithm is used to to this. A typical (regression) model generated by GPTIPS is In GPTIPS, you don't specify the functional form of the evolved models, you only specify what 'building blocks' that model can be constructed from, e.g. plus, minus, times, cos, tanh etc. The advantage is that in symbolic data mining you build non-linear models without having to know beforehand what the model structure should be. Another advantage is that collinearity and correlation of the inputs (which can cause severe problems in many linear regression methods) is not generally problematic in symbolic data mining. ## How do I run GPTIPS?GPTIPS is run from the MATLAB command line using a config file to specify the run settings and to load or generate user data. The config file is a standard MATLAB m file. For instance, in GPTIPS 2 a config file could be as simple as function gp = myconfig(gp) gp.userdata.xtrain = rand(100,10); gp.userdata.ytrain = rand(100,1); gp.nodes.functions.name = {'plus','minus','times','rdivide','cube','sqrt','abs'}; This example randomly generates input (X) and output (y) data and specifies that the trees should be built using plus, minus, times, divide (unprotected), cube, square root and abs nodes. Other nodes typically used in symbolic regression are square, sin, cos, exp, power, add3, mult3, log, negexp and neg. All other settings use the GPTIPS defaults, including the fitness function (regressmulti_fitfun.m) which performs 'multigene' symbolic model discovery. You can overwrite any default setting by adding an appropriate line to your config file. For example, the default number of genes is 4 and to override this to use 8 genes add the following line gp.genes.max_genes = 8; To run GPTIPS use the rungp function from the command line with a function handle to your config file as a parameter. E.g. >>gp = rungp(@myconfig); When the run is complete the resulting gp data structure can then be visualised and processed using a variety of GPTIPS functions, e.g. to graphically browse the population in terms of goodness of fit and model complexity, the popbrowser function can be used as follows >>popbrowser(gp); When using popbrowser - it's often a good idea to also generate an HTML report of the model equations that form the pareto front (i.e. the green dots on the popbrowser figure window) using the paretoreport function. In this report the equations can be sorted by model complexity and model performance by clicking on the appropriate header. To generate the report use: >>paretoreport(gp)
In addition to the demos provided with GPTIPS 2 (gpdemo1, gpdemo2, gpdemo3, gpdemo4) the following example config files are also included for reference purposes. cubic_config - Multigene regression on data from a cubic polynomial. 3.4 x e.g. to run this use >>gp = rungp(@cubic_config) uball_config - Multigene regression on data on the 5 dimensional Unwrapped Ball function. ripple_config - Multigene regression on data from a mathematical function f(x
f(x salustowicz1d_config - Multigene regression on data from a mathematical function f(x) of a single input variable
f(x) = exp(-x) x ## How do I run GPTIPS 2 in parallel mode?Add the following line to your config file:
gp.runcontrol.parallel.auto = true; You must have the Parallel Computing Toolbox installed and licensed for this to work.
- The first time you run it in a session there is a short delay whilst the parallel mode initialises.
- GPTIPS will autodetect the number of cores you have on your machine.
## I'm getting weird Java errors when running GPTIPS 2 in parallel mode. Why is this happening?There is a known issue with the JVM in versions of MATLAB (all platforms) prior to version R2013b (6.3). This causes a failure of the Parallel Computing Toolbox in most cases. There is a fix/workaround for this here: http://www.mathworks.com/support/bugreports/919688 Please apply this fix if you are using a version prior to R2013b. ## How do I run GPTIPS 2 for a fixed amount of time?To perform a GPTIPS run that terminates after a set amount of time (in seconds) add the following line to your config file:
gp.runcontrol.timeout = 60; Where, in this case the run terminates after 60 seconds regardless of how many generations it was set to run for. ## Can I export GPTIPS multigene regression models as standalone M files for use outside GPTIPS?Yes, the GPTIPS utility function gpmodel2mfile was specifically written for this purpose (it requires the Mathworks Symbolic Math Toolbox to create the standalone model file - but this toolbox is not required to run the standalone model file).For example, to convert the 'best' (as evaluated on the training data) symbolic model in a population to a standalone M file use >>gpmodel2mfile(gp,'best','mymodel'); This writes the model to the file mymodel.m You can then run the model on a new data input matrix x using mymodel.m as follows: >> yprediction = mymodel(x); Additionally, if you want the vector of model predictions for (say) your training data you can use: >>yprediction_train = mymodel(gp.userdata.xtrain); ## What GP selection methods does GPTIPS support?In GPTIPS 1 only tournament selection is implemented (with the option of lexicographic parsimony pressure enabled). GPTIPS 2 supports pareto tournaments based on fitness and tree complexity. For instance, to use only pareto tournaments add the following line to your config file gp.selection.tournament.p_pareto = 1; To set a quarter of all selection events to pareto tournaments use the following (the remaining 3/4 will be regular tournaments based only on fitness). gp.selection.tournament.p_pareto = 0.25;
- In GPTIPS 2 lexicographic selection is enabled by default for regular tournament selection.
- 'Selection' refers to the process of selecting individuals in the current population (based on their fitness and complexity) to create new individuals in the next generation of individuals.
## How do I visualise the trees in a GPTIPS model?In GPTIPS 2, the trees in any multigene regression model can be drawn to an HTML file using the drawtees function, e.g. >> drawtrees(gp,'best') draws the best model in the population (as evaluated on the training data). >> drawtrees(gp,'valbest') draws the best model in the population (as evaluated on the validation data - if it exists). >> drawtrees(gp,'testbest') draws the best model in the population (as evaluated on the test data - if it exists). >> drawtrees(gp,5) draws the model in the population with numerical ID 5 where the ID is an integer that can range from 1 to the population size. You can control the formatting of the drawn trees (colour, line width, font etc.) using additional CSS arguments to the drawtrees function. For instance, to change the font to 'Comic Sans MS' use: >> drawtrees(gp,'best',[],'Comic Sans MS') For further advanced formatting see the help for the drawtrees function: >>help drawtrees
- You need an internet connection for this function as it uses the Google Charts Javascript API.
- Internet Explorer does not render the trees well - so use another browser for best results.
## Are GPTIPS multigene regression models as good as neural net models?It depends on (a) what you mean by good and (b) the problem at hand and (c) what you want the model for. GPTIPS sometimes (usually when only a few input variables are involved) lags behind a neural net model (e.g. a feedfoward neural network with a single hidden layer) in terms of raw predictive performance but the equivalent GP models are often simpler, shorter and may be open to physical interpretation. It's not always an easy question to answer. To put it another way: is the model y = 3X GPTIPS generates models that are intended to be understood by humans, but neural networks - whilst very powerful when trained correctly - are not. At any rate, on typical practical applications GPTIPS normally significantly outperforms feedforward neural net models in terms of model performance, interpretability, deployability and robustness. ## Is GPTIPS better than PLS (partial least squares) regression?Again, It depends on context but: yes, GPTIPS is better. In my opinion, PLS regression is an overused and overvalued analytical tool. PLS regression is also hard to understand (well at least I find it hard to understand) and the models it produces are complex, fragile, hard to deploy and difficult to interpret. ## What license is GPTIPS distributed under?GPTIPS is 'free' subject to the GPL (GNU General Public ) v3 license which can be viewed here http://www.gnu.org/licenses/gpl-3.0.html. ## How do I cite GPTIPS?If you use GPTIPS in any published work then please use the following citations. GPTIPS 2: an open-source software platform for symbolic data mining Searson, D.P. Chapter 22 in Handbook of Genetic Programming Applications, A.H. Gandomi et al., (Eds.), Springer, New York, NY, 2015 (in press). GPTIPS: an open source genetic programming toolbox for multigene symbolic regression Searson, D.P., Leahy, D.E. & Willis, M.J. Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 (IMECS 2010), Hong Kong, 17-19 March, 2010. I work in an academic environment and citations are important for my career. :) ## How are multigene regression models represented in GPTIPS?Like most GP implementations, each symbolic model is represented as one or more trees. E.g.
y = √X Internally, each GP tree is represented in GPTIPS by a compact coded string. Tree structures are used in GP because they facilitate the process of simulated evolution to create populations of better trees from existing ones using tree mutation and tree crossover operations. GPTIPS uses multigene genetic programming (MGGP). This allows the use of multigene (multiple tree) models which usually perform much better than standard GP models. In MGGP regression, each overall model is a linear combination of one or more trees. Each tree can be thought of as a partial model fragment which has a weighted contribution to the overall model. The tree encoding is referred to as the 'genotype' and the decoded (and simplified) model structure is referred to as the 'phenotype'. ## How many genes do I need to model my data in multigene regression?This really depends on the data and your expectations of the resulting model. More genes usually results in a more "accurate" regression model, but the complexity of the model may be high. That said - as a very broad guideline - start with 3 or 4 genes (with a maximum depth of 4 or 5 nodes) and work upwards from there. After a few runs you should begin find the "sweet spot" that gives a decent trade off between model accuracy and performance. As a side note: I would avoid having large ( > 7) maximum tree depths as this tends to encourage overfitting and bloated models. It also makes GPTIPS run more slowly. Finally, too many genes can also lead to bloated models (this is "horizontal bloat" in contrast to the usual "vertical bloat" found in single tree GP models) which may not generalise well. GPTIPS 2 contains tools to explicitly identify horizontal bloat in multigene models. ## Where can I find out more about genetic programming?I recommend the excellent, free to download Field Guide to Genetic Programming. ## Does GPTIPS do variable selection (feature selection)?GPTIPS implicitly performs variable selection. The simulated evolutionary processes will "try" to pick the input variables that give the best overall performance. However, how well it does this is dependent on several factors, like how many inputs there are, what population size is used, how many generations GPTIPS is run for etc. ## What are the system requirements for GPTIPS?For GPTIPS 1 you will need MATLAB Version v 7.6 or higher. I have tried to make it as portable as possible across platforms and versions - however, I do not work for The Mathworks and I do not have access to the resources to ensure it works on everything. GPTIPS 2 has been tested on 64 bit Windows (R2011b) and 64 bit Mac OSX (R2014a , R2014b and R2015a). For certain GPTIPS 2 functions, namely drawtrees, gpmodelreport and paretoreport you will need a web browser (preferably not Internet Explorer) and an internet connection. This is to allow the automatic download of certain web based JavaScript visualisation APIs - your data is not sent to any servers however, all processing is done locally in your browser. ## What Mathworks toolboxes are required?None for the core functions of GPTIPS but the Mathworks Symbolic Math Toolbox is ## Does GPTIPS run on Octave?No. Currently there are too many limitations in Octave to make it worthwhile. ## Who develops GPTIPS?GPTIPS was written by Dominic Searson (me). I am a senior researcher in the School of Computing Science, Newcastle University, UK and I have been working on GP and its applications intermittently for about 15 years. My general interests are machine learning, scientific computing, data-driven modelling, data analysis/mining and data visualisation. If you would like to connect professionally, my Linked In profile is at https://www.linkedin.com/in/domsearson GPTIPS is an ongoing, open source project and any contributions, suggestions etc. are welcome. Please email them to searson@gmail.com. |