G-REX is an open source programming framework for predictive modeling using genetic programming but also have a GUI that facilitates access to predefined functionality.  The GUI is built up of two main parts; a left pane containing all relevant settings and a right desktop pane where windows monitoring the evolution and the results are shown during execution. The settings pane is itself divided into two parts; i.e. the upper pane consists of setting concerning the dataset and the representation while the tabbed pane below contains settings regarding the evolution process, benchmarking, how the results should be reported and how larger experiments should be performed. Each of these parts will be described in detail below.

Settings for dataset and representation

The basic settings of G-REX are always available in the top part of the settings pane of the G-REX GUI, see Figure 45 above. Here, a dataset can be selected using the top most combo box which contains all datasets present in the Dataset folder in G-REX’s root folder. Files that reside in other locations may be selected using the Select Data button. A BNF for a specific representation can be selected in the same way using the button and combo box in the following row. In the standard G-REX package the following representations are present in the G-REX’s BNF folder and will hence be available in the combo box:

  • Boolean Tree -  e.g. see Figure 33
  • Decision Tree – e.g. see Figure 32
  • Decision Tree with Boolean conditions  –  e.g. see Figure 35
  • Decision List – e.g. see Figure 36
  • Decision List with Boolean Conditions
  • Fuzzy rules - e.g. see (Eggermont 2002).
  • Hybrid k-NN – e.g. see Figure 39

The third row contains two more combo boxes which is used for selecting which part of the dataset that should be used as validationset and testset. NONE specifiy that no validationset should be used and is default since G-REX standard fitnessfuncions do not utilize validationsets. If n-fold cross validation is used ALL signifys that G-REX should evolve one program for each fold. Finaly there are also two buttons which open internal windows for manipulation of the data set, i.e. Figure 46, and the choosen representions Figure 47.

The last row contain two checkboxes which decides if a predfiened partitioning of the dataset should be used if availible or if G-REX’s should be responsible for creating the partions. To use a predifiened partitioning the las column of the dataset specifiy the role of each instance using TRAIN, VAL and Test.

Finally,the Start button stars the evolutions using the selected dataset, BNF and setting defined in the tabbed setting pane. The pdf  icon present to the left of the Start button is used throut the GUI to link to related research papers. Clicking on a pdf icon will open one of the articles residing in Publications folder in G-REX’s root folder in an external browser.

The lower part of G-REX’s data manipulation window consists of a tabbed pane with one tab Data that presents the data and another About File that shows any comments included in the file. The upper pane divides the functinality into four distinct groups where the first contain a button for reloding the dataset and another for stratifying the data. Stratifcation first sorts the instances based on the target variable and randomize, (based on the fold seed) the order of instances with the same target value. Stratification is mainly used in combination whith n-fold cross validation which is specified by setting the number of folds and clicking the Create folds button. Folds are then created by simply assinging each n-instance to a specific fold. A simple houldout sample can also be created by pressing Train/Test which assigns each nth instance to the test set and the remaining to the trainign set. Finally, it is possible to save the folds into n  separate files.

A    houldout sets can also be created by marking instancses using the mouse and then pressing Set Train, Set Val or Set Test to assign the coresponding role to a certain part of the dataset.

Finally, the dataset can be automatically fuzzified, as described by Eggermont 2002, by finding k mediods using the k-means clustering algorithm and then creating k triangular membership functions with the mediods as centers. When programs are evolved using the fuzzy rules BNF the dataset is automatically fuzzified using this functionality.

The Edit BNF button in the basic settings opens a simple texteditor which can be used to modify the curently selected BNF, see section 7.8 for more details for how to design a BNF.

Settings for Evolution

Settings related to the evolution and the reporting of the results are availible in the tabbed pane below the basic setting. The first tab Auto, selected in Figure 45, contains a simple way of setting reasonable settings for users who are not familiar with GP. Two sliders are used to controle how thorough the search should be, i.e. fast to extensive, and if accuracy or comprehensibillity should be favoured during evoaution. The sliders affect the settings in the GP tab to to make the effect of the sliders transparent.

Basic GP Settings

The GP tab, shown in Figure 48  contains all settings related to the evolution of  programs for the BNF, dataset and evaluation scheme selcted in the basic setti


The settings are divided into three groups where GP settings contain typical GP settings, Simplification settings regards removal of introns and Run setting concerns general settings for the execution.   

GP settings

Except for typical GP settings that could be expected of a GP framework several predefined fitness function are available for both classification and regression tasks:

Fitness functions for classification:

·         ACC_Fitnessoptimizing accuracy as defined in equation 36.

·         AUC_Fitnessoptimizing the area under the ROC curve as defined in equation 44.

·         BRI_Fitnessoptimizing the brier score as defined in equation 52.

·         BRE_Fitnessoptimizing brevity as defined in equation 45.

Fitness functions for regression:

·         MAE_Fitnessoptimizing the mean absolute error according to equation 49

·         RMSE_Fitnessoptimizing the root mean square error according to equation 49. 

·         MAPE_Fitnessoptimizing the mean absolute percentage error according to equation 48.

·         MUAPE_Fitnessoptimizing the mean unbiased absolute percentage error according to equation 48.

·         CORR_Fitness – Optimizing the Pearson correlation, i.e. (1-r).

Note that all fitness functions are minimized which means that the fitness relates to the error of the optimized performance metric. Even if these fitness functions should be sufficient for most applications it is simple to create new by extending the FitnessFunction class and simple overriding the calcPredictionError method.

            It may also be noted that G-REX supports both tournament and roulette wheel selection and that #Tourn. Mem. defines the number of programs that should be selected in each tournament when using tournament selection. Roulette wheel selection used if the number of tournament members are set to zero.

Parsimony P. sets the parsimony pressure as defined in equation 36 with the addition that a maximum size can be set, i.e. Max Size. Since all G-REX fitness function should be normalized to the range of zero to one a reasonable parsimony pressure should be between 0.1 and 0.001 with a typical pressure of 0.01. Programs which become larger than the Max Size will receive an additional large parsimony pressure which all but guarantees that they are not selected for the next generation.

Finally the  LaPlace check box decides if probabilities should be calculated as the relative frequency of the class values of the trainings instances in a leaf or if this estimate should be corrected using la place.

Simplification settings

The next group of settings regards the removal of introns, i.e. simplification, of the final program of an evolution. There are two types of simplification that can be used together or individually.  Deep Simplification is an implementation of the technique suggested in section 6.1 which performs a second evolution with the predictions of the final program as target. Deep Simplification is applicable to arbitrary representation and can remove both unused and unnecessary nodes.

Fast Simplification is a fast way to remove unused nodes, i.e. nodes not reached by any trainings instances, of a program by recursively replacing all nodes that are parents to an unused leaf node with its other child node. Even if this technique is fast and effective it is only applicable to BNFs based on the If node and cannot remove nodes that are used but unnecessary for the final prediction. It can be noted that the results reported in section 6.1 only uses Deep Simplification even if the techniques can be used together.

 Run settings

The final group relates to settings for how a G-REX run should be performed. Batch Size decides how many evolutions that should be performed for each fold. Each batch creates a new randomly initialized population and if more than one batch is used the program with the highest training fitness is reported as the final program. More than one batch is normally used to reduce the risk of that the evolution would get stuck in a local minimum. Persistence is a stopping criteria that stops the evolution if no improvement in fitness have been done in the specified number of generations. If set to zero all generations specified in the GP-settings will be performed. Finally, Save result to file decides if the results of a run should be saved to files or just be displayed in the GUI.