Neural Network Results in More Detail (2004)

Various plots and results using the networks from 2003 are given. Much of these are in addition to what is in the paper, e.g. due to plot size and amount of discussion. In particular, biases in assigned types are discussed, as these are important if the types are to be used for large scale structure work. The plots are approximately in the order in which I first made them.

Terminology used:

run = one set of initial weights randomised between -1 and 1 trained to a minimum in the error space

network = a particular network architecture e.g. 22:8:1 (22 parameters, 8 neuron hidden layer, 1 neuron output layer)

residual = network type - target type or one run minus another

correlation = correlation coefficient of network type versus target type

rms = root mean square difference between network type and target type

Many of the plots can be made for numerous target types and architectures. Here the 22:8:1 network with eClass target is often shown.

- Input parameters versus target type and network type for training and test sets - it is useful to see what is being fed into the network. Parameter versus target type are shown for the training set for eClass, eyeball type, redshift and star formation rate. The test sets look virtually identical in each case. The parameters are as listed on the Galaxy Types from Artificial Neural Networks page, and are numbered across then down on the figures. It can be seen why the star formation rate does not do very well.

- Standard deviation of network type for a galaxy: any plot showing the network type for galaxies over more than one run can also show the median type +/- the standard deviation for the type for each galaxy over all the runs. This is an example using the 22:8:1 network with 10 runs and the eClass target. The network types are in blue and the 2 sigma standard deviations are in green. One can see that the standard deviation in type is much less than the RMS. This also shows up in the runs versus each other plots below. The eyeball plot also shows -1 unassigned types which had not been removed at the stage the data for this plot was run on the network.

- Residuals: sorted by target type (one could also plot sorted by network type). These show up e.g. the 'S' shape in assigned types in network type versus target type more clearly. (The network avoids assigning extreme types as it concentrates its effort on the vast majority of training examples.) Here each plot is one of the runs of the 22:8:1 network with eClass target.

- One run versus another: The correlation/RMS is much higher/lower between runs than between net and target type. This suggests the network is limited by the noise and intrinsic spread in the training set. Again the 22:8:1 net, eClass target, 10 runs is shown.

Correlations:

1.0000 0.9911 0.9917 0.9915 0.9913 0.9926 0.9920 0.9908 0.9917 0.9919

0.9911 1.0000 0.9928 0.9950 0.9946 0.9939 0.9953 0.9939 0.9942 0.9938

0.9917 0.9928 1.0000 0.9935 0.9933 0.9940 0.9942 0.9930 0.9939 0.9938

0.9915 0.9950 0.9935 1.0000 0.9958 0.9940 0.9947 0.9945 0.9953 0.9933

0.9913 0.9946 0.9933 0.9958 1.0000 0.9939 0.9946 0.9932 0.9949 0.9934

0.9926 0.9939 0.9940 0.9940 0.9939 1.0000 0.9949 0.9939 0.9943 0.9937

0.9920 0.9953 0.9942 0.9947 0.9946 0.9949 1.0000 0.9945 0.9951 0.9950

0.9908 0.9939 0.9930 0.9945 0.9932 0.9939 0.9945 1.0000 0.9945 0.9939

0.9917 0.9942 0.9939 0.9953 0.9949 0.9943 0.9951 0.9945 1.0000 0.9943

0.9919 0.9938 0.9938 0.9933 0.9934 0.9937 0.9950 0.9939 0.9943 1.0000

RMS values:

0 0.0237 0.0229 0.0231 0.0234 0.0215 0.0225 0.0240 0.0228 0.0225

0.0237 0 0.0212 0.0178 0.0184 0.0196 0.0172 0.0195 0.0191 0.0197

0.0229 0.0212 0 0.0202 0.0206 0.0195 0.0191 0.0209 0.0196 0.0197

0.0231 0.0178 0.0202 0 0.0163 0.0194 0.0183 0.0186 0.0172 0.0206

0.0234 0.0184 0.0206 0.0163 0 0.0196 0.0185 0.0206 0.0180 0.0203

0.0215 0.0196 0.0195 0.0194 0.0196 0 0.0180 0.0195 0.0189 0.0199

0.0225 0.0172 0.0191 0.0183 0.0185 0.0180 0 0.0186 0.0175 0.0177

0.0240 0.0195 0.0209 0.0186 0.0206 0.0195 0.0186 0 0.0185 0.0196

0.0228 0.0191 0.0196 0.0172 0.0180 0.0189 0.0175 0.0185 0 0.0189

0.0225 0.0197 0.0197 0.0206 0.0203 0.0199 0.0177 0.0196 0.0189 0

- Residuals of one run versus another: one run minus the other and one run minus the other versus the other. The types are sorted according to network assigned type. Sorting by target type shows slightly increasing spread towards higher eClass values but mostly the intrinsic spread of the training set. The y axis is cut at the range +/- 0.2. Note the vast majority of the residuals are much less than this.

- Correlation and RMS by parameter set. Sets 1-22 correspond to the 20 parameters given on the current results (now deprecated) page (11 and 12 gave NaNs and have been removed). The rest are listed in the paper. The set of all 20, without the magnitudes is set 37. With the magnitudes is 38.

- Correlation and RMS by architecture. The plots show correlation and RMS for parameter sets which were run on several architectures. The eyeball and eClass targets are shown. The x axis shows the number of weights in the network on a logarithmic scale. The 22:8:1 network mentioned in other plots contains 193 weights (other numbers here are shown below). The sizes of the training sets are ~600 for eyeball and ~15000 for eClass. The larger nets are overfitting the data for the eyeball set but not for the eClass set. The form of the eyeball graph may change if the network is trained using a validation set and the weights taken at the minimum of the validation set error but the limit on the intrinsic information in the set suggested here will not - a larger training set is needed. If even larger nets were used so that the number of weights is comparable to the training set size for eClass it would presumably show the same pattern. Thus using a validation set may be justified on the grounds of consistency. It is only the size of the training set and the spread in the eClass which is preventing overfitting.

Numbers of weights are:

No. param.s Architecture

1 2:1 4:1 8:1 16:1 32:1 4:4:1 8:8:1 16:16:1 8:8:8:1

1 2 7 13 25 49 97 33 97 321 169

2 3 9 17 33 65 129 37 105 337 177

4 5 13 25 49 97 193 45 121 369 193

22 23 49 97 193 385 769 117 265 657 337

32 33 69 137 273 545 1089 157 345 817 417

This also shows that because very few hidden layers are needed the main factor in the number of weights (and hence the training time, since trainlm scales as number of weights cubed) is the number of parameters. Thus a pre-PCA may be useful in enabling many more runs to get a good distribution of assigned types for each galaxy. However, the correlation and RMS can never be improved in this way, just approached more quickly.

I found that doing prepca is faster after about 100 epochs of running, but the correlation and RMS are a little worse (0.84 and 0.112 as opposed to 0.85 and 0.109). It may be worth doing if many training runs are needed (e.g. binned redshifts).

The plots for the redshift target are also interesting, and suggest that improvements may be made using an even larger network. Firth et al use large networks and obtain improvements in their delta z for photometric redshifts. However, the spurious correlation in the magnitudes from the r<17.77 sample cut may also be a factor, particularly for the light blue circles (all parameters). It may also be interesting to run 20 parameters as here instead of the 4 colours for the eClass plot above.

- Number of galaxies versus type for the 22:8:1 training and test sets and 10 runs (eClass target), binned by target type. This shows that there is significant variation in the number of galaxies in the most populated bin (887 to 1193, 34%). The network tends to 'squash' the training set, so the peak is pushed higher. Far fewer outlying types are assigned, as expected. Because of this variation, it is probably unwise to pick a median network to assign types; instead a median type from all the runs, or some kind of mean type is preferable. The median types in the net type versus eClass plot are done in this way.

- Values of initial and final weights: there are no extreme values (spikes or blades, to e.g. +/- 10⁶ or more) due to the regularisation. The final weights for the 10 runs of 22:8:1, eClass target are shown. The initial values are random, between -1 and 1.

Input weights: matrices, get mostly small values, occasional 'walls' of larger values, but not greater than +/- 20:

Layer weights: vectors here but can be matrices when more than one hidden layer is present or if multiple outputs were used:

Biases: vectors (hidden layers with more than one neuron) or scalar (single output neuron):

- Training records: mean squared error (RMS²) versus number of epochs. Single neuron, eClass target, all parameter sets. Each plot shows all ten runs for the parameter set:

This shows that the Levenberg-Marquardt algorithm jumps straight to the minimum for a linear problem. The training usually stops here when mu_max , the momentum term, is reached.

8:1 network, all parameter sets: more spread in the minima reached for the larger sets (sets are 1-38, as in the correlation/RMS by parameter set plot). The maximum number of epochs (set by epochs = 100) is reached more often for larger networks, presumably due to the greater potential for slight improvements in the minimisation of the error function at each epoch.

Architectures: this shows the training records for the redshift target, as this was run on several architectures using sets including all the parameters except the magnitudes. Apart from the single neuron, the architecture seems to have little effect. presumably due to the large training set size (~15000) relative to the number of weights (up to 769).

- Levenberg-Marquardt momentum term (mu). mu_max is 1.00 x 10¹⁰. Again this is for all parameters except magnitudes, redshift target, 10 architectures: Another instance had a mu dropping momentarily to 10^-80 but other plots look similar to this. Oscillating mu over one order of magnitude for many epochs indicates the weights are near a minimum, mu going to the large value indicates the minimum is effectively found.
- Simulating the trained network 22:8:1 (eClass target) on test set and on its own training set: it does slightly better on the training set (0.87 correlation as opposed to 0.85), and although this difference is quite small it is about a factor of ten greater than the spread in correlations for different runs. The rms is similar. The 4:8:1 net is a also similar so it is not an effect of having many parameters.

22:8:1

4:8:1

22:8:1

4:8:1

- Rms as a function of target type: this may be worth taking into account when using the assigned types. The plot shows the rms per bin for 20 bins, each containing an equal number of galaxies for one run of the 22:8:1 net, eClass target.

Rms on the plot of net type versus target type:

- Magnitudes contain more information than colours, e.g., g and r as opposed to g-r, but a plot of colour correlates to type and the magnitudes do not. So which does better?

Using Petrosian g and r, eyeball type target and the 8:1 net, the corr/RMSs are:

g r: 0.7323, 1.0312

g-r: 0.7249, 1.0457

Using all five magnitudes and all four model colours with the 8:1 net and eClass target:

u g r i z: 0.8459, 0.1102

u-g g-r r-i i-z: 0.8446, 0.1107

However, the magnitudes have the spurious correlation from the Petrosian r=17.77 cut and the colours do not.

e.g. with a redshift target (from paper results), model:

u (spurious correlation): 0.6349 0.0379

r (flattish-topped plot): 0.4188 0.0445

g-r colour: 0.7514 0.0324

Petrosian:

g (spurious correlation): 0.7116 0.0345

r (flat-topped plot): 0.4825 0.0430

g-r: 0.7121 0.0344

So I'm not convinced that using the magnitudes is better than using the colours. Whether to use model or Petrosian depends on the target (Petrosian are better for morphology, model for eClass). And of course, using both cannot be worse than using one or the other; it will just take longer to do the training.

- Validation set error versus training epoch: even in the worst case of a 32:1 net, 30 parameters (1025 weights) and the eyeball training set (about 600 galaxies), the validation set error does not increase. The difference is less for more typical examples e.g. eClass, 4 colours. Thus even the worst case is not severely overfitting and the more typical cases are not overfitting at all. (To generate these, max_fail, the number of times the validation set error can increase before training stops, was changed to Inf from the default 5).

- The eyeball types are discrete (nearest 0.5) as opposed to continuous targets, but trainlm and one output is designed for regression of continuous variables. Thus smoothing the eyeball targets by adding +/- 0.25 random noise may improve the results; however, the results here are the same. May be worth doing anyway to help the numerical conditioning.
- Test data: the code has not been rigorously tested, but I have run some simple test data to check that it isn't doing anything wildly incorrect.
- Jan 24th 2003: even sample of training set (100 bins between lowest and highest training eclass without removing outlying targets, limiting number of galaxies in each bin to 500 but otherwise leaving the set unchanged) mostly removes the sigmoid shape in net type versus target type:

Using unevened training set, as used so far

Using evened training set

- Jan 29th 2003: also tried this for star formation rate (now 50 bins, removing N*mean number of galaxies in bin, here N = 1) and removing targets below -0.5 and 1. The results look better than previously but still aren't particularly exciting. This suggests that different galaxy parameters are needed to predict star formation directly.
- Jan 29th 2003: it is assumed that the median and mean net types are similar. Checking this (results for 10 runs):

Type Architecture std(mean-median) max min RMS for this architecture RMS/std

eclass: 20:8:1 [1] 0.0028 0.0306 -0.0621 0.1080 39

eyeball: 20:8:1 0.0728 0.2587 -0.4159 0.6652 9

redshift: 20:8:1 0.0013 0.0153 -0.0178 0.0218 17

sfr: 20:8:1 0.0383 1.0589 [2] -0.2689 0.6964 18

tpauto: 20:8:1 0.0151 0.1186 -0.1238 0.2930 19

eclass: 4:8:1 [3] 0.0018 0.0263 -0.0462 0.1102 61

eyeball: 4:8:1 [4] 0.0593 0.3947 -0.5260 0.9614 16

1:8:1 [5] 0.0039 0.0095 -0.0079 1.0457 268

etc.

[1] all parameters except magnitudes

[2] second highest is 0.62 then < 0.3

[3] model colours

[4] Petrosian colours

[5] Petrosian g*-r*

The mean is clearly not the correct value to use because it is distorted by the values from the poorly performing runs, but as this shows, it makes little difference so nothing more complex than the median is fine. Of course, more runs is still better as the median will then be closer to the true type for the galaxy.

More possible plots:

Corr/rms versus internal scatter (Naim et al)

Cumulative percent of ANN types correct to within a given error versus residual (Naim et al)

Corr/rms versus training set size (Firth et al)

Time taken versus training set size and number of parameters

Cone plots by type