Support Vector Machines

[Page under construction.]

Within the Next Generation Virgo Cluster survey, we wish to determine which galaxies are in the Virgo cluster, and which are in the background. This is useful for measurements such as, e.g., the galaxy luminosity function. Traditional criteria such as magnitude, surface brightness, and resolution into structure, while somewhat effective, do not capture the full information available in the survey, in particular, galaxy colours.

Within the regime for which spectra are available within the survey, g<21 mag, a straightforward application of supervised learning is able to provide a good separation. The trained model can then be applied to the full survey.

On a modern machine, the NGVS training set is not particularly large, numbering ... . The full survey catalogue to r<21 is ... , and to the full survey epth is 12.5 million. The final survey catalogue will be ~20 million.

Nevertheless, we see ... [fill in]

Data preprocessing:

...

The commands run in R were:

library(e1071)

data_tr = read.csv("data/scaling/ngvs/ngvs_spectro_giz_tr.csv",header=TRUE)

date(); data_te = read.csv("data/scaling/ngvs/ngvs_<frac>.csv",header=TRUE); date()

tr_features = data_tr[,c(1,2,3,4)]

tr_classes = data_tr[,5]

model <- svm(tr_features, tr_classes)

date(); pred <- predict(model, data_te); date()

The commands run in Skytree were:

smo --references_in=../data/scaling/ngvs/ngvs_spectro_giz_tr.fl --parameters_out=params --weights_out=weights

smo --run_mode=eval --queries_in=../data/scaling/ngvs/ngvs_full_<frac>.fl --parameters_in=params --results_out=results --weights_in=weights

The runtimes for increasing training set size are:

... [table, plot]

The application to the full survey scales linearly with dataset size, as expected: