Star-Galaxy Separation in the SDSS DR3 (2006)

The available Java package Data-To-Knowledge from the NCSA was used as the basis of a knowledge-discovery in data bases (KDD) approach to the classification of objects within the SDSS, as part of the Laboratory for Cosmological Data Mining at the Department of Astronomy and NCSA. In this paper, decision trees were used. The combination of peer-reviewed nationally allocated supercomputing resources on the Linux cluster Tungsten, the expertise of the Automated Learning Group at NCSA and the local expertise in handling large data sets placed us in a strong position to carry out this type of analysis.

The results are presented in more detail in Ball et al. 2006 (ApJ 650 497), or in preprint form at astro-ph/0606541. There is also a poster, which was displayed at the June 2006 208th meeting of the American Astronomical Society.

In this paper, we provided, for the first time, an application of machine learning to an astronomical dataset of order 108 objects. The idea was to generalize the SDSS star-galaxy separation to classify objects as stars, galaxies, or, the novel part, 'neither star nor galaxy' (nsng) to uncover samples of potentially astrophysically interesting objects. Examples include quasars, objects that could be stars or galaxies but are unusual in some way, or completely unknown objects. We thus were able to provide full probabilities of the form p(galaxy,nsng,star) for each of the 143 million objects in the SDSS DR3.

Supercomputing resources enabled us to make an extensive exploration of the decision tree parameter space. This visualization, created with Partiview, shows one view of the classification error, as a function of 3 of the adjustable parameters.

Figure 1: Decision tree classification error from training as a function of the minimum decomposition population (MDP) and minimum error reduction (MER) for the maximum tree depths (MTDs) shown (including 1, the front extension of the plane at the highest classification error). Each vertex of the mesh represents the result from a decision tree. The best MDP is 2 and the best MER is 0. From Ball et al. 2006 (ApJ 650 497).

As in other work, we trained on a sample and set aside some of that sample for a blind test. This plot of the confusion matrix shows that a large fraction of the objects are classified correctly.

Figure 2: Confusion matrix for the overall assignment of the types galaxy, nsng and star for the SDSS DR3 blind testing set in the spectroscopic regime. The three columns in each panel show the numbers of galaxies, nsngs and stars assigned and hence that the decision trees successfully assign the types to the objects. Note the logarithmic vertical axes. From Ball et al. 2006.

An important advantage of providing full probabilities for each object is that the user is able to cut the sample according to the desired level of completeness (percentage of objects of a true type classified as such) and efficiency (percentage of objects classified as a type that are truly that type). This is important because different applications have varying requirements, for example, studying quasar properties would require a high completeness, but a study of clustering would require a high efficiency (so as not to be contaminated by, e.g., stars).

Figure 3: Completeness and efficiency for galaxies as a function of p(galaxy,nsng,star) for the blind testing set. The optimal overall samples correspond to the points closest to the upper right of the plot for each object type. The curves extend outside the axes shown. From Ball et al. 2006.

Finally, we also performed a blind test on a different survey, and modified the nsng criterion to select quasars directly. This plot shows the resulting completeness and efficiency when the trained decision trees are applied to the 2dF Quasar Redshift Survey (2QZ). The efficiency drops at fainter magnitudes (r > 19) than the training set, but, because we are using colors, such extrapolation is much more reasonable than it would be in, for example, redshift.

Figure 4: Completeness as a function of magnitude for 2QZ quasars. The upper panel shows the differential counts for matches with the SDSS DR3. The lower panel shows the completeness for the integrated counts. The error bars are Poisson. From Ball et al. 2006.