Fast Data Mining Algorithms on the CADC Infrastructure (2012)

Here, we describe the implementation of fast data mining algorithms on the cloud computing infrastructure of the Canadian Astronomy Data Centre (CADC). For those who would like a broad introduction to data mining in astronomy, with an emphasis on what good is it for the science, take a look at the IVOA KDD-IG Guide to Data Mining in Astronomy.

CADC provides a cloud computing environment for astronomical projects, CANFAR. This is the world's first cloud computing system built specifically with astronomy in mind. Its aim is to provide the generic portions of a project, such as storage, processing, and dissemination of data, so that they do not need to be built from scratch each time, and are under the development and supervision of experts. The user remains free to install and run their own code, utilizing this infrastructure.

CADC's overall mission is not simply to provide a data archive, but to enable novel science to be done with that data. Therefore, by analogy to CANFAR, here, we demonstrate fast data mining algorithms that provide the generic portions of a study of astronomical data using data mining techniques. The motivation is similar to that of legacy datasets, e.g., from wide-field sky surveys: the provision of a generically useful tool and/or dataset enables a great deal of science. This is exemplified by the Sloan Digital Sky Survey, which, thanks to both its data, and the database infrastructure that made it accessible, has become the most cited facility in astronomy.

To be scalable to astronomical datasets at the petascale (1015 bytes) or beyond, the algorithms must scale as N log N, or better, where N is the number of objects, either in the training set, or the whole dataset. Therefore, it is essential that the tools provided are able to do this.

We give results showing the scaling of data mining codes in the R language, which provides a large number of freely available and open source codes, and the Skytree software, which uses state-of-the-art methods to provide better speed and scaling. Each example is on a real science dataset, performing an investigation which is useful to the author and/or collaborators.

Skytree recently launched; we have been using beta versions since 2010, in collaboration with the developers.

It claims the following speedups, for N objects or D-dimensional data:

Each of these methods is useful for a plethora of science investigations:

K nearest neighbours: Any dataset for which detailed data on properties are available for a representative subset of objects may have those properties predicted for the remaining objects using supervised learning. Many supervised learning algorithms are available, of which k nearest neighbours is conceptually the simplest. An example is training on a spectroscopic sample, e.g., a square degree of spectra to 22nd magnitude, and predicting properties, e.g., a redshift, for a much larger sample, perhaps 100 square degrees, for which only photometry is available. Such photometric redshifts are, when training data are available, generally superior to template-based redshifts.

Kernel density estimation: A distribution that is made up of components can be decomposed into those components by estimating the density of a given kernel, such as a Gaussian. It is similar to a histogram, but more general. Examples are the selection of quasars by modeling the densities of different types of objects, or the description of a photometric redshift probability density function (PDF) that gives signal-to-noise equivalent to a survey many times the size of not using PDFs.

Linear regression: This well-known classic method finds the best-fitting linear relationship between two variables. Of the seven methods here, we currently do not demonstrate this one, since our only current application is the trivial one of fitting the slopes on the scaling relations for the code shown below! However, an example would be that the supernova Hubble diagram is not linear, i.e., the expansion of the universe is accelerating.

Support vector machine: Like k nearest neighbours, this is a supervised learning algorithm that has become popular in astronomy for object classification, e.g., star-galaxy separation, or galaxy types. An example classification is reproducing in an automated manner the impressive and extensive results of the Galaxy Zoo project. While the project was able to generate over 30 million classifications in 6 months, using a pool of volunteers numbering 100,000, it is our contention that next-generation surveys such as LSST will not be able to employ the same method for their several billion galaxies, due to the unlikeliness of gaining 10 million volunteers. Therefore, it will be necessary, to perform analogous science with LSST, to classify the majority of galaxies automatically.

Singular value decomposition: Many modern datasets are high-dimensional, i.e., each object is described by a large number of measurements. However, many of these may be redundant, or correlated, and so on, so that the most useful information is represented by a much smaller subset of the measurements, or combinations thereof. Hence, a method for objective dimension reduction is of wide applicability. Singular value decomposition, of which principal component analysis is a part, is a widely used method. An example is finding the principal components of galaxy spectra as the basis of a classification system (used, e.g., in the Sloan Survey).

K-means clustering: If one has a set of data that one suspects contains particular subsets of objects, but the structure of the data is not known, the application of unsupervised learning to the data may elucidate that structure. K-means is one of the commonest and simplest algorithms, and finds clusters of objects in a parameter space. It has been used in astronomy, for example, to show that the fundamental plane of galaxies is made up of components with widely different evolutionary histories.

2-point correlation function: Is a set of points randomly distributed, or are there correlations? The 2-point correlation function measures the excess probability over random that a point (e.g., a galaxy in space) will be found within a given distance of another point. The 2-point, and its higher-order counterparts, have been used to show that, e.g., the universe appears to become statistically homogeneous on scales of greater than ~100 Mpc.

The following pages give results on our datasets: the Next Generation Virgo Cluster Survey, and the Sloan Digital Sky Survey. Each shows why we wanted to do the investigation, and the speedup (if any) provided by the Skytree software.

[Due to the finite number of hours in the day, and the author's requirements for 8 hours of sleep, the results pages are under construction! We apologize for the inconvenience.]

K nearest neighbours

Kernel density estimation

Linear regression

Support vector machine

Singular value decomposition

K-means clustering

2-point correlation function

In our case, the true power of the software lies not simply in speedups provided, but in the expertise gained by collaborating with its developers, who are world experts in machine learning and data mining. Over time, many of the most significant advances in astronomy have resulted from such interdisciplinary collaboration. Other CANFAR users would also benefit from such collaboration, because Skytree's deployment includes their support.

Skytree is a spinoff of the much more extensive results of the FastLab group at the Georgia Institute of Technology.