CANFAR+Skytree

CANFAR + Skytree is a new capability at the National Research Council's Canadian Astronomy Data Centre, designed for those who

wish to extract useful information from datasets. It is the world's first cloud computing and data mining system for astronomy.

If you are interested in using the system, please contact the author of this page, Nick Ball ( nick.ball at nrc-cnrc.gc.ca ).

Not got time to read the whole page? Download the flyer.

Want more? Check out the webinar.

Introduction
What are CANFAR and Skytree?
Why machine learning?
How to access CANFAR and install and use Skytree
How fast?
Technical specifications
Examples of improved science enabled by data mining
Documentation
Links
Media

Introduction

We have combined:

- CANFAR, the Canadian Astronomy Data Centre's cloud computing system, the first of its kind in the world for astronomy
- Skytree, the world's most advanced machine learning software, capable of scaling to petascale datasets
- Collaboration with the world's largest group dedicated to the development of machine learning algorithms for analyzing large data
- CADC's world leading expertise in infrastructure for handling large astronomy data

and are using it to do science, e.g., with the Next Generation Virgo Cluster Survey, Canada-France-Hawaii Telescope Legacy Survey, and photometric redshifts (galaxy distances) for 13 billion galaxies.

What are CANFAR and Skytree?

CANFAR, led by the University of Victoria and implemented at the Canadian Astronomy Data Centre, aims

To develop an operational system that enables the effective delivery, processing, storage,

analysis and distribution of very large datasets produced by astronomical surveys.

and

The CANFAR platform leverages CANARIE’s high-speed network and cloud computing technologies (in which dynamically scalable and virtual resources are provided as a service over the Internet). CANFAR provides a ‘survey-agnostic’ cloud environment that allows astronomers to effectively process the massive amounts of data generated by highly sophisticated and complex observatories using their own custom software.

The user interacts with the system via a virtual machine, which is exactly as running one's own desktop, or laptop. This means that you can install and run your own code, then run the same code replicated up to 500 times in parallel, hence, 500x faster.

Skytree is the world's most advanced machine learning software. It provides the most well-known algorithms spanning the field of data mining, implemented so that they scale to very large datasets.

Why machine learning?

Supervised Learning

If you possess data for a set of objects for which you know detailed information, and a further set for which you have only a subset of the same measurements, hence less detailed information, supervised learning can predict the detailed information for your larger set.

E.g., galaxy classification 100,000x faster than humans; approximating steps that are too computationally intensive in radiative transfer simulations.

Unsupervised Learning

Supposing instead you have a dataset, with perhaps many measurements per object, and you do not know what populations it contains. Unsupervised learning allows one to objectively find naturally similar populations of objects.

E.g., components of the galaxy Fundamental Plane correspond to different evolutionary histories; discover outliers or new object classes.

Density Estimation

Density estimation is a more general version of a histogram that does not require binning the data.

E.g., selection of quasars candidates from large surveys; modeling the components of a population.

Dimension Reduction

You have a set of objects described by a large number of parameters each. Which parameters are the most important? Dimension reduction provides an objective method of determining this.

E.g., principle component analysis for objective classification of galaxy spectra.

Nearest Neighbours

For each given object within a catalogue, which other objects are most similar to it, and how similar are they?

E.g., kNN and KDE gives photometric redshifts.

Linear Regression

This finds the best-fit straight line relationship between one variable and another in a dataset.

E.g., a straight line relation in the supernova Hubble diagram is not a good fit: the expansion of the universe is accelerating.

2 Point Correlation

The 2 point correlation function is the excess probability over random that an object is within a given distance of another object. This allows characterization of the clustering of objects in a dataset.

E.g., galaxy clustering confirms the cosmological principle of large scale homogeneity of the universe.

How to access CANFAR and install and use Skytree

The process is described here.

How Fast?

For N objects in D dimensions:

* = data-dependent

Skytree claims speedups of up to 10000x compared to existing approaches.

Technical specifications

CANFAR

Operating system: Scientific Linux 5.5

Virtual machines: Xen Hypervisor, Nimbus

Job scheduling: Condor

Authentication: X.509 certificates

Storage virtualization: VOSpace

Processor cores: 500

Skytree

Operating system: Linux, Mac OS X

Hardware: x86 or 64-bit

Memory: 1GB or above

Disk space: 20GB or above

Examples of improved science enabled by data mining

Data mining has enabled a number of improved results in astronomy. The User Guide for Data Mining in Astronomy details numerous examples in its section 2.

Documentation

CANFAR wiki

Skytree Documentation

CANFAR+Skytree flyer

CANFAR+Skytree poster

Examples of improved science from data mining

Links

CANFAR

Skytree

KDD-IG guide: A User Guide for Data Mining in Astronomy

CADC

University of Victoria Physics & Astronomy

Next Generation Virgo Cluster Survey

Canada-France-Hawaii Telescope Legacy Survey

Media

Blog: CANFAR+Skytree on Astronomy Computing Today

Webinar: Exploring the Universe with Machine Learning

Astroinformatics on Slashdot (2011)