CANFAR+Skytree
CANFAR + Skytree is a new capability at the National Research Council's Canadian Astronomy Data Centre, designed for those who
wish to extract useful information from datasets. It is the world's first cloud computing and data mining system for astronomy.
If you are interested in using the system, please contact the author of this page, Nick Ball ( nick.ball at nrc-cnrc.gc.ca ).
Not got time to read the whole page? Download the flyer.
Want more? Check out the webinar.
Introduction
What are CANFAR and Skytree?
Why machine learning?
How to access CANFAR and install and use Skytree
How fast?
Technical specifications
Examples of improved science enabled by data mining
Documentation
Links
Media
Introduction
We have combined:
CANFAR, the Canadian Astronomy Data Centre's cloud computing system, the first of its kind in the world for astronomy
Skytree, the world's most advanced machine learning software, capable of scaling to petascale datasets
Collaboration with the world's largest group dedicated to the development of machine learning algorithms for analyzing large data
CADC's world leading expertise in infrastructure for handling large astronomy data
and are using it to do science, e.g., with the Next Generation Virgo Cluster Survey, Canada-France-Hawaii Telescope Legacy Survey, and photometric redshifts (galaxy distances) for 13 billion galaxies.
What are CANFAR and Skytree?
CANFAR, led by the University of Victoria and implemented at the Canadian Astronomy Data Centre, aims
To develop an operational system that enables the effective delivery, processing, storage,
analysis and distribution of very large datasets produced by astronomical surveys.
and
The CANFAR platform leverages CANARIE’s high-speed network and cloud computing technologies (in which dynamically scalable and virtual resources are provided as a service over the Internet). CANFAR provides a ‘survey-agnostic’ cloud environment that allows astronomers to effectively process the massive amounts of data generated by highly sophisticated and complex observatories using their own custom software.
The user interacts with the system via a virtual machine, which is exactly as running one's own desktop, or laptop. This means that you can install and run your own code, then run the same code replicated up to 500 times in parallel, hence, 500x faster.
Skytree is the world's most advanced machine learning software. It provides the most well-known algorithms spanning the field of data mining, implemented so that they scale to very large datasets.
Why machine learning?
Supervised Learning
If you possess data for a set of objects for which you know detailed information, and a further set for which you have only a subset of the same measurements, hence less detailed information, supervised learning can predict the detailed information for your larger set.
E.g., galaxy classification 100,000x faster than humans; approximating steps that are too computationally intensive in radiative transfer simulations.
Unsupervised Learning
Supposing instead you have a dataset, with perhaps many measurements per object, and you do not know what populations it contains. Unsupervised learning allows one to objectively find naturally similar populations of objects.
E.g., components of the galaxy Fundamental Plane correspond to different evolutionary histories; discover outliers or new object classes.
Density Estimation
Density estimation is a more general version of a histogram that does not require binning the data.
E.g., selection of quasars candidates from large surveys; modeling the components of a population.
Dimension Reduction
You have a set of objects described by a large number of parameters each. Which parameters are the most important? Dimension reduction provides an objective method of determining this.
E.g., principle component analysis for objective classification of galaxy spectra.
Nearest Neighbours
For each given object within a catalogue, which other objects are most similar to it, and how similar are they?
E.g., kNN and KDE gives photometric redshifts.
Linear Regression
This finds the best-fit straight line relationship between one variable and another in a dataset.
E.g., a straight line relation in the supernova Hubble diagram is not a good fit: the expansion of the universe is accelerating.
2 Point Correlation
The 2 point correlation function is the excess probability over random that an object is within a given distance of another object. This allows characterization of the clustering of objects in a dataset.
E.g., galaxy clustering confirms the cosmological principle of large scale homogeneity of the universe.
How to access CANFAR and install and use Skytree
The process is described here.
How Fast?
For N objects in D dimensions:
* = data-dependent
Skytree claims speedups of up to 10000x compared to existing approaches.
Technical specifications
CANFAR
Operating system: Scientific Linux 5.5
Virtual machines: Xen Hypervisor, Nimbus
Job scheduling: Condor
Authentication: X.509 certificates
Storage virtualization: VOSpace
Processor cores: 500
Skytree
Operating system: Linux, Mac OS X
Hardware: x86 or 64-bit
Memory: 1GB or above
Disk space: 20GB or above
Examples of improved science enabled by data mining
Data mining has enabled a number of improved results in astronomy. The User Guide for Data Mining in Astronomy details numerous examples in its section 2.
Documentation
CANFAR+Skytree flyer
CANFAR+Skytree poster
Examples of improved science from data mining
Links
KDD-IG guide: A User Guide for Data Mining in Astronomy
University of Victoria Physics & Astronomy
Next Generation Virgo Cluster Survey
Canada-France-Hawaii Telescope Legacy Survey
Media
Blog: CANFAR+Skytree on Astronomy Computing Today