Source Classification

with

Machine Learning





Associated code on Github: https://github.com/informationcake/Astro-Machine-Learning

The aim is to automatically classify what an astronomical source is based on any information describing it. As a basic example, I took photometry data for Stars, Galaxies and Quasars from the Sloan Digital Sky Survey (SDSS), and the Wide-field Infrared Survey Explorer (WISE). What are all the dots in the image at the top of this page??

Is it a Star?

Is it a Galaxy?

Is it a Quasar?

I took 2.5 million sources and used a Random Forest as a supervised machine learning classifier. I trained the Random Forest on 10% of the samples (250,000), and then asked it to classify the remaining 2.25 million. It got 97% of them correct, giving a precision, recall and F1 score of 0.97. In fact, training on 1% of the samples (25000) achieves the same scores! This means that this classification task is very well represented by the features used. Below you can see plots showing the True vs Predicted labels, and also the ranking of the features used. In this run I used radio photometry from FIRST and TGSS, though given that very few sources have corresponding radio points these features were ranked very low. This means that they were basically not used for this classification task. I would need a significant number of sources to have corresponding radio data points in order for it to be useful.

Unsupervised Learning

I used t-Distributed Stochastic Neighbour Embedding on my 9 dimensional data set (dropping the radio data) to reduce the number of dimensions down to 2. This means I can now plot each object in a 2D plane and see if similar objects cluster together. This is a useful visualisation of whether supervised learning will be successfull.

More complex source types will likely not have appropriate labels (e.g. specific types of Stars, Galaxies or AGN). Therefore, un-supervised learning such as this could be a useful way of labelling data sets which do not have any class labels, in preparation for supervised learning on larger samples.


I'm now looking at using lots more features for each object to help with advanced classifiers, with a focus on including radio data for a large sample of sources.