By Niharika Shrivastava, BS-MS 2021
Machine Learning In Astronomy
Astronomy, as we know it, is the scientific study of celestial objects such as stars, planets, galaxies, and the universe as a whole. It encompasses observing, analyzing, and understanding of the physical properties, movements, origins, and evolution of objects and phenomena beyond Earth's atmosphere. However, a question may arise in one's mind: why do we need a modern and sophisticated technique called machine learning in astronomy? Well, in these recent times of a digitalized world, humans have begun sending out telescopes into space to extend their horizons and know more about the universe than it ever could from just its observatories on Earth. We now have telescopes collecting data in different wavelengths of the electromagnetic spectrum.
As a consequence of these establishments, we now have petabytes of data coming from space telescopes and ground-based observatories. Hence, dealing with astronomical data now comes under the field of 'Big Data.' Moreover, besides machines, what else can assist humans in data mining and drawing scientific inferences from it?
First of all, we formally define machine learning as a field of study that gives computers the capability to learn without being explicitly programmed. It is these far-reaching implications of machine learning that have become increasingly popular among astronomers. Broadly, this regime has been divided into three different types based on the way the algorithms work. These are namely:
Supervised learning
Unsupervised learning
Reinforcement learning
This article will cover the Supervised machine learning techniques, their pros and cons, and their uses in astronomy.
Supervised learning algorithms encompass those programs that use a set of examples to learn the relationship between the different attributes of the dataset and the target variable. Once established, the relationship can be used to predict the target variable from the unseen data. These algorithms can be used to describe the complex, non-linear relations between attributes and the target variables, increasing the efficiency of the model compared to the traditional model fitting techniques where the model is pre-defined.
The astronomical datasets contain different attributes, such as spectra or light curves of the physical entities, such as stars and galaxies, and our task defines the target variables. If the target variables are discrete, the algorithms come under the classification task, and if the target variables are continuous, that is, they can take any value within a given range, then they come under the regression task. According to what we want to predict, we apply these methods to find the 'least error' value of our variable.
Before we go into how supervised can be helpful in astronomy, we’ll briefly discuss some major algorithms. The following can be implemented with the use of programming languages like Python or R.
Support Vector Machine: It is a popularised learning model used for both classification and regression. Typically, for classification problems, we classify the data points using the construction of a hyperplane, which acts as a decision boundary to separate different classes.
In a two-dimensional space, the hyperplane is a line that divides the space into two parts, each containing one class. If the classes are not linearly separable, SVM employs a technique called the kernel trick. This involves mapping the dataset into a higher-dimensional feature space where linear separation might be possible. Once the decision boundary is found in this transformed space, it is back-projected to the original input space, resulting in a non-linear decision boundary. SVM is a versatile and robust classification method available in libraries like scikit-learn.
Decision Trees and Random Forest: Decision trees are represented as a tree-like graph with consecutive nodes, each of which represents a condition on a feature in the dataset. The tree is constructed during the training stage, with the root node initially containing the entire training set. The final decision tree can predict the class of new objects by following the conditions in the nodes.
Random Forest, on the other hand, is a collection of decision trees. Different trees are trained on randomly selected subsets of the training set, and random subsets of features are used in each tree. This process reduces correlations between trees and results in better generalization to new datasets through an aggregate of individual tree predictions. Random Forest can handle datasets with many features and is a popular machine-learning algorithm in astronomy.
Artificial Neural Networks: These algorithms lie at the heart of deep learning; their name and structure are inspired by the human brain, mimicking the way the biological neurons signal to one another. They consist of node layers containing an input layer, or one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.
Neural networks rely on training data to learn and improve their accuracy over time.
These algorithms are among the many popular algorithms employed by researchers all around the world to perform data analysis on classification and regression tasks. Next up are some examples that will demarcate the usage of these methods.
These algorithms can be utilized in many different ways to help us analyze astronomical data in a more robust and time-efficient manner. In time-domain astronomy, where we observe how objects change on timescales from seconds to decades, both in photometric and spectroscopic datasets, we can use these algorithms to identify and classify objects and transient events.
In solar observations, computer algorithms can be used to predict solar weather phenomena such as flares. This predictive capability is crucial for mitigating the effects of solar storms on Earth's technology and infrastructure.
Another domain where machine learning plays a pivotal role is exoplanet hunting. By analyzing the duration and amount of light blocked by potential exoplanets, we can gather information about the planet's size and orbit. Several exoplanets have been identified using machine learning, including a few in multiple-planet systems, where the signals are too complex for a human to distinguish easily.
-> Above shown is a simple decision tree built with the J48 algorithm. It creates decision trees by recursively partitioning data based on attribute values. J48 employs information gain or gain ratio to select the best attribute for splitting. It handles categorical and numeric attributes, supports pruning to prevent overfitting, and is widely used for classification tasks due to its simplicity and effectiveness. This tree was trained with 50,000 objects from the spectroscopic sample and has a minimum number of objects per leaf equal to 50.
-> A recent study indicates that employing machine learning (ML) algorithms, such as artificial neural networks, for the morphological classification of galaxies can yield reliable results with 90% accuracy, surpassing human classifications. This study utilized a training sample from the Galaxy Zoo, which categorized the dataset into three classes: early types, spirals, and point sources/artifacts. A subset of these objects was employed to train the artificial neural network (ANN), revealing that increasing the number of input parameters improves the accuracy of the results. Consequently, it appears promising to leverage machine learning algorithms for morphological classification in the next generation of wide-field imaging surveys. The Galaxy Zoo catalog stands out as an invaluable training set for such endeavors.
Despite the ease and efficiency we get, applying supervised learning algorithms to astronomical datasets presents challenges in handling uncertainty, knowledge transfer, and ensuring model interpretability. Most existing algorithms are not tailored for astronomical datasets. They assume uniform feature quality and treat provided labels as ground truth. Astronomical datasets, characterized by noise and gaps, often have ambiguous labels from human experts. While supervised learning excels with high signal-to-noise ratio datasets or those with uniform noise, its performance depends on noise characteristics, hindering generalization to datasets with different noise profiles. Adapting tools and creating new algorithms becomes necessary to accommodate dataset uncertainties during model construction. Advanced algorithms should offer prediction uncertainties based on intrinsic object properties and measurement uncertainties within the dataset, addressing the intricacies of astronomical data for more robust and accurate modeling.
References:
‘MACHINE LEARNING IN ASTRONOMY: A PRACTICAL OVERVIEW’ by Dalya Baron
‘DATA MINING AND MACHINE LEARNING IN ASTRONOMY’ by NICHOLAS M. BALL and ROBERT J. BRUNNER
‘Galaxy Zoo: Reproducing Galaxy Morphologies via Machine Learning’ by Manda Banerji
IBM Machine Learning Article
‘Machine Learning in Astronomy’ 365 Data Science
Wikipedia