AI and Machine Learning

Astrophysics has entered a new era of Big Data surveys, like LSST, which promise to revolutionise our understanding of how the Universe evolves over cosmic time. However, the analysis of the prodigious data volumes expected from such surveys will likely render traditional methods inadequate, making it necessary to augment or even replace classical techniques using artifical intelligence (AI) and machine-learning algorithms. Here, we consider an example of such a data-analysis problem: the automated morphological analysis of tens of thousands of galaxies, in images that are similar to what will be available from the LSST.


Example - morphological analysis of galaxies using unsupervised machine learning

Galaxy morphology is critical not only for the full spectrum of galaxy evolution studies, but is also a key parameter for a wide array of astrophysical science, e.g. as a prior for photometric redshift pipelines and as contextual data for transient lightcurve classifications. LSST offers an unprecedented opportunity for the morphological analysis of galaxies (and structures like low-surface-brightness tidal features), down to lower stellar masses and fainter surface brightnesses than ever before.

However, the unprecedented LSST data volumes will make traditional methods of classification, like visual inspection (Willett et al. 2015) challenging, even using massively-distributed systems like Galaxy Zoo (Lintott et al. 2011). The short cadence of LSST may pose an additional hurdle, because repeatedly producing training sets that are required for supervised machine-learning also becomes impractical. For Big Data surveys like LSST, unsupervised machine learning (UML) offers an ideal method for morphological analysis.

An effective UML algorithm autonomously groups objects of similar morphology, thus compressing an arbitrarily large population into a small number of 'morphological clusters'. If these clusters exhibit high purity, they can then be (collectively) benchmarked using visual classification. Fig 1 shows an example of such a UML algorithm, implemented on both space-based HST (Hocking et al. 2018) and ground-based HSC (Martin et al. 2020) images.

Fig 1: Top two rows - the unsupervised machine-learning algorithm implemented on HST data produces clean separation of objects that are composed of pixels with different properties e.g. colour and texture. Bottom two rows: An implementation on ground-based images, from the Ultra-deep layer of the HSC-SSP DR1 (Martin et al. 2020). The HSC-SSP DR1 Ultradeep images have similar resolution to LSST and a depth that is similar to one of the LSST commissioning surveys and also its final Wide survey.

The algorithm extracts image patches from multi-band data, each of which is transformed into a rotationally-invariant representation of a small region of the survey data, efficiently encoding colour, intensity and spatial frequency information. Utilizing growing neural gas and hierarchical clustering algorithms, it then groups the patches into a library of patch types, based on their similarity. It then assembles 'feature vectors' for each object, which describe the frequency of each patch type. A k-means algorithm is then employed to separate objects into 160 morphological clusters, based on the similarity of their feature vectors (see Martin et al. 2020 for details).

These clusters can then be benchmarked via visual inspection, both to establish the morphology of the average object and the morphological purity of the cluster. If the purity of the clusters is high, then the benchmarking exercise reduces to visual inspection of 160 clusters, rather than hundreds of thousands of individual galaxies, making the problem tractable even for individual researchers. Fig 1 shows a few morphological clusters produced by the algorithm (individual columns represent morphological clusters). The top two rows show an implementation on HST data, while the bottom two rows show an implementation on ground-based HSC data.

Fig 2: Physical properties of galaxies -- rest-frame colours and absolute magnitudes and the star-formation main sequence (left) -- in broad morphological classes (ellipticals, S0/Sa and discs; shown in the right-hand panel) in the low and intermediate redshift Universe (z<1). Galaxies classified into these broad morphological classes reproduce known trends in galaxy properties as a function of morphology.

The discrimination between broad morphological classes (e.g. elliptical galaxies, S0/Sa systems and spiral galaxies) is very accurate, with high purity within morphological clusters. Galaxies classified into these classes reproduce known trends in physical properties (e.g. stellar masses, absolute magnitudes, rest-frame colours and star formation rates) with redshift at z<1 (Fig 2). Given the ground-based resolution of HSC and future LSST images it will be difficult to go beyond this redshift. Such work can, however, be extended to earlier epochs using images from EUCLID and JWST.

Fig 3: Examples of two morphological clusters from the unsupervised machine-learning algorithm described above which involve low-surface-brightness (LSB) galaxies or structures. The first two images (from the left) show examples of objects from a morphological cluster populated by relatively massive LSB dwarfs. The remaining images show examples of objects from a morphological cluster populated by elliptical galaxies that exhibit LSB shells (which are indicative of recent minor mergers).

Since it leverages colour, intensity and spatial frequency information, the algorithm is able to identify sub populations, such as galaxies with faint LSB shells and massive dwarfs in the HSC images (Fig 3). Finally, since the characteristics of different objects are encoded by their feature vectors, comparison of these vectors makes it possible to find similar systems, given an archetype. Fig 4 shows an example, where an object with known LSB tidal features can be used to identify other similar objects in the galaxy sample.

Fig 4: An example of a 'similarity search' where an archetypal object can be used to find similar systems in survey images. The similarity is established via the galaxy feature vectors. In this case an object with known tidal features has been used to find similar tidally-disturbed objects in HST CANDELS images.