Basics of Machine Learning with HSC-SSP PDR

OVERVIEW

This page was made for ML beginners to learn how to download and preprocess the HSC-SSP data for ML analyses.

First, you need to sign up to access the HSC-SSP PDR data here

Download

There are multiple ways to cutout imaging data in the HSC-SSP PDR

Preprocess

Preprocessing is crucial for data science, which must be tailored to your objectives

Implementation

This page provides an example of TensorFlow/Keras implementation

How to Download HSC Imaging Data

  1. Downloading the whole HSC-SSP data (the most flexible if you don't mind the storage volume)
    You can search and download all HSC-SSP data through the Image Query Form.
    Or, you may want to access the data directories on
    the HSC-SSP page directly.

  2. Image cutout tool (the most recommended if the sample size is less than ~1M)
    Image cutout tool is a convenient way to cutout the target images without downloading all data.
    More convenient
    command line tools (found at the bottom of the page) are also provided by the pipeline team.

  3. Interactive cutout on the hscMap Sky Explore (the easiest but not recommended especially for low-z science)
    You can cutout the fits image interactively, as you wish, on the
    hscMap.
    Please select "Tool -> Rectangle Selection" on the menu bar, and select the region.
    Then you can cutout the available image by clicking the button on the corner.

  4. Use of other tools
    There are some ways to download images directly (not fits files).
    You can find these tools at the bottom of
    the HSC-SSP page.

Note

  • In the image cutout tool, you can define the output name by adding a column of "name" (see manual for details).

  • If your targets are bright extended sources at low redshift, you should use the data before "the aggressive sky subtraction".
    You can select such data by changing the file type from "coadd" (default) to "coadd/bg". See the example below.

Preprocessing: How to Create Images?

First of all, you may need to install the following libraries

import numpy as np
from PIL import Image
from astropy.io import fits
from astropy.nddata import Cutout2D
from scipy.ndimage import gaussian_filter

  1. Import fits data
    Here I assume that you use the python library, Astropy. The code would be like this;
    hdu = fits.open(fitsname)[1] # note that HSC image is multi-extension fits
    img = hdu.data

  2. PSF matching (if necessary)
    Scipy gaussian filter allows you to execute gaussian smoothing for the data
    img = gaussian_filter(img, sigma=sgm)

  3. Cutout images (if necessary)
    If you need to further cutout the data, Cutout2D in Astropy would be useful
    def center(data):
    width, height = data.shape
    return (int(width / 2), int(height / 2))
    cut = Cutout2D(img.data, position=center(img.data), size=(size, size))

  4. Flux normalization
    Refer to the
    Color Postage tool developed by the HSC-SSP pipeline team.
    This tool enables you to create SDSS or hscMap -type images in the RGB or grayscale.
    See
    Lupton et al. 2004 for more details on color normalization.

Note

  • About image size: If your targets are identified by SDSS, a Petrosian R90 from SDSS would be useful.
    In this case, the image size of 4-8 times PetroR90 will be comfortable in most cases.

  • About PSF matching: If you plan to conduct a PCA-type analysis, PSF matching should be necessary because the PSF distribution is also one of the principal components.

TensorFlow/Keras Implementation

  1. For Apple Silicon Mac users: Installation of TensorFlow
    Install Mac TensorFlow, following the instruction on the Apple website (Note: Your mac must have the latest OS version).

  2. ImageDataGenerator
    This command helps you to easily import and argument the data.

  3. Running machine-learning models
    Keras also provides a bunch of
    pre-trained models.
    You can find
    example codes on the TensorFlow tutorial page.

Note

  • If you already installed conda, please update conda before starting the installation,
    conda update conda

  • If you encounter a RuntimeError about the NumPy version during the TensorFlow installation (e.g., module compiled against API version 0x10 but this version of numpy...), please update NumPy by the following code,
    pip install numpy --upgrade

  • The augmentation process is essential to generalize the machine learning training, especially if you use Transformer.
    Meanwhile, you should carefully decide how to transform the dataset for the science goal.

  • Note that PNG is different from JPEG which has artifacts.
    If you train the model with JPEG, you should also use JPEG in the test run too.

References

  • Passive spiral galaxies deeply captured by Subaru Hyper Suprime-Cam, 2022, PASJ, 74, 612 (ads)

  • Where's Swimmy?: Mining unique color features buried in galaxies by deep anomaly detection using Subaru Hyper Suprime-Cam data, 2022, PASJ, 74, 1 (ads)

  • Subaru Hyper Suprime-Cam revisits the large-scale environmental dependence on galaxy morphology over 360 deg2 at z=0.3-0.6, PASJ, 73, 1575 (ads)

  • Third data release of the Hyper Suprime-Cam Subaru Strategic Program, PASJ, 74, 247 (ads)