Basics of Machine Learning with HSC-SSP PDR
OVERVIEW
This page was made for ML beginners to learn how to download and preprocess the HSC-SSP data for ML analyses.
First, you need to sign up to access the HSC-SSP PDR data here
Download
There are multiple ways to cutout imaging data in the HSC-SSP PDR
Preprocess
Preprocessing is crucial for data science, which must be tailored to your objectives
Implementation
This page provides an example of TensorFlow/Keras implementation
How to Download HSC Imaging Data
Downloading the whole HSC-SSP data (the most flexible if you don't mind the storage volume)
You can search and download all HSC-SSP data through the Image Query Form.
Or, you may want to access the data directories on the HSC-SSP page directly.Image cutout tool (the most recommended if the sample size is less than ~1M)
Image cutout tool is a convenient way to cutout the target images without downloading all data.
More convenient command line tools (found at the bottom of the page) are also provided by the pipeline team.Interactive cutout on the hscMap Sky Explore (the easiest but not recommended especially for low-z science)
You can cutout the fits image interactively, as you wish, on the hscMap.
Please select "Tool -> Rectangle Selection" on the menu bar, and select the region.
Then you can cutout the available image by clicking the button on the corner.Use of other tools
There are some ways to download images directly (not fits files).
You can find these tools at the bottom of the HSC-SSP page.
Note
In the image cutout tool, you can define the output name by adding a column of "name" (see manual for details).
If your targets are bright extended sources at low redshift, you should use the data before "the aggressive sky subtraction".
You can select such data by changing the file type from "coadd" (default) to "coadd/bg". See the example below.
Preprocessing: How to Create Images?
First of all, you may need to install the following libraries
import numpy as np
from PIL import Image
from astropy.io import fits
from astropy.nddata import Cutout2D
from scipy.ndimage import gaussian_filter
Import fits data
Here I assume that you use the python library, Astropy. The code would be like this;
hdu = fits.open(fitsname)[1] # note that HSC image is multi-extension fits
img = hdu.dataPSF matching (if necessary)
Scipy gaussian filter allows you to execute gaussian smoothing for the data
img = gaussian_filter(img, sigma=sgm)Cutout images (if necessary)
If you need to further cutout the data, Cutout2D in Astropy would be useful
def center(data):
width, height = data.shape
return (int(width / 2), int(height / 2))
cut = Cutout2D(img.data, position=center(img.data), size=(size, size))Flux normalization
Refer to the Color Postage tool developed by the HSC-SSP pipeline team.
This tool enables you to create SDSS or hscMap -type images in the RGB or grayscale.
See Lupton et al. 2004 for more details on color normalization.
Note
About image size: If your targets are identified by SDSS, a Petrosian R90 from SDSS would be useful.
In this case, the image size of 4-8 times PetroR90 will be comfortable in most cases.About PSF matching: If you plan to conduct a PCA-type analysis, PSF matching should be necessary because the PSF distribution is also one of the principal components.
TensorFlow/Keras Implementation
For Apple Silicon Mac users: Installation of TensorFlow
Install Mac TensorFlow, following the instruction on the Apple website (Note: Your mac must have the latest OS version).ImageDataGenerator
This command helps you to easily import and argument the data.Running machine-learning models
Keras also provides a bunch of pre-trained models.
You can find example codes on the TensorFlow tutorial page.
Note
If you already installed conda, please update conda before starting the installation,
conda update condaIf you encounter a RuntimeError about the NumPy version during the TensorFlow installation (e.g., module compiled against API version 0x10 but this version of numpy...), please update NumPy by the following code,
pip install numpy --upgradeThe augmentation process is essential to generalize the machine learning training, especially if you use Transformer.
Meanwhile, you should carefully decide how to transform the dataset for the science goal.Note that PNG is different from JPEG which has artifacts.
If you train the model with JPEG, you should also use JPEG in the test run too.
References
Passive spiral galaxies deeply captured by Subaru Hyper Suprime-Cam, 2022, PASJ, 74, 612 (ads)
Where's Swimmy?: Mining unique color features buried in galaxies by deep anomaly detection using Subaru Hyper Suprime-Cam data, 2022, PASJ, 74, 1 (ads)
Subaru Hyper Suprime-Cam revisits the large-scale environmental dependence on galaxy morphology over 360 deg2 at z=0.3-0.6, PASJ, 73, 1575 (ads)
Third data release of the Hyper Suprime-Cam Subaru Strategic Program, PASJ, 74, 247 (ads)