Dimensionality Reduction

Dimensionality Reduction refers to the process of transforming high-dimensional datasets into a more concise lower dimensional dataset. More concretely, we want to reduce the number of features required to describe each datapoint.

For example, consider the following dataset. Each datapoint is defined by 2 features. By plotting these data points in (x1, y1) space, we observe that they seem to lie on a straight line.

次元削減とは高次元データをより簡潔な低次元データに変換することを言います。具体的にはデータを表現する特徴量を減らすことを目的としています。例として以下の図を見てみると、プロットされたデータ点には線形の関係が見て取れます

If we were to define a new coordinate system with axes (x2, y2), like in the figure above, we note that our points largely lie along the x2 axis, with minimal scatter/variance along the y2 axis. We can interpret this as the x2 axis containing most of the information needed to define each data point, while y2 contains minimal such information. As such, if we wanted to approximate each data point using fewer features, we could ‘ignore’ the y2 axis, and only

define each datapoint by 1 feature: the x2 feature.

上の図のように新しい座標系(y1, y2)を取ると、多くのデータ点はy1上に存在し，y2上にはデータの散らばりが小さいことがわかります。このことからy1軸はデータ点を表現するためのほとんどの情報を持っおり，y2軸は持っていないと解釈することができます。つまり私たちは全てのデータをたった一つの特徴量y1で表現することが可能になります。

This, in a very simplified way, is the idea of dimensionality reduction: finding a new set of ‘latent features’ that allows us to capture as much information within as few dimensions as possible.

これは、単純化すると、できるだけ多くの情報を表現する「潜在的特徴」を見つけるということが次元削減の考え方です。

Applications to Astrophysics

Datasets in astrophysics tend to be very high dimensional, which can make them computationally expensive to analyse and prone to the ‘curse of dimensionality’. In such cases, dimensionality reduction can help us prepare our dataset for easier analysis.

In addition, dimensionality reduction methods also allow us to find the underlying lower-dimensional space (manifold) that the data lies on. This allows us to better understand the underlying physics of galaxy evolution that might not have been obvious at first glance.

天体データは巨大化の一歩をたどっており，これは解析に必要な計算コストに増加と「次元の呪い」といった問題に繋がります。このような場合、解析の簡易化のためにも次元削減は必要になります。さらに，次元削減は真に重要な低次元の特徴を見つける(多様体学習)という意味でも重要です。これにより、一見しただけではわからなかった銀河進化の重要な物理をより深く理解することが可能になりました。

For example, in Cooray et al., 2022, manifold learning (a non-linear method) was used to find a set of only two parameters that efficiently represent a galaxy’s spectral energy distribution.

例えば、Cooray et al., 2022では、多様体学習（非線形手法）を用いて、銀河のスペクトルエネルギー分布がたった2つのパラメータで表現されること発見しました。

Papers: https://arxiv.org/abs/2210.05862