Summer Code Sprint [2020]

DataPower

Summer Code Sprint [2020] —— Yang Chen@dmlab

1. Parameters analysis

One multivariate time seires has 60 timesteps. And each timestep has 33 parameters, including:

TOTUSJH , TOTBSQ , TOTPOT , TOTUSJZ , ABSNJZH , SAVNCPP , USFLUX , TOTFZ , MEANPOT , EPSZ , SHRGT45 ,

MEANSHR , MEANGAM , MEANGBT , MEANGBZ , MEANGBH , MEANJZH , TOTFY , MEANJZD , MEANALP , TOTFX , EPSY ,

EPSX , R_VALUE , RBZ_VALUE , RBT_VALUE , RBP_VALUE , FDIM , BZ_FDIM , BT_FDIM , BP_FDIM , PIL_LEN , XR_MAX .

A description sample of parameters:

A Box plot example with all parameters:

2. Target feature analysis

5-class distribution of partition 1 of SWAN-SF.

The plot shows the distributions of solar flare time series data in partition 1 with 5 classes. And the per-class frequencies are listed below:

Class Q: 63400,
Class B: 6010,
Class C: 6531,
Class M: 1157,
Class X: 172.

2-class distribution of partition 1 of SWAN-SF.

The plot shows the distributions of solar flare time series data in partition 1 with 2 classes. And the per-class frequencies are listed below:

Positive class : 1329,
- Class M: 1157; Class X: 172.

Negative class: 75941,
- Class Q: 63400; Class B: 6010; Class C: 6531.

3. Build pet dataset

Sampling method: Preserves climatology.

The partition1 has 77270 samples. The pet dataset aims to sample 1000 out of 77270 with applying preserves climatology strategy. To fulfill this purpose, 500 flares and 500 no-flares are kept. The ratios of sampling are listed below:

- - - Positive class (M and X): 500/1329 = 0.376.
      - M_after_sampling = 1157 * 0.376 = 435.
        X_after_sampling = 500 - 435 = 65.
    - Negative class (Q, B and C) = 500/75941 = 0.0066.
      - B_after_sampling = 6531 * 0.0066 = 43.
        C_after_sampling = 6010 * 0.0066 = 40.
        Q_after_sampling = 500 - 43 - 40 = 417.

Sampling strategies for creating balanced sub-dataset.

Ref: Ahmadzadeh, Azim, et al. "Challenges with extreme class-imbalance and temporal coherence: A study on solar flare data." 2019 IEEE International Conference on Big Data (Big Data). IEEE, 2019.

A Pet dataset with 5-class

The plot shows the distributions of solar flare time series data with 5 classes after sampling. And the amount within each class is listed below:

Class Q: 417,
Class B: 40,
Class C: 43,
Class M: 435,
Class X: 65.

A Pet dataset with 2-class

The plot shows the distributions of solar flare time series data with 2 classes after sampling. And the amount within each class is listed below:

Positive class : 500,
- Class M: 435; Class X: 65.

Negative class: 500,
- Class Q: 417; Class B: 40; Class C: 43.

4. Preprocessing

Ref: https://bitbucket.org/gsudmlab/yang_3861/src/master/src/preprocessing/

Normalization

Ref: https://bitbucket.org/gsudmlab/yang_3861/src/7327983b7dd73e747e610ea1afb4e914ae748c04/src/preprocessing/preprocessor.py#lines-93

A min_max_normalizer for processing multivariate time seires data with shape of [n, timesteps, num_features], e.g. [1000, 60, 33].

|- n: the total number of records in a MVTS dataset.

|- timesteps: the total steps in a single time sequence, e.g. 60.

|- num_features: the number of features of one time step.

NOTE that this is a global normalization method, which means each value (in a column) is scaled across n*timesteps with a certain feature.

In meanwhile, the information of scalers are saved for inversion (undo normalization) in the future. (each feature has a global scaler).

(** this method is extended from sklearn's preprocessing package (see `sklearn.preprocessing.MinMaxScaler`))

Example: ( Data format description: 3-d array, [n, timesteps, num_features] )

data = [ [ [ 1 2 3]

[ 0 10 4] ]

[ [-1 18 2]

[ 4 1 1] ] ]

data_norm = [ [ [-0.2 -0.88235294 0.33333333]

[-0.6 0.05882353 1. ] ]

[ [-1. 1. -0.33333333]

[ 1. -1. -1. ] ] ]

We can find that each value is scalced across different multivariate time series.

e.g. the first feature of all MVTS: [[1, 0], [-1, 4]] --> [[-0.2, -0.6], [-1, 1]].

5. PyTS

Ref: https://pyts.readthedocs.io/en/stable/api.html

(1) Plot a sample time series of 'TOTUSJH'.

(2) GAF

Introduction:

A Gramian Angular Field is an image obtained from a time series, representing some temporal correlations between each time point.

Two methods are available: Gramian Angular Summation Field and Gramian Angular Difference Field.

Usage in python:

# Transform the time series into Gramian Angular Fields

gasf = GramianAngularField(image_size=24, method='summation')

X_gasf = gasf.fit_transform(X)

gadf = GramianAngularField(image_size=24, method='difference')

X_gadf = gadf.fit_transform(X)

# Show the images for the first time series

fig = plt.figure(figsize=(8, 4))

grid = ImageGrid(fig, 111, nrows_ncols=(1, 2), axes_pad=0.15, share_all=True,

cbar_location="right", cbar_mode="single", cbar_size="7%", cbar_pad=0.3,)

images = [X_gasf[0], X_gadf[0]]

titles = ['Summation', 'Difference']

for image, title, ax in zip(images, titles, grid):

im = ax.imshow(image, cmap='rainbow', origin='lower')

ax.set_title(title, fontdict={'fontsize': 12})

ax.cax.colorbar(im)

ax.cax.toggle_label(True)

plt.suptitle('Gramian Angular Fields', y=0.98, fontsize=16)

plt.show()

An example of GAF with different settings of imageSize:

The above figure shows generating GAF with different sizes. The less size means using less bins for calculating GAF, which means less computations as well. Therefore, choosing an approprite size is to balance the complexities between representations and computations. In this case, the sizes between 10 and 50 can obtain pretty good representations since this is a simple line, but ratios of 0.25 and 0.5 are more suitable for more complex situations.

(3) Applying GAF into MVTS

Introduction:

A Gramian Angular Field is an image obtained from a time series, representing some temporal correlation between each time point.

Two methods are available: Gramian Angular Summation Field and Gramian Angular Difference Field.

Usage in python:

# Reading MVTS and corresponding labels.

from src.data.data_reader import DataReader

from src.data.ts_imaging import TSImaging

from pyts.image import MarkovTransitionField, GramianAngularField

from src.plotting.plotter import Plotter

# Reading MVTS and corresponding labels.

X = DataReader().read_npy('.' + PATH_TO_PET_NORM_TOP_5_FEATURE_DATA)

y = DataReader().read_npy('.' + PATH_TO_PET_LABEL)

# GAF transformation and saved with image format.

tsi = TSImaging()

plotter = Plotter()

for i in range(len(X_train)): # mvts index

gaf_summ = tsi.transform_mvts(X_train[i], GramianAngularField, image_size=28)

gaf_diff = tsi.transform_mvts(X_train[i], GramianAngularField, image_size=28, method='difference')

label = y_train[i]

for j in range(len(TOP_5_FEATURES)): # param index

plotter.save_gaf_with_image(gaf_summ[j], gaf_diff[j], label, i, j)

A workflow for Processing and generating GAF image dataset:

Observations:

(1) The summations of GAF are symmetric along the counter-diagonal line, which is from lower left to upper right.

(2) The differences of GAF are opposite along the counter-diagonal line.

GAF summation images dataset examples:

GAF images tranformed from corresponding MVTS with summation method. Each image in the above table is obtained from a univarite time series, e.g. (60, 1). Each multivariate time series produces 5 images if top 5 features are used. In this way, we can generate an image dataset for all MVTS.

GAF difference images dataset examples: