Summer Code Sprint [2020]

SolarFlare

BDML Summer Code Sprint 2020: Pavan Ajit Babajiyavar

This is an attempt at forecasting a solar flare from SWAN-SF: Space Weather ANalytics for Solar Flares dataset based on multivariate time series analysis approach. Among the five non-overlapping partitions of the dataset, the first partition is used data pre-processing and training. In this approach, the time series data would be converted to image-like objects and a CNN employed for classification. PyTS library is being utilized to work with the dataset.

Dataset Exploration:

The dataset contains 33 features of interest and a class to be predicted. Each multivariate time series consists of 60 readings of the feature taken over 12 minutes interval. One of the challenges o the dataset is the huge imbalance seen in the class to be predicted. Out of the existing 77270 instances, majority of them are found to be class 'Q' flares (63400) which is a quiet region. The classes that we are interested in predicting are 'X' and 'M' which are potentially disruptive.

A common measure taken to handle such class imbalances seen is to perform undersampling where in a small subset of the data would be used instead of the entire dataset to train our models (Refer Initial Dataset Preparation section).

The dataset is also observed to have a large number of features, 33 to be precise. Challenge is to identify only the informative features and discard the rest for training our models. Finding the correlation between features is known to be an effective way for understanding the informative value of features for a time series dataset. A correlation value of 1, -1 or close to it suggest that the feature is has a high positive or negative correlation i.e. the feature is directly proportional to the output in case of the former. The latter suggests that the feature is indirectly proportional to the output. Having a correlation value closer to 0 suggests that the feature has negligent or no impact on the output. Heat map representing the correlation between features of SWAN-SF dataset can be found below. In this analysis, the top 5 features identified by Bobra et al. (2015) [B4] would be utilized for training the model. These top 5 features would be compared with the feature correlation seen in the heat map below at a later stage.

Initial Dataset Preparation:

As this approach involves converting the multi variate time series into image like objects using PyTs library functions, the existing dataset which is in JSON format needs to be modified to a format that is accepted by PyTs functions. The dataset in JSON format was split into two arrays consisting of features and classes respectively. The array representing the features that would be subjected to PyTs image modifier functions is a a 3 dimensional array of size (1000, 33, 60) where 1000 represents the number of instances, number of features = 33 and size of the time series=60.

Dataset was found to have missing values in the multi variate time series which would essentially make the time series incomplete. The missing values were imputed using linear interpolation method with the help of InterpolationImputer library function from PyTs library. Further, to ensure that no feature intrinsically influences the model, the dataset was normalized using standard normalization method with the help of StandardScaler function from PyTs library. In standard normalization, the time series is standardized by removing mean and scaling the data to unit variance.

As seen before, the dataset is dominated largely by instances corresponding to weak flares (Q, C and B) while the class of interest and the intention of this research is to forecast strong flares (X and M). The significant class imbalance seen in the dataset would affect any model injecting a bias towards the majority class i.e. the weak flares in this case. From the results seen by Azim et al (2019), the method of undersampling while retaining the climatology of the original dataset has been chosen.

Accordingly, two datasets have been created. These two smaller datasets represent two different approaches that are going to be taken for time series analysis. The first smaller dataset represents binary classification in which the classes have been divided into '0' and '1' each representing weak flare ('C', 'B' and 'Q') and strong flare ('X', 'M') respectively. The second dataset represents multiclass classification problem in which the original 5 classes 'X', 'M', 'C', 'B' and 'Q' have been retained. For the creation of these datasets, 1000 instances were sampled randomly for each.

Binary Class Distribution

Multiclass Distribution

Classification of Time Series by Imaging and CNN

Image Transformation:

For converting the existing time series instances into images, below two algorithms were explored:

Gramian Angular Field
Markov Transition Field

GAF representation of a feature of mvts

Gramian Angular Filed is a framework in which the time series data is represented in a polar co-ordinate system instead of Cartesian co-ordinates. The image is built from a Gramian matrix wherein each element represents the temporal correlation between time and the point itself.

Gramian matrix can be formed by computing either the trignometric sum, Gramian Angular Summation Field (GASF) or the trignometric difference, Gramian Angular Difference Field (GADF) as shown below.

The image or the visual representation of a single feature of MVTS of SWAN-SF dataset can be seen in the image on the left.

MTF representation of a feature

Markov Transition Field is a framework in which the image/matrix is given by sequential representation of Markov Transition Probabilities which preserves information in the time domain. Time series data is split into quantile bins and a weighted adjacency matrix is constructed by counting the transitions among quantile bins of a first order Markov Chain along the temporal axis.

The image on the left is an example of how a Markov Transition Field looks like. The image was built from the first instance of the first feature of SWAN-SF dataset. While binning the data, it was found that at least one sample of among the time series was constant (value was observed to be 0). Possible solutions which include excluding the non-informative feature for building the image are being explored.

Gramian Matrix generated by the Gramian Angular Summation/Difference Filed would be a 4 dimensional matrix (n_samples, n_features, img_width, img_height). For a SWAN-SF MVTS, the Gramian Matrix would be of the order (33,24,24) wherein 33 represents the number of features of SWAN-SF dataset and 24 denotes the default image size set. Noting the impact of the size of the images seen by Zhang et al, the size of images would be experimented with in this analysis as well.

Finally, the Gramian Matrix of all 1000 instances was stored as a 4 dimensional numpy array which is going to be subjected to a CNN classifier.

CNN Classification:

A CNN classifier accepts data in the form of multi dimensional arrays or images themselves.
A 4 dimensional array of Gramian Matrices of 1000 mvts instances has been prepared to be fed to a Convolutional Neural Network (CNN).
As a second approach, these Gramian Matrices would need to stored as images and then fed to a CNN. Conventional images are of the order (3 , image_width, image_height) where 3 refers to the scale of the image. In this case the scale is 3 = RGB. A grayscale image would be of the order (2 , image_width, image_height). The highest scale known is 4=RGBA where A is representation of the opacity of the color. However, the Gramain Matrix dataset is a 4-D numpy array with 33 channels/features. A 33 channel dataset cannot be converted and saved as an image.

Convolutional Neural Network:

A convolutional neural network (CNN) is a class of deep neural network that has proven to be quite effective for image recognition and classification. CNNs are versions of Multilayer perceptrons that have been regularized i.e. adding a weight/filter and optimize to reduce the loss function.

A convolutional neural network is usually made up of three basic layers:

Convolution Layer
Pooling Layer
Output Layer

Convolution Layer

Convolution

Stride

Padding

Convolution layer is used to extract certain features from an image by convolving it with a weight matrix i.e. summation of element wise multiplication between the image and weight matrix. If the image is of size (n x n) and weight matrix is of size ( f x f) then the resulting matrix would be of size (n-f+1) x (n-f+1). The size of the resulting matrix also depends on the stride for weight matrix i.e. by how many points does the weight matrix move over the image before adding the results of element wise multiplication.

To retain the dimensions of the image post convolution, padding technique is employed. Padding is the process of addition of layers of zeroes to the input images.

p=padding

n=size of input matrix

f=size of weight matrix

p = (f – 1) / 2

[(n + 2p) x (n + 2p) image] * [(f x f) filter] —> [(n x n) image]

Pooling Layer

When the size of the images are too large, pooling technique is introduced to reduce computation time. Pooling is similar to undersampling of a dataset. MaxPooling is a popular pooling layer used often to reduce the image to a manageable size for quicker computation. In MaxPooling, a kernel/filter matrix is passed over the image again after convolution to pick the pixel/element which is maximum inside the kernel. For eg. in the image shown, the image is a 4x4 matrix and when a MaxPool kernel of size 2x2 is passed over this matrix with a stride value of 2, the maximum element is picked and the resulting matrix is the size 2x2 (size of the kernel itself).

Average Pooling and L2 Norm Pooling are few of the other techniques utilized to reduce the image.

Output Layer

The convolution and pooling layers only extract features and parameters from the images. A fully connected layer has to be applied to generate the final output/predict the class after multiple layers of convolution and pooling. The output layer consists of a loss function that is a metric to identify the accuracy of classification of an image or calculate the error in prediction. The higher the loss the higher is the error in prediction. As defined before, CNN are multilayer perceptrons with a weight and optimizer added to this loss function to increase the accuracy of the network. Once a forward pass is completed, based on the error in prediction, a backward propagation is kicked off to update the weights set before to reduce error and loss.

Test data on CNN model and pre-trained PyTorch models:

The 4D Gramian Matrix obtained during transformation of data to image like object needs to be fed to the CNN or PyTorch classifier models in batches using PyTorch DataLoader library. A custom dataloader function that works in tandem with the PyTorch DataLoader functionality has been designed. PyTorch classifier accept tow types of dataset: Map-Style dataset and Iterable dataset. The 4D Gramian Matrix is an Iterable dataset that requires a custom dataloader function to be designed that works in tandem with PyTorch DataLoader functionality that picks up random samples from the dataset that includes both the inputs and targets in batches.

Top 5 features identified by Bobra et al. (2015) [B4] to be most the informative have been selected for further analysis. The 4D Gramian Matrix has to be further transformed for it to be fit to CNN or Pytorch classifier model.

Roadmap:

Transformation:

Test dataset on pretrained models in PyTorch.
Build CNN for SWAN-SF dataset.

Classification:

Explore classifiers for CNN

Page updated

Google Sites

Report abuse