CIA


The benchmark dataset that we are using is named Space Weather ANalytics for Solar Flares (SWAN-SF), which is extracted from solar photospheric vector magnetograms in Space weather HMI Active Region Patch (SHARP) series. The SWAN-SF dataset is made of five partitions that covers over the period of May 2010 through August 2018.

See here for more information about the dataset.

GLANCE OF THE DATA

SWAN-SF Partition I contains 77270 multivariate time series data. The following frequency diagrams are built by using Seaborn (A Python data visualization library based on matplotlib). These diagrams demonstrate the flare distributions based on the five flare classes and the binary classes from partition I. A brief observation of these diagrams shows that a class imbalance is present. The majority vote is class 'Q' (or what we called quite class), which appears to be the least threathening flare class due to it having little to no effect on Earth. Our research pay more attention to 'M' and 'X' class flares, which are the more powerful flares that can cause brief radio blackouts at the poles and minor radiation storms that might endanger astronauts.


Frequency distribution diagrams between data with 5 class labels
Frequency distribution diagrams between data with grouped binary class labels

Data Sampling

There are many ways as related to handling imbalanced dataset, one of the most popular approach is to balance out the dataset. As for this approach, we often undersample the majority vote or oversample the minority vote of the data. In our cases of study to handle the issue of inequal distribution of different class, we create sub-datasets under each of the undersampling or oversampling methodology. This balanced-out approach help us overcome the class imbalanced issue and sample out simple datasets to significantly reduce the time needed to perform later classification algorithms.

The first distribution of dataset is built on undersampling, which takes down the overall sample size of each class and mainly the majority vote of 'C', 'B', and 'Q' (aka. non-flaring) classes. This method samples out around 2000 instances and preserves the original proportion of each non-flaring class over all the non-flaring classes and the ratio of each flaring class over all the flaring classes. The second distribution of dataset is built on oversampling, which extends the majority of 'X' and 'M' (aka. flaring) classes. In this method, we consider C classes as the "base" class and suppress the major Q class instances. That is, we take all the C, B class instances as |C|, |B| and shrink the Q class to 3|C| − (|C| + |B|). Then, we sample 3|C| instances from X and M classes and divide by 2 for each flaring class.

In both cases above, a balance ratio between flaring and non-flaring classes in the SWAN-SF Partition I dataset are maintained.


Sampling visualization with 5 class labels:

Undersamle frequency distribution diagrams of the 5 class sample datasets that extracted from the main SWAN-SF datasets
Oversample frequency distribution diagrams of the 5 class sample datasets that extracted from the main SWAN-SF datasets

Sampling visualization with binary class labels:

Undersamle frequency distribution diagrams of binary class sample datasets that extracted from the main SWAN-SF datasets
Oversample frequency distribution diagrams of binary class sample datasets that extracted from the main SWAN-SF dataset


LIBRARY & FUNCTIONALITY

A machine learning toolkit named tslearn is mainly used for the analysis of our project. This toolkit package builds on (and hence depends on) scikit-learn, numpy and scipy libraries. Some other essential packages that we use include pyts, pandas, h5py, matplotlib.

To build a tslearn model, the format of a 3-D numpy array is required as the input. Therefore, the original data format (i.e., list of dicts) will be converted into a 3-D numpy array representing all multivariate time series data, along with a 1-D array representing all the prediction labels. As for a 3-D numpy array, the corresponding parameters are the number of time series, the number of measurements per time series, and the number of dimensions respectively.

The figures on the right are snaps of the data format before and after transformation.


A demonstration of the data format required before and after transformation


Data Preprocessing

Real world data is usually incomplete, inconsistent, and likely to contain many errors. Similarly, in the SWAN-SF dataset we have, there are some values that are missing. In order to handle these inconsistent and missing values, a major functionality called imputation is required. This function fill in all the missing values and return us with a more complete dataset. With this complete dataset, we have the ability to perform normalization task by applying either min-max normalization or standardization. The three major functionalities that transform our raw data are explained below:

Min-Max Normalization is a functionality where we normalized the data into a certain range, usually from 0 to 1. We can also set up a maximum and minimum value that the data can be transformed into. The module called TimeSeriesScalerMinMax from the tslearn library will be applied to handle this scaling.

Standardization is similar to normalization but using mean and standard deviation as their scaling process. In this way, we will implement another module named TimeSeriesScalerMeanVariance that also comes from the tslearn library.

Imputation is applied when replacing all the missing data with substituted values. Because tslearn library does not have a specified module for this task, we implemented our own multivariate imputation algorithms to handle the missing data.


The pictures aboved show how we define our methods for the three major preprocessing functionalities


Classification

The sources for multivariate time series analysis are very limit. Among all the available time series classification algorithm, Support Vector Machine (SVM) is what we found a state-of-the-art classification method. SVM falls into the general category of kernel methods which depends on the data only through dot-products. The biggest advantage of using SVM is that we can generate non-linear decision boundaries by using methods that are designed for linear classifiers.


Hyperparameter Tuning

Hyperparameter is the parameter in which values are preset before the learning curve starts. Hyperparameter tuning is defined as choosing a set of optimal hyperparameters for a learning algorithm, in our case, the SVM classification algorithm.

The SVM classifier has a set of hyperparameters: The soft margin constant, C, and parameters the kernel function depend on (including width of a Gaussian kernel and degree of a polynomial kernel). The first hyperparameter, soft margin constant C, acts as a regularization parameter in our SVM classifier. It controls the influence of each individual support vector and allows some examples to be "ignored" or placed on the wrong side of the margin. A larger value of C will encourage a smaller margin if the decision function is better at correctly classifying all the training samples. Another important hyperparameter in our tuning process is gamma, which is the free parameter of the Gaussian radial basis function. The gamma parameter defines how far the influence of a single training example reaches. If gamma is large, then its variance will tend to be small, which implying that the support vector does not have wide-spread influence. Technically speaking, a large gamma value will easily lead to high bias and therefore build models with low variance.

The main score measurements implemented in our tuning process are TSS, HSS, and F1.


True Skill Statistic (TSS)

True Skill Statistic score compares the difference between the probability of detection (Recall) and the probability of false detection (the ratio of all the wrong prediction of non-flaring/negative class over all the actual negative class). This measurement weighed from the quantity between each flaring and non-flaring class. The incorrect prediction of a non-major class will impact the final value. TSS has the range from -1 to +1 where the closer the value is to +1 the more accurate our prediction is.

The formula used to define True Skill Statistic score

Method that is defined for implementing the TSS measurement

The formula used to define Heidke Skill Score
Method that is defined for implementing the HSS2 measurement



Heidke Skill Score (HSS2)

Heidke Skill Score measures the improvement of the forecast over a random forecast. HSS is ranked from -1 to +1 where the value of 1 indicates the ideal forecasting and the value of -1 indicates a opposite forecasting as there is no correct prediction. The closer of the value to 0, the model has less power to distinguish between labels. In another word, when a model has HSS2 = 0, the model is no better than predicting the labels randomly.



F1 Score

F-1 score defined as the harmonic mean of precision and recall. It has the range between 0 and 1. The closer to value 1, the better the model has performed.

The formula used to define F-1 Score
Method that is defined for implementing the F-1 measurement


Hyperparameter Tuning Results

Sklearn toolkit contains a built-in hyperparameter tuning model called GridSearchCV. It is an exhaustive search over specified parameter values for an estimator. Based on this model, we trained our model with different C and gamma values. The range that we choose for parameter C is from range [0.0001, 1000] and for parameter gamma is from range [0.0001, 1000]. Among the searching process, the best fit parameters that we find so far are {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'} with an average tuning score of 0.7069.

The heatmaps of the classifier shown below provide us with Grid Search hyperparameter tuning accuracy as a function of C and gamma. Each color block inside the heatmap has a meaning. A deeper color in the heatmap represents a higher tuning score. In this sense, the deepest color shown in the graph will be our best hyperparameters throughout the tuning process. If the best parameters lie on the boundaries of the heatmap grid, we can consider to extend in the direction of the grid with a subsequent search.




Model preformance based on accuracy score before and after hyperparameter tuning
All the pre-defined hyperparameters in our tuning process

Results from TSS Measurements

Based on TSS measuring scores, we can see the kernel 'rbf' perform significantly better than kernel 'sigmoid'. Between these two kernels, the similar color blocks have similar meanings with a equal range of score values. For example, on the top right side of the heatmap where the color is the lightest, this represents hyperparameters {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'} and {'C': 1000, 'gamma': 0.0001, 'kernel': 'sigmoid'} both of which have a grid search tuning accuracy of 0.


Tuning results based on TSS score with kernel 'rbf'
Tuning results based on TSS score with kernel 'sigmoid'

Results for HSS2 Measurements

Similar for HSS2 measuring socre, the kernel 'rbf' perform significantly better than kernel 'sigmoid'. We still have our best parameters lie toward the left bottom edge of our tuning grid compared to using TSS measurement. As we state earlier, we can consider to extend in the direction of the grid with a subsequent search.


Tuning results based on HSS2 score with kernel 'rbf'
Tuning results based on HSS2 score with kernel 'sigmoid'

Results for F-1 Measurements

From F-1 measuring score heatmap, we can see that the best parameters result is very close to the results we found from the other two measurements. As similar to the other two measurements, we can consider to extend in the direction of the left bottom grid with a subsequent search.


Tuning results based on F-1 score with kernel 'rbf'
Tuning results based on F-1 score with kernel 'sigmoid'

Detailed Hyperparameter Tuning Results


hyperparameter_results


Experimental Evaluation Analysis

Two experimental evaluation are conducted in this section. The experiments train SVM classifier based on the top 5 features that we found in the paper written by Bobra et al. (2015). The top 5 parameters include ['TOTUSJH', 'TOTBSQ', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH']. See here for more information about these parameters and the paper.

In experiment A, each of the top 5 parameters are trained individually by using the optimal hyperparameters {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'} , and then tested with the testing dataset. In experiment B, these five parameters are grouped together and trained as the inputs to the model. We keep our hyperparameters setting the same as for experiment A and B. In this sense, we will have a relatively fair comparison.


Evaluation - Receiver Operating Characteristic (ROC) Curve

ROC Curves is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different threshold values between 0.0 and 1.0. In another word, it summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. The shape of the curve contains a lot of information. Smaller values on the x-axis of the plot indicate lower false positives and higher true negatives. Larger values on the y-axis of the plot indicate higher true positives and lower false negatives.

In addition, the area under the curve (AUC) is another important measurement. It can be used as a summary of the model skill. The larger the AUC value, the better a model is performed. In our experiment A, each individual parameter has a performance (AUC score) that is lower than 80%. However, in experiment B (where we combined all the top 5 parameters together), a better performance has shown as the AUC score rised up for about 5%.



Evaluation - Precision Recall (PR) Curve

A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, very similar to the ROC curve. It shows the tradeoff between precision and recall from these different threshold. Reviewing both precision and recall is useful in cases where there is an imbalance in the observations between the two classes. More specifically in our case, if we have a majority voting of non-flaring (CBN) classes and a minority voting of flaring (XM) classes. A large number of CBN instances refers to that the model has less interested in the skill of the model at predicting the non-flaring instances correctly. This will gives us a relatively high true negative values. The main advantage of using the precision and recall measurements is that the calculation avoid the use of true negatives. It only concerned with the correct prediction of our minority class.



Evaluation - Reliability Diagram (Calibration Plots)

A realiability diagram is a diagram that plots the calibrated probabilities of class labels as related to a classification model instead of just predicting the actual class labels from the model. It shows the observed frequencies of an event as a function of its forecast probability. The perfect calibrated model will tend to position in the diagonal of the graph (the dotted line shown below).



Evaluation - Measurement Score

As we can see below in the measurement score, for each individual parameter that was trained into the SVM classifier, the two skill score measurements are relatively low. Even for the 'best' performed parameter 'ABSNJZH' among all the parameters in the experiements, the TSS and HSS2 are still below 40%. However, as for all the top 5 parameters grouped together, the skill scores boost up for around 10%. These results show that these parameters are good as for the SVM classification and we have more improvement on this model. It is also worth to know that the nearly identical values appear in both TSS and HSS2 score is caused by the balance of our sampled dataset. We believe it can also become one of our future improvement to discover some more different sampling schema to extend from our current sampling dataset.