Summer Code Sprint [2020]

DataMine

INSIGHTS IN THE DATA

SWAN SF dataset has been partitioned into 5 sets. The work shown below is with respect to the partition 1 which spans from May 1st, 2010 to February 28, 2012. This amounts to a total of 7270 MVTS.

Feature Analysis and Subsampling:

For insight into the data partition a histogram and count of each flare values is shown.

The Q flare dominates the count and shows the unbalance in the dataset .

This imbalance in the dataset severely hampers the performance of Machine learning models to learn and evaluate fairly on each class flare.

Bar Plot showing count of each flare class in the partition 1 dataset of SWAN SF .

To counter this unbalance in the dataset the data will be resampled so as to have balanced 'strong' and 'weak' class flares. A strong class flare consists of both X and M flare and a weak class fare is combination of all Q, C and B flares. For resampling the data the idea is to undersample the data meanwhile trying to preserve the climatology from the initial dataset.

Original distribution of combined strong and weak flares showing contributions of each class for total 77k+ time series.

Undersampled distribution of XM and BCQ class flares showing each class flare contribution to make a balanced dataset of 1000 time series.

For further analysis we create a binary class dataset which merges CBN and XM flares as individual components. Further on resampling the data with 1000 flare points while preserving the climatology.

Bar plots showing count of each class flare after undersampling to create a 1000 time series pet dataset.

For undersampling we take 500 class flares in each of the binary classes ['XM' and 'BCQ'] and scale them down from the original distribution to proportionate distribution. For example we scale down Q flare from original distribution while maintaining the ratio with C and B class flares.

For analysis we also make another pet dataset with a binary class distribution having strong and weak class flares.

tslearn:

tslearn is a Python package that provides machine learning tools for the analysis of time series. This package builds on other commonly used python libraries.

tslearn.utils helps doing transformations on time series.

Dynamic Time Warping (DTW) is a similarity measure between time series. Let us consider two time series

In tslearn, a time series is nothing more than a two-dimensional numpy array with its first dimension corresponding to the time axis and the second one being the feature dimensionality (1 by default).

Then, if we want to manipulate sets of time series, we can cast them to three-dimensional arrays, using to_time_series_dataset.

Our pet dataset is processed by tslearn package using to_time_series_dataset which converts each individual time series into a numpy 3D array of shape (n , T, d) where n is the number of time series, T is the length and d is the dimensionality

Pre Processing of Data

Diving deeper in the dataset, major roadblocks in the processing of data in ML algorithms is the varied scales in each parameters, NaN values and lower computational power.

Normalization of the data handles irregular scales of parameters, standardization makes sure the mean of the data is 0 and standard deviation is 1, Imputation handles non numeric values and Resampling data of data helps in working with smaller dataset to counter all the above errors. These packages are built using functionalities in tslearn library and are imported as shown.

Usually when splitting the dataset into Train and testing data we prefer our testing data is untouched and is completely ghosted from the training data to avoid bias creeping in classification. An extension of this error can arise when we balance testing dataset along with training dataset. Balancing of dataset can lead to introducing similar distribution in testing dataset as of training set.

To avoid this issue, testing dataset selection is changed so as to take a different partition3 of class flare exclusively to get testing dataset and then use it to test classifier accuracy.

Normalization : Min - Max normalization (-1, 1)

Imputation : replace Nan values with mean

Resampling : taking a smaller sample to ease computation load.

** Going ahead we will resample the dataset to include only 20 parameters instead of the existing 33 parameters.

Distribution of partition 3 without balancing the data

Classical Machine Learning :

K- Nearest Neighbor :

The dataset of 1000 time series is randomly split into train and testing datasets in a ratio of 3 : 1 respectively. This is done using library in sklearn package.

Nearest Neighbor algorithm trains the model based on 'k' nearest neighbors and this model in turn predicts any given class which can be evaluated using accuracy measure.

tslearn package offers a KNeighbors classifier for time series data which takes in parameters like neighbors, weights, metric, etc. For initial example i am taking 2 nearest neighbors and dtw as the metric with default uniform wight distribution.

Support Vector Machines :

In their basic formulation, SVMs are supervised learning model which help find a linear decision function sign that both minimizes the prediction error on the training set and promises the best generalization performance.

The objective of a Linear SVC (Support Vector Classifier) is to fit to the data you provide, returning a "best fit" hyperplane that divides, or categorizes, your data. From there, after getting the hyperplane, you can then feed some features to your classifier to see what the "predicted" class.

Kernels: The main function of the kernel is to take low dimensional input space and transform it into a higher-dimensional space. It is mostly useful in non-linear separation problem.
C (Regularisation): C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimisation how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term. When C is high it will classify all the data points correctly, also there is a chance to overfit.
Gamma : It defines how far influences the calculation of plausible line of separation. When gamma is higher, nearby points will have high influence; low gamma means far away points also be considered to get the decision boundary.

tslearn provides a TimeSeriesSVC classifier function which requires hyperparamters C, Kernel and gamma. The support vector for each class is shown for "rbf" kernel.

A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric

Hyperparameter Tuning

GridSearchCV

We use GridsearchCV from sklearn to try to find better hyperparameters for C, gamma and kernel values. Gamma hyperparamter affects the curvature of the hyperplane and is directly proportional to the curvature. Soft margin constant(C) on the other hand is affects the margin of the hyperplane. Greater value of C implies closer margin and vice versa. For our analysis we choose between 'Polynomial' and 'RBF' kernel. The degree of the kernel also affects the curvature if the boundary.

GridSearchCV is preferred over RandomSearchCV because of the simplicity and interpretability

A hyperparameter is a parameter whose value is set before the learning process begins. Hyperparameter tuning is choosing a set of optimal hyperparameters for a learning algorithm. Major techniques to choose hyperparameters are Grid Search or Random Search.

Grid search is a traditional way to perform hyperparameter optimization. It works by searching exhaustively through a specified subset of hyperparameters. I will use sklearn's GridSearchCV to do a grid search on KNN classifier.

altering between 2,5 and 25 nearest neighbors and weight distribution, the best hyperparameter combination is k = 2 and distance-centered weight distribution.

Best Parameter Evaluation with 'Poly' and 'RBF' Kernels :

Radial Basis Function(RBF) Kernel :

Radial basis kernels form finds similarity of two examples by simply judging by their euclidian distance.
GridSearch hypertuning is done for 'C': [0.1, 1, 10, 100, 1000], and 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]
We get a classification score of 0.61 which improves on the default SVM values.

RBF Kernel nonlinearly mapssamples into a higher dimensional space so it, unlike the linear kernel, can handle thecase when the relation between class labels and attributes is nonlinear

Polynomial Kernel:

Polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these features
GridSearch hypertuning is done for 'C': [0.1, 1, 10, 100, 1000], and 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]
For the best parameters we get a classification score of 0.52

RBF vs Poly :

A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary classifiers.
The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal

When compared with rbf kernel 'TN' and 'FP' values are higher for the Polynomial kernel
Both these values in turn affect the false positive rate and push the Poly ROC Curve towards the left edge.

RBF Kernel

Area Under the Curve = 0.88

Poly Kernel

Area under the curve = 0.9

Reliability Curve (Calibration Curve) :

Reliability diagrams allow checking if the predicted probabilities of a binary classifier are well calibrated. For perfectly calibrated predictions, the curve in a reliability diagram should be as close as possible to the diagonal/identity.
We bin the probability returns in 10 bins
From our diagram we can infer that poly kernel is better calibrated than rbf kernel

Scoring Strategies :

Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. For SVC estimator score method returns the mean accuracy on the given test data and labels. This is not exactly a good way and we look at alternative scoring strategies:

F1-Score (F-Measure, F-Score)

A classic metric, varying within the range [0, 1], widely used for imbalance datasets in all domains. As shown on the right, this is the harmonic mean of precision and recall.

True Skill Statistic (TSS)

A metric that is simply True Positive Rate minus False Positive Rate (or, recall - FPR). TSS ranges within [-1, 1]; random or constant forecasts score 0, perfect forecasts score 1, and forecasts that are always wrong score −1.

Heidke Skill Score (HSS2)

It quantifies the performance of a model by comparing it with the model that predicts randomly. This ranges within the interval [-1, 1].

5 IMPORTANT CLASS FLARE

Bobra et al(2015) worked on GOES data to classify and forecast solar flares
They applied feature selection to identify top 5 features useful in discriminating between flaring and non-flaring active regions
5 important features suggested were ['TOTUSJH', 'TOTBSQ', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH']
Applying GridSearchCV to identify best hyperparameter combination for each of the parameters and then comparing with her work helps us transcend the importance of these parameters across dataset.

TOTUSJH

TOTBSQ

TOTPOT

TOTUSJZ

ABSNJZH

Hyperparameter of Combined Classes :

Implementing GridSearchCV on all classes combines gives this {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
as the best hyperparamter combination on rbf kernel
This combination gives an accuracy of 0.616 which is almost equal to mean accuracy of 5 classes seperately(0.624)

['TOTUSJH', 'TOTBSQ', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH']

Since the scores are almost similar we say that we can continue with these 5 features and work on tuning hyperparmeter
Feature selection can be extended to other parameters to find more such combinations.

Reliability Curve (Calibration Curve) :

For this reliability curve we plot 5 flare classes classifiers as suggested by Bobra et al which has individual Posterior Probability . Along with these 5 classifiers we also include Combined Classifier to compare the calibrations
The variance in all the plots are similar and suggest consistency with results from Bobra et al. We should look into tuning hyperparameters with these 5 features

Page updated

Google Sites

Report abuse