Data

SWAN-SF: Space Weather ANalytics for Solar Flares


Data: [Xa]

SWAN-SF dataset has recently been accepted (May 2020) to be published in Scientific Data (by Nature). Students in this course are first to work on the official release of this large dataset.

We are only just recently finding access to high-quality time series data for use in solar flare prediction. Many previous and current projects use point-in-time measurements. It is possible that a time-series approach will enable new progress on using machine learning for flare prediction. Here we will use a benchmark dataset, named Space Weather ANalytics for Solar Flares (SWAN-SF), released by Angryk et al. [Xa], and made entirely of multivariate time series, aiming to carry out an unbiased flare forecasting and hopefully set the above question to rest.

The SWAN-SF dataset is made of five partitions (see Fig. 1). These partitions are temporally non-overlapping and divided in such a way that each contains approximately an equal number of X- and M- class flares (see Fig. 2). The data points in this dataset are time series slices of physical (magnetic field) parameters extracted from the flaring and flare-quiet regions, in a sliding fashion. That is, for a particular flare with a unique id, k equal-length multivariate time series are collected from a fixed period of time in the history of that flare. This period is called an observation window, denoted by T_obs, and spans over 24 hours. Given that t_i indicates the starting point of the i-th slice of the multivariate time series, the i+1-th slice starts at t_i + τ , where T_obs = 8τ . [Xb]

Figure 1. Counts of the five flare classes across five partitions. (The numbers correspond to v0.7 of SWAN-SF data, and may vary in the most recent version.)
Figure 2. Time span considered for each partition of SWAN-SF dataset, the monthly report of average number of sunspots per day, and the daily variance represented with a gray ribbon (Source: WDC-SILSO, Royal Observatory of Belgium, Brussels).

Magnetic Field Parameters






SWAN-SF contains a collection of 82 physical parameters derived from the vector magnetic field data. These parameters could potentially be important to analyze and forecast solar flares and coronal mass ejections.

On the right [Xa], 24 of those magnetic-field parameters are listed, with their formulas and units.

magnetic_field_parameters.pdf

In this version of SWAN-SF that we work on, the following 33 physical parameters are included:

['TOTPOT', 'MEANGBT', 'RBT_VALUE', 'PIL_LEN', 'TOTBSQ', 'MEANSHR', 'TOTFY', 'TOTUSJZ', 'ABSNJZH', 'TOTFZ', 'RBZ_VALUE', 'SHRGT45', 'MEANJZD', 'TOTUSJH', 'BT_FDIM', 'BP_FDIM', 'MEANALP', 'SAVNCPP', 'TOTFX', 'MEANPOT', 'EPSZ', 'MEANGBZ', 'MEANJZH', 'MEANGBH', 'R_VALUE', 'FDIM', 'MEANGAM', 'RBP_VALUE', 'EPSY', 'EPSX', 'BZ_FDIM', 'XR_MAX', 'USFLUX']

Get Started with Data

Clink the button on the right and download partition I:

  • train_partition1_data.json (3.3GiB)

This file contains 77270 multivariate time series and that is more than enough for almost all of our experiments. At the end, for a proper evaluation, we may use other partitions as well.

This snippet of code on the right shows the proper way of reading the data.

Note: This is a large file and since having gigabytes of data loaded on memory is usually not the best idea. For this very reason, instead of having one big dictionary (json), I created small dictionaries and separated them with \n. In this way, you can load as many mvts instances as you need for different tasks.

fname = 'path/to/train_partition1_data.json'

mvts_list = []

with open(fname) as infile:

for line in infile: # each line is a dictionary

d: dict = json.loads(line) # each dictionary is a mvts

for k, v in d.items():

mvts_list.append(v)