There have been many interesting approaches in utilizing machine learning algorithms for forecasting solar flares [A-1:8]. While there is much room for improvement in the achieved forecast performance, the real challenge seems to be in finding a way to fairly compare the models. The fact is that the differences between the strategies is not limited to the choice of a machine learning algorithm or the architecture of DNNs. The collection of datasets, preprocessing of data, sampling methods, training and validation strategies, and the verification metrics used are some of the major differences which make these studies simply uncomparable.
To address this very issue, in 2020, DMLab created a benchmark dataset, named Space Weather Analytics for Solar Flares (SWAN-SF) [A-1]. Using this dataset as a test bed for flare forecast models, while avoiding the bad practices we previously highlighted [A-2], can indeed mitigate the comparability issue.
We have conducted several preliminary studies on this dataset [A-3:9] in order to understand the challenges in the way of flare forecasting task and to explore more innovative avenues. One of the challenges that is yet to be investigated is to rank the physical parameters in the order of their usefulness in prediction of flare activities. A reliable ranking of these parameters is highly valuable for both heliophysics community interested in the formation of solar flares and the machine learning community who can then deal with a manageable dataset of important parameters and utilize more computationally demanding algorithms.
Since the data points in SWAN-SF are multivariate time series, in this summer code sprint, we will be exploring the feature subset selection algorithms for high-dimensional data. This will require efficient programming, familiarity with docker containers and unix systems for connecting to DMLab's server and using our computing resources.