Summer Code Sprint 2021: Multivariate Time Series Feature Selection on Heliophysics Big Data
Duration: 7 weeks (June 7 - July 26, 2021)
Place: Online [virtual]
Application deadline: June 1, 2021
Course: Directed Readings -- CSC 6999 -- 4 Credit Hours
Grade: The final grade will count towards your GPA.
Schedule: Mondays 10:30 - 13:30 (EDT)
Prerequisite: Machine Learning (CSC 6850 / CSC 8850) and/or Deep Learning (CSC 8851).
"Solar flares are a sudden explosion of energy caused by tangling, crossing or reorganizing of magnetic field lines near sunspots. Solar flares release a lot of radiation into space. If a solar flare is very intense, the radiation it releases can interfere with our radio communications here on Earth." [NASA]
There have been many interesting approaches in utilizing machine learning algorithms for forecasting solar flares [A-1:8]. While there is much room for improvement in the achieved forecast performance, the real challenge seems to be in finding a way to fairly compare the models. The fact is that the differences between the strategies is not limited to the choice of a machine learning algorithm or the architecture of DNNs. The collection of datasets, preprocessing of data, sampling methods, training and validation strategies, and the verification metrics used are some of the major differences which make these studies simply uncomparable.
To address this very issue, in 2020, DMLab created a benchmark dataset, named Space Weather Analytics for Solar Flares (SWAN-SF) [A-1]. Using this dataset as a test bed for flare forecast models, while avoiding the bad practices we previously highlighted [A-2], can indeed mitigate the comparability issue.
We have conducted several preliminary studies on this dataset [A-3:9] in order to understand the challenges in the way of flare forecasting task and to explore more innovative avenues. One of the challenges that is yet to be investigated is to rank the physical parameters in the order of their usefulness in prediction of flare activities. A reliable ranking of these parameters is highly valuable for both heliophysics community interested in the formation of solar flares and the machine learning community who can then deal with a manageable dataset of important parameters and utilize more computationally demanding algorithms.
Since the data points in SWAN-SF are multivariate time series, in this summer code sprint, we will be exploring the feature subset selection algorithms for high-dimensional data. This will require efficient programming, familiarity with docker containers and unix systems for connecting to DMLab's server and using our computing resources.
This Summer Code Sprint is organized by DMLab at Georgia State University to provide some practical training in Machine Learning on Big Data while trying to provide some insight into the ambitious task of flare forecasting.
This sprint is planned for graduate students currently enrolled in M.S. in Computer Science or Data Science and Analytics. This 7-week program is a project-based course during which students are closely guided through different avenues toward a shared objective which is feature subset selection (FSS) on a large multivariate time series dataset. Each student/team will take on a particular task and complete it during the 7 weeks of the Summer semester. Students will be exposed to the complexity of multi-class and high-dimensional data, implement different algorithms, and build upon their theoretical knowledge of Machine Learning and Data Mining.
Students' final grade will be calculated as the sum of the following four components: Active Participation (10%), Project Implementation (40%), Project Maintenance (20%), and Final Report (30%).
During this sprint ...
Students will obtain:
hands-on experience in pre-processing of real-world benchmark dataset,
skills needed for dimensionality reduction through feature selection,
a new and practical perspective in interdisciplinary Machine Learning/Data Science.
At the end ...
All participants will present their work to the computer scientists and solar physicists of DMLab.
Upon successful completion of the course, students will earn grades for 4 credit hours.
Interested students will be closely guided to turn their quality work into a scientific paper and submit them to a peer-reviewed conference.
Upon acceptance of the paper by the conference, DMLab will sponsor the students for the registration fees (~$800 per paper).
The top projects with highest scientific quality will be awarded individually with gift peripheral devices.
How to Apply
Application deadline: June 1, 2021 Applicant notification: June 2, 2021
We highly encourage all eligible students who are passionate about Machine Learning and enjoy teamwork, to apply.
Please, email your (unofficial) Transcripts of Records and Resume to us at:
with the email titled as "Code Sprint Application".