On this documentation page, we provide a technical description of our software design and logic, as well as provide a link to the GitHub repository with our code.
Our code base consists of the scripts/ directory and a README.md file. The scripts/ folder contains all the necessary Python scripts for our study, and the README - a short description of our project, similar to the one on this website.
scripts/ consists of 6 Python files, which are going to be described below. It also contains a preprocessing directory with a single file: modify_stocks.py. As described on the Data page, this piece of code just adds the "Name" column to the stock CSVs files and changes the "Close Adjusted" column name to "CloseAdjusted". This file is irrelevant for the questions of our study and is just a cleansing step to simplify further analysis.
The provided code is an interactive script that allows the user to automatically generate a PySpark command with desired arguments to conduct the analysis and a plotting command to plot the results. It does so by guiding the user through the options for analyzing stock market and COVID-19 data. It first prompts users to choose specific stock (Volume, Low, High, Open, Close, AdjustedClose) and Covid-19 metrics (daily_covid_deaths, daily_covid_cases). Then, it allows the user to select the area (world, region or country) to analyze the Covid-19 data for.
Afterwards, the user should choose if they want to analyze the stock markets in general or the extreme performing stocks. If they choose the latter, they should select a stock market (or all of them) to analyze. If they choose the former, they can also choose between analyzing stocks in the context of stock markets or economic sectors. It is possible to choose from 3 sectors: Healthcare, Technology and Industrials (more details on Data page).
Based on these choices, the script generates two commands: one for running a PySpark job (`spark-submit`) and another for executing a Python plot script to visualize the results.
This Python script is designed to start the data analysis using Apache Spark through the PySpark library. The script takes command-line arguments to specify the analysis parameters and 'redirect' the user to the function they need to analyze desired data. It also defines the path to the buckets with the source data or results and suppresses PySpark INFO logs.
process_corona():
Reads COVID-19 death data from a specified file path, filtering by a chosen area (world, region or country), calculating the average numbers of cases and grouping by year and week.
merge_markets_covid():
Iterates over specified stock markets, reading and cleansing stock market data for each market.
For each market, calls merge_by_group() function to merge stock and COVID-19 data.
Aggregates the DataFrames of different markets in one DataFrame and groups by year, week, and COVID-19 column.
Writes the merged data to a CSV file.
merge_sectors_covid():
Reads stock categories and corresponding stock names.
Reads stock data for each market and filters by sectors.
Merges COVID-19 data with stock data for individual sectors.
Groups the result DataFrames by year, week, and COVID-19 column.
Writes the merged data to CSV files.
cleanse_stocks():
Selects relevant columns in a given DataFrame.
Filters the data for the period between January 2018 and December 2022.
Adds a 'Group' column with a constant value (market or sector name).
Calculates weekly averages or totals based on the specified column.
Removes rows with null values in any column.
merge_by_group():
Calls cleanse_stocks() to cleanse the stock data for given market / sector.
Merges the cleansed stock DataFrame with the COVID-19 DataFrame based on the 'Year' and 'Week' columns.
Writes the merged DataFrame to a CSV file at the specified path.
The code includes functions for visualizing stock market trends alongside COVID-19 data. plot_market() generates a time-series plot for a specified market metric, while plot_stocks_corona() creates a dual-axis plot comparing stock data with COVID-19 statistics.
Based on the passed system arguments, it determines which file has to be read and plotted. The script processes data for chosen markets, generating visualizations that highlight the correlation between market movements and pandemic trends, and uploads these plots to Google Cloud Storage.
This Python code defines functions for identifying the highest and lowest-performing stocks in stock market dataset, and merging the results with COVID-19 data using PySpark.
calculate_extremes():
Calculates weekly returns for each stock based on the specified column.
Aggregates the returns to get the overall weekly performance for each stock.
Identifies the worst and best-performing stocks based on their total return.
cleanse_stocks():
Filters the stock DataFrame for a particular group by relevant dates.
Calculates weekly averages or totals based on the specified column.
Removes empty rows and filters out stocks with incomplete data (less than 200 rows).
find_for_market():
Reads and cleanses stock market data, calculating extremes for each market. For multiple stock markets, finds the highest and lowest-performing stocks across all markets.
Unites the each market's DataFrame into one and merges with COVID-19 data.
Writes the merged DataFrame to a CSV file.