Data

On this documentation page, we describe the steps taken on the Stock Market and Covid Deaths datasets to prepare the data for analyzing.

ACCESS THE DATA

Stock Market Dataset

Size: 10.23 GB
Source: Kaggle
Content: Daily stock market prices for NASDAQ, S&P500, and NYSE listed companies from March 1980 to December 2022 in CSV format. Also includes companies from the Forbes Global 2000 ranking. Columns: Date, Volume, High, Low, Closing Price.

COVID-19 Global Excess Deaths Dataset

Size: 10.54 GB
Source: Kaggle
Content: Daily Covid-related death numbers in CSV format classified by area, income group and calculation method. Also provides daily infection numbers by country. Data spans from January 2020 to November 2023.

1. Find & Download from Kaggle

Using Kaggle search, we found the two datasets related to our project's problem, containing data in an easily analyzable format (CSV) and fulfilling the size requirements for this project.
We set up a Kaggle account and downloaded each of the datasets to our local disk. Since we have limited local memory, we only could download one dataset at a time.

2. Minimal Cleansing

The Stock Market dataset contains the tables both in CSV and JSON format. Since we are only going to analyze the data in CSV format, we removed the JSON folders from the dataset.
Apart from the death toll data, the COVID-19 dataset also contains scripts, plots and other intermediate files needed for visualization or retrieving the contained data. So, we removed all the extra directories apart from output-data, as well as all non-CSV files and output-for-interactive folder in it.

3. Upload to Google Cloud Storage

We created a GCS bucket called marketquake_data located in europe-southwest1 region.
Through the Google Cloud Platform's GUI, we uploaded the cleansed datasets into the covid_death_data and stock_market_data folders.
To store the future results of our study, we also created a GCS bucket called marketquake_results in the same region.

4. Explore the Data and IDENTIFY Peculiarities

Exploring the data, we identified the following problems in our stock_market_data dataset:

Stock CSVs do not contain the name of the stock as a column. The only way to find out the stock name is by looking at its file name, which is not going to be available after merging all stocks into one DataFrame.
The 'Adjusted Close' column name contains a space, which could be problematic if we use this column for file naming.
Nearly every stock file lacks rows for certain dates (typically 2-3 days in a week). There seems to be no pattern of when the information is missing. Also, the data in covid_death_data is captured weekly.
Some files (e.g.: ACTA.csv) contain empty values starting from a particular date (e.g.: 20-04-2018,,,,,,).

5. Come Up with Solutions

Using a simple Python script, add the "Name" column with the bear file name at the end of each line in each stock CSV. Execute the script for all markets as part of the cleansing phase.
In the same script, change the 'Adjusted Close' column name to 'AdjustedClose'. You can view the cleansing script on GitHub.
Since the information is only missing for 2-3 day periods, view data in weekly, not daily iterations. For this, calculate average stock prices and total volume for each week. More details on implementation can be found in our Code documentation.
Empty values are interpreted as None by PySpark and are automatically ignored when doing calculations on a DataFrame. However, not to overload a DataFrame with them, simply remove all rows with None values during information processing.

6. Stock Categorization

We classified stocks into certain categories that relevant for the analysis of the stock market during a pandemic. We identified Healthcare, Technology and Industrials as the most interesting stock market sectors and manually categorized stocks by these sectors after conducting some research. The classification, however, may contain imprecise or incomplete assignments due to the manual evaluation. With this classification, we cover the stocks from FORBES2000, NASDAQ, and NYSE datasets.

The categorization is based on the following factors:

Healthcare: includes pharmaceuticals, biotechnology, healthcare equipment, and healthcare services.
Technology: companies in this sector are involved in software, hardware, semiconductors, and IT businesses.
Industrials: covers machinery, construction, aerospace, defence, and transportation.

The table schema is: Name (of the stock), Category, Description (of the stock). The corresponding CSV file can be found in Google Cloud Storage: stock_market_data/CategorisedStocks.csv.

Page updated

Report abuse