BTHO COVID-19

Data Mining Class Project

Wangyang He, Jinhao Pan, Lu Zhang, Sicong Huang

Mission of the project

Ever since 2019, the ongoing Covid-19 pandemic has affected everyone’s daily lives. With the goal of helping to track the severity of viruses and seeking for deep hidden features of the existing data, we picked Covid-19 as our project topic. Our project result could benefit the governments, people that are researching the pandemic and people are affected by or curious about the data trends of the disease.

COVID-19 Dataset from JHU

The dataset we chose is from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. (Link to Dataset) This project used 3 csv files provided by Johns Hopkins University, the global confirmed data, global deaths data, and global recovered data. The data set contains 196 countries and the date range is 01/22/20 - present.

TOOLS

DATA PREPROCESSING

The first step of the data preprocessing process is to read the previous CSV files into pandas data frames. Since the dataset is recorded by region, the data frames are then grouped by country and then sum the data in regions from the same country. We then take the last entry of “Confirmed Cases”, “Recovered Cases”, and “Deaths” and calculate “Active Cases”, “Daily Increase” and “Mortality Rate” based on the three values above. The calculations are:
Active Cases = Confirmed Cases - (Recovered Cases + Deaths)
Daily Increase = Today’s Confirmed Cases - Yesterday’s Confirmed Cases
Mortality Rate = Deaths / Confirmed Cases

DATA VISUALIZATION

METHODS

K-Means

One of the methods we chose to implement is the K-Means algorithm from scikit-learn, where it divides the dataset into clusters based on “confirmed cases”, “recovered cases”, and “deaths”. We chose to use k=3 since there are 3 features from the raw dataset, and we used sklearn.cluster(K-Means) and sklearn.neighbors(Nearest Neighbors) to compute the clusters. PCA is then used to make k-means clustering with two dimensions, since the data carried more than one feature. Left is the visualization for K-Means algorithm on this dataset.

Outlier Detection (DeepLog)

Another method we implemented was the outlier detection using DeepLog algorithm from TODS. DeepLog is a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence. This allows DeepLog to automatically learn log patterns from normal execution, and detect anomalies when log patterns deviate from the model trained from log data under normal execution.

We first calculated active rate, recovered rate and death rate for each country as the input data to the neural network model. The model detected 20 countries (shown on the left) as outliers based on the three rates, and assigns each country a outlier score.

Below are the visualization plots to show outlier points, each peak on each graph represents an outlier. The outliers on three graphs might repeat with each other, which means the country is abnormal on more than one rate. We can obtain a list of countries that show abnormal behaviours by looking up the country index on the original data.

From the 2D to 3D visualization graphs below, one can gain a better understanding of why the outlier is picked out.

GAN

The last method we chose to implement was the Generative Adversarial Network (GAN). GAN is known for its ability to learn underlying layout and deep features through a contest between generator and discriminator and it has been widely used on research challenges surrounding strong sparsity. Thus, we propose to solve this novel and difficult challenge of continuous infection rate projection through a modified known technique called Recurrent Conditional GAN (RCGAN). The plot to the left demonstrates our results compared to 1 DL baseline and 2 statistical based baselines on the same dataset.

CONCLUSION

Overall, we used data preprocessing and data visualization techniques as well as three methods, k-means, outlier detection and GAN to analyze the chosen COVID-19 dataset. The results from the three methods present different aspects of the pattern and trends in the given dataset.

As the COVID-19 pandemic continues to interface social activities and global trades, every part of the world still lives under the fear of outbreak. With the results of our project, we hope to help the world to understand the behavior of the virus better and to help guide authorities and citizens around the global to plan accordingly.

Check out "Example Code" and "Project Poster" pages for more project details.

Check out "Related Work" page for more topic information.

Questions?

Contact [hewangyang@yahoo.com] to get more information on the project