Our model uses both China related air-traffic and population data to simulate the spread of COVID-19 within China. Data concerning air-traffic was gathered completely from openflights.org, which provides comprehensive lists of airports, plane models, and, most importantly, air routes (including departing airport, destination airport, and plane model). We also found the population data of 112 Chinese cities from simplemaps.com. Simple python functions were created to scrape data from these sources.
In effort to find the volume of passengers traveling from one airport to another we first had to find the volume of the planes that traveled domestic air routes in China. Once these planes were determined, we manually found respective plane volumes through quick Google searches. We could then complete our objective. It is important to note that some air routes listed multiple plane models. In these cases, since openflights.org did not explain what this exactly meant (or the time period over which data for these routes was collected), we decided that the passenger volume for the route would be the average passenger volume of all planes flying that route.
We had unique airport data for 255 Chinese airports, however, since we only had data for 112 Chinese cities, we only used airport data that pertained to those 112 cities. We used the population data of each city to create a Barabási–Albert network structure of the city. The size of the Barabási–Albert network of each city was determined by the population data divided by a COMMUNITY_SIZE variable that we have come up with. This was done as an effort to keep the the size of each city network as manageable as possible to maintain both efficiency and accuracy (for example, Whuan has a population of 11.8 million which equate to 11.8 million nodes in its Barabási–Albert network structure, which is way to large) . The value of COMMUNITY_SIZE was determined the by researching the population density of China, which was discovered to be 153 (per KM^2). The number of new-edges formed between each node in the city network was set to be a 100. Another step was taken to insure the efficiency and speed of our model. Rather than creating the network structure of each city and taking valuable memory space right away, we only create the Barabási–Albert network of each city once the virus has reached that city. Similar strategy is used again later on.
While the strategy above is useful, it is incomplete and inaccurate if it is solely used by itself because it does not effectively keep track of individual infections. Our solution was to add a "community" attribute to each node within each city's Barabási–Albert network structure. The "community" attribute of each node would hold a Watts–Strogatz network structure of size 153 nodes (COMMUNITY_SIZE). In this way, each node in a city's Barabási–Albert network represent a community, and within the "community" attribute of that node, each inner node represent an individual. However, similar to before, we only actually created a "community" level, Watts–Strogatz network structure for a node, only when the virus reaches that node.
The SEIR model was also implemented on a community level. Each node in the Watts–Strogatz network structure has 7 attributes : "Susceptible", "Exposed", "Infected", "Recovered", "Deceased", "Incubation period", and "Infectious Period". Once a node is initialized, the value are set as True, False, False, False, False, Random.randint(1,7), Random.randint(1,7) respectively. The "Incubation period" determines how long an individual remains as "Exposed" while the "Infectious Period" determines how long an individual remains as "Infected". The distinction between "Exposed" and "Infectious" is an "Exposed" node has been contaminated with the virus but is not yet able to spread the virus, while an "Infectious" node is actively spreading the virus. According to CDC the incubation period for COVID-19 is 1-14 days, so for our model we decided that a node will be both "Exposed" and "Infected" within this 14 day period. We introduced the random generation of these values to account for the differing immunities of different individuals.
The simulation starts by randomly infected an individual within a random "community" of given city (in this case, Wuhan). Then the infected node being to spread the infection to other nodes via a PATHOGEN_TRANSMISSION probability. The PATHOGEN_TRANSMISSION probability was calculated by using the following formula: R = PK/M, where R is the reproductive number, P is the PATHOGEN_TRANSMISSION, K is the average number of people infected, and M is the average number of people infected. Through this formula and data available on COVID-19, the PATHOGEN_TRANSMISSION probability was derived to be 0.7 (70% of the time). There are 3 layers of transmissions in our model: Community-wide, city-wide, nation-wide. Community-wide transmission has to do with the spread of the virus within the Watts–Strogatz network structure. This is simply done by looking at the neighbors of every infected node and "rolling the dice" to see if they can beat the PATHOGEN_TRANSMISSION probability. Every node conforms to states path below:
City-wide transmission has to do with the spread of the virus within the Barabási–Albert network structure. This is done by looking at neighbors of every infected community (node in the network) and "rolling the dice" to see if they can beat (PATHOGEN_TRANSMISSION * #active infected in the community/COMMUNITY_SIZE). At this level of transmission, we also implemented a "social distancing" feature that when activated reduces the PATHOGEN_TRANSMISSION probability by 50%. Nation-wide transmission has to with the spread of the virus from a city to another. This is done by looking at the flight routes coming out of every infected city and "rolling the dice" to see if they can beat ((PATHOGEN_TRANSMISSION * #active infected in the city/City population) * flight route volume). At this level of transmission, we also implemented a "lock-down" feature that when activated stops the transmission of the virus at the national level.
This visualization for this project was completed using Networkx and a Python library called mpl_toolkits (a matplotlib library), from which we imported Basemap. Using the flight routes between different airports, we created a Networkx directed graph. These airport nodes are serving as proxies for the cities themselves, since a majority of cities only have one primary airport. The airport/city nodes were then populated with important data like the population, city name, and number of infections. These pieces of data were later used to visualize the color and size of the nodes. The next step was overlaying the whole city/airport network on a Basemap of China, using latitude and longitude coordinates which were scraped from OpenFlights. Using the infection count of each city, were then able to color the airport nodes to indicate the degree of severity of the spread at each airport. Blue-colored nodes means that there are more than 5000 cases in that city. Red-colored nodes indicates that there are less than 5000 cases. Lastly, green-colored nodes indicates that a city has no infections. The size of the nodes is also directly proportional to the number of infections in that city. In other words, the size of the nodes grows with the number of cases (with a cap around 10,000 since we don't want nodes to cover the entire screen).
The inspiration for this visualization was from an article on TowardsDataScience.com called "Catching that flight: Visualizing social network with Networkx and Basemap" by Tuan Doan Nguyen. Similar to the article, our model utilizes two kinds of visualizations, a Basic visual that offers simplicity while still providing some insight to the propagation of the virus and an Advanced visual that provides a higher level of detail and gives the option of including the names of the cities and contains a legend as well.
Some pie charts depicting the percentage of individuals who are infectious, recovered, or dead were also generated for each simulation.
Because our model has a relatively large number of moving parts, it is quite computationally expensive to run. Particularly, dealing with the spread of the disease among nodes in a sub-community requires four for-loops, with this number decreasing by one as our scale grows by each degree (sub-community, community, province). In effort to combat this, we forced our model to run exclusively in areas which harbor active nodes. So our model may loop through a province if it contains active nodes, but will not loop through any community in the province that does not contain an active node.
Our model also deals with a fair amount of randomness. The act of infecting a node or transmitting the disease between provinces are both probabilistic outcomes and depend heavily on a built in Python random number generator. Therefore, depending on the random numbers that are generated, our model may have to loop through many communities and sub-communities in a province or may not have to loop through many.
We ran a number of different experiments - all with varying computation times - to better understand the propagation of the Coronavirus through China. At its quickest, the simulation only needed about thirty seconds to run when a lock-down is placed on Wuhan ten days after the first case. At its longest, the simulation took about twenty minutes to run when modeling the spread of the disease in the case where no preventative actions are taken. In each case, the vast majority of the computation time is caused by the simulation - the visualization only takes about eight seconds. It's also important to note that the computation time is machine-dependent so we can only provide a rough estimate of how long it takes to run our program.