Lauren McLane '23
On December 30, nine days before the World Health Organization alerted the world to the threat of the coronavirus, BlueDot, a Canadian artificial intelligence startup, detected the outbreak of “unusual pneumonia” cases in Wuhan, China. These pneumonia cases would later be identified as the novel coronavirus and lead to the eruption of a global health crisis.
Throughout the pandemic, AI tools and technologies have played a critical role in detecting, diagnosing, and preventing the spread of the virus and informing the efforts of policymakers, health organizations, and society at large to manage every aspect of the crisis.
In this article, I will examine the functions and processes behind a simple machine learning algorithm that tracks the origins of Covid strains. Models like these have played a crucial role in monitoring the spread of the highly infectious Delta and Omicron varients.
To understand the role of AI in the surveillance of Covid variants, there are a few key questions to answer: What are Covid mutations, and what causes them? What is machine learning? How do you use a machine learning model to predict the origins of a given Covid mutation?
What are Covid mutations and what causes them?
COVID-19 is an RNA virus, which means that its genetic material is made up of RNA molecules. RNA, ribonucleic acid, is a long single-stranded chain of alternating ribose and phosphate units. Attached to each ribose sugar is one of four ribonucleotide bases: adenine (A), uracil (U), cytosine (C), or guanine (G). These four bases constitute the building blocks of RNA; they store information and give the RNA the ability to encode genotypes and phenotypes.
Moreover, since viruses are acellular, they cannot reproduce on their own and, therefore, they must hijack the reproduction methods of a host in order to spread. When viruses infect a host, they attach to the host’s cells, enter them, and replicate their RNA. Often, a malfunction occurs during the replication process, leading to changes in the sequence of RNA bases. These differences in ribonucleotide bases result in virus mutations. SARS-CoV-2, for example, has been hypothesized to originate from the bat coronavirus RTG13, because only 1,000 of 29,903 ribonucleic bases differed between the SARS-CoV-2 virus and the bat coronavirus RTG13. The differences in ribonucleic bases likely account for virus mutations, the most notable of which occurred within the molecule spike (S) Glycoprotein and allowed the virus to transmit to humans more effectively.
What is machine learning and how does a machine learning algorithm work?
Machine learning is a subset of artificial intelligence, which allows a machine to learn from past data without explicit programming. In most instances, a machine learning algorithm utilizes “training data”, a large dataset used to teach a machine learning model to predict an outcome. “Testing data” is then used to determine the accuracy of the algorithm.
Among the most common machine learning algorithms is logistic regression, a classification algorithm that uses statistical probabilities to predict a category. In a logistic regression model, input variables are multiplied by “weights”, values set by the machine learning algorithm during the “training” process, to determine the probability of a category.
Logistic regression models are popular in the field of biology because they are simpler and easier to interpret, and cheaper to develop because they don’t require as much data as other machine learning algorithms, such as Neural Networks.
How do you use a machine learning model to predict the origins of a given Covid mutation?
Programming a machine-learning logistic regression model to determine the origin of a Covid mutation requires a large pool of data, containing a Covid genome sequence and the country the genome sequence was recorded in.
Below is a chart from the National Center for Biotechnology Information (NCBI) comparing the genome sequences of the bat coronavirus and SARS CoV2. The colors red, grey, green, and blue represent the nucleotide bases uracil (U), guanine (G), adenine (A), and cytosine (C) respectively. The “consensus” row outlines the similarities and differences between the two strands, highlighting mutations with a grey box and “Y” value. Inputting a large pool of similar genomics data for different countries allows us to train a machine learning algorithm.
However, in order for a machine-learning algorithm to interpret this genomic data, the data must be converted into numeric values (0-1). One way of doing so is with a one-hot encoding matrix, as shown below. The one-hot encoding matrix above indicates the presence or absence of a base with ones and zeroes in each location in the genome.
After converting the data into zeroes and ones, it can be inputted into the logistic regression algorithm to train the model. The machine learning model performs multinomial regression analysis, after which it determines its own parameters for predicting the origin of any given mutation. Inputting a new Covid sequence will yield a prediction of where sequences originated from.
AI models like these have served as a powerful tool in the battle against COVID-19 and have accelerated the overall development of machine learning and deep learning technologies. Since the outbreak of the pandemic, the National Institutes of Health has launched the Medical Imaging and Data Resource Center (MIDRC), which uses artificial intelligence and medical imaging to diagnose, treat, and monitor Covid-19 patients; Amazon Web Services has created CORD-19 Search, a machine learning website that helps researchers quickly access research papers and documents related to the virus; Parkland Center for Clinical Innovation has developed a machine learning risk index that generates a risk score for Covid-19 patients. These innovations, among countless others, will help to pave the way for the development and application of AI technology beyond the stretch of the pandemic and in the decade to come.
References
Brown, Sara. “Machine Learning, Explained.” MIT Sloan. Accessed February 22, 2022. https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained.
Khan, Muzammil, Muhammad Taqi Mehran, Zeeshan Ul Haq, Zahid Ullah, Salman Raza Naqvi, Mehreen Ihsan, and Haider Abbass. “Applications of Artificial Intelligence in COVID-19 Pandemic: A Comprehensive Review.” Expert Systems with Applications 185 (December 15, 2021): 115695. https://doi.org/10.1016/j.eswa.2021.115695.
McClean, Toby. “Covid-19 Has Accelerated Digital Transformation — With AI Playing A Key Role.” Forbes. Accessed February 22, 2022. https://www.forbes.com/sites/forbestechcouncil/2020/11/04/covid-19-has-accelerated-digital-transformation---with-ai-playing-a-key-role/?sh=1472f3665a58.
Stieg, Cory. “How This Canadian Start-up Spotted Coronavirus before Everyone Else Knew about It.” CNBC, March 3, 2020. https://www.cnbc.com/2020/03/03/bluedot-used-artificial-intelligence-to-predict-coronavirus-spread.html.
OECD. “Using Artificial Intelligence to Help Combat COVID-19.” Accessed February 22, 2022. https://www.oecd.org/coronavirus/policy-responses/using-artificial-intelligence-to-help-combat-covid-19-ae4c5c21/.
Wrobel, Antoni G., Donald J. Benton, Pengqi Xu, Chloë Roustan, Stephen R. Martin, Peter B. Rosenthal, John J. Skehel, and Steven J. Gamblin. “SARS-CoV-2 and Bat RaTG13 Spike Glycoprotein Structures Inform on Virus Evolution and Furin-Cleavage Effects.” Nature Structural & Molecular Biology 27, no. 8 (August 2020): 763–67. https://doi.org/10.1038/s41594-020-0468-7.