Covid

Covid genome revisited with pattern matching and Joint Complexity


Using Pattern matching and Joint complexity (see my IT page), I have investigated the covid SAR-2 genome in order to track similarities with ancient virus. The study does not need any knowledge in mede3cine and biology and the genome sequence can be found on the very good site NCBI the world genomic public (and free) database. The programs in order to run pattern matching are simple to implement (see IT page) are of low complexity and need little processing time on any regular laptop..

I have run these similarity program into two possible context:

Weak pattern matching: when the sequences to compare look like a pair of random sequence. This happens when the virus are not significantly related.

Strong pattern matching: when the sequence to compare are very similar., and differ by few elements. This happens when the sequence are strongly related.


Weak Pattern matching analysis

When two sequences X and Y are random uniform on the alphabet {A,C,G,T} and independent, the average joint complexity J(X,Y) has asymptotic expression

E[J(X,Y)]=1/log(4)*((|X|+|Y|) log(|X|+|Y|)-|X|log|X|-|Y|log|Y|)+o(|X|+|Y|)

where |X| and |Y| respectively denotes the length of the sequences. When the sequences have identical length the expression simplifies in

E[J(X,Y)]=|X|+o(|X|).

The following figure (left) display the joint complexity value of the SARs-2 genome of 2019 with the Bat_coronavirus_HKU2 , the oldest known ancestor of the virus, indeed a bat coronavirus discovered in 2007. Both are of length of 30,000 bases., we read the figure from right to left. On the abscissa x we compute the joint complexity of the portion of the SARS-2 genome between x and 30,000 with the full genome of Bat coronavirus. In dashed we display the theoretical plot of the joint complexity if the sequences were purely random. As we can see the plot of the joint complexity with the bat coronavirus is slightly above, indicating the filiation between the two genomes (which are separated by 12 years). The figure on the right is indeed the same figure but after normalization with the random joint complexity. In brown the joint complexity with the genome of HIV_1_isolate_060SE_Sweden of 1993. We see that there is a negative correlation, proving that the genomes are not affiliated.


Despite this absence of filiation between the full genomes, Perez et al in a non published article have been able to detect in the SARS-2 genome 19 segments of around 20 bases long issued from 19 HIV and Simian IV (the ape equivalent of HIV) full genomes. The segments appear sometimes with no error, sometimes with one error, two or three errors. The following figure display the position of those segments. The segments are enumerated from 1 to 19, the 20th is a segment from the bat coronavirus. The figure has been built the following way: the SARS-2 genome has been sliced in 24 base long slots. For each of the 19 candidate genomes sequence Y1, Y2,…,Y19 we compute the joint complexity between each SARS-2 slot, then we substract to this value the slot average value and express it in multiple of the standard deviation. We mark the position where the maximum matching occurs for the candidate genome. There should be 19 figures but we compress them in a single figure by displaying the maximum of the values for each slot. We see that matchings peak at high values, up to 15 multiple of standard deviation which is remarkable.

The matchings are remarkable and the authors were trying to use them to support the hypothesis that the SARS-2 genome would be of artificial origin. Indeed, in the word sequence problem the probability that a random sequence of length N contains exactly 19 such factors is N19 4-19*20

which in our case is of order 2*10-144; better play to Powerball! Why this is not fully correct? Two remarks, (1) the segments copied in the SARS-2 genome should not be decisive in the mode of operations of the virus, (2) the authors seem to have reviewed many genomes which span over many versions of HIV and Simian IV, some of them have even been reversed in order to give better fits, Therefore the problem look rather on how many full genomes we have to test in order to get a dozens of matching of size around 20 bases. We can estimate that the total database related to VIH and related genomes is around one million base. The probability of each matching is of order 4-20 . Then by the Tchebytchev inequality the probability to have k matching of size 20 over a database of size M is upper-bounded by the formula 30,000*M*4-20/k, if we accept a tolerance of three error max we have to multiply this value by (203). With M=1,000,000 and k=19 we get the value 1.6, larger than 1. Thus the event of 19 matchings is not that exceptional.


Strong Pattern Matching analysis

When X=Y we have J(X,X)=|X|2/2+O(|X|*log|X|) when |X|=|Y| and with only k mismatches between X and Y with k<<|X| J(X,Y)=(|X|-2k)2/(k+1)/(k+2)-O(|X|log|X|).


The figure below gives the recent hypothesis about the filiation of SARS-2. We see the ancestor Bat coronavirus HKU2, discovered in 2007 in Yunnan. The closest known ancestor is RATG93 discovered in 2013, also in Yunnan. A close cousin is the bat coronavirus RaCCS203, discovered in 2020 in Yunnan. All these genomes have same length of little less than 30,000.


Our aim is to compare the genome of the middle guy RATG13 with all the other genomes cited so far. For this end we will slice the genome in slots of length 50 bases and compute the joint complexity of each slot with the full other genomes. Using the formula about weak and strong pattern matching, we notice that the average random (weak) pattern matching will be around 271.2482570 and the average maximal pattern matching (what would be given if we compare the slot with itself) is 1222.406850. Thus the area between 271 and 1222 covers the weak and strong pattern matchings.

In the figure below we display the slot joint complexity of pivot genome RATG13 with its (putative) ancestor HKU2 (left), and with RATG93 itself (right). The ancestor genome seems to be genetically too far from the pivot, thus the joint complexity ranges in the weak pattern matching area as if it were purely random. On the other hand, the joint complexity of the pivot with itself confirms the above formula.

The next figures show the slot joint complexity of the pivot with its putative descendant the SARS-2 (left) and its cousin RaCCS203 (right). Both are in strong pattern matching. In fact it is stronger for the SARS-2 than for the RaCCS203. This is surprising since there is the same lasp of time between the ancestor HKU2 and RATG13 and between RATG13 and SARS-2. Regarding the cousin genome RACCS203 the pattern matching is less strong, in fact it oscillates between strong and weak. In particular the segment between base 21,000 and 24,000 looks like a exogenous insertion.


The similarities between the three genomes needs deeper investigations. In the next figures we use a variation in the joint complexity algorithm where instead enumerating the common factors, we search the largest common factor. The figures below displays for each slot of RATG13 the offset between the position of the largest factor in the RATG13 genome and the position of the same factor in the SARS-2 genome (left) and in the RaCCS203 genome left. In both case (but mostly for the SARS-2) genome, the offset is surprisingly constant. The changes are indicated by small triangles. The blue triangles when the change is an increase, in this case this an indication of a mutation by insertion: four in SARS-2 genome, one in RaCCS203. The red triangles when the change is a decrease, thus a mutation by deletion: none in SARS-2, four in RaCCS203. The changes are only of few bases, a total of 30 bases for SARS-2, 50 bases for RaCCS203. The dashed vertical lines indicates when the slot is so corrupted that the largest common factor is anywhere in the 30,000 position in the target position (the figures have been cropped in order keep visibility). For SARS-2, the length of the slot (50) is the largest length where we have this kind of mismatch. Regarding RaCCS203, we again notice that the area between base 21000 and 24000 is very corrupted with crazy offsets, indicating a large mutated area.

The offset analysis seems to confirm that both genomes, and mostly SARS-2 are extremely close to RATG13 despite the important laps of time between their discovery. Regarding SARS-2 we see that the mutation is mostly by substitution and insertion, while in general there is a general understanding that most virus mutation occur by deletion as we can see on RaCCS203.


The Ariadne Covid project

We show some of the work done with Liubov Tupikana about an application, called Ariadne Covid, whose aim is to help the user to limit exposure to virus during outdoor excursion by building dedicated path. The COVID-19 pandemic has forced more than half of mankind into lock-down situation. During lock-down in some countries, outdoor excursion were authorized under strict restrictions; in France, no more than 1 hour less than 1km away from home. These restrictions had an aim to reduce the exposure rate to virus through outdoor contact with other persons. This problem is crucial in urban places, since in rural areas the virus exposure is lower. The Ariadne Covid application has two main modes of application:

  • The in-lockdown mode, where outdoor excursion have main purpose to offer physical exercises during 1 hour and therefore has no other aim than returning home after the authorized period;

  • The off-lockdown mode, where outdoor excursion have a aim, mainly commuting from home to work. We will address the case where the excursion is done by bike, since this transportation mode has become very popular, since the public transportation was considered to be too exposed to virus.

The time of the lockdown, it was already established that most Covid infection was due to airborn propagation. In short staying at less than 1 meter of an infected person during several minutes were giving a significant probability of being infected. In fact it depends of the degree of infection of the propagator (more 50% chance when close to a highly infected person during 15mn). Thus the aim of application is to find path which limit the total time of close vicinity with other people.

The in-lockdown mode of Ariadne Covid.

Paris has 2900 km of pathwalks for 2.2million inhabitants. Therefore it is possible that all Parisians walk 24 hour per day at safe distance of each other. But as an obvious statement, people have social trends which make them to gather more likely in the same places as the same time. And this was confirmed even of absence of attraction centers (shopping mails and recreation areas where closed). Using Google StreetView, I randomly sampled 55 km of Paris street (left) and enumerated the walkers at every 100 m stop points (right). The histogram of the various density encountered is not flat (figure below left), it corresponds to a Zipf distribution of coefficient 0.75. The next picture on the right is similar but display the path walk densities in Manhattan, and the last picture on the right is for Madrid.

The data was collected in situation prior to the pandemic crisis, but we expect that the lockdown worsened the situation in the streets since the shopping center were closed. The bords de Seine or other city parks and river side became crowded to the point that parks, forests were closed by the authorities, which was not in theory an happy decision since it reduced the path linear devoted to walkers. Similarly outdoor excursion dedicated to jogging could only be practiced in a five hour dayly interval and we will see below that was neither unhappy . Anyhow if we convert this data in exposure time and by letting varying the average excursion time, we find that with a 1 hour excursion time, social walkers stay in average around 15 minutes at less one meter of another walker, in the hypothesis that walker choose their excursion slot uniformly during the 10 hour of daylight. If the selection of the slot is restricted to a five hour interval, the average exposure time rises to 30mn! The Ariadne Application in its most basic settings selects the path like a random walk, since it is well known that such walks optimize the balance of street density. Of course, after 30 mn of excursion delay the walker must reverse its course. Reversing the path back home maintains the street density balance. The figure below left shows the exposure time (in minute) as a function of the outdoor excursion slot duration. In red solid is when all walkers are social, but with the their interval slot selection of 10 hour, red dashed when the selection interval is reduced to 5 hour, blue solid when all walkers are random (i.e. follows the Ariadne application) and in this case the average exposure time is below 10 mn. On the right he figure shows the exposure time of a single walker when all other walkers are social and walk during one hour. The exposure time of the single walker is function of the duration of its slot, assuming (s)he is the only one to have the possibility to extend its excursion. In bown is when the single walker is social, and in green when (s)he is random. We see that for one hour of outdoor excursion the exposure time of the random walker drops to 6 minute smaller than when all walker are social. This proves that even with a single user, the Ariadene application gives a strong benefit to this user.

In off-lockdown mode

In this mode the outdoor excursion are aimed as they have a destination (commuting to work or back to home). The issue is that a simple shortest path algorithm will not give a satisfsctory answer because the shortest path lack the path diversity which will reduce the virus exposure. Paris authorities decided to create bike lane parallel to underground lines, in order to keep commuting travel off of locked wagon . This give a loose network of preferred road. We have simulated the effect of the preferred road by using a shortest path algorithm after applying a discount of 10% of the distance weight of the concerned edges. We compare with other better balanced routing algorithm called georouting routing which we will describe later. We model Paris as a grid and the preferential roads by 14 roads in order to mimic the 14 bike lane parallel to the 14 paris suburban lines. We randomly select source and destination and repeat 3,000 times. We display the density map on the two figure below (left: preferred roads, right georouting). The color of the edges indicates the density of user per street: black above 200, red between 50 and 200, blue between 25 and 50, green between 1 and 25. The figure below left shows the density histogram if the streets in red the preferential routing, in blue the georouting. The figure below right shows the virus average exposure duration (in minute) when the peak traffic lasts 2 hour as the function of the total bike traveller population. In dashed if one user take the georouting and all the other commuters are in preferential routing. It should be noted that the safe distance has been raised to 10 m, due to the suspension of the droplet in the air as recommended by sanitary authorities.



The georouting algorithm takes the advantage of the quasi isotropy in the road orientations in the city map. Therefore a routing based on the heading toward the destination will work as well as a short path routing. To make it diversified (and avoid concentration on same streets) we introduce a random component, which consists at an intersection into selecting between the two best exit streets toward the destination with a bias which makes the average orientation to coincide with the heading toward the destination. This is illustrated in the figure below.

We have tested the georouting algorithm on a more challenging map than a grid map. We have choosen the map of Koenigsberg/Kaliningrad. This city is the place of the famous bridge problem of Leonard Euler. The intricacy of the rivers and other obstacles make the street network very challenging for the georouting. The figure below shows tbe map in 1938 on the left and the map of the modern city extracted from OpenStreet map.

The georouting algorithm is limited in the map of Kaliningrad because the numerous obstacles may create local loops. In case of loops we apply the georouting on the extended graph of the street map, extending streets up to two or four intersections if necessary. The use of georouting allows a path diversity and a better balance of the street density with the consequence of considerably reducing the exposure to the virus during commuting. In fact there exist a version of georouting, called the isotropic routing which together with a isotropic condition provides a perfect balance between streets. The isotropic condition tells that in most intersection that regardless of the entrance street, the relative heading toward the destination follows the same distribution. The isotropic condition holds in the grid map when the source and destination are randomly uniformly chosen. The two figures below show on the left the performance of georouting and on the right the performance of isotropic routing. To show the diversity we have run 5000 times the routing algorithms between one fixed pair of source and destination chosen in Central Kaliningrad. green means at least one passage, blue, more than 15 passages, red more than 25 passages, brown more than 50 passages, black more than 200 passages. The figures under show on the left the average path length of both algorithms in kilometer (brown is georouting, blue is isotropic routing), on the left the exposure duration as function of the total population commuting atthe same time. The advantage of georouting in term of kilometer is significantly attenuated when considering exposure.