The Concept

STRUCTURAL VARIATIONS

Structural variations are mainly separated in two categories; these are the balanced and the unbalanced. The basic variations include insertions, deletions, duplications, translocations and inversions. Balanced variations refer to genome rearrangements, which do not change the total content of the DNA. These are mainly inversions or intra/inter-chromosomal translocations. Unbalanced variations on the other hand, refer to rearrangements that change the total DNA content.

SEQUENCING - Read Depth

Sequencing reads are mapped to different positions on the reference genome. Thus, there is a great overlap among reads that get mapped to nearby positions. Usually the way to visualize such mappings is to place the reads using a stack on top of the reference sequence. Sequencing depth measures the number of reads that have been mapped to a given genomic position; a higher depth means more measurements of that position, increasing the confidence of the base call given a sequencing error. Depth of coverage (DOC) is a significant way to detect insertions or deletions gains or losses in a donor sample comparing to the reference genome. Thus a region that has been deleted will have less reads mapped to it and vice versa in cases of insertions.

 

SEQUENCING - Pair-End

After sequencing, the two ends of each DNA fragment (paired end reads or mate pairs) are then sequenced and finally get mapped back to the reference sequence. Notably, the two ends of each read are long enough to get uniquely mapped back to the reference genome. The idea behind this strategy is that the ends of the reads, which align back to the reference genome, map back at specific positions of an expected distance according to stored DNA libraries. In the case that the mapping distance is different from the expected length or in the case that the mapping has a different orientation, a clear indicator for a possible structural variation is present. Thus, if the mapped distance is smaller than the expected one, this would indicate a deletion or vice versa an insertion. 

THE PROBLEM

While a human chromosome consists of millions of bases (~35-140 millions), a typical screen consists of a limited amount of pixels (900-1200 pixels width). A first approach would be to visualize the chromosome linearly and overlay the related information to it such as the coverage histogram or the pair-ends that map at different positions. Such an approach is not efficient though because ~1000 pixels is a way too much limited space to show thee millions of bases linearly. Even if we split the chromosome in 1000 buckets and assign to each bucket the average coverage of its bases to fit the chromosome to the space we lose information due to low resolution.

Chromosome22 for example consists of let's say 49.000.000 bases. Having only 1000 pixels available, that would require 49.000.000/1000=49.000 bases/bucket.....

To visualize the data at a higher resolution in a single screen, more space is required. Therefore, we use a 2D plane of 512x512=262.144 pixels. Compared to the 1000 pixels that we had before we see that difference is significantly big.

The next step is to use a Hilbert curve to visualize one chromosome. A Hilbert curve (also known as a Hilbert space-filling curve) is a continuous fractal space-filling curve first described by the German mathematician David Hilbert in 1891.

As the fold level of the Hilbert curve increases, the more of the available area is covered. Using a fold level of 9 we see that Hilbert curve covers every single of out panel as 512x512=262.144 pixels.

 WHY USING SPACE FILLING CURVES?

If we would follow a naive approach of braking the chromosome into multiple lines, we would see that visually, two points that are vertically aligned and belong to different lines seem to be very close to each other which is misleading.

Similarly, two points which indicate the beginning and the end of two different lines seem to be very far away from each other which is not true.

One of the main advantages of a Hilbert curve is that it overcomes this problem. Thus two points that seem to be close to each other on the Hilbert curve are also close to each other in reality. 

 

THE MEANDER APPROACH

Having a 2D space of 512x512=262.144pixels we assign to each pixel one coverage value. To do that we sequentially split the chromosome into 262.144 buckets and for each bucket we calculate the average coverage of the bases that correspond to that bucket.

Chromosome22 for example consists of let's say 49.591.432 bases. Having only 262.144 pixels that would require 49.591.432/262.144=135 bases/bucket. Compared to a linear representation (see previously) where each bucket consists of 49.591.432/262.144=55.102 bases we see that we can visualize data at a much higher resolution using the Hilbert curve compared to the linear representation as 135<<55.102 bases/bucket.

 

 MEANDER Visualization

This is a typical representation of chromosome 22 using the Hilbert curve. Chromosome 22 was split into 262.144 buckets and each pixel represents one bucket. The current Hilbert curve shows the signal intensities as they were mapped to a gray scale color RGB:0-255 with adjusted transparency. The higher the intensity of the signal is, the darker the pixel appears. Similarly, the lower the intensity is, the lighter the pixel appears. Areas with white spots represent coverage=0 or absence of data (bottom left corner.

The red axis indicates the coordinate system as we are looking the chromosome linearly.