Post date: Nov 19, 2019 11:59:57 PM
The dataset
The dataset we focused on comprises 296 samples that contain Pando and neighboring clones.
They are Genotyping By Sequencing data, obtained after digestion by EcoRI and MseI enzymes (protocol here).
Pre-processing
The reads are 100bp before processing.
The barcodes are used to re-assign each reads to a sample name.
Then the barcodes, that are, 8,9 or 10 base-pair long, are removed. After removing cut sites and barcode, the reads are 80-85 long.
Alignement
The reads were aligned against Populus tremuloides current version of the reference genome. The genome that is available is still highly fragmented.
Mem algorithm was used to produce the sam file (alignment), which was then converted to bam (binary version of the sam human readable file).
Step 1 - Variant calling among all samples
SNVs in a first step were called using samtools/bcftools algorithms, and we obtained 497 617 SNVs (vcf file).
Filtering
The first set of SNVs were filtered using stringent conditions (refer to script for details) and we were left with 39-164SNVs.
The allele frequencies and genotype point estimates were calculated based on the genotype likelihoods of the vcf file.
Preliminary analysis
This set of SNVs were then used to have a first peak at the variation between individuals, in the whole plot sampled. The goal of this first part was to determine which trees among the 296 sequenced were Pando, and which ones were the nieghboring clones.
A PCA for all samples (Pando and surroundings) was revealed 17% of the variation was explained by the first axis (PC1). We then plotted spatially each tree and colored following their respective PC1 score, with no a priori knowledge of which clone was which. We realized the samples described as Pando previously clearly separated from the rest with a positive PC1 score (look up here!! really cool map ahead).
We used this observation to cluster the samples according the PC1 score (k-means algorithm with k=2). We obtained 109 Pando samples and 117 non Pando samples (there are 226 individuals studied after filtering for read depth > 4).
For each location, a tall tree and a smaller (younger) tree was sampled. The small and big trees tend to cluster together (here PC1 for the small tree of the pair against PC1 of the big tree of the pair tend to fall on the diagonal).
Step 2 - Variant calling among all samples
We were then pretty confident about what was Pando from the trees that were not Pando, and ready to actually call the somatic mutations inside the Pando clone.
We used MuTect2 haplotype caller to call somatic mutations, which is designed to call mutations in tumors as compared to normal samples.
As we do not have a precise idea of which tree is the ancestor tree, we do not have a reference, in other words, a "normal" tree. We thus tried 6 different combinations of the same algorithm, and using two options of the algorithm: three well covered, well spatially distributed trees within Pando, and three well covered, well spatially distributed trees in the nearby clones, with or without using a panel of normals. The panel of normal is a combination of mutations regularly seen in "normal" trees, here we chose 100 samples from Utah, Colorado, Nevada and Wyoming states (taken from the same GBD dataset).