Post date: Nov 13, 2019 10:23:4 PM
Email from Zach about the PCA (found here):
#############################
Hmm, some things are as expected, but others are not.
First, it looks like everything went well folding the allele frequencies, and as expected you mostly have rare variants.
The PCA patterns are more complicated. When you focus on very rare variants (p < 0.01) less variation comes out on each PC and only a few trees pull off of the 0,0 point in the center. This is mostly to be expected. In other data sets, with (very) rare variants like this you tend to have different populations separate from the main cloud of points on each PC. You could be pulling out groups of trees that share somatic mutations. Follow this down a few more PC axes and see if the trees pulling off of the 0,0 point are clustered in space. This would be cool.
For 0.01 < p < 0.05 (i.e., moderatly rare variants) the first axis pulls the nearby clone from the rest. This is what I predicted for common variants (p > 0.05) but makes some sense here too if these mostly are not somatic mutations (which mostly fall in the 0.01 or less range?). But there are some misplaced yellow points. What does PC 3 for this set look like?
The actual common variants (p > 0.05) look more or less like the combined data set. Not quite what I expected, but not crazy.
All of this was for all individuals, right? You haven't yet tried this all with a minimum depth of say 4?
I think you want to try that and dig in some on spatial patterns and on the small vs. big tree comparisons.
#############################
To do today:
- p<0.01: follow down more PC axes - ok
- pc# for 0.01<p<0.05 - ok
- use different depths -
- color by small/big
Testing two read depths thresholds and redoing the allele frequency separation.
Keep 283 individuals for mean read depth > 1, and 226 individuals for read depth >4
############### Filter by read depth and separate by allele frequency ###########
#### Filter by read depth
dp <- as.matrix(read.table("depth_matrix.txt"))
# transpose to have individuals as lines and variables (SNVs) as columns
dp <- t(dp)
indMean <- rowMeans(dp)
plot(sort(indMeans))
# Individuals with less than 5 reads total
threshVec = c(1,4) # number of thresholds
keepInd = matrix(0,nrow = 296, ncol = 2 ) # matrix to store the indices boolean value
for (i in seq(1:2))
{
thresh = threshVec[i]
temp <- t(indMean > thresh)
keepInd[,i] <- temp
}
# KeepInd has the indices of the individuals we want to keep in each condition
#### Separate by allele frequency
# Allele frequency estimates
mle <- as.matrix(read.table("mle_p_pando_plus.txt"))
high <- mle>.5
# Store in a new vector to keep the original safe
mirror <- mle
mirror[high==TRUE] <- 1-mirror[high==TRUE]
# Use the indices from keepInd and mirror to only plot the "true" values
data_noQual <- data[,-c(1,2)]
# Common allele PCA
keepCommon <- data_noQual[,mirror>0.05]
keepCommon <- keepCommon[keepInd[,2]==1,]
res.pca <- prcomp(keepCommon, center = TRUE)
fviz_pca_ind(res.pca,
label = "none", # hide individual labels
habillage = data[keepInd[,2]==1,2], # color by groups
palette = c("#af8dc3", "#E7B800", "#5ab4ac"),
point.size = 2,
title = "p>0.05, mean read depth>4"
#addEllipses = TRUE # Concentration ellipses
)
# Rare allele PCA
allFilter <- mirror<0.05
rdpFilter <- keepInd[,2]==1
keepRare <- data_noQual[,allFilter]
keepRare <- keepRare[rdpFilter,]
res.pca <- prcomp(keepRare, center = TRUE)
fviz_pca_ind(res.pca,
label = "none", # hide individual labels
habillage = data[rdpFilter,2], # color by groups
palette = c("#af8dc3", "#E7B800", "#5ab4ac"),
point.size = 2,
title = "p<0.05, mean read depth>4"
#addEllipses = TRUE # Concentration ellipses
)
#########################################################################
plots can be found here.
#########################################################################
## Color by tree pairs BS together
## Color by size S versus B
# Matrix storing the size (age?) for small and big trees
size <- as.matrix(read.table("size.txt"))
size <- as.factor(size)
data_2 <- cbind(size,data)
# I would like to bind it to the type (Pando or other clones)
size_type <- rbind(track_type,size, sep = ",")
res.pca <- prcomp(data_2[,-c(1,2,3)], center = TRUE)
fviz_pca_ind(res.pca,
axes=c(3,4),
label = "none", # hide individual labels
habillage =data_2[,1], # color by groups
palette = c("#af8dc3", "#E7B800", "#5ab4ac"),
point.size = 3,
title = "all individuals"
#addEllipses = TRUE # Concentration ellipses
)
#########################################################################
## Color by tree pairs BS together
pairs <- as.matrix(read.table("pairs.txt"))
pairs <- as.factor(pairs)
data_3 <- cbind(pairs,data)
res.pca <- prcomp(data_3[,-c(1,2,3)], center = TRUE)
fviz_pca_ind(res.pca,
axes=c(1,2),
label = "none",
col.ind = data_3[,1], # hide individual labels
#habillage =data_3[,1], # color by groups
#palette = c("#af8dc3", "#E7B800", "#5ab4ac"),
point.size = 10,
title = "all individuals",
pch = 19,
#addEllipses = TRUE # Concentration ellipses
) +
theme(legend.position = "none")
To grab the different informations from the name column.
perl -p -i -e s'/potr-//g' test.txt
Quick test:
There is 11592544 squares in the SNV matrix.
There is 997691 read depths equal to 0.
Which is 8.6% of the total reads. (not too bad!)
If we think each individual has at least one read per SNV, 39164 SNV and approx. 80 bp per SNVs then we covered 3 133 120 bp (3.1 Mbp).
The genome is 360 Mb : we covered 0.83% of the genome!
This is a very rough calculation.