Post date: May 07, 2015 10:56:54 PM
As a first pass I called variants and calculated genotype likelihoods under the assumption that all individuals were diploid. This isn't perfect, but its a necessary first step to see if I can get the ploidy of the unknowns. This also gave me a chance to check overall patterns in the data. The files described below are in /labs/evolution/data/aspen/gbs/Variants/.
The vcf file from GATK is aspenvariants.vcf and includes 724,501 SNVs. I used a vcf filter script to filter variants based on the following criteria (from the local vcfFilter.pl script):
$minCoverage = 384; # minimum number of sequences; DP
$minAltRds = 8; # minimum number of sequences with the alternative allele; AC
$notFixed = 1.0; # removes loci fixed for alt; AF
$dels = 0.1; # maximum prop. of reads spanning an indel; DEL
$mq0 = 5; # maximum number of mapping quality 0 reads; MQ0
$bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum
$mqrs = 2; # maximum absolute value of the mapping quality rank sum test; MQRankSum
$qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD
$mq = 30; # minimum mapping quality; MQ
$miss = 38; # maximum number of individuals with no data
The output is in filtered2x_aspenvariants.vcf, which contains 55,474 SNVs. I then converted this to a gl file, and only kept variants with MAF >= 1%. This left 54,203 SNVs which are in filt2xAspenMaf1.gl. Finally, I converted these to genotype point estimates using the global MAF for a prior. This was done with gl2genest.pl and produced the file pntest_filt2xAspenMaf1.txt.
I ran a PCA on these data, here is the code (note that GG and Gcov are similar but not identical):
ids<-read.table("ids.txt",header=F)
G<-as.matrix(read.table("pntest_filt2xAspenMaf1.txt",header=F))
GG<- t(G) %*% G
Gcov<-matrix(NA,nrow=190,ncol=190)
for(i in 1:190){for(j in i:190){
Gcov[i,j]<-cov(G[,i],G[,j])
Gcov[j,i]<-Gcov[i,j]
}}
pcout<-prcomp(Gcov,scale=FALSE,center=TRUE)
#Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6 PC7
#Standard deviation 0.1648 0.05753 0.03475 0.02870 0.02770 0.02660 0.02407
#Proportion of Variance 0.4689 0.05714 0.02085 0.01422 0.01324 0.01222 0.01000
#Cumulative Proportion 0.4689 0.52600 0.54685 0.56107 0.57432 0.58653 0.59653
prcol<-c("darkblue","orange","forestgreen","red","darkgray","lightblue","brown")
uids<-unique(as.character(ids[,1]))
pdf("aspenPcGcovDT.pdf",width=6,height=6)
par(m=c(5.5,5.5,0.5,0.5))
plot(pcout$x[,1],pcout$x[,2],type='n',xlab="PC1 (46.8%)",ylab="PC2 (5.7%)",cex.lab=1.4,cex.axis=1.1)
for(i in 1:7){points(pcout$x[ids[,1]==uids[i],1],pcout$x[ids[,1]==uids[i],2],col=prcol[i],pch=as.character(ids[ids[,1]==uids[i],3]))}
for(i in 1:7){
mn1<-mean(pcout$x[ids[,1]==uids[i],1])
mn2<-mean(pcout$x[ids[,1]==uids[i],2])
text(mn1,mn2,uids[i])
}
legend(0.3,0.3,uids,prcol)
dev.off()
There is some population structure, but lots of variation within populations. Here are Karen's initial thoughts:
"Kind of makes sense for the x-axis. Southern & Central UT populations are together (CMI,FSL,GUT). NV was likely sampled over a larger area than the rest (perhaps explaining its breadth), but I'm surprised CSS (Steamboat Spgs CO) is not further apart."
Note that the triploid FLS samples stand out on PC2, but these could all (mostly) come from one clone. I need to look into this among other things (what do replicates look like, can I really tell diploids and triploids apart).