Post date: May 25, 2015 7:44:59 PM
We gave data from populations on the west and east side of the great basin, but only the former are relevant for the host-plant variation paper Matt is working on. So, for now I restrict my analyses mostly to these populations. Here is a description of what I have done (files are in labs/evolution/projects/alfalfaGen/popgen/forMatt):
Methods
DNA was isolated and purified from dessicated leaf tissue sampled from 132 alfalfa plants using Qiagen's DNAeasy 96 Plant Kit (Qiagen Inc.). We then generated DNA fragment libraries for genotyping-by-sequencing using our established protocol (Gompert et al. 2012, Parchman et al. 2012, Gompert et al. 2014). Specifically, genomic DNA was first enzymatically digested with the restriction enzymes EcoRI and MseI. Double-stranded adaptor oligonucleotides were then ligated onto the sticky-ends of the digested DNA fragments. These adaptors included 8-10 base pair (bp) barcode sequences that could later be used to assign DNA sequences to individual plants. Fragment libraries were amplified using PCR and fragments between 250 and 350 bps in length were isolated and purified using a BluePippin (Sage Science). The fragment libraries were then sequenced on an Illumina HiSeq 2500 (one lane, 1 x 100 base pair reads) at the University of Texas Genome Sequence and Analysis Facility.
We aligned the DNA sequences (236 million DNA sequences total) to a draft genome sequence generated from the diploid progenitor of alfalfa (total scaffold length = 673 Mbp, N50 scaffold size = 37 kbp, number of scaffolds = 41319; we will fully describe this genome sequence in a future publication). Sequences were aligned using the aln algorithm in bwa (Li and Durbin 2009) with a maximum of 5 miss-matches, and a 20 bp seed with only two allowed miss-matches in the seed. Bases with quality scores lower than 10 were excluded from the alignment. We used samtools to compress, sort and index the alignments. We then used the Unified Genotyper in GATK to identify variable nucleotide positions and calculate genotype likelihoods for each individual and variable position (this is a Bayesian genotype and variant caller). We assumed a ploidy of 4 as alfalfa is a tetraploid. We set the minimum base quality to 20 and set the prior expectation for heterozygosity to 0.001. A custom Perl script was then used to filter the initial set of variants. We retained those variants that met the following criteria: 2x minimum average coverage, 8 or more sequences containing the non-reference allele, less than 10% of sequences spanning an insertion-deletion, no more than 5 mapping quality 0 sequences, a minimum mapping quality of 30, at least one sequence in 72% of individuals, a maximum absolute value of the mapping quality rank sum test of 2, a maximum absolute value of the base quality rank sum test of 3, and a minimum ratio of variant confidence to non-reference sequence coverage of 2. We also excluded alfalfa samples with mean sequence coverage less than 2x (i.e. we dropped low coverage nucleotide positions and low coverage individuals). 71 plants (five populations) and 16,920 single nucleotide variants (SNVs) were retained for population genetic analysis.
We estimated the posterior probabilities of each genotype for each individual at each locus using a Bayesian approach. To do this we took the genotype likelihoods from GATK and multiplied them by the prior probabilities of each genotype assuming Hardy-Weinberg genotype frequencies and a maximum likelihood estimate of global non-reference allele frequency, which was also obtained using GATK. We then took the mean of the posterior distribution for each locus and individual as the genotype estimate for downstream analysis (this value is between 0 and 4, is not constrained to be an integer, and is an estimate of the number of non-reference allele copies at a locus). This general approach, which we have used previously (e.g. Gompert et al. 2014, Gompert et al. 2015), allowed us to make better use of the information in low to moderate coverage population genomic data than we would have been able to if we had simply called genotypes directly from the sequence data for each individual.
We used several methods to quantify and summarize patterns of genetic variation within and among populations. We generated a genetic covariance matrix where each element in the matrix measured the genetic similarity (covariance in genotypes) for a pair of individuals. We then used principal components analysis (PCA) to visualize patterns of genetic similarity based on this matrix. Next, to quantify genetic variation within populations, we calculated the average expected heterozygosity (1 - [p^2 + (1 - p)^2]) and variance in PC 1 and 2 scores. Genome-average pairwise and global Fst were then estimated from the sample allele frequencies to assess the extent of population genetic structure. Finally, we tested for a positive correlation between genome-average Fst and the geographic distance between pairs of populations to determine whether alfalfa showed signs of isolation-by-distance (a Mantel test with 1000 permutations was used to test whether the observed correlation was significantly different from 0).
Results
PCs 1 and 2 explained 16.2 and 5.3% of the variation in pairwise genetic similarities (Figure X). VUH and GVL differed most with respect to PC1, whereas PC2 separated these populations from AWFS. The AWFS population was notably more variable with respect to both PC1 (var = 0.011) and PC2 (var = 0.0052) scores (var PC1: AFAL = 0.003, GVL = 0.006, SCC = 0.005, VUH = 00.8; var PC2: AFAL < 0.0001, GVL = 0.0008, SCC = 0.0004, VUH = 0.0011). However, all five populations exhibited similar levels of expected heterozygosity (0.27-0.29). Overall, little genetic variation was partitioned among populations (global Fst = 0.020, 95% bootstrap CIs 0.020-0.021). While all pairs of populations were genetically similar, population pairs varied in the degree to which they differed from each other (mean pairwise Fst = 0.013, minimum = 0.008, maximum = 0.019). Finally, we found no evidence that genetic differences were greater between more geographically distant populations (Mantel r = -0.463, p = 0.95).
And here is the R code:
## Alfalfa initial analyses
library(RColorBrewer)
nl<-17237
N<-238
G<-matrix(scan("pntest_filtered2x_alfalfavariants.txt",n=N*nl,sep=" "),nrow=nl,ncol=N,byrow=TRUE)
depth<-matrix(scan("depth_filtered2x_alfalfavariants.txt",n=N*nl,sep=" "),nrow=nl,ncol=N,byrow=TRUE)
ids<-read.table("ids.txt",header=FALSE)
## identify low-coverage individuals
lDepth<-apply(depth,1,mean)
mnDepth<-apply(depth,2,mean)
x<-which(mnDepth < 2)
y<-which(lDepth < 2 | lDepth > 25)
Gs<-G[-y,-x]
IDs<-ids[-x,]
Ns<-N-length(x)
pops<-as.character(unique(IDs[,1]))
### subset for wester pops with data ###
tapply(X=Gs[1,] > -1,INDEX=IDs[,1],sum)
#AFAL ALP APLL AWFS BWP CFS CKV DBS GRO GVL HWR SCC VUH
# 6 17 1 17 1 18 17 16 14 19 19 19 10
popsw<-pops[c(1,4,10,12,13)]
x<-which(IDs[,1] %in% popsw)
Nw<-length(x)
## 71 ## no. inds
IDw<-IDs[x,]
Gw<-Gs[,x]
dim(Gw)
## 16920 71
## genetic covariance matrix
Gcov<-matrix(NA,nrow=Nw,ncol=Nw)
for(i in 1:Nw){for( j in 1:Nw){
Gcov[i,j]<-cov(Gw[,i],Gw[,j])
}}
## pca
out<-prcomp(Gcov,center=TRUE,scale=FALSE)
myc<-brewer.pal(5,"Set2")
pdf("alfalfaPCAWest.pdf",width=6,height=6)
par(mar=c(5.5,5.5,0.5,0.5))
plot(out$x[,1],out$x[,2],xlab="PC1 (16.2%)",ylab="PC2 (5.3%)",cex.lab=1.5,cex.axis=1.2,type='n')
for(i in 1:5){
a<-which(as.character(IDw[,1])==popsw[i])
mn1<-mean(out$x[a,1])
mn2<-mean(out$x[a,2])
for(j in 1:length(a)){
lines(c(mn1,out$x[a[j],1]),c(mn2,out$x[a[j],2]),col=myc[i])
}
}
for(i in 1:5){
a<-which(as.character(IDw[,1])==popsw[i])
mn1<-mean(out$x[a,1])
mn2<-mean(out$x[a,2])
text(mn1,mn2,popsw[i])
}
legend(0.1,0.21,popsw,col=myc,lty=1,lwd=1.5,cex=1)
dev.off()
## var pc scores
tapply(X=out$x[,1],INDEX=as.character(IDw[,1]),var)
# AFAL AWFS GVL SCC VUH
#0.002664695 0.011070755 0.006162062 0.004566879 0.007897510
tapply(X=out$x[,2],INDEX=as.character(IDw[,1]),var)
# AFAL AWFS GVL SCC VUH
#0.0000102663 0.0052433932 0.0007520867 0.0003628479 0.0011236866
## allele frequencies
npw<-5
nlw<-dim(Gw)[1]
P<-matrix(NA,nrow=nlw,ncol=npw)
for(j in 1:npw){
P[,j]<-apply(Gw[,IDw[, 1] == popsw[j]], 1, mean)/4
}
Hs<-1 - (P^4 + (1-P)^4)
mnHs<-apply(Hs,2,mean)
mnHs
## [1] 0.2860762 0.2903586 0.2907043 0.2938623 0.2737054
## Gst overall and pairwise
Pbar<-apply(P,1,mean)
Gt<-Pbar^4 + (1-Pbar)^4
Gs<-apply(P^4,1,mean) + apply((1-P)^4,1,mean)
Ht<-1-Gt
mHs<-1-Gs
Gst<-mean(Ht-mHs)/mean(Ht)
## bootstrap Gst
bsGst<-rep(NA,1000)
for(i in 1:1000){
x<-sample(1:nlw,nlw,replace=TRUE)
bsGst[i]<-mean(Ht[x]-mHs[x])/mean(Ht[x])
}
quantile(bsGst,probs=c(0.025,0.975))
# 2.5% 97.5%
#0.02004969 0.02076918
## pairwise Gst
GstMat<-matrix(0,nrow=5,ncol=5)
for(j in 1:4){for(k in (j+1):5){
Pbar<-(P[,j]+P[,k])/2
Gt<-Pbar^4 + (1-Pbar)^4
Gs<-apply(P[,c(j,k)]^4,1,mean) + apply((1-P[,c(j,k)])^4,1,mean)
Ht<-1-Gt
mHs<-1-Gs
GstMat[j,k]<-mean(Ht-mHs)/mean(Ht)
GstMat[k,j]<-GstMat[j,k]
}
}
rownames(GstMat)<-popsw
colnames(GstMat)<-popsw
library(fields)
latlong<-read.table("latLong.txt",header=FALSE,sep=",")
geoD<-rdist.earth(as.matrix(latlong[,-1]),miles=FALSE)
r<-cor(geoD[lower.tri(geoD)],GstMat[lower.tri(GstMat)])
null<-rep(NA,1000)
for(k in 1:1000){
x<-sample(1:5,5,replace=FALSE)
gnull<-geoD[x,]
null[k]<-cor(gnull[lower.tri(gnull)],GstMat[lower.tri(GstMat)])
}
mean(null >= r)
# p = 0.95
# r = -0.463
## pop Gst table
write.table(GstMat,"gstTable.csv",row.names=TRUE,col.names=TRUE,quote=FALSE,sep=",")
## individual genetic distance, avg. difference in allele counts
GenDist<-matrix(NA,nrow=Nw,ncol=Nw)
for(i in 1:Nw){for( j in 1:Nw){
GenDist[i,j]<-mean(abs(Gw[,i]-Gw[,j]))
}}
rownames(GenDist)<-IDw[,2]
colnames(GenDist)<-IDw[,2]
GenDist<-GenDist/4
write.table(GenDist,"genDistTable.csv",row.names=TRUE,col.names=TRUE,quote=FALSE,sep=",")
GenDistHat<-matrix(NA,nrow=Nw,ncol=Nw)
for(i in 1:Nw){for( j in 1:Nw){
GenDistHat[i,j]<-mean(abs(round(Gw[,i],0)-round(Gw[,j],0)))
}}