Post date: Sep 24, 2018 7:39:51 PM
We used these whole genome sequence data to characterize patterns of diversity and differentiation for the identified SVs vs. other parts of the genome.
1. First, from the variants sub-directory (/uufs/chpc.utah.edu/common/home/u6000989/projects/timema_SV/popgen/Variants), I split the filtered gl file, filtered1X_tcr_wgs_variants.gl into one gl file per population (note the 128 in the names isn't relevant).
perl ../Scripts/splitPops.pl filtered1X_tcr_wgs_variants_x.gl
HVA: 20 3252350
HVC: 20 3252350
LA: 19 3252350
MR1A: 20 3252350
MR1C: 20 3252350
PRC: 19 3252350
R12A: 21 3252350
R12C: 21 3252350
2. Next, I used the EM algorithm and my own script to estimate population allele frequencies for the 3,167,725 SNPs. Not that I ran 100 EM starts with a tolerance of 0.001.
perl ../Scripts/runestpEM.pl tcr_*gl
Reading data from tcr_HVA.gl
Number of loci: 3252350
Number of individuals: 20
This makes the ptcr* files allele frequency files, which I moved up a directory.
3. Next, and now working in /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_SV/popgen/, I extracted the ML allele frequency estimates (P* files) and locus tags (L* files) from the ptcr files.
perl SplitP.pl ptcr__*
4. As a first pass, I ran this version of popgen.R to estimate Fst for all 13 LGs. These now make more sense.
# population genetic analyses for pop. paris with SVs
L<-3169725
files<-c("P_ptcr_HVA.txt","P_ptcr_HVC.txt","P_ptcr_LA.txt","P_ptcr_MR1A.txt","P_ptcr_MR1C.txt",
"P_ptcr_PRC.txt","P_ptcr_R12A.txt","P_ptcr_R12C.txt")
files<-files[c(1:2,4:5,7:8,3,6)]
P<-matrix(NA,nrow=L,ncol=8)
for(j in 1:8){
P[,j]<-scan(files[j])
}
H<-2 * P * (1-P)
pB<-apply(P,1,mean)
snps<-matrix(scan("snpIds.txt",n=3*L,sep=" "),nrow=L,ncol=3,byrow=TRUE)
## genome Fsts
genomeFsts<-matrix(NA,nrow=4,ncol=3)
prs<-matrix(1:8,ncol=2,byrow=TRUE)
for(i in 1:4){
a<-prs[i,1]
b<-prs[i,2]
Hs<-(H[,a]+H[,b])/2
pbar<-(P[,a]+P[,b])/2
Ht<-2 * pbar * (1-pbar)
genomeFsts[i,1]<-mean(Ht-Hs)/mean(Ht)
genomeFsts[i,2]<-mean((Ht-Hs)/Ht,na.rm=T)
genomeFsts[i,3]<-median((Ht-Hs)/Ht,na.rm=T)
}
genomeFsts
# [,1] [,2] [,3]
#[1,] 0.03082957 0.003061511 3.334334e-05
#[2,] 0.03331073 0.015301258 1.859359e-02
#[3,] 0.03854495 0.007955549 5.441040e-05
#[4,] 0.07105843 0.006177644 8.004002e-05
## genome Fsts, plus bs
genomeFsts<-matrix(NA,nrow=4,ncol=3)
prs<-matrix(1:8,ncol=2,byrow=TRUE)
for(i in 1:4){
a<-prs[i,1]
b<-prs[i,2]
Hs<-(H[,a]+H[,b])/2
pbar<-(P[,a]+P[,b])/2
Ht<-2 * pbar * (1-pbar)
genomeFsts[i,1]<-mean(Ht-Hs)/mean(Ht)
bs<-rep(NA,100)
for(j in 1:100){
a<-sample(1:L,L,replace=TRUE)
bs[a]<-mean(Ht[a]-Hs[a])/mean(Ht[a])
}
genomeFsts[i,2:3]<-quantile(bs,probs=c(0.05,0.95))
}
genomeFsts
## point estimate plus 90% bs intervals
# [,1] [,2] [,3]
#[1,] 0.03082957 0.03059924 0.03096821
#[2,] 0.03331073 0.03317753 0.03338025
#[3,] 0.03854495 0.03853262 0.03880212
#[4,] 0.07105843 0.07069285 0.07123855
save(list=ls(),file="popgen.rdat")