Population genomic analysis of sequences withing SVs

Post date: Sep 24, 2018 7:39:51 PM

We used these whole genome sequence data to characterize patterns of diversity and differentiation for the identified SVs vs. other parts of the genome.

1. First, from the variants sub-directory (/uufs/chpc.utah.edu/common/home/u6000989/projects/timema_SV/popgen/Variants), I split the filtered gl file, filtered1X_tcr_wgs_variants.gl into one gl file per population (note the 128 in the names isn't relevant).

perl ../Scripts/splitPops.pl filtered1X_tcr_wgs_variants_x.gl

HVA: 20 3252350

HVC: 20 3252350

LA: 19 3252350

MR1A: 20 3252350

MR1C: 20 3252350

PRC: 19 3252350

R12A: 21 3252350

R12C: 21 3252350

2. Next, I used the EM algorithm and my own script to estimate population allele frequencies for the 3,167,725 SNPs. Not that I ran 100 EM starts with a tolerance of 0.001.

perl ../Scripts/runestpEM.pl tcr_*gl

Reading data from tcr_HVA.gl

Number of loci: 3252350

Number of individuals: 20

This makes the ptcr* files allele frequency files, which I moved up a directory.

3. Next, and now working in /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_SV/popgen/, I extracted the ML allele frequency estimates (P* files) and locus tags (L* files) from the ptcr files.

perl SplitP.pl ptcr__*

4. As a first pass, I ran this version of popgen.R to estimate Fst for all 13 LGs. These now make more sense.

# population genetic analyses for pop. paris with SVs

L<-3169725

files<-c("P_ptcr_HVA.txt","P_ptcr_HVC.txt","P_ptcr_LA.txt","P_ptcr_MR1A.txt","P_ptcr_MR1C.txt",

"P_ptcr_PRC.txt","P_ptcr_R12A.txt","P_ptcr_R12C.txt")

files<-files[c(1:2,4:5,7:8,3,6)]

P<-matrix(NA,nrow=L,ncol=8)

for(j in 1:8){

P[,j]<-scan(files[j])

}

H<-2 * P * (1-P)

pB<-apply(P,1,mean)

snps<-matrix(scan("snpIds.txt",n=3*L,sep=" "),nrow=L,ncol=3,byrow=TRUE)

## genome Fsts

genomeFsts<-matrix(NA,nrow=4,ncol=3)

prs<-matrix(1:8,ncol=2,byrow=TRUE)

for(i in 1:4){

a<-prs[i,1]

b<-prs[i,2]

Hs<-(H[,a]+H[,b])/2

pbar<-(P[,a]+P[,b])/2

Ht<-2 * pbar * (1-pbar)

genomeFsts[i,1]<-mean(Ht-Hs)/mean(Ht)

genomeFsts[i,2]<-mean((Ht-Hs)/Ht,na.rm=T)

genomeFsts[i,3]<-median((Ht-Hs)/Ht,na.rm=T)

}

genomeFsts

# [,1] [,2] [,3]

#[1,] 0.03082957 0.003061511 3.334334e-05

#[2,] 0.03331073 0.015301258 1.859359e-02

#[3,] 0.03854495 0.007955549 5.441040e-05

#[4,] 0.07105843 0.006177644 8.004002e-05

## genome Fsts, plus bs

genomeFsts<-matrix(NA,nrow=4,ncol=3)

prs<-matrix(1:8,ncol=2,byrow=TRUE)

for(i in 1:4){

a<-prs[i,1]

b<-prs[i,2]

Hs<-(H[,a]+H[,b])/2

pbar<-(P[,a]+P[,b])/2

Ht<-2 * pbar * (1-pbar)

genomeFsts[i,1]<-mean(Ht-Hs)/mean(Ht)

bs<-rep(NA,100)

for(j in 1:100){

a<-sample(1:L,L,replace=TRUE)

bs[a]<-mean(Ht[a]-Hs[a])/mean(Ht[a])

}

genomeFsts[i,2:3]<-quantile(bs,probs=c(0.05,0.95))

}

genomeFsts

## point estimate plus 90% bs intervals

# [,1] [,2] [,3]

#[1,] 0.03082957 0.03059924 0.03096821

#[2,] 0.03331073 0.03317753 0.03338025

#[3,] 0.03854495 0.03853262 0.03880212

#[4,] 0.07105843 0.07069285 0.07123855

save(list=ls(),file="popgen.rdat")

Page updated

Google Sites

Report abuse