Genetic differentiation for candidate color genes, chumash vs. podura

Post date: Nov 21, 2017 5:22:11 PM

We are using the data from the whole genomes sequenced with the sperm donor.

Alignment and variant calling have been described, but the variant calling entry is not complete, here is the file script and filtering (this is all in king:/uufs/chpc.utah.edu/common/home/u6000989/data/timema/combind_wgs_dovetailV3/variants_tsp/):

Variant calling (jointVarCall.sh):

#!/bin/sh

#SBATCH --time=168:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=20

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=gatk

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/combind_wgs_dovetailV3/variants_tsp/

java -Xmx460g -jar ~/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/tcrDovetail/version3/mod_map_timema_06Jun2016_RvNkF702.fasta --variant bart_JL_IC_2320.g.vcf --variant bart_JL_IC_2321.g.vcf --variant bart_JL_IC_2322.g.vcf --variant bart_JL_IC_2323.g.vcf --variant bart_JL_IC_2324.g.vcf --variant bart_JL_PP_2315.g.vcf --variant bart_JL_PP_2316.g.vcf --variant bart_JL_PP_2317.g.vcf --variant bart_JL_PP_2318.g.vcf --variant bart_JL_PP_2319.g.vcf --variant bart_JL_WF_2307.g.vcf --variant bart_JL_WF_2308.g.vcf --variant bart_JL_WF_2309.g.vcf --variant bart_JL_WF_2310.g.vcf --variant bart_JL_WF_2311.g.vcf --variant bart_JL_WP_870.g.vcf --variant bart_JL_WP_871.g.vcf --variant bart_JL_WP_872.g.vcf --variant bart_JL_WP_873.g.vcf --variant bart_JL_WP_874.g.vcf --variant bart_PCT_JP_2886.g.vcf --variant bart_PCT_JP_2887.g.vcf --variant bart_PCT_JP_2888.g.vcf --variant bart_PCT_JP_2889.g.vcf --variant bart_PCT_JP_2890.g.vcf --variant bart_PCT_WF_370.g.vcf --variant bart_PCT_WF_371.g.vcf --variant bart_PCT_WF_372.g.vcf --variant bart_PCT_WF_373.g.vcf --variant bart_PCT_WF_374.g.vcf --variant bart_PCT_WP_2906.g.vcf --variant bart_PCT_WP_2907.g.vcf --variant bart_PCT_WP_2908.g.vcf --variant bart_PCT_WP_2909.g.vcf --variant bart_PCT_WP_2910.g.vcf --variant chum_BS_C_447.g.vcf --variant chum_BS_C_448.g.vcf --variant chum_BS_C_449.g.vcf --variant chum_BS_C_450.g.vcf --variant chum_BS_C_451.g.vcf --variant chum_GR806_MM_1501.g.vcf --variant chum_GR806_MM_1502.g.vcf --variant chum_GR806_MM_1503.g.vcf --variant chum_GR806_MM_1504.g.vcf --variant chum_GR806_MM_1505.g.vcf --variant chum_GR806_Q_1514.g.vcf --variant chum_GR806_Q_1515.g.vcf --variant chum_GR806_Q_1516.g.vcf --variant chum_GR806_Q_1517.g.vcf --variant chum_GR806_Q_1518.g.vcf --variant curi_CRH_C_996.g.vcf --variant curi_CRH_C_997.g.vcf --variant curi_CRH_C_998.g.vcf --variant curi_CRH_M_995.g.vcf --variant curi_CRL_A_961.g.vcf --variant curi_CRL_A_962.g.vcf --variant curi_CRL_A_963.g.vcf --variant curi_CRL_A_964.g.vcf --variant curi_CRL_A_965.g.vcf --variant curi_CRL_M_967.g.vcf --variant curi_CRL_M_968.g.vcf --variant curi_CRL_M_969.g.vcf --variant curi_CRL_M_970.g.vcf --variant curi_CRL_M_971.g.vcf --variant curi_CRL_MM_984.g.vcf --variant curi_CRL_MM_985.g.vcf --variant curi_CRL_MM_986.g.vcf --variant curi_CRL_MM_987.g.vcf --variant curi_CRL_MM_988.g.vcf --variant land_BCHC_M_609.g.vcf --variant land_BCHC_M_610.g.vcf --variant land_BCHC_M_611.g.vcf --variant land_BCHC_M_612.g.vcf --variant land_BCHC_M_613.g.vcf --variant land_BCHC_Q_617.g.vcf --variant land_BCHC_Q_618.g.vcf --variant land_BCHC_Q_619.g.vcf --variant land_BCHC_Q_620.g.vcf --variant land_BCHC_Q_621.g.vcf --variant podu_BS_C_454.g.vcf --variant podu_BS_C_455.g.vcf --variant podu_BS_C_456.g.vcf --variant podu_BS_C_458.g.vcf --variant podu_PCT_WP_2934.g.vcf -ploidy 2 -o tcrSpWgsVariants.vcf

Filtering (vcfFilter.pl, also removes multi-allelic SNPs and INDELs, and non-LG SNPs):

#### stringency variables, edits as desired

my $minCoverage = 168; # minimum number of sequences; DP

my $minCount = 4; ## minmum number of alternative alleles

my $bqrs = -8; # minimum absolute value of the base quality rank sum test; BaseQRankSum

my $mqrs = -12.5; # minimum absolute value of the mapping quality rank sum test; MQRankSum

my $rprs = -8; # minimum absolute value of the read position rank sum test; ReadPosRankSum

my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD

my $mq = 40; # minimum mapping quality; MQ

my $fish = 60; #Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation b

eing seen on only the forward or only the reverse strand) in the reads. More bias is indicative of fa

lse positive calls.

The specific bugs sequenced include only green T. chumash and only brown T. podura. So, this is the comparison I will focus on for now.

Here is what I am doing (from the king:/uufs/chpc.utah.edu/common/home/u6000989/projects/timema_genarch/popgen_gendif/variants/ directory):

1. Convert the vcf file to gl, with maf = 0.01 cut-off to retain:

perl ../scripts/subvcf2gl.pl 0.01 filtered1X_tcrSpWgsVariants.vcf

This generates filtered1X_tcrSpWgsVariants.gl with 3,297,072 SNPs.

2. Estimate allele frequencies. First I split populations, and then used the EM algorithm to estimate allele frequencies for each population.

perl ../scripts/splitPops.pl filtered1X_tcrSpWgsVariants.gl

perl ../scripts/runestpEM.pl timema_*

Example command:

estpEM -i timema_bart.gl -o ptimema_bart.txt -e 0.0001 -m 20

This generates ptimema* files, which I moved to the popgen sub-directory.

3. Subset SNPs and calculate measures of differentiation/divergence.

First, get the Indel SNPs, and write to files (indel_p*) with the position (on scaffold 128) and allele frequency (this are in the popgen sub-directory).

perl ../scripts/getIndel.pl ptim*txt

Calculate Fst and Dxy in R with 20kb windows and make plots. The plots include vertical read lines for SNPs with PIPs > 0.4 in chumash (for lateral color RG or GB). Here is the code (commands.R). Nothing really stands out.

## differentiation between

p_chumash<-as.matrix(read.table("indel_ptimema_chum.txt",header=FALSE))

p_podura<-as.matrix(read.table("indel_ptimema_podu.txt",header=FALSE))

## het

het_chumash<-2 * p_chumash[,2] * (1-p_chumash[,2])

het_podura<-2 * p_podura[,2] * (1-p_podura[,2])

## fst

pbar<-(p_chumash[,2]+p_podura[,2])/2

hT<-2*pbar*(1-pbar)

hS<-(het_chumash + het_podura)/2

fst<-(hT-hS)/hT

## 20 kb windows

lb<-seq(5000000,5980000,20000)

ub<-seq(5020000,6000000,20000)

L<-length(lb)

fst_win<-rep(NA,L)

dxy_win<-rep(NA,L)

for(i in 1:L){

a<-lb[i]

b<-ub[i]

x<-which(p_chumash[,1] >= a & p_chumash[,1] <=b)

if(length(x) > 1){

fst_win[i]<-mean(hT[x]-hS[x])/mean(hT[x])

dxy<-(p_chumash[x,2] * (1-p_podura[x,2])) +

((1-p_chumash[x,2]) * p_podura[x,2])

dxy_win[i]<-sum(dxy)/20000

}

else{

fst_win[i]<-NA

}

pos<-seq(5010000,5990000,20000)

pips<-c(5177853,5763047,5796836,5034274,5034292,5466657,5708266,5763074,5763074)

pdf("fst20kb_chumashXpodura.pdf",width=8,height=6)

par(mar=c(5,5,.5,.5))

plot(pos,fst_win,pch=19,col="darkgray",xlab="position on scaffold 128 (bp)",ylab="genetic differentiation (Fst)",cex.lab=1.5)

abline(v=pips,col="red")

dev.off()

pips<-c(5177853,5763047,5796836,5034274,5034292,5466657,5708266,5763074,5763074)

pdf("dxy20kb_chumashXpodura.pdf",width=8,height=6)

par(mar=c(5,5,.5,.5))

plot(pos,dxy_win,pch=19,col="darkgray",xlab="position on scaffold 128 (bp)",ylab="genetic divergence (Dxy)",cex.lab=1.5)

abline(v=pips,col="red")

dev.off()

4. Add and compare the PIP data.

Here is my summary sent to Patrik

"Genetic differentiation/divergence between the chumash and podura genomes doesn't coincide with chumash or podura top PIP SNPs. The results I have are based on 20kb windows (could adjust this some, but not much without making them too big or having too few SNPs).

i. The top PIP windows, top Fst windows and top Dxy windows are all distinct.

ii. There is not overall correlation between Fst or Dxy and PIPs (podura, chumash, or the sum of both). Here are the actual correlations and p-values (the comparisons should be mostly clear from the cor.test commands, ch=chumash, po=podura, and the last two are for the sum PIPs across both):

cor.test(pip_ch_win,dxy_win)

# r = -0.204, p = 0.485

cor.test(pip_ch_win,fst_win)

# r = 0.0615, p = 0.835

cor.test(pip_po_win,dxy_win)

# r = -0.179, p = 0.426

cor.test(pip_po_win,fst_win)

# r = -0.277, p = 0.212

cor.test(apply(cbind(pip_po_win,pip_ch_win),1,sum,na.rm=TRUE),fst_win)

# r = 0.151, p = 0.387

cor.test(apply(cbind(pip_po_win,pip_ch_win),1,sum,na.rm=TRUE),dxy_win)

# r = -0.0700, p = 0.691

iii. There really isn't anything in terms of distance from the candidate genes either. We have 50, 20kb windows, so for this to be significant you would need the gene to be the top PIP, Dxy or Fst window (as soon as you are one window off you have p = 0.06, being right on would be p = 0.02). The only close case is for the chumash top PIP window being one window away from 5750 (note that playing with window boundaries could perhaps make these overlap, but that it pretty circular and of course we identified the top candidates in part based on top PIP SNPs anyway). And finally, it is not even true that the two genes are in the top several 20kb windows.

There is of course still a chance we will find something with bartmani... I think part of the issue here is that chumash and podura are sufficiently divergent to mask any signal, particularly when you combine this with the inversion in podura."

And here is the updated commands.R:

## differentiation between

p_chumash<-as.matrix(read.table("indel_ptimema_chum.txt",header=FALSE))

p_podura<-as.matrix(read.table("indel_ptimema_podu.txt",header=FALSE))

## het

het_chumash<-2 * p_chumash[,2] * (1-p_chumash[,2])

het_podura<-2 * p_podura[,2] * (1-p_podura[,2])

## fst

pbar<-(p_chumash[,2]+p_podura[,2])/2

hT<-2*pbar*(1-pbar)

hS<-(het_chumash + het_podura)/2

fst<-(hT-hS)/hT

## 20 kb windows

lb<-seq(5000000,5980000,20000)

ub<-seq(5020000,6000000,20000)

L<-length(lb)

fst_win<-rep(NA,L)

dxy_win<-rep(NA,L)

for(i in 1:L){

a<-lb[i]

b<-ub[i]

x<-which(p_chumash[,1] >= a & p_chumash[,1] <=b)

if(length(x) > 1){

fst_win[i]<-mean(hT[x]-hS[x])/mean(hT[x])

dxy<-(p_chumash[x,2] * (1-p_podura[x,2])) +

((1-p_chumash[x,2]) * p_podura[x,2])

dxy_win[i]<-sum(dxy)/20000

}

else{

fst_win[i]<-NA

}

pos<-seq(5010000,5990000,20000)

pips<-c(5177853,5763047,5796836,5034274,5034292,5466657,5708266,5763074,5763074)

pdf("fst20kb_chumashXpodura.pdf",width=8,height=6)

par(mar=c(5,5,.5,.5))

plot(pos,fst_win,pch=19,col="darkgray",xlab="position on scaffold 128 (bp)",ylab="genetic differentiation (Fst)",cex.lab=1.5)

abline(v=pips,col="red")

dev.off()

pips<-c(5177853,5763047,5796836,5034274,5034292,5466657,5708266,5763074,5763074)

pdf("dxy20kb_chumashXpodura.pdf",width=8,height=6)

par(mar=c(5,5,.5,.5))

plot(pos,dxy_win,pch=19,col="darkgray",xlab="position on scaffold 128 (bp)",ylab="genetic divergence (Dxy)",cex.lab=1.5)

abline(v=pips,col="red")

dev.off()

## direct comparison with pips for lateral

pip<-vector("list",4)

pip[[1]]<-read.table("../../mapping/latGB/output/Tchumash_GR8.06.latGB.bial.noindel.qs20.cov50.mdp265Mdp5310.maf0.01.lgNA.excluded_sparse_effects_all_SNPs_except_MAFcutoff_formatted.txt",header=TRUE)

pip[[2]]<-read.table("../../mapping/latRG/output/Tchumash_GR8.06.latRG.bial.noindel.qs20.cov50.mdp265Mdp5310.maf0.01.lgNA.excluded_sparse_effects_all_SNPs_except_MAFcutoff_formatted.txt",header=TRUE)

pip[[3]]<-read.table("../../mapping/latGB/output/Tpodura_BSC.latGB.bial.noindel.qs20.cov50.mdp25Mdp420.maf0.01.lgNA.excluded_sparse_effects_all_SNPs_except_MAFcutoff_formatted.txt",header=TRUE)

pip[[4]]<-read.table("../../mapping/latRG/output/Tchumash_GR8.06.latRG.bial.noindel.qs20.cov50.mdp265Mdp5310.maf0.01.lgNA.excluded_sparse_effects_all_SNPs_except_MAFcutoff_formatted.txt",header=TRUE)

## sum of pips for two traits, and subset to indel

x<-which(pip[[1]]$scaffold == 128 & pip[[1]]$PS >= 5000000 & pip[[1]]$PS <= 6000000)

pip_chumash<-cbind(pip[[1]]$PS[x],pip[[1]]$gamma[x]+pip[[2]]$gamma[x])

x<-which(pip[[3]]$scaffold == 128 & pip[[3]]$PS >= 5000000 & pip[[3]]$PS <= 6000000)

pip_podura<-cbind(pip[[3]]$PS[x],pip[[3]]$gamma[x]+pip[[4]]$gamma[x])

## 20 kb windows

lb<-seq(5000000,5980000,20000)

ub<-seq(5020000,6000000,20000)

L<-length(lb)

pip_ch_win<-rep(NA,L)

pip_po_win<-rep(NA,L)

for(i in 1:L){

a<-lb[i]

b<-ub[i]

x<-which(pip_chumash[,1] >= a & pip_chumash[,1] <=b)

if(length(x) >= 1){

pip_ch_win[i]<-sum(pip_chumash[x,2])

}

else{

pip_ch_win[i]<-NA

}

x<-which(pip_podura[,1] >= a & pip_podura[,1] <=b)

if(length(x) >= 1){

pip_po_win[i]<-sum(pip_podura[x,2])

}

else{

pip_po_win[i]<-NA

}

## nothing with overall correlations

cor.test(pip_ch_win,dxy_win)

# r = -0.204, p = 0.485

cor.test(pip_ch_win,fst_win)

# r = 0.0615, p = 0.835

cor.test(pip_po_win,dxy_win)

# r = -0.179, p = 0.426

cor.test(pip_po_win,fst_win)

# r = -0.277, p = 0.212

cor.test(apply(cbind(pip_po_win,pip_ch_win),1,sum,na.rm=TRUE),fst_win)

# r = 0.151, p = 0.387

cor.test(apply(cbind(pip_po_win,pip_ch_win),1,sum,na.rm=TRUE),dxy_win)

# r = -0.0700, p = 0.691

## top windows are all different

save(list=ls(),file="popgen.rdat")

Page updated

Google Sites

Report abuse