Xip, filtering variants, ancestry inference and null sims

Post date: Aug 03, 2017 7:44:17 PM

Here I am working with the lab hybrids in lab-cross.vcf. There are 116721 variants, which have already been filtered for mapping quality. But, there is a lot of missing data.

1. So, I first applied my own filter for missing data (and mult-allelic loci, can't do much else as there is not much information in the vcf file).

perl ../scripts/vcfFilter.pl lab-cross.vcf

I then moved the filtered file tofiltered_variants/filtered_lab-cross.vcf. It contains 41,258 variants.

2. Next I converted the file to a genotype format (similar to my gl format, but just genotypes). I would have used genotype likelihoods, but they were only given for the heterozygous genotype (really weird/crappy vcf format).

perl ../scripts/vcf2gen.pl 0.0 filtered_lab-cross.vcf

This produces geno_xip_lab_cross.txt and a corresponding (minor) allele frequency file p_xip_lab_cross.txt. The geno file has header rows with number of SNPs and individuals (41,258 and 43 respectively) a line with the ids and then one row per SNP (beginning with the SNP id).

3. My initial plan had been to generate input for popanc. However, given the very limited sampling of the parents (2 or 3 inds.) and the fact that the parents are inbred with small N for a long time (and thus actually should be mostly fixed for alleles), I decided against this. Instead I took an approach similar to what Axel had already done and filtered based on those SNPs that mostly likely were fixed and then converted these to ancestry estimates by just polarizing them. The code is in commands.R in filtered_variants. Here is the first bit:

## read in the xip genotype data skipping header rows

dat<-read.table("geno_xip_lab_cross.txt",skip=2,header=FALSE)

## create the genotype matrix

G<-as.matrix(dat[,-1])

## get id information

ids<-read.table("group_ids.txt",header=FALSE)

## get LG information

lgs<-scan("lgs.txt")

## sample allele freqs.

nl<-dim(G)[1]

p<-matrix(NA,nrow=nl,ncol=7)

for(i in 1:nl){

p[i,]<-tapply(X=G[i,],INDEX=ids[,2],mean,na.rm=TRUE)/2

}

## sample allele counts

cnt0<-matrix(NA,nrow=nl,ncol=7)

cnt1<-matrix(NA,nrow=nl,ncol=7)

cnt2<-matrix(NA,nrow=nl,ncol=7)

N<-matrix(NA,nrow=nl,ncol=7)

for(i in 1:nl){

cnt0[i,]<-tapply(X=G[i,]==0,INDEX=ids[,2],sum,na.rm=TRUE)

cnt1[i,]<-tapply(X=G[i,]==1,INDEX=ids[,2],sum,na.rm=TRUE)

cnt2[i,]<-tapply(X=G[i,]==2,INDEX=ids[,2],sum,na.rm=TRUE)

N[i,]<-tapply(X=is.na(G[i,])==FALSE,INDEX=ids[,2],sum,na.rm=TRUE)

}

## seg ratio in BC1s

ps<-matrix(NA,nrow=nl,ncol=2)

for(i in 1:nl){

if(cnt0[i,1] > cnt2[i,1]){

pr<-c(0,0.25,0.75)

}

else{

pr<-c(0.75,0.25,0)

}

ps[i,1]<-dmultinom(x=c(cnt0[i,4],cnt1[i,4],cnt2[i,4]),size=N[i,4],prob=pr)

ps[i,2]<-dmultinom(x=c(cnt0[i,5],cnt1[i,5],cnt2[i,5]),size=N[i,5],prob=pr)

}

mps<-apply(ps,1,min)

## get putatively fixed differences

x<-which(abs(p[,1]-p[,2])==1)

## 27,167 with allele freq. difference of 1, but note small sample sizes, these are not necessarily fixed

xx<-which(abs(p[,1]-p[,2])==1 & p[,3]==0.5)

## 14,405 that with allele freq. difference of 1, 0.5 in F1 and 3:1 seg ratio in BCs

xxx<-which(abs(p[,1]-p[,2])==1 & p[,3]==0.5 & mps > 0.01)

## convert to ancestry by polarizing

anc<-p[xxx,]

ancLgs<-lgs[xxx]

a<-which(anc[,1]!=1)

anc[a,]<-1-anc[a,]

## bounds for LGs

bnds<-0.5+which(ancLgs[-1] != ancLgs[-14405])

bndsPlus<-c(0.5,bnds,14405.5)

mids<-(bndsPlus[-1]+bndsPlus[-length(bndsPlus)])/2

## ancestry plots

plot(anc[order(ancLgs),7],type='l',axes=FALSE,ylim=c(0,1),ylab="ancestry freq.",xlab="linkage group")

axis(2)

axis(1,at=mids,1:length(mids))

box()

abline(v=bnds,col="lightblue",lty=2)

plot(anc[order(ancLgs),6],type='l',axes=FALSE,ylim=c(0,1),ylab="ancestry freq.",xlab="linkage group")

axis(2)

axis(1,at=mids,1:length(mids))

box()

abline(v=bnds,col="lightblue",lty=2)

4. I then ran null simulations with 10-15 female parents to see what was expected by drift. The answer is that all loci should be fixed for the BC ancestry (p<0.0001). Thus, any case of segregating ancestry suggests selection/drive/something.

## null sims

xipsim<-function(N=10,gens=100,reps=1000){

nAnc<-rep(NA,reps)

for(x in 1:reps){

p<-0.5 ## base ancestry

for(i in 1:gens){

p<-(rbinom(n=1,size=N,prob=p)/N + 0)/2

}

nAnc[x]<-p

}

return(nAnc)

}

o<-xipsim(N=10,gens=100,reps=100000)

mean(o>0)

#[1] 0

o<-xipsim(N=15,gens=100,reps=100000)

mean(o>0)

#[1] 0

Here are Manfred's comments that were the basis of these simulations:

thank you for your offer to do more analyses with our dataset, I am looking forward to exiting results. The artificial hybrids have been maintained in each generation with about 10-15 hybrid females (which had the selected phenotypes) and 1-3 males of X. hellerii, the backcross parent.

just one remark: there was a single F1 hybrid from which this line was established, so the numbers apply from the first backcross generation onwards only.

And here is the summary I sent Axel:

Sorry for the delay, I wasn't able to get much done before field work and then was without internet. However, I am now back and have been looking at analyses I started and summarizing things. Here are the results/comments I have thus far:

1. There was a good deal of missing data in the lab vcf file. This made me nervous, so I applied additional filtering to remove SNPs with missing data for more than ~10% of the individuals (by missing I mean no reads for a SNP for that individual). This dropped the number of SNPs from 116,721 to 41,258.

2. My plan had then been to extract the genotype likelihoods and run the data through a program I developed for inferring ancestry frequencies (popanc). However, the genotype likelihoods were not fully encoded in the vcf files, and upon further reflection the approach would not be optimal here. Specifically, the very low number of parents included would result in high uncertainty in parental allele frequencies and thus in ancestry frequencies. Normally this would be fine/correct, but in this case the small populations of the parents that have been maintained suggest that most SNPs should be fixed. Thus, we have better estimates of allele frequencies (mostly 0 or 1) than the genetic data alone imply. Thus, I instead followed the approach of filtering to find those that looked like they were (likely) fixed between parents based and behaving properly based on patterns of segregation in the F1 and early generation BC.

3. I then generated simple ancestry frequency estimates for each SNP. These are in the attached plots and basically mirror what you had already (which is nice, but doesn't add to what you have done).

4. The novel thing I have done so far is develop null simulations to ask what ancestry frequencies are expected given the history of backcrossing and drift (using data/information from Manfred). The answer is that all SNPs should be fixed for X. hellerii alleles/ancestry. For the SNPs where this is not true (see plots) one can reject the null drift hypothesis with p < 0.0001 (probably much less than that, this just reflects the precision based on the number of simulations). Thus you have pretty convincing evidence of selection favoring (and thus retaining) maculatus ancestry at various places in the genome (including the ones you highlight). This doesn't mean all of these SNPs are directly under selection of course, much of this is probably indirect selection via LD with causal variants, but still it implicates selection as a substantial contributor to patters of ancestry in the hybrids.

5. The next step would be to infer actual selection coefficients. I can have these this weekend if it is interest.