Post date: Jun 29, 2015 10:59:0 PM
As an empirical example, I am estimating population ancestry frequencies in the Uygur human population. The data I used are from admixtools. The data are derived from the 'allmap.*' set which includes sequences ascertained from multiple individual combined. I am using 'French' and 'Han' as references for 'Uygur' following the admixtools paper. In this paper Patterson et al 2012 find evidence of admixture in Uygur with an admixture proportion (western Europe = French) of 0.452-0.525.
I generate one dataset per chromosome using the mkPopAnc.R script in /labs/evolution/projects/popanc_sims/humanUygur directory (see below for script). I further thinned each data set to every third SNP as I was having issues with identical SNPs resulting in singular matrixes for the LDA (this doesn't completely fix the problem and I should tweak my program to gaurd against
this).
## mkPopAnc.R
N<-836
L<-621038
gen<-matrix(scan("allmapxsp.geno",n=N*L,sep=NULL,what=integer()),nrow=L,ncol=N,byrow=TRUE)
gen[gen==9]<-NA
genX<-gen
mns<-apply(genX,1,mean,na.rm=TRUE)
for(i in 1:L){
x<-which(is.na(genX[i,])==TRUE)
genX[i,x]<-round(mns[i],0)
}
snps<-read.table("allmapx.snp",header=FALSE)
ids<-read.table("allmapx.ind",header=FALSE)
uygur<-which(ids[,3]=='Uygur')
han<-which(ids[,3]=='Han')
french<-which(ids[,3]=='French')
nu<-length(uygur)
nh<-length(han)
nf<-length(french)
a<-apply(genX[,c(han,french)],1,sd) > 0 & apply(genX[,c(han,french)],1,mean) > 0.2 & apply(genX[,c(han,french)],1,mean) < 1.8
for(i in 1:22){
x<-which(snps$V2==i & a)
x<-x[seq(1,length(x),3)]
sprint<-paste(snps$V2[x],snps$V3[x]*100,sep=":")
nl<-length(x)
## han
fh<-paste("genHan",i,".txt",sep="")
write.table(t(c(nl,nh)),file=fh,row.names=FALSE,col.names=FALSE,quote=FALSE)
write.table(cbind(sprint,genX[x,han]),fh,append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE)
## french
ff<-paste("genFrench",i,".txt",sep="")
write.table(t(c(nl,nf)),ff,row.names=FALSE,col.names=FALSE,quote=FALSE)
write.table(cbind(sprint,genX[x,french]),ff,append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE)
## uygur
fu<-paste("genUygur",i,".txt",sep="")
write.table(t(c(nl,nu)),fu,row.names=FALSE,col.names=FALSE,quote=FALSE)
write.table(cbind(sprint,genX[x,uygur]),fu,append=TRUE,row.names=FALSE,col.names=FALSE,quote=FALSE)
}
## make snp file to lookup
for(i in 1:22){
x<-which(snps$V2==i & a)
if(i != 12){x<-x[seq(1,length(x),3)]}
else{x<-x[seq(2,length(x),3)]}
sprint<-cbind(snps$V2[x],snps$V4[x])
file<-paste("chromPos",i,".txt",sep="")
write.table(sprint,file,row.names=FALSE,col.names=FALSE,quote=FALSE)
}
I ran two to four MCMC analysis with popanc for each chromosome (the variation is due to some runs failing because of the singular matrix problem, hence my need to fix this before release).
Here is an example submission command:
popanc -o estpanc_uygur15snpChrom12Chain1.hdf5 -m 30000 -b 10000 -t 5 -n 15 -d 10 -s 1 -c 1.3 -u 20 -l 0.05 -a 0.1 -z 1 /labs/evolution/projects/popanc_sims/humanUygur/genHan12.txt /labs/evolution/projects/popanc_sims/humanUygur/genFrench12.txt /labs/evolution/projects/popanc_sims/humanUygur/genUygur12.txt