Post date: Sep 17, 2017 3:29:14 AM
I simulated wing pattern (trait) data to see how the BSLMM in gemma performs with different degrees of structure.
1. Simulations
I am using the actual Lycaeides genotype data as the basis for phenotype simulations. 10 or 100 causal SNPs are chosen from the actual set of SNPs, and I am working with four genetic data sets (all bugs, anna, melissa, and SIN). Heritabilities are all 0.5 and their are 50 simulations per set of conditions. The relevant in/outfiles are in /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/infiles/.
Here is the simulation code from simPheno.R:
## all simulations have a heritability of 0.5
## with 10 or 100 causal variants (effect draft from standard normal)
## phenotypes than normalized to have mean 0 and sd 1, before adding second random deviate from N(0,1) (this yields heritability of 0.5)
## causal variants are in the data set
## use: all data, anna, melissa and SIN
## 50 reps each = 50 * 2 * 4 = 400 data sets
simp<-function(Gmat=NA,h2=0.5,nsnps=NA,nreps=50){
N<-dim(Gmat)[2];L<-dim(Gmat)[1]
phm<-matrix(NA,nrow=N,ncol=nreps)
snps<-matrix(NA,nrow=nsnps,ncol=nreps) ## numeric id of causal snps
effects<-snps ## effects of snps
for(i in 1:nreps){
snps[,i]<-sample(1:L,nsnps,replace=FALSE)
effects[,i]<-rnorm(nsnps,mean=0,sd=1)
for(j in 1:N){
phm[j,i]<-sum(Gmat[snps[,i],j] * effects[,i])
}
phm[,i]<-(phm[,i]-mean(phm[,i]))/sd(phm[,i]) + rnorm(N,mean=0,sd=1)
}
simo<-list(phm,snps,effects)
return(simo)
}
writeRes<-function(o,prefix=NA){
write.table(o[[1]],paste("ph_",prefix,".txt",sep=""),row.names=FALSE,col.names=FALSE,sep=" ",quote=FALSE)
write.table(o[[2]],paste("snp_",prefix,".txt",sep=""),row.names=FALSE,col.names=FALSE,sep=" ",quote=FALSE)
write.table(o[[3]],paste("effect_",prefix,".txt",sep=""),row.names=FALSE,col.names=FALSE,sep=" ",quote=FALSE)
}
## all, 10 and 100 snps
G<-read.table("mod_geno_size.txt",sep=", ",header=FALSE)
o_all_10<-simp(Gmat=as.matrix(G[,-c(1:3)]),nsnps=10,nreps=50)
o_all_100<-simp(Gmat=as.matrix(G[,-c(1:3)]),nsnps=100,nreps=50)
writeRes(o_all_10,"sim_all_10")
writeRes(o_all_100,"sim_all_100")
## Anna, 10 and 100 snps
G<-read.table("AN_mod_geno_size.txt",sep=",",header=FALSE)
o_AN_10<-simp(Gmat=as.matrix(G[,-c(1:3)]),nsnps=10,nreps=50)
o_AN_100<-simp(Gmat=as.matrix(G[,-c(1:3)]),nsnps=100,nreps=50)
writeRes(o_all_10,"sim_AN_10")
writeRes(o_all_100,"sim_AN_100")
## Melissa, 10 and 100 snps
G<-read.table("ME_mod_geno_size.txt",sep=",",header=FALSE)
o_ME_10<-simp(Gmat=as.matrix(G[,-c(1:3)]),nsnps=10,nreps=50)
o_ME_100<-simp(Gmat=as.matrix(G[,-c(1:3)]),nsnps=100,nreps=50)
writeRes(o_all_10,"sim_ME_10")
writeRes(o_all_100,"sim_ME_100")
## SIN, 10 and 100 snps
G<-read.table("SIN_mod_geno_size.txt",sep=",",header=FALSE)
o_SIN_10<-simp(Gmat=as.matrix(G[,-c(1:3)]),nsnps=10,nreps=50)
o_SIN_100<-simp(Gmat=as.matrix(G[,-c(1:3)]),nsnps=100,nreps=50)
writeRes(o_all_10,"sim_SIN_10")
writeRes(o_all_100,"sim_SIN_100")
2. Analysis
I ran gemma with perl ../scripts/wrap_qsub_slurm_gemma_sims.pl
This actually gets the list of files to analyze from sim_files.txt and then runs a fork wrapper over the 50 traits for each set of simulation conditions (different chains, of which there are 5, are different jobs), e.g.,
cd /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/infiles/
perl forkRunGemmaPop.pl SIN_mod_geno_size.txt ph_sim_SIN_10.txt 0 50
cd /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/infiles/
perl forkRunGemmaPop.pl SIN_mod_geno_size.txt ph_sim_SIN_10.txt 1 50
cd /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/infiles/
perl forkRunGemmaPop.pl SIN_mod_geno_size.txt ph_sim_SIN_10.txt 2 50
cd /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/infiles/
perl forkRunGemmaPop.pl SIN_mod_geno_size.txt ph_sim_SIN_10.txt 3 50
cd /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/infiles/
perl forkRunGemmaPop.pl SIN_mod_geno_size.txt ph_sim_SIN_10.txt 4 50
Here is the fork script that is called and that has the MCMC details (these are the same conditions I used for the empirical data):
#!/usr/bin/perl
#
# creates a child process per phenotype
#
use Parallel::ForkManager;
my $pm = Parallel::ForkManager->new(12); ## 12 concurrent processes
$g = shift(@ARGV); ## genotype file
$p = shift(@ARGV); ## phenotype file
$ch = shift(@ARGV); ## chain number number
$nph = shift(@ARGV); ## number of phenotypes
$out = $p;
$out =~ s/\.txt// or die "failed sub $p\n";
TRAITS:
foreach $n (1..$nph){
$pm->start and next TRAITS; ## fork
$outch = "o_$n"."_$out"."_$ch";
system "gemma-0.94.1 -g $g -p $p -bslmm 1 -n $n -o $outch -maf 0.001 -w 200000 -s 1000000\n";
$pm->finish;
}
$pm->wait_all_children;