Post date: Oct 27, 2019 2:18:47 AM
We now know that including the epistatic interactions from MAPIT increase PVE in gemma for color, especially RG. However, that doesn't necessarily mean it makes a better (i.e., more predictive) model. To assess this I am running 10-fold cross-validation on RG and GB in gemma with and without the epistatic SNPs.
1. In R, I made phenotype data where each of 10 folds were converted to NA for RG and GB. This is in MkCvSets.R within the Gemma_mel directory.
## MkCvSets.R
# set up phenotype files for 10 fold cross-validation
ph<-read.table("pheno_color.txt",header=FALSE)
rg_cv<-matrix(ph[,1],nrow=437,ncol=10)
gb_cv<-matrix(ph[,2],nrow=437,ncol=10)
grp<-sample(rep(1:10,each=44),437,replace=FALSE)
for(i in 1:10){
xna<-which(grp==i)
rg_cv[xna,i]<-NA
gb_cv[xna,i]<-NA
}
write.table(rg_cv,file="pheno_rg_cv.txt",row.names=FALSE,col.names=FALSE)
write.table(gb_cv,file="pheno_gb_cv.txt",row.names=FALSE,col.names=FALSE)
2. I then ran gemma on these data sets, basically treating each training set as a different phenotype. Note here I am only running 5 (not 10) chains. Also, I am using the pre-computed kinship matrix and notsnp option for the epistasis data, but not for the fully additive SNP data sets.
I ran,
sbatch RunGemmaCv.sh
which runs,
#!/usr/bin/perl
#
# run gemma jobs
#
use Parallel::ForkManager;
my $max = 24;
my $pm = Parallel::ForkManager->new($max);
my $ph;
my @phf = ("pheno_rg_cv.txt","pheno_gb_cv.txt");
my $trt;
foreach $g (@ARGV){
foreach $ph (@phf){
$base = $g;
$base =~ s/mod_g_//;
$base =~ s/\.txt//;
$ph =~ m/pheno_([a-z]+)/;
$trt = $1;
foreach $ch (0..4){
system "sleep 2\n";
foreach $set (1..10){
$pm->start and next; ## fork
$out = "o_cv_$base"."_ph$trt"."_set$set"."_ch$ch";
if($base eq "tchum"){
system "gemma -g $g -p $ph -bslmm 1 -n $set -maf 0.0 -w 200000 -s 1000000 -o $out\n";
}
elsif($base eq "epi_tchum"){
system "gemma -g $g -p $ph -bslmm 1 -n $set -k output/tchum_epi.cXX.txt -notsnp -maf 0.0 -w 200000 -s 1000000 -o $out\n";
}
$pm->finish;
}
}
}
}
$pm->wait_all_children;
3. Summarizing the results/comparing model performance via cross-validation genomic prediction.
I ran k (10)-fold cross-validation to ask whether including the epistatic terms from MAPIT in the gemma analysis improved our ability to predict color phenotype. To do this, the data set was split into 10 equal sized groups. Then, we ran 10 iterations where nine of the ten groups were used to fit the gemma model and then that model was used to predict the phenotype for the other group (the one left out). This was done for RG and GB with and without including the epistatic terms. I then compared the genome estimated breeding values (predicted colors) from the cross-validation (i.e., for the test sets) to the true phenotypes. Fit was summarized in terms of the Pearson correlation and r2. Note that cross-validation inherently penalizes for over fitting, so the better model is the one that predicts things better, period. Here is what we got:
RG, additive only
r = 0.82, 95 CI = 0.78-0.84, P < 0.001, r2 = 0.66
RG, with epistasis
r = 0.89, 95 CI = 0.87-0.91, P < 0.001, r2 = 0.79
GB, additive only
r = 0.76, 95 CI = 0.71-0.79, P < 0.001, r2 = 0.57
GB, with epistasis
r = 0.78, 95 CI = 0.74-0.81, P < 0.001, r2 = 0.60
First, these numbers all rock. We can predict color remarkably well. The model with epistasis gives a clear, small, but non-trivial bump to performance for RG. There is a bit of an improvement for GB, but it is more negligible. In short, the improvement for epistasis is real, we aren't just getting a bump in PVE from over fitting.
This was all done with cvSummary.R in /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_genarch/color_exp_2019_gbs/Gemma_mel/cv_output/
## calculate r2 for obs. vs. predicted from CV analysis
G_add<-as.matrix(read.table("../g_tchum.txt",header=FALSE))
G_epi<-as.matrix(read.table("../g_epi_tchum.txt",header=FALSE))
for(k in 1:dim(G_add)[1]){
G_add[k,]<-G_add[k,]-mean(G_add[k,])
}
for(k in 1:dim(G_epi)[1]){
G_epi[k,]<-G_epi[k,]-mean(G_epi[k,])
}
## drop fixed SNPs
fixSnp_add<-which(apply(G_add,1,sd)==0)
fixSnp_epi<-which(apply(G_epi,1,sd)==0)
## get list of files from GEMMA
f_bv_rg<-list.files(pattern=glob2rx("o_cv_tchum_phrg_set*bv.txt"))
f_bv_gb<-list.files(pattern=glob2rx("o_cv_tchum_phgb_set*bv.txt"))
f_eff_rg<-list.files(pattern=glob2rx("o_cv_tchum_phrg_set*param.txt"))
f_eff_gb<-list.files(pattern=glob2rx("o_cv_tchum_phgb_set*param.txt"))
f_bv_epi_rg<-list.files(pattern=glob2rx("o_cv_epi_tchum_phrg_set*bv.txt"))
f_bv_epi_gb<-list.files(pattern=glob2rx("o_cv_epi_tchum_phgb_set*bv.txt"))
f_eff_epi_rg<-list.files(pattern=glob2rx("o_cv_epi_tchum_phrg_set*param.txt"))
f_eff_epi_gb<-list.files(pattern=glob2rx("o_cv_epi_tchum_phgb_set*param.txt"))
## function to calcualte GEBVs for subset of individuals not in each training set,
calcGEBV<-function(bvf=NA,efff=NA,G=NA){
N<-length(bvf)
gebv<-vector("list",N)
for(j in 1:N){
bv<-scan(bvf[j])
eff<-read.table(efff[j],header=TRUE)
L<-dim(eff)[1]
I<-length(bv)
gebv[[j]]<-rep(NA,I)
x<-which(is.na(bv)==TRUE)
mav<-eff$alpha + eff$beta * eff$gamma
for(i in x){
gebv[[j]][i]<-sum(mav*G[,i])
}
}
gebv<-matrix(unlist(gebv),nrow=I,ncol=N)
return(gebv)
}
## loop over sets
gebv_rg_add<-matrix(NA,nrow=437,ncol=10)
gebv_gb_add<-matrix(NA,nrow=437,ncol=10)
gebv_rg_epi<-matrix(NA,nrow=437,ncol=10)
gebv_gb_epi<-matrix(NA,nrow=437,ncol=10)
for(s in 1:10){
## compute GEBVs for each test set, averaged over 5 chains
gebv_rg_add[,s]<-apply(calcGEBV(bvf=f_bv_rg[grep(pattern=paste("set",s,"_",sep=""),x=f_bv_rg,perl=TRUE)],
efff=f_eff_rg[grep(pattern=paste("set",s,"_",sep=""),x=f_eff_rg,perl=TRUE)],
G=G_add[-fixSnp_add,]),1,mean)
gebv_gb_add[,s]<-apply(calcGEBV(bvf=f_bv_gb[grep(pattern=paste("set",s,"_",sep=""),x=f_bv_gb,perl=TRUE)],
efff=f_eff_gb[grep(pattern=paste("set",s,"_",sep=""),x=f_eff_gb,perl=TRUE)],
G=G_add[-fixSnp_add,]),1,mean)
gebv_rg_epi[,s]<-apply(calcGEBV(bvf=f_bv_epi_rg[grep(pattern=paste("set",s,"_",sep=""),x=f_bv_epi_rg,perl=TRUE)],
efff=f_eff_epi_rg[grep(pattern=paste("set",s,"_",sep=""),x=f_eff_epi_rg,perl=TRUE)],
G=G_epi[-fixSnp_epi,]),1,mean)
gebv_gb_epi[,s]<-apply(calcGEBV(bvf=f_bv_epi_gb[grep(pattern=paste("set",s,"_",sep=""),x=f_bv_epi_gb,perl=TRUE)],
efff=f_eff_epi_gb[grep(pattern=paste("set",s,"_",sep=""),x=f_eff_epi_gb,perl=TRUE)],
G=G_epi[-fixSnp_epi,]),1,mean)
}
## read in trait data
ph<-read.table("../pheno_color.txt",header=FALSE)
## calc cross validation r and r2
cor.test(ph[,1],apply(gebv_rg_add,1,mean,na.rm=TRUE))
#data: ph[, 1] and apply(gebv_rg_add, 1, mean, na.rm = TRUE)
#t = 29.297, df = 433, p-value < 2.2e-16
#alternative hypothesis: true correlation is not equal to 0
#95 percent confidence interval:
# 0.7811340 0.8445607
#sample estimates:
# cor
#0.8152783
# r2 = 0.6646787
cor.test(ph[,1],apply(gebv_rg_epi,1,mean,na.rm=TRUE))
#data: ph[, 1] and apply(gebv_rg_epi, 1, mean, na.rm = TRUE)
#t = 40.672, df = 433, p-value < 2.2e-16
#alternative hypothesis: true correlation is not equal to 0
#95 percent confidence interval:
# 0.8689661 0.9082504
#sample estimates:
# cor
#0.8902523
# r2 = 0.7925492
cor.test(ph[,2],apply(gebv_gb_add,1,mean,na.rm=TRUE))
#data: ph[, 2] and apply(gebv_gb_add, 1, mean, na.rm = TRUE)
#t = 23.983, df = 433, p-value < 2.2e-16
#alternative hypothesis: true correlation is not equal to 0
#95 percent confidence interval:
# 0.7118561 0.7930264
#sample estimates:
# cor
#0.7553234
# r2 = 0.5705134
cor.test(ph[,2],apply(gebv_gb_epi,1,mean,na.rm=TRUE))
#data: ph[, 2] and apply(gebv_gb_epi, 1, mean, na.rm = TRUE)
#t = 25.587, df = 433, p-value < 2.2e-16
#alternative hypothesis: true correlation is not equal to 0
#95 percent confidence interval:
# 0.7354577 0.8107145
#sample estimates:
# cor
#0.7758309
# r2 = 0.6019136