Chumash color mapping CV with and w/o epistasis

Post date: Oct 27, 2019 2:18:47 AM

We now know that including the epistatic interactions from MAPIT increase PVE in gemma for color, especially RG. However, that doesn't necessarily mean it makes a better (i.e., more predictive) model. To assess this I am running 10-fold cross-validation on RG and GB in gemma with and without the epistatic SNPs.

1. In R, I made phenotype data where each of 10 folds were converted to NA for RG and GB. This is in MkCvSets.R within the Gemma_mel directory.

## MkCvSets.R

# set up phenotype files for 10 fold cross-validation

ph<-read.table("pheno_color.txt",header=FALSE)

rg_cv<-matrix(ph[,1],nrow=437,ncol=10)

gb_cv<-matrix(ph[,2],nrow=437,ncol=10)

grp<-sample(rep(1:10,each=44),437,replace=FALSE)

for(i in 1:10){

xna<-which(grp==i)

rg_cv[xna,i]<-NA

gb_cv[xna,i]<-NA

}

write.table(rg_cv,file="pheno_rg_cv.txt",row.names=FALSE,col.names=FALSE)

write.table(gb_cv,file="pheno_gb_cv.txt",row.names=FALSE,col.names=FALSE)

2. I then ran gemma on these data sets, basically treating each training set as a different phenotype. Note here I am only running 5 (not 10) chains. Also, I am using the pre-computed kinship matrix and notsnp option for the epistasis data, but not for the fully additive SNP data sets.

I ran,

sbatch RunGemmaCv.sh

which runs,

#!/usr/bin/perl

# run gemma jobs

use Parallel::ForkManager;

my $max = 24;

my $pm = Parallel::ForkManager->new($max);

my $ph;

my @phf = ("pheno_rg_cv.txt","pheno_gb_cv.txt");

my $trt;

foreach $g (@ARGV){

foreach $ph (@phf){

$base = $g;

$base =~ s/mod_g_//;

$base =~ s/\.txt//;

$ph =~ m/pheno_([a-z]+)/;

$trt = $1;

foreach $ch (0..4){

system "sleep 2\n";

foreach $set (1..10){

$pm->start and next; ## fork

$out = "o_cv_$base"."_ph$trt"."_set$set"."_ch$ch";

if($base eq "tchum"){

system "gemma -g $g -p $ph -bslmm 1 -n $set -maf 0.0 -w 200000 -s 1000000 -o $out\n";

}

elsif($base eq "epi_tchum"){

system "gemma -g $g -p $ph -bslmm 1 -n $set -k output/tchum_epi.cXX.txt -notsnp -maf 0.0 -w 200000 -s 1000000 -o $out\n";

}

$pm->finish;

}

$pm->wait_all_children;

3. Summarizing the results/comparing model performance via cross-validation genomic prediction.

I ran k (10)-fold cross-validation to ask whether including the epistatic terms from MAPIT in the gemma analysis improved our ability to predict color phenotype. To do this, the data set was split into 10 equal sized groups. Then, we ran 10 iterations where nine of the ten groups were used to fit the gemma model and then that model was used to predict the phenotype for the other group (the one left out). This was done for RG and GB with and without including the epistatic terms. I then compared the genome estimated breeding values (predicted colors) from the cross-validation (i.e., for the test sets) to the true phenotypes. Fit was summarized in terms of the Pearson correlation and r2. Note that cross-validation inherently penalizes for over fitting, so the better model is the one that predicts things better, period. Here is what we got:

RG, additive only

r = 0.82, 95 CI = 0.78-0.84, P < 0.001, r2 = 0.66

RG, with epistasis

r = 0.89, 95 CI = 0.87-0.91, P < 0.001, r2 = 0.79

GB, additive only

r = 0.76, 95 CI = 0.71-0.79, P < 0.001, r2 = 0.57

GB, with epistasis

r = 0.78, 95 CI = 0.74-0.81, P < 0.001, r2 = 0.60

First, these numbers all rock. We can predict color remarkably well. The model with epistasis gives a clear, small, but non-trivial bump to performance for RG. There is a bit of an improvement for GB, but it is more negligible. In short, the improvement for epistasis is real, we aren't just getting a bump in PVE from over fitting.

This was all done with cvSummary.R in /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_genarch/color_exp_2019_gbs/Gemma_mel/cv_output/

## calculate r2 for obs. vs. predicted from CV analysis

G_add<-as.matrix(read.table("../g_tchum.txt",header=FALSE))

G_epi<-as.matrix(read.table("../g_epi_tchum.txt",header=FALSE))

for(k in 1:dim(G_add)[1]){

G_add[k,]<-G_add[k,]-mean(G_add[k,])

}

for(k in 1:dim(G_epi)[1]){

G_epi[k,]<-G_epi[k,]-mean(G_epi[k,])

}

## drop fixed SNPs

fixSnp_add<-which(apply(G_add,1,sd)==0)

fixSnp_epi<-which(apply(G_epi,1,sd)==0)

## get list of files from GEMMA

f_bv_rg<-list.files(pattern=glob2rx("o_cv_tchum_phrg_set*bv.txt"))

f_bv_gb<-list.files(pattern=glob2rx("o_cv_tchum_phgb_set*bv.txt"))

f_eff_rg<-list.files(pattern=glob2rx("o_cv_tchum_phrg_set*param.txt"))

f_eff_gb<-list.files(pattern=glob2rx("o_cv_tchum_phgb_set*param.txt"))

f_bv_epi_rg<-list.files(pattern=glob2rx("o_cv_epi_tchum_phrg_set*bv.txt"))

f_bv_epi_gb<-list.files(pattern=glob2rx("o_cv_epi_tchum_phgb_set*bv.txt"))

f_eff_epi_rg<-list.files(pattern=glob2rx("o_cv_epi_tchum_phrg_set*param.txt"))

f_eff_epi_gb<-list.files(pattern=glob2rx("o_cv_epi_tchum_phgb_set*param.txt"))

## function to calcualte GEBVs for subset of individuals not in each training set,

calcGEBV<-function(bvf=NA,efff=NA,G=NA){

N<-length(bvf)

gebv<-vector("list",N)

for(j in 1:N){

bv<-scan(bvf[j])

eff<-read.table(efff[j],header=TRUE)

L<-dim(eff)[1]

I<-length(bv)

gebv[[j]]<-rep(NA,I)

x<-which(is.na(bv)==TRUE)

mav<-eff$alpha + eff$beta * eff$gamma

for(i in x){

gebv[[j]][i]<-sum(mav*G[,i])

}

gebv<-matrix(unlist(gebv),nrow=I,ncol=N)

return(gebv)

}

## loop over sets

gebv_rg_add<-matrix(NA,nrow=437,ncol=10)

gebv_gb_add<-matrix(NA,nrow=437,ncol=10)

gebv_rg_epi<-matrix(NA,nrow=437,ncol=10)

gebv_gb_epi<-matrix(NA,nrow=437,ncol=10)

for(s in 1:10){

## compute GEBVs for each test set, averaged over 5 chains

gebv_rg_add[,s]<-apply(calcGEBV(bvf=f_bv_rg[grep(pattern=paste("set",s,"_",sep=""),x=f_bv_rg,perl=TRUE)],

efff=f_eff_rg[grep(pattern=paste("set",s,"_",sep=""),x=f_eff_rg,perl=TRUE)],

G=G_add[-fixSnp_add,]),1,mean)

gebv_gb_add[,s]<-apply(calcGEBV(bvf=f_bv_gb[grep(pattern=paste("set",s,"_",sep=""),x=f_bv_gb,perl=TRUE)],

efff=f_eff_gb[grep(pattern=paste("set",s,"_",sep=""),x=f_eff_gb,perl=TRUE)],

G=G_add[-fixSnp_add,]),1,mean)

gebv_rg_epi[,s]<-apply(calcGEBV(bvf=f_bv_epi_rg[grep(pattern=paste("set",s,"_",sep=""),x=f_bv_epi_rg,perl=TRUE)],

efff=f_eff_epi_rg[grep(pattern=paste("set",s,"_",sep=""),x=f_eff_epi_rg,perl=TRUE)],

G=G_epi[-fixSnp_epi,]),1,mean)

gebv_gb_epi[,s]<-apply(calcGEBV(bvf=f_bv_epi_gb[grep(pattern=paste("set",s,"_",sep=""),x=f_bv_epi_gb,perl=TRUE)],

efff=f_eff_epi_gb[grep(pattern=paste("set",s,"_",sep=""),x=f_eff_epi_gb,perl=TRUE)],

G=G_epi[-fixSnp_epi,]),1,mean)

}

## read in trait data

ph<-read.table("../pheno_color.txt",header=FALSE)

## calc cross validation r and r2

cor.test(ph[,1],apply(gebv_rg_add,1,mean,na.rm=TRUE))

#data: ph[, 1] and apply(gebv_rg_add, 1, mean, na.rm = TRUE)

#t = 29.297, df = 433, p-value < 2.2e-16

#alternative hypothesis: true correlation is not equal to 0

#95 percent confidence interval:

# 0.7811340 0.8445607

#sample estimates:

# cor

#0.8152783

# r2 = 0.6646787

cor.test(ph[,1],apply(gebv_rg_epi,1,mean,na.rm=TRUE))

#data: ph[, 1] and apply(gebv_rg_epi, 1, mean, na.rm = TRUE)

#t = 40.672, df = 433, p-value < 2.2e-16

#alternative hypothesis: true correlation is not equal to 0

#95 percent confidence interval:

# 0.8689661 0.9082504

#sample estimates:

# cor

#0.8902523

# r2 = 0.7925492

cor.test(ph[,2],apply(gebv_gb_add,1,mean,na.rm=TRUE))

#data: ph[, 2] and apply(gebv_gb_add, 1, mean, na.rm = TRUE)

#t = 23.983, df = 433, p-value < 2.2e-16

#alternative hypothesis: true correlation is not equal to 0

#95 percent confidence interval:

# 0.7118561 0.7930264

#sample estimates:

# cor

#0.7553234

# r2 = 0.5705134

cor.test(ph[,2],apply(gebv_gb_epi,1,mean,na.rm=TRUE))

#data: ph[, 2] and apply(gebv_gb_epi, 1, mean, na.rm = TRUE)

#t = 25.587, df = 433, p-value < 2.2e-16

#alternative hypothesis: true correlation is not equal to 0

#95 percent confidence interval:

# 0.7354577 0.8107145

#sample estimates:

# cor

#0.7758309

# r2 = 0.6019136