Post date: Nov 14, 2013 8:50:22 PM
I would like to identify individual SNVs associated with survival or weight and quantify the percent of phenotypic variance (PVE) explained by the SNVs. At present, I think the best approach for this is to fit polygenic models using Bayesian Sparse Linear Mixed Models. This approach is described by Zhou et al 2013 (doi:10.1371/journal.pgen.1003264) and implemented in the computer program gemma. The key model parameters are,
h = inferred prior PVE
rho = inferred prior PGE (percent of the PVE attributable to sparse rather than polygenic effects); this is not usually estimated well
PVE
PGE
pi = inferred prior probability of a sparse effect for a SNP
alpha = polygenic effect of the locus
beta = sparse effect of the locus given gamma = 1
gamma = posterior inclusion probability for a sparse effect
The overall effect of a locus is thus alpha_i + beta_i gamma_i.
These are the steps I followed to prepare the data for analysis:
I used espost to infer mean genotypes for each individual and locus averaged over the two MCMC chains from popmod (see previous post). The results are in the ghat_* files in projects/lycaeides_hostplant/. These are counts of the non-reference allele, and have individuals as rows and loci as columns.
I then used a R script (transposeGhat.R, same directory) to extract the genotypes for experimental individuals, transpose the genotype matrix, and convert to the count of the reference allele. The resulting files are ~/Analyses/melGemma/matrixes/tran_ghat_* (the matrixes directory is compressed). I will move this directory to projects later.
Finally, I combined the genotype data with locus ids (~/Analyes/melGemma/snpIds.txt) to generate bimbam formatted genotype (geno*) files. Note, the bimbam format allows imputed (non-integer) genotypes.
I then prepared the phenotypic data. I began with SelectExpMain.csv and use an R script (formatPheno.R) to grab subsets of individuals from specific populations and treatments, and remove the affect of sex and normal quantile transform adult weight. The output from this is in the file phenotypeData.txt in the directory projects/lycaeides_hostplant/experiment. I then used the formatPheno.R script in ~/Analyses/melGemma/ to split this file by treatment and population and to remove extraneous information. This generated the pheno_* files for gemma.
Gemma requires a kinship matrix that it can estimate. The preferred option is to use centered but not scaled genotypes and to simply estimate the genotype covariance matrix. This is done by default if one is not provided. Gemma also allows binary traits to be modelled as quantitative traits (-bslmm 1) or with a probit link function (-bslmm 3). The authors suggest the former is probably better in practice in part because it mixes much better (it is also much faster). The also argue that " This approach can be justi.ed partly by recognizing the linear model as a .rst order Taylor approximation to a generalized linear model, and partly by the robustness of the linear model to model mis-speci.cation". Thus, this is what I have done for now. Here are the specific gemma commands. The phenotype is specified with -n, 1 = weight, 2 = survival.
gemma -g geno_glaTrtAc.txt -p pheno_glaTraAc.txt -bslmm 1 -n 2 -o surv_glaTrtAc -w 100000 -s 1000000
gemma -g geno_glaTrtAc.txt -p pheno_glaTraAc.txt -bslmm 1 -n 1 -o wgt_glaTrtAc -w 100000 -s 1000000
gemma -g geno_glaTrtMs.txt -p pheno_glaTraMs.txt -bslmm 1 -n 2 -o surv_glaTrtMs -w 100000 -s 1000000
gemma -g geno_glaTrtMs.txt -p pheno_glaTraMs.txt -bslmm 1 -n 1 -o wgt_glaTrtMs -w 100000 -s 1000000
gemma -g geno_slaTrtAc.txt -p pheno_slaTraAc.txt -bslmm 1 -n 2 -o surv_slaTrtAc -w 100000 -s 1000000
gemma -g geno_slaTrtAc.txt -p pheno_slaTraAc.txt -bslmm 1 -n 1 -o wgt_slaTrtAc -w 100000 -s 1000000
gemma -g geno_slaTrtMs.txt -p pheno_slaTraMs.txt -bslmm 1 -n 2 -o surv_slaTrtMs -w 100000 -s 1000000
gemma -g geno_slaTrtMs.txt -p pheno_slaTraMs.txt -bslmm 1 -n 1 -o wgt_slaTrtMs -w 100000 -s 1000000
The results are in an output directory in ~/Analyses/melGemma/. I have only begun to investigate the results and I want to run additional chains, but here are a few initial observations or thoughts:
We do a poor job estimating most parameters, in the sense that the credible intervals are quite large. This again reflects the difficulty of polygenic modelling. Interestingly, this same issue is seen even in the example data from Zhou et al's paper (note table 2, and the intervals are sd's not 95% CI's).
Interestingly, we see the clearest signal that we are explaining some of the trait variation for survival when individuals are reared on their natal host plant (particularly SLA on Ac). Here we estimate PVE = 0.844 (0.73-0.90) with 18 (13-23) SNVs contributing sparse effects. This is pretty cool. There are individual SNPs with large posterior inclusion probabilities (15 > 0.5, 6 = 1.0). I looked at one of these and it appears that the few individuals that died included a greater number of heterozygotes than expected. This does not appear to be driven by variation in coverage, but I need to watch out for this!
The PVE for binary traits is on the observed scale. One can also consider a latent scale. The latent scale model posits a continuous character with a threshold response. Individuals with continuous character values greater than t exhibit the response phenotype. You can convert PVE from the observed to latent as PVE_latent = PVE_observed * K (1 - K)/z^2, where K is the proportion of individuals with case phenotype (or one minus this if it is greater than 0.5), t is the value of a standard normal distribution at the Kth quantile (use qnorm(K)), and z is the density of the standard normal distribution at value t (use dnorm(t)). This is described in more detail by Lee et al. 2011 (see figure 1) and in supplemental text 3 from Zhou et al. I tried this with PVE for survival for SLA on Ac but obtained a value greater than 1. For now I will work on the observed scale.
Next I want to run a second chain for each analysis.