Post date: Nov 02, 2014 9:20:44 PM
I used the BSLMM model in gemma (probit and linear) to estimate no. snps and pve from the newly filtered genome data (with modal and mean genotypes). The linear model once again explains too much of the variance to be believable (> 95%). The probit model results in PVE posteriors that are mostly indistinguishable from the prior. Null pve estimates are close to 0 with massive standard errors. In short, the BSLMM models (and polygenic approach?) appear to fail with this data set. I still don't know why for sure, but I suspect this is due to a mix of low coverage and n <<< p. See my comments to Patrik about this below:
Some of the MCMC runs for gemma finished, but some are still running. I ran a separate analysis for the bugs transplanted to A and C, and I used either the posterior mean genotype (any value between 0 and 2) or the mode (mostly likely integer value). The mean is generally a better choice but given the low coverage of this data set the mode might minimize the effect of variation in uncertainty. I then ran three MCMC replicates of each. So far two reps of A with the mode, one of A with the mean, and one of C with the mode are done. The posterior for PVE for each is shown in the attached violin plot. The most important thing to note at the moment is that the two replicates for A with the mode give very different posteriors. This means that given the chains have not converged (the same is thus likely true for other combos too). I ran quite long chains (hence the long runtime), but this suggests that the probit model simply isn't mixing well (for this data set), presumably because it doesn't mix as well in general and we have many many loci and we have low coverage. I am going to let the rest finish and give the linear model a shot (just to see what we get). The linear model will be really quick. We can then decide rather to try even longer probit runs (the actual length of runs we might need could be prohibitive) or to switch to a different analytical framework based on allele frequencies (which gets around the coverage issue).
I used gemma to obtain MLE PVE just from the polygenic term (i.e. with no loci with measurable individual effects). We end up with PVE < 1% and big standard errors (using the modal or mean genotypes). I then ran single SNP analyses that control for the (mostly absent) polygenic effect. I did this for both the mean and modal genotypes. The plots are attached. the 'int' plot is for modal genotypes and in both cases Adenostoma is the top plot and Caenothus is the bottom one. Only SNPs with p < 0.001 are shown (to keep the file size down, this is why there is a bunch of white space at the bottom of the plots). The solid red line is the threshold for strict genome-wide significance, i.e., p < 0.05/(no. of snps). The dashed red line is a bit less stringent and instead uses the effective number of tests accounting for the fact that some nearby SNPs exhibit LD and thus don't really constitute independent tests. Depending on the host plant, threshold, and use of modal or mean genotypes we end up with 0, 1, or two significant SNPs. This is something but not much. We would probably get a few more with a FDR correction (but this would of course be less stringent).