more thoughts on gemma analyses

Post date: Aug 29, 2014 5:32:17 PM

Here are my thoughts from an e-mail I sent Patrik:

"I have finished a series of gemma analyses with the probit model now. In particular, I have used mean genotypes (our normal practice) or modal genotypes (something I tried because of the lower coverage whole genome data) and various chain lengths (mixing actually looks quite good).

The results so far are disappointing. Basically, no matter what I try the posterior distributions for PVE and #SNPs are nearly identical to their prior distribution. In other words, I keep getting posteriors on PVE that are essentially uniform from 0 to 1. This is troubling as it means we are basically learning nothing from these data. I would be happy in a sense regardless of what value we were getting for PVE as long as the posterior was narrowing down on a region of values, but this isn't what is happening. This is markedly different from my results with Lycaeides and Aaron's results with GBS Timema data.

At the moment, I have two explanations for what we are seeing. One possibility is that this data set represents the worst case of P >>>> N that we (maybe anyone) has ever encountered. In other words, while gwas always has more parameters (P) than samples (N), we are now in the extreme. We have 8.5 million SNPs vs a few hundred thousand in previous data sets, but still about the same number of individuals (250 per analysis). There are of course human data sets with as many SNPs, but these have 10-100x more individuals. This could be pushing these methods to their limit and causing analytical problems. I have tried cutting down the number of SNPs by using only those with MAF > 5%. This isn't ideal, but I just wanted to see what we get. We still would have 2.5 million SNPs, and these analyses are still running, but so far the results look almost identical. We could cut even more, but at some point that defeats the whole point of whole genome sequencing. I have also tried the single SNP analyses just to see what we get. We end up with a single significant SNP for C with a very stringent genome-wide significance level (p < 0.5/#ofLoci). I also generated maximum likelihood estimates of PVE just for the kinship matrix (i.e. basically BSLMM with no individual effects) and they come out to 0 (as close to 0 as the algorithm searches). But even the kinship matrix could result in weird things with P so much greater than N.

The other possibility is that there is something funky about the new way I am processing the data with GATK. Unfortunately this is somewhat tricky to check with this data set as we don't really have any expectations of population structure, etc. that we could use to verify we don't have garbage. But I am using the same exact approach with Moe's data and there we do have strong expectations. I am running the indel realignment algorithm right now and should start on variant calling proper over the weekend. So, we should have his data ready for analysis next week and I can make some quick PC plots to make sure that look reasonable.

I will let the current analyses finish to see if anything interesting comes up. If not, I will see what Moe's data look like to evaluate which of the explanations above seems more likely and go from there. We could then look into alternative methods if needed. "

Page updated

Google Sites

Report abuse