Post date: Sep 26, 2016 1:31:22 AM
The current status of things follows. Given the results thus far, it is clear that we want the Refugio covariate. The next steps are,
(1) to compare models with elevation, climate PC1 and climate PCs 1 and 2
(2) limit cross-validation to samples of size > N (where N is something like 20?)
The model is essentially a fancy GLM, where the expected value of the logit freq. of melanistic bugs is:
alpha_i + beta_i * temp + theta * year + epsilon,
where alpha_i and beta_i are population specific intercepts and slopes (for the effect of temp.), epsilon is an observation-specific error-term (to account for overdispersion relative to a binomial), and theta is the effect of year (independent of temperature). I included this to account for any possible trends in the frequency of melanistic bugs across all populations and through time (i.e., an overall increase or decrease in freq.). I think this actually decreases our ability to detect effects of temperature, as it temperatures are warming overall (thus there is a general trend in temp. that could be partially accounted for by year), but I think we would get dinged if we didn't include it.
The coolest part of the model is that we then have a second level (hierarchical) linear model (really two of them) for the population specific intercepts and slopes (for temperature). Specifically, we model these as,
alpha_i = a1 + b1 * elevation + c1 * host + d1 * mountain
beta_i = a2 + b2 * elevation + c2 * host + d2 * mountain
Where a1, b1, c1, d1, a2, b2, c2, and d2 are all regression coefficients. Thus, variation among sites in the overall freq. of melanistic bugs and the effect of temp. on them is not just random, but rather we assume it is a function (potentially) of elevation, host (coded as 0 or 1) and mountain range (also 0 or 1). a1 and a2 give the average intercept and average effect of temp. for host A and not being on refugio at the average elevation.
So, at this point, the model with Refugio as a covariate beats one with out it based on DIC and cross-validation (more on the latter below). Focusing on that model we see:
1. A borderline effect of year.
2. A clear effect of being on Refugio. There are fewer melanistic bugs, but this does not clearly modify the effect of temp (i.e., d1 is 'significant', but not d2).
3. An effect of host on the relationship between temperature and % melanistic (c2 is significantly negative). This means that being on C makes the slope less positive (best interpretation) or more negative (maybe sometimes).
4. Cases where the effect of temperature for individual sites is significant, and many cases where it is not. Significant effects are always positive.
5. An average negative effect of elevation on melanism (less at higher elevations, i.e., b1 is significant), but no to marginal evidence that temperature and elevation interaction (95% ETPIs on b2 = -0.15 to 0.028).
6. An overall positive effect of temperature (i.e., a2 is significant and positive).
This all seems pretty cool, albeit a bit complicated. Let me know if any of it doesn't make sense (which could just be me not summarizing things well).
As for CV, I think things look ok, but not awesome. With 5-fold (i.e., dropping 20% of the data at a time) CV, we get a correlation between true and estimates frequencies of melanism of around 0.25 (0.21 for the model without Refugio as a covariate). This gives a cross-validated r^2 of ~ 6% (that is, our model can predict frequencies that explain about 6% of the variation in the true frequencies). One thing to keep in mind is that the model fits the actual count data (# of melanistic out of total) not the frequencies. Thus, sites with more data drive things more. But for measuring the success of CV, I am using the frequencies. If you use the counts the correlation shoots up above 0.9, but for a mostly trivial reason that large samples sizes give larger counts. But, using the frequencies I think could be driving things artificially low, as we don't necessarily expect to do well for sites with very low sample sizes. I will think more about this, and about whether we want to change the model in any way with the hopes of increasing it's performance. Once I have thought this through, I will run it all again (maybe the last time, but maybe not).