Post date: Aug 22, 2013 4:1:55 PM
I began a series of test runs with entropy on 20viii13. See the summary below from the notes file:
Test runs of entropy, k=2-8, 2 chains each, 4000 mcmc, 1000 burnin, thin 3, q-init from lda with -s 50, outfiles = entk($k)ch($j).hdf5. These are running on lycaeides via the batch queue with a 72 hour wall time.
Second test, regarding long running jobs. I tried to submit jobs to lycaeides via the long queue, but they were queued for at least a day (maybe indefinitely). I don't think lycaeides accepts 'long' queue jobs. When I left off the cluster, I was able to submit the following, which is running on sawtooth: k=2, 2 chains, 10000 mcmc, 5000 burnin, thin 5, outfiles start with long, 120 hour wall time.
Most of the runs finished successfully in fewer than 48 hours, and I think 72 hours will be sufficient for all of them. Here are a few things I learned from these:
estimates of q or Q between replicate chains are highly correlated, i.e., 0.99 or higher, although there are a few instances of near 0 q_k for one run but moderate q_k for another run at higher K
effective samples sizes calculated in coda were modest, i.e., means on the order of 50, but better for higher K
even at K=8 some individuals have nearly pure ancestry, and pure ancestry individuals exist for each population, this is somewhat less true for the Q model, but still much more believable than before
given the above, I think initializing q based on linear discriminant analysis greatly speeds convergence to the stationary distribution
I am still a bit surprised by the large estimates of Fst from these models. This could be a bug in the software, or it could reflect population structure apparent in common variants. I have a short test running now (outfile aftest.hdf5) that will include allele frequency ouput. I want to verify that allele frequencies are mixing well and consistent with genotypes.