Post date: Nov 25, 2013 11:16:49 PM
I want to quantify population genetic structure in melissa to consider the independence or lack thereof of alfalfa feeding in melissa (we have already looked at this for the populations in the admixture paper, but we now have additional populations). To this end I running the admixture proportion model in entropy to estimate genotypes for the melissa individuals.
I first created a data set for this analysis. I used the script vcf2glWild.pl in the variants directory to filter varMelissaAll.vcf to retain the 525 individuals with non-'SELEXP' id names and loci with maf > 5%. These data are in filtered_varMelissaAll.vcf. I then converted this to genotype likelihood format as melissaWild.gl and moved this file to the entropy directory. I then used an R script to identify a subset of loci greater that are not near (within 3 bp) other variable loci and with no more than one locus every 1000 bp. This set of 14051 loci is in sub_melissaWild.gl (the R script is sampleVariants.R and the list of loci are in retained_locids, with TRUE = retain this locus). I generated starting values with using lda based on pca with clusters defined by k-means clustering (see startingValues.R).
I then analyzed these data with entropy with K=1..5, 15000 steps, a 5000 step burnin, thinning interval of 5, and by sampling starting values based on the lda with a Dirichlet scalar of 20. These jobs are in the long queue with 336 hours and have ids 47139-47141 and 47567-47578. The results will be written to scratch/melent/.