Post date: Mar 11, 2016 2:14:28 AM
I obtained this from Victor:
variants.flt.100kmreads_10kdepth_20QS.50cov.phylo.bcf -> 1505 individuals, 324,355 SNVs
used for ML tree with all 1505 individuals
filters: individuals with <100,000 mapped reads removed
variants <50% sampling coverage, max. depth>10000, QS<20 removed
CMD: bcftools view -vd 0.5 variants.raw.bcf | vcfutils.pl varFilter -D 10000 | awk '/^#/ || $6>=20'
I then converted variants.flt.100kmreads_10kdepth_20QS.50cov.phylo.bcf to variants.flt.100kmreads_10kdepth_20QS.50cov.phylo.vcf with bcftools:
bcftools view -N -c -e -g -I -p 0.01 -P full -v variants.flt.100kmreads_10kdepth_20QS.50cov.phylo.bcf > variants.flt.100kmreads_10kdepth_20QS.50cov.phylo.vcf
Hereafter, everything is in /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_moehmm/ at the UofU CHPC.
Next I converted the vcf to gl format (with only MAF > 1% SNPs retained) and split the gl by population = species, host and locality:
perl subvcf2gl.pl 0.01 variants.flt.100kmreads_10kdepth_20QS.50cov.phylo.vcf
perl splitPops.pl modgbs50.gl
Some of the files (populations) have no or very few individuals. Next I need to infer allele frequencies, but I am doing this only for populations with at least five individuals.
I iterated over the 85 files (populations) with 5+ individuals (in files5plus.txt) with runestpEM.pl. Here is an example command:
estpEM -i timemabart_BMP90_LP.gl -o ptimemabart_BMP90_LP.txt -e 0.0001 -m 20
This creates p_* files with EM allele frequencies in column 3. Note that some allele frequencies appear to be at 0.5, which suggests no data (I am sure this is true for most, but of course not all). Also, it looks like there are runs of SNPs with identical allele frequencies.
I calculated pi, Dxy, Da, and Fst in 20kb windows for all pairs of populations (regardless of species) that were in the same locality (73 comparisons, listed in filesMoe.txt).
R scripts for this were generated with,
perl ../scripts/calcNeiMoe.pl
And then run with,
perl wrap_slurm_chpc_example.pl source_*
I used sortNei.pl to order scaffolds by LG and order within LGs, and to drop any non-LG scaffolds (based on genome 0.3). The resulting files are sort_nei* and the originals are in the popgen/raw subdirectory.
I then generated scripts to run the 2-state HMM with scripts/makeHmmScripts.pl, which defines states based on the mean and 90th quantile logit Fst. Summaries are number of highh regions, total length of high state, mean length, mean background Fst, and mean high state Fst. I ran the scripts in R with runhmm.pl and the results are in the hmm*txt files and rwsp files.
With the odd exception of the first 20kb window, there are essentially no high HMM states.