To run treemix, I had to redo variant calling to get the same variants for all species. Thankfully, I had saved the alignments for all species where I had aligned the files to the T.cristinae genome. So I started a run to redo variant calling which includes all the individuals, from all the species.
The bam files were prefixed with species name and I moved them out of the species folder into the main folder:
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles
I created a list of all the sorted bam files to supply to the program for variantcalling:
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/bamfilelist.txt
The variant calling script and output folder is here:
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/variantcalling_allfiles
Number of individuals = 1420
1. Did variant calling as follows:
samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/fasta_genome/re_mod_map_timema_06Jun2016_RvNkF702.fasta -q 20 -Q 15 -g -I -t DP,DPR -u -b bamfilelist.txt -o variantsTimema_allspecies.bcf
bcftools call -v -c -p 0.01 -P 0.001 -O v -o variantsTimema_allspecies.vcf variantsTimema_allspecies.bcf
Total variants called: 1860088
get summary of allele frequencies from the vcf file
grep -o -E 'AF1=[0-9\.]+' variantsTimema_allspecies.vcf > allelefreq.txt
wc -l af_chum.txt #np. of snps prefilter
2. I then ran:
perl vcfFilter.pl variantsTimema_allspecies.vcf
This left me with 23303 SNPs.
I set the parameters in the script as follows:
#### stringency variables, edits as desired
my $minCoverage = 2840; # minimum number of sequences; DP, no. of individuals = X; mincoverage = 2X
my $minAltRds = 4; # minimum number of sequences with the alternative allele; AC #10
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $mq = 30; # minimum mapping quality; MQ
my $miss = 284; # maximum number of individuals with no data
my $minMaf = 14.2; # minimum MAF, as allele count ~ 0.005
my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation
****** extra stuff************
awk '$1 < 0.005' af_cali.txt | wc -l
awk '$1 > 0.995' af_cali.txt | wc -l
3. Run filter some more script
#get depth across sites for filtersomemore script
grep ^Sc filtered2x_variantsTimema_allspecies.vcf | cut -f 8 | cut -d ';' -f 1 | grep -o '[0-9]*' > filt2X_depth.txt
#calculate max coverage for each species in R
depth = read.table("filt2X_depth.txt", header=FALSE)
cov = mean(depth$V1) + 2*(sd(depth$V1)) #21672.11
To create the SNPs.txt file for filtersomemore.pl script:
#to get scaffold and position
grep ^Sc filtered2x_variantsTimema_allspecies.vcf | cut -d ':' -f 3 > scaffolds.txt
open the scaf file in vim and remove the scaf- part.
grep ^Sc filtered2x_variantsTimema_allspecies.vcf | cut -f 2 > position.txt
#create final file
paste scaffolds.txt position.txt > snps_mappos.txt
After this step I am left with 8787 SNPs. These are what I will use for treemix.
4. Converted the vcf file to gl
perl vcf2gl.pl morefilter_filtered2x_variantsTimema_allspecies.vcf
#Number of loci: 8787; number of individuals 1420
5. Then modified the ids in the glfile to get population name.
head -2 timema_allspecies.gl | tail -1 > glfile_ids
python modifySampleIDglfile.py glfile_ids mod_glfile_ids
cp timema_allspecies.gl timema_allspecies_copy.gl
Then open this file in vim and replace the second line with modified ids from mod_glfile_ids by simple copy paste.
6. Split into file per population
perl splitPops.pl timema_allspecies_copy.gl
mkdir glfiles
mv Lmpop_* glfiles/
#convert to p files
for f in Lmpop_*.gl; do /uufs/chpc.utah.edu/common/home/u6000989/bin/estpEM -i $f -o "p_$f.txt" -h 2; done
I copy pasted the output from above loop in estpemout. Then:
grep "individuals" estpemout | cut -d ":" -f 2 > num_individuals
grep "^Reading" estpemout | cut -d ' ' -f 4 | cut -d '_' -f 2 | cut -d '.' -f 1 > populations
paste populations num_individuals > population_numindividuals