treemix

To run treemix, I had to redo variant calling to get the same variants for all species. Thankfully, I had saved the alignments for all species where I had aligned the files to the T.cristinae genome. So I started a run to redo variant calling which includes all the individuals, from all the species.

The bam files were prefixed with species name and I moved them out of the species folder into the main folder:

/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles

I created a list of all the sorted bam files to supply to the program for variantcalling:

/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/bamfilelist.txt

The variant calling script and output folder is here:

/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/variantcalling_allfiles

Number of individuals = 1420

1. Did variant calling as follows:

samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/fasta_genome/re_mod_map_timema_06Jun2016_RvNkF702.fasta -q 20 -Q 15 -g -I -t DP,DPR -u -b bamfilelist.txt -o variantsTimema_allspecies.bcf

bcftools call -v -c -p 0.01 -P 0.001 -O v -o variantsTimema_allspecies.vcf variantsTimema_allspecies.bcf

Total variants called: 1860088

get summary of allele frequencies from the vcf file

grep -o -E 'AF1=[0-9\.]+' variantsTimema_allspecies.vcf > allelefreq.txt

wc -l af_chum.txt #np. of snps prefilter

2. I then ran:

perl vcfFilter.pl variantsTimema_allspecies.vcf

This left me with 23303 SNPs.

I set the parameters in the script as follows:

#### stringency variables, edits as desired

my $minCoverage = 2840; # minimum number of sequences; DP, no. of individuals = X; mincoverage = 2X

my $minAltRds = 4; # minimum number of sequences with the alternative allele; AC #10

my $notFixed = 1.0; # removes loci fixed for alt; AF

my $mq = 30; # minimum mapping quality; MQ

my $miss = 284; # maximum number of individuals with no data

my $minMaf = 14.2; # minimum MAF, as allele count ~ 0.005

my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation

****** extra stuff************

awk '$1 < 0.005' af_cali.txt | wc -l

awk '$1 > 0.995' af_cali.txt | wc -l

3. Run filter some more script

#get depth across sites for filtersomemore script

grep ^Sc filtered2x_variantsTimema_allspecies.vcf | cut -f 8 | cut -d ';' -f 1 | grep -o '[0-9]*' > filt2X_depth.txt

#calculate max coverage for each species in R

depth = read.table("filt2X_depth.txt", header=FALSE)

cov = mean(depth$V1) + 2*(sd(depth$V1)) #21672.11

To create the SNPs.txt file for filtersomemore.pl script:

#to get scaffold and position

grep ^Sc filtered2x_variantsTimema_allspecies.vcf | cut -d ':' -f 3 > scaffolds.txt

open the scaf file in vim and remove the scaf- part.

grep ^Sc filtered2x_variantsTimema_allspecies.vcf | cut -f 2 > position.txt

#create final file

paste scaffolds.txt position.txt > snps_mappos.txt

After this step I am left with 8787 SNPs. These are what I will use for treemix.

4. Converted the vcf file to gl

perl vcf2gl.pl morefilter_filtered2x_variantsTimema_allspecies.vcf

#Number of loci: 8787; number of individuals 1420

5. Then modified the ids in the glfile to get population name.

head -2 timema_allspecies.gl | tail -1 > glfile_ids

python modifySampleIDglfile.py glfile_ids mod_glfile_ids

cp timema_allspecies.gl timema_allspecies_copy.gl

Then open this file in vim and replace the second line with modified ids from mod_glfile_ids by simple copy paste.

6. Split into file per population

perl splitPops.pl timema_allspecies_copy.gl

mkdir glfiles

mv Lmpop_* glfiles/

#convert to p files

for f in Lmpop_*.gl; do /uufs/chpc.utah.edu/common/home/u6000989/bin/estpEM -i $f -o "p_$f.txt" -h 2; done

I copy pasted the output from above loop in estpemout. Then:

grep "individuals" estpemout | cut -d ":" -f 2 > num_individuals

grep "^Reading" estpemout | cut -d ' ' -f 4 | cut -d '_' -f 2 | cut -d '.' -f 1 > populations

paste populations num_individuals > population_numindividuals

Page updated

Google Sites

Report abuse