I have converted sam files to bam files for variant calling. The bamfiles are in the folder:
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles
I have to do variant calling for each species separately. Right now I have all the files together. I have separate these files species wise and then do variant calling separately.
SO right now...
I have the excel file with population and species information and I have all the files indexed by the SRR sequence number. So I should create a script or something to separate out the files based on this number into maybe folders containing files for each species.
Here are some details about the data:
I have 8 species included in the data. Here are the species and the number of populations:
**I have removed population PCTCR from the variant calling analysis.
To separate files by species I created a list of SRR numbers for each species saved by species name in the bamfiles folder as : bart.txt, cali.txt...etc
Then in linux I implemented the following loop to move the files from the list to respective species folder:
for i in $(cat cali.txt); do mv "$i" ./cali/; done ##change for each species
I modified the file names for .bam, .sorted.bam, .sorted.bam.bai in the list files to move the respective files in the species folder.
*****************************************************************************************************************************************************************************************************************
Folder for each species:
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/bart
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/cali
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/chum
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/cris
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/knul
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/land
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/podu
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/alignments/bamfiles/popp
To determine the total number of reads in the study and number of mapped reads in the study, I used samtools flagstat to get the count of total and mapped reads for each bam file in the folder. I did this for each species. I went into the species folder and ran the following loop and the python script to find out mapped reads. In each species folder there is a out.txt file and the python script.
Here is the unix for loop to save the output for each bam file in a file
for f in *sorted.bam; do samtools flagstat $f; done >> out.txt
The python script to determine the total number of reads across all bam files is called findreadcounts.py and is saved in the same folder. The script prints out the sum of all read counts (total and mapped) by adding each of the numbers in a sequence. The FINAL number printed out is the total number of reads across all bam files.
Here are the number of mapped reads for each species using samtools aln and mem: (we are using mem for analysis).
*************************************************************************************************************************************************************
I created a shell script to run variant calling for each species separately. The script is called timemaVariantCalling.sh and is saved in the following directory:
/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/scripts/
This script created a variant file for each population and these bcf and vcf files are present in the folder: /uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/variantcalling
This directory contains VCF files for each species. I used the script vcf2gl.pl to convert vcf files to the genotype likelihood files. I think something is up with these files and I need to figure this out. Mainly I need to make sure that the number of loci is consistent in the vcf and gl file.
I used samtools version 1.5 and bamtools version 1.6 for variant calling. Note that bcftools view is now replaced by bcftools call. Here are the commands for variant calling in the shell script which takes genome, and a list of bam files as input (not the path to bam files but a list containing names of all the bam files):
samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/fasta_genome/re_mod_map_timema_06Jun2016_RvNkF702.fasta -q 20 -Q 15 -g -I -t DP,DPR -u -b bartBam.txt -o variantsTimemaBart.bcf
bcftools call -v -c -p 0.01 -P 0.001 -O v -o variantsTimemaBart.vcf variantsTimemaBart.bcf
OPTIONS
#SAMTOOLs mpileup version 1.5 options used:
#C = adjust mapping quality; recommended:50, disable:0 [0]
#d = max per-file depth; avoids excessive memory usage [250]
#f = faidx indexed reference sequence file
#q = skip alignments with mapQ smaller than INT [0]
#Q = skip bases with baseQ/BAQ smaller than INT [13]
#g = generate genotype likelihoods in BCF format
#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
#BCFTOOLs call version 1.6 options used
#v = output variant sites only
#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)
#p = variant if P(ref|D)<FLOAT with -c [0.5]
#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]
#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')
#o = write output to a file [standard output]
Once I got the variants files for each species, I will do further filtering to make sure the data is clean. Before I did this, I wanted to know some basic stats such as number of SNPs, number of indels, TS/TV etc.
For this I ran bcftools stats for each species.
Example usage: bcftools stats variantsTimemaBart.vcf > bart.stats
The other way we can find the number of variants is simply by using grep to get the sequence lines (here the lines start with Scvr...):
grep ^Sc variantsTimemaCali.vcf | wc -l
Now I want to do additional filtering. To do this I can use Zach's filtering scripts. The other option is using vcftools filter. Here are the stringency criteria used for filtering in the script vcfFilter.pl
my $minCoverage = 232; # minimum number of sequences; DP, no. of individuals = X; mincoverage = 2X #change for each species
my $minAltRds = 4; # minimum number of sequences with the alternative allele; AC #10 #copies
my $notFixed = 1.0; # removes loci fixed for alt; AF #check this
my $mq = 30; # minimum mapping quality; MQ
my $miss = 154; # maximum number of individuals with no data #20% of the number of individuals #change for each species
my $minMaf = 5; # minimum MAF, as allele count ~ 0.005 # 0.005% of 2X, check AF in the sequences #frequency rare vs common #change for each species
my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation
## to make the table below for allele freqs:
grep -o -E 'AF1=[0-9\.]+' variantsTimemaChum.vcf > af_chum.txt
wc -l af_chum.txt #np. of snps prefilter
perl vcfFilter.pl variantsTimemaChum.vcf 358 71.6
awk '$1 < 0.005' af_cali.txt | wc -l
awk '$1 > 0.995' af_cali.txt | wc -l
#get depth across sites for filtersomemore script
grep ^Sc filtered2x_variantsTimemaPopp.vcf | cut -f 8 | cut -d ';' -f 1 | grep -o '[0-9]*' > filt2Xpopp_depth.txt
#calculate max coverage for each species in R
bart = read.table("filt2Xbart_depth.txt", header=FALSE)
bartcov = mean(bart$V1) + 2*(sd(bart$V1))
To create the SNPs.txt file for filtersomemore.pl script:
#to get scaffold and position
grep ^Sc filtered2x_variantsTimemaBart.vcf | cut -d ':' -f 3 > scafBart.txt
open the scaf file in vim and remove the scaf- part.
grep ^Sc filtered2x_variantsTimemaBart.vcf | cut -f 2 > posBart.txt
#create final file
paste scafBart.txt posBart.txt > snpsBart.txt
Converting vcf file to gl and then splitting populations for estimating allele frequencies:
After completing variant calling and the filtering for variants, the vcf file needs to converted to the genotype likelihood file.
To this I used the script vcf2gl.pl. This script was modified to work for the specific timema data as we wanted to pick up scaffold, linkage group and position instead of just scaffold and position. I need to create a gl file for each of the 8 species.
Now, since the individual IDs in the analysis so far are not specific to populations, I modified the ids by adding the population as prefix to the individual ID. Once I modify these IDs, I split the file for genotype likelihoods for individual populations for each species for the next step of determining allele frequencies.
#transitions vs transversions after this filtering
Example: grep ^Sc filtered2x_variantsTimemaBart.vcf | cut -f 4,5 | sort | uniq -c
1). Bart
Transitions: 6325
1420 A G
1731 C T
1714 G A
1460 T C
Tansversions: 3963
533 A C
552 A T
708 C A
253 C G
201 G C
696 G T
526 T A
494 T G
2). Cali
1604 A C
3873 A G
1549 A T
1798 C A
600 C G
3414 C T
3515 G A
612 G C
1791 G T
1568 T A
3797 T C
1530 T G
3). Chum
405 A C
1286 A G
595 A T
721 C A
199 C G
1819 C T
1823 G A
232 G C
689 G T
581 T A
1348 T C
392 T G
4). Cris
14094 A C
29302 A G
23303 A T
28318 C A
7159 C G
51922 C T
52016 G A
7132 G C
28416 G T
23548 T A
29296 T C
14014 T G
5). Knul
1315 A C
3440 A G
1418 A T
1749 C A
608 C G
3458 C T
3670 G A
544 G C
1811 G T
1361 T A
3453 T C
1319 T G
6). Land
1025 A C
2473 A G
1054 A T
1273 C A
377 C G
2529 C T
2605 G A
399 G C
1333 G T
1121 T A
2463 T C
983 T G
7). Podu
621 A C
1798 A G
852 A T
1081 C A
334 C G
2609 C T
2663 G A
310 G C
1117 G T
844 T A
1911 T C
639 T G
8). Popp
1234 A C
3078 A G
1102 A T
1349 C A
510 C G
2810 C T
2802 G A
472 G C
1360 G T
1122 T A
3116 T C
1236 T G