alignment and variant calling

On Feb 21 2018.

I am starting alignments for the Lycaeides populations today.

Step 1: Fastq files and bwa alignment

Folder: /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/lycaeides_dubois/Alignments/fastqfiles

This is the folder where all the fastq files are: mine + all the other populations and individuals which I will use for the analysis. Note that my fastq files after splitting are saved in the Parsed folder.

Here are the details of the populations we will use in the study and the number of individuals for each population. This will be the number of fastq files for each population.

The files in this folder are:

1. fastq files: these are fastq files for each individual from each population. Total 835 files for each individual (see table below).
2. melissa_blue_21Nov2017_GLtS4.fasta = dovetail genome to which reads are aligned for all the files.
3. melissa_blue_21Nov2017_GLtS4.fasta.amb, melissa_blue_21Nov2017_GLtS4.fasta.ann, melissa_blue_21Nov2017_GLtS4.fasta.bwt, melissa_blue_21Nov2017_GLtS4.fasta.pac, melissa_blue_21Nov2017_GLtS4.fasta.sa = index files for the dovetail genome above. For alignments these files *need* to be present along with the genome in the same folder.
4. wrap_qsub_slurm_bwa_mem.pl = perl script to run alignments using bwa mem.
5. slurm_bwamem (folder) = slurm output files for each individual fastq file while running bwa alignments.
6. runbwamem.sh = shell script to run the perl script on the cluster.

Here are the details of the perl script used for alignment using bwa mem.

bwa mem -t 12 -k 15 -r 1.3 -T 30 $genome $file > mem"."$file".".sam

Options used:

-t = number of threads [1]

-k = minimum seed length [19]

-r = look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]

-T = minimum score to output [30]

$genome = dovetail genome

$file = fastq files

This script created samfiles for each fastq file (each individual). I moved these files to the samfiles folder. (See step 2).

Step 2: Converting sam to bam

Folder: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Alignments/samfiles

This folder consists of samfiles created from the step above.

1. .am = sam files created from the script/step above.
2. wrap_qsub_slurm_sam2bam.pl = perl script to convert sam files to bam files.
3. slurm_sam2bam = slurm output files for each individual sam file converted bam file.

The perl script was used to convert samfiles to bamfiles. I moved the bam files to the bamfiles folder. I used samtools version 1.5 and the following commands to create bam files, sort them and then index them.

samtools view -b -S -o $out"."bam $file"

samtools sort $out"."bam -o $out"."sorted.bam"

samtools index $out"."sorted.bam"

The folder with bam files:

/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Alignments/bamfiles

Files in this folder:

.bam, .sorted.bam, .sorted.bam.bai = bam files, sorted bam files and indexed bam files respectively.
samtoolsFlagstat.sh = shell script to run the for loop below for each population over individual bam files for that population.
out*.txt = mapped reads information from the above file for each population.
findreadcounts_v2.py = python script to calculate total mapped reads for each population. Use the out.txt file for each population from the shell script above to calculate the mapped reads. When we run this script, the last value listed for the file is the total number of reads for that file/population.

Before moving on to variant calling, I calculated the number of mapped reads for each population. These are listed in the table below in the column "no. of mapped reads". To do this I created a shell script to run for loops to output the number of mapped reads in a text file. I then ran the python script findreadcount_v2.py on this text file to calculate total mapped reads for each population group.

The for loop for this is (this is an example for DBS but this should be repeated for each population):

for f in memDBS*.sorted.bam; do samtools flagstat $f; done >> outDBS.txt

So, I then ran the python script on the outDBS.txt.

Usage: python findreadcounts_v2.py outDBS.txt

Step 3: variant calling for bam files using samtools and bcftools.

I created a shell script to run variant calling. The script is called lycaeidesVariantcalling.sh and is saved in the following directory:

/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Scripts

This script created a variant file for in the bcf and vcf format and these bcf and vcf files are present in the folder: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Variants

This directory contains a BCF and VCF.

I used samtools version 1.5 and bamtools version 1.6 for variant calling. Note that bcftools view is now replaced by bcftools call. Here are the commands for variant calling in the shell script which takes genome, and a list of bam files as input (not the path to bam files but a list containing names of all the bam files):

samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Alignments/fastqfiles/melissa_blue_21Nov2017_GLtS4.fasta -q 20 -Q 15 -g -I -t DP,DPR -u -b lycaeidesBam.txt -o variantsLycaeides.bcf

bcftools call -v -c -p 0.01 -P 0.001 -O v -o variantsLycaeides.vcf variantsLycaeides.bcf

OPTIONS

#SAMTOOLs mpileup version 1.5 options used:

#C = adjust mapping quality; recommended:50, disable:0 [0]

#d = max per-file depth; avoids excessive memory usage [250]

#f = faidx indexed reference sequence file

#q = skip alignments with mapQ smaller than INT [0]

#Q = skip bases with baseQ/BAQ smaller than INT [13]

#g = generate genotype likelihoods in BCF format

#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []

#BCFTOOLs call version 1.6 options used

#v = output variant sites only

#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)

#p = variant if P(ref|D)<FLOAT with -c [0.5]

#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]

#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')

#o = write output to a file [standard output]

Step 4: filtering of variants from the vcf file.

Directory: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Variants

Once I got the variants file, I did further filtering to make sure the data is clean. Before I did this, I wanted to know some basic stats such as number of SNPs, number of indels, TS/TV etc.

For this I ran bcftools stats for each species.

Example usage: bcftools stats variantsLycaeides.vcf > lycaeides.stats

Number of variants called = 1113814

The other way we can find the number of variants is simply by using grep to get the sequence lines (here the lines start with Scaffold_):

grep ^Scaffold variantsLycaeides.vcf | wc -l

Now I want to do additional filtering. To do this I can use Zach's filtering scripts. The other option is using vcftools filter.

Filtering step 1: using the vcfFilter_af.pl script

Here are the stringency criteria used for filtering in the script vcfFilter.pl. This script is saved as vcfFilter_af.pl in the Scripts directory. I used bash script runVcfFilter_af.sh to run the perl script on the cluster.

Here are the options used for the filtering:

my $minCoverage = 1670; # minimum number of sequences; DP, no. of individuals = X; mincoverage = 2X #change for each species

my $minAltRds = 4; # minimum number of sequences with the alternative allele; AC #10 #copies

my $notFixed = 1.0; # removes loci fixed for alt; AF #check this

my $mq = 30; # minimum mapping quality; MQ

my $miss = 167; # maximum number of individuals with no data #20% of the number of individuals #change for each species

my $minMaf = 5; # minimum MAF, as allele count ~ 0.005 # 0.005% of 2X, check AF in the sequences #frequency rare vs common #change for each species

my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation

I did an additional check to find out how many variants lie below and above the MAF threshold. For this I did the following unix things:

grep ^Sc variantsLycaeides.vcf |cut -f 8 | grep -o "AF1=[0-9\.]*" > alleleFreqs.txt

awk '$1 < 0.005' alleleFreqs.txt | wc -l #428030

awk '$1 > 0.995' alleleFreqs.txt | wc -l #87601

Now I know how many variants have been filtered after the first round of filtering.

Number of variants retained after this round of filtering = 91266

Filtering step 2: using the filterSomeMore.pl script

To run the second script for further filtering, I needed two details: a). a file with scaffold and position for each script b). the max coverage depth across SNPs.

To get depth across sites for filtersomemore.pl script:

grep ^Sc filtered2x_variantsLycaeides.vcf | cut -f 8 | cut -d ';' -f 1 | grep -o '[0-9]*' > depth_filtered2X_lycaeides.txt

Then I calculated the max coverage for each species in R:

x = read.table("depth_filtered2X_lycaeides.txt", header=FALSE)

cov = mean(x$V1) + 2*(sd(x$V1))

To create the snpsLycaeides.txt file for filtersomemore.pl script:

#to get scaffold and position

grep ^Sc filtered2x_variantsLycaeides.vcf | cut -f 1 | cut -d '-' -f 2 > scaffold.txt

#open the scaf file in vim and remove any extra text.

grep ^Sc filtered2x_variantsLycaeides.vcf | cut -f 2 > positions.txt

#create final file

paste scaffold.txt positions.txt > snpsLycaeides.txt

Here are the details of the filtering criteria used in this script:

my $maxCoverage = 17162; #maximum depth to avoid repeats, mean + 2sd = 17162.16

my $mind = 3; #don't call if SNPs are closer together than this

I checked how many SNPs are retained if we change the $mind filter. Here are the numbers:

5 = 41544

4 = 48717

3 = 59013 <--- we decided to go with 3

2 = 71063

Number of variants retained after this round of filtering = 59013. These will be used for downstream analysis.

Additional filtering after realizing systematic coverage errors while starting Entropy analysis.

When I started running entropy, I noticed that there is some systematic errors associated with my newly sequenced data as compared to Zach's old GBS sequence data. There were issues with depth of coverage for some specific SNPs which was pushing the data in a different direction. To fix this, we decided to further filter the SNPs. We took the filtered_variantsLycaeides.gl file generated in the previous step (this file has 59013 SNPs) and filtered it again to get 39193 SNPs. For these SNPs the difference in mean coverage for old vs. new samples was <= half the mean coverage for old and new combined. Here old is SNPs from old sequenced data and new is new snps from data which I generated.

This new file is called subset_filtered_variantsLycaeides.gl and will be used for all downstream analysis and has 39193 SNPs.

Step 5: split the gl file into population files (23 pops, 23 gl file)

Use splitPops.pl to split the main gl file (For these SNPs the difference in mean coverage for old vs. new samples was <= half the mean coverage for old and new combined.) into individual population files.

Usage: perl ../Scripts/splitPops.pl filtered_variantsLycaeides.gl

This will create 23 files for each population used in this analysis. I have moved these gl files to the folder pop_glfiles.

The same thing was done with the subset filtered files and the gl files from this are in the folder pop_glfiles_subset

Step 6: Convert gl files into genotype likelihood files

Use the estPM to convert gl file for each population into a genotype likelihood file

Estimated population allele frequencies using Zach's estpEM program. This program only gives the population allele frequencies for each population. This program uses maximum likelihood allele frequency estimates by taking the gl files as input and outputting a text file with estimates.

Here is the usage and options for this program:

Usage: popmod -i infile.txt [options]

-i Infile with genetic data for the population

-o Outfile for allele frequency estimates [default = out.txt]

-e Tolerance for EM convergence [default = 0.001]

-m Maximum number of EM iterations [default = 20]

-h Number of header lines [default = 0]

Here is the example command I use (I modified this for each population):

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpEM -i Lmpop_BIC.gl -o p_BIC.txt -h 2

The output file has 3 columns:

Scaffold:position MLE MLE-foranalysis

0:33945 0.0658 0.0000

0:33977 0.0296 0.0000

0:33997 0.1872 0.1667

0:34052 0.0677 0.0000

0:50620 0.1239 0.0002

0:50681 0.1528 0.0709

0:68856 0.1586 0.0005

Page updated

Google Sites

Report abuse