alignment and variant calling for pilot data set

Post date: Jun 25, 2016 10:11:36 PM

GBS data from 2013 and 2015 (or 2014 for RNV) from the 10 focal Lycaeides populations were generated to obtain initial estimates of contemporary Ne for my CAREER grant (libraries gomp012-015, HiSeq4000). The data are in /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/. Here are my initial analyses.

1. Files were parsed and split using my standard scripts. Note that fewer reads had barcodes than normal (about 80% were good).

2. Sequences were aligned to the L. melissa reference using bwa 0.7.10-r789 (aln and samse algorithms). This was done for all 768 individuals using my perl wrapper script. An example command is below. Files were then sorted and compressed.

cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Parsed/

bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnUSL-15-50.sai /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/melissa_genome/final.assembly.fasta USL-15-50.fastq

bwa samse -n 1 -r '@RG\tID:lyc-USL-15-50\tPL:ILLUMINA\tLB:lyc-USL-15-50\tSM:lyc-USL-15-50' -f alnUSL-15-50.sam /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/melissa_genome/final.assembly.fasta alnUSL-15-50.sai USL-15-50.fastq

3. Variant calling was conducted with samtools 1.2 (htslib 1.2.1) and bcftools 1.3 (with htslib 1.3). If this turns out to be too slow, I will use GATK. Here is the script and commands I ran (in callvar.sh):

#!/bin/sh

#SBATCH --time=120:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=8

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=samtools

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

echo ------------------------------------------------------

echo SLURM: job identifier is $SLURM_JOBID

echo SLURM: job name is $SLURM_JOB_NAME

echo ------------------------------------------------------

module load samtools

cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Alignments

samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/melissa_genome/final.assembly.fasta -q 20 -Q 30 -I -u -g -t DP,DPR -o lycTimeseries.bcf

bcftools call -v -m -p 0.01 -P 0.001 -O v -o LycTimeseries.vcf LycTimeseries.bcf

4. This resulted in 1,152,985 SNVs.

I am filtering variants with my own vcfFilter.pl script. Here are the key parameters:

#### stringency variables, edits as desired

my $minCoverage = 1536; # minimum number of sequences; DP, 1536 = 2X

my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC

my $notFixed = 1.0; # removes loci fixed for alt; AF

my $mq = 30; # minimum mapping quality; MQ

my $miss = 154; # maximum number of individuals with no data

my $minMaf = 5; # minimum MAF, as allele count ~ 0.005

my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation

Beyond this, I also drop any INDELs (shouldn't be any) or multi-allelic SNPs. Note that 154 inds. with missing data is 20%. Filtering left me with 132,396 SNVs which are in filtered2x_lycTimeseries.vcf in /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Variants/.

As with my last Lycaeides data set, I decided that additional filtering was warrantedtai

1. Remove very high coverage (putative repeat region) SNPs. Mean depth (across all inds.) was 7603 reads (9.9X). I plan to drop all reads with greater than mean + 2 s.d. coverage, that is the 1.7% of SNPs with > 19,124 reads.

2. Drop all SNPs withing X base pairs of each other (where X is 4 in this case).

These are accomplished with filterSomeMore.pl. The resulting file is morefilter_filtered2x_lycTimeseries.vcf and includes 55,379 SNPs. Of these 6276 have a MAF > 5% (common variants), 14,499 have MAF > 1% and < 5% (uncommon variants), and 34,604 have MAF < 1% (rare variants). These are based on the vcf output.

Page updated

Google Sites

Report abuse