variant calling

Post date: Nov 09, 2013 11:1:19 PM

I am using samtools and bcftools to identify variable sites in the melissa sequence data. I am not sure whether I want to use all 1573 individuals, or just wild-caught individuals (most individuals were in the selection experiment and come from just two populations). Also, some of the individuals with 'SELEXP' ids are actually wild-caught (this includes more individuals than were sampled from the other natural populations). I will first contrast two variant calling runs with all individuals or those without 'SELEXP' ids. The specific commands are below. Note I also added the -C 50 option to downgrade the quality scores of reads with many miss-matches (the value 50 is recommended for alignments with bwa), and I added the -d 50 option to only use 50 reads per individual and locus. Otherwise this options mainly match what I have used before.

wild-caught only, no 'SELEXP'

samtools mpileup -C 50 -d 50 -f final.assembly.fasta -q 10 -Q 15 -D -g -I aln[A-Z][A-Z][A-Z][0-9]*bam aln[A-Z][A-Z][0-9]*bam > varMelissaWild.bcf

bcftools view -c -d 0.8 -g -p 0.01 -P full -t 0.001 -v varMelissaWild.bcf > varMelissaAll.vcf

all individuals

samtools mpileup -C 50 -d 50 -f final.assembly.fasta -q 10 -Q 15 -D -g -I *bam > varMelissaAll.bcf

bcftools view -c -d 0.8 -g -p 0.01 -P full -t 0.001 -v varMelissaAll.bcf > varMelissaAll.vcf

I used samtools v 0.1.19-44428cd, which has these command options for mpileup,

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:

-6 assume the quality is in the Illumina-1.3+ encoding

-A count anomalous read pairs

-B disable BAQ computation

-b FILE list of input BAM filenames, one per line [null]

-C INT parameter for adjusting mapQ; 0 to disable [0]

-d INT max per-BAM depth to avoid excessive memory usage [250]

-E recalculate extended BAQ on the fly thus ignoring existing BQs

-f FILE faidx indexed reference sequence file [null]

-G FILE exclude read groups listed in FILE [null]

-l FILE list of positions (chr pos) or regions (BED) [null]

-M INT cap mapping quality at INT [60]

-r STR region in which pileup is generated [null]

-R ignore RG tags

-q INT skip alignments with mapQ smaller than INT [0]

-Q INT skip bases with baseQ/BAQ smaller than INT [13]

--rf INT required flags: skip reads with mask bits unset []

--ff INT filter flags: skip reads with mask bits set []

Output options:

-D output per-sample DP in BCF (require -g/-u)

-g generate BCF output (genotype likelihoods)

-O output base positions on reads (disabled by -g/-u)

-s output mapping quality (disabled by -g/-u)

-S output per-sample strand bias P-value in BCF (require -g/-u)

-u generate uncompress BCF output

SNP/INDEL genotype likelihoods options (effective with `-g' or `-u'):

-e INT Phred-scaled gap extension seq error probability [20]

-F FLOAT minimum fraction of gapped reads for candidates [0.002]

-h INT coefficient for homopolymer errors [100]

-I do not perform indel calling

-L INT max per-sample depth for INDEL calling [250]

-m INT minimum gapped reads for indel candidates [1]

-o INT Phred-scaled gap open sequencing error probability [40]

-p apply -m and -F per-sample to increase sensitivity

-P STR comma separated list of platforms for indels [all]

Notes: Assuming diploid individuals.

And I used bcftools view version 0.1.19-44428cd, which has these options,

Usage: bcftools view [options] <in.bcf> [reg]

Input/output options:

-A keep all possible alternate alleles at variant sites

-b output BCF instead of VCF

-D FILE sequence dictionary for VCF->BCF conversion [null]

-F PL generated by r921 or before (which generate old ordering)

-G suppress all individual genotype information

-l FILE list of sites (chr pos) or regions (BED) to output [all sites]

-L calculate LD for adjacent sites

-N skip sites where REF is not A/C/G/T

-Q output the QCALL likelihood format

-s FILE list of samples to use [all samples]

-S input is VCF

-u uncompressed BCF output (force -b)

Consensus/variant calling options:

-c SNP calling (force -e)

-d FLOAT skip loci where less than FLOAT fraction of samples covered [0]

-e likelihood based analyses

-g call genotypes at variant sites (force -c)

-i FLOAT indel-to-substitution ratio [-1]

-I skip indels

-m FLOAT alternative model for multiallelic and rare-variant calling, include if P(chi^2)>=FLOAT

-p FLOAT variant if P(ref|D)<FLOAT [0.5]

-P STR type of prior: full, cond2, flat [full]

-t FLOAT scaled substitution mutation rate [0.001]

-T STR constrained calling; STR can be: pair, trioauto, trioxd and trioxs (see manual) [null]

-v output potential variant sites only (force -c)

Contrast calling and association test options:

-1 INT number of group-1 samples [0]

-C FLOAT posterior constrast for LRT<FLOAT and P(ref|D)<0.5 [1]

-U INT number of permutations for association testing (effective with -1) [0]

-X FLOAT only perform permutations for P(chi^2)<FLOAT [0.01]