Post date: Nov 09, 2013 11:1:19 PM
I am using samtools and bcftools to identify variable sites in the melissa sequence data. I am not sure whether I want to use all 1573 individuals, or just wild-caught individuals (most individuals were in the selection experiment and come from just two populations). Also, some of the individuals with 'SELEXP' ids are actually wild-caught (this includes more individuals than were sampled from the other natural populations). I will first contrast two variant calling runs with all individuals or those without 'SELEXP' ids. The specific commands are below. Note I also added the -C 50 option to downgrade the quality scores of reads with many miss-matches (the value 50 is recommended for alignments with bwa), and I added the -d 50 option to only use 50 reads per individual and locus. Otherwise this options mainly match what I have used before.
wild-caught only, no 'SELEXP'
samtools mpileup -C 50 -d 50 -f final.assembly.fasta -q 10 -Q 15 -D -g -I aln[A-Z][A-Z][A-Z][0-9]*bam aln[A-Z][A-Z][0-9]*bam > varMelissaWild.bcf
bcftools view -c -d 0.8 -g -p 0.01 -P full -t 0.001 -v varMelissaWild.bcf > varMelissaAll.vcf
all individuals
samtools mpileup -C 50 -d 50 -f final.assembly.fasta -q 10 -Q 15 -D -g -I *bam > varMelissaAll.bcf
bcftools view -c -d 0.8 -g -p 0.01 -P full -t 0.001 -v varMelissaAll.bcf > varMelissaAll.vcf
I used samtools v 0.1.19-44428cd, which has these command options for mpileup,
Usage: samtools mpileup [options] in1.bam [in2.bam [...]]
Input options:
-6 assume the quality is in the Illumina-1.3+ encoding
-A count anomalous read pairs
-B disable BAQ computation
-b FILE list of input BAM filenames, one per line [null]
-C INT parameter for adjusting mapQ; 0 to disable [0]
-d INT max per-BAM depth to avoid excessive memory usage [250]
-E recalculate extended BAQ on the fly thus ignoring existing BQs
-f FILE faidx indexed reference sequence file [null]
-G FILE exclude read groups listed in FILE [null]
-l FILE list of positions (chr pos) or regions (BED) [null]
-M INT cap mapping quality at INT [60]
-r STR region in which pileup is generated [null]
-R ignore RG tags
-q INT skip alignments with mapQ smaller than INT [0]
-Q INT skip bases with baseQ/BAQ smaller than INT [13]
--rf INT required flags: skip reads with mask bits unset []
--ff INT filter flags: skip reads with mask bits set []
Output options:
-D output per-sample DP in BCF (require -g/-u)
-g generate BCF output (genotype likelihoods)
-O output base positions on reads (disabled by -g/-u)
-s output mapping quality (disabled by -g/-u)
-S output per-sample strand bias P-value in BCF (require -g/-u)
-u generate uncompress BCF output
SNP/INDEL genotype likelihoods options (effective with `-g' or `-u'):
-e INT Phred-scaled gap extension seq error probability [20]
-F FLOAT minimum fraction of gapped reads for candidates [0.002]
-h INT coefficient for homopolymer errors [100]
-I do not perform indel calling
-L INT max per-sample depth for INDEL calling [250]
-m INT minimum gapped reads for indel candidates [1]
-o INT Phred-scaled gap open sequencing error probability [40]
-p apply -m and -F per-sample to increase sensitivity
-P STR comma separated list of platforms for indels [all]
Notes: Assuming diploid individuals.
And I used bcftools view version 0.1.19-44428cd, which has these options,
Usage: bcftools view [options] <in.bcf> [reg]
Input/output options:
-A keep all possible alternate alleles at variant sites
-b output BCF instead of VCF
-D FILE sequence dictionary for VCF->BCF conversion [null]
-F PL generated by r921 or before (which generate old ordering)
-G suppress all individual genotype information
-l FILE list of sites (chr pos) or regions (BED) to output [all sites]
-L calculate LD for adjacent sites
-N skip sites where REF is not A/C/G/T
-Q output the QCALL likelihood format
-s FILE list of samples to use [all samples]
-S input is VCF
-u uncompressed BCF output (force -b)
Consensus/variant calling options:
-c SNP calling (force -e)
-d FLOAT skip loci where less than FLOAT fraction of samples covered [0]
-e likelihood based analyses
-g call genotypes at variant sites (force -c)
-i FLOAT indel-to-substitution ratio [-1]
-I skip indels
-m FLOAT alternative model for multiallelic and rare-variant calling, include if P(chi^2)>=FLOAT
-p FLOAT variant if P(ref|D)<FLOAT [0.5]
-P STR type of prior: full, cond2, flat [full]
-t FLOAT scaled substitution mutation rate [0.001]
-T STR constrained calling; STR can be: pair, trioauto, trioxd and trioxs (see manual) [null]
-v output potential variant sites only (force -c)
Contrast calling and association test options:
-1 INT number of group-1 samples [0]
-C FLOAT posterior constrast for LRT<FLOAT and P(ref|D)<0.5 [1]
-U INT number of permutations for association testing (effective with -1) [0]
-X FLOAT only perform permutations for P(chi^2)<FLOAT [0.01]