SNP and Indel Detection

Detection of Single Nucleotide Polymorphisms (SNPs) and short insertions and deletions (indels) is a common goal in high-throughput sequencing experiments.

Below is a simple workflow for variant discovery using single end Illumina high throughput sequencing data.

1 - Get data into Galaxy: Get Data -> Princeton HTSEQ or Upload File

2 - If necessary, split datasets(s) using barcode splitter: HTS: FASTQ Manipulation and QC -> Barcode Splitter

3 - Run FastQC on fastq file(s) for each sample/replicate. Look for low quality, bias in sequence, etc.

HTS: FASTQ Manipulation and QC -> FastQC Read Quality reports

4 - Map reads to reference using BWA (reads <100bp) or BWA-MEM (reads >=100bp)

HTS: Mapping -> Map with BWA (Map with BWA-MEM)

  1. Select the appropriate reference genome (e.g. saccharomyces_cerevisiae_s288c_sgd_r64-1-1_20110203)
  2. Select input type (single end, paired end, etc.)
  3. Choose "Set read groups (SAM/BAM specification)
    • Choose a unique read group identifier (e.g. wildtype_lib1)
    • Enter the sample name (e.g. wildtype)
    • Choose platform (e.g. ILLUMINA)
    • Enter a library name (e.g. lib1)

5 - Evaluate mapping results

  1. Mapping percentage: HTS: SAM/BAM Manipulation -> Flagstat
  2. Typically, this should be relatively high for genomic samples (~90%)
  3. Depth of coverage histogram: BEDTools -> Genome Coverage in bedGraph or histogram format
  4. This helps to determine if there are large regions of poor coverage. Typically, one would like at least 30x coverage.
  5. For this tool:
    • Output type: Data suitable for histogram
    • Max depth: 100 (this depends on the depth of coverage in your sample)
    • Genome file: you will need to import a fasta file with the reference genome you used into your history.
    • These are typically available in Shared Data -> Data Libraries -> Reference Genomes and Annotations
  6. Depth of coverage plot: Convert Format -> Bam to BigWig
  7. This can be visualized in Galaxy (using Trackster), via the UCSC Genome Browser, or IGV.

6 - Call Variants: Variant Detection and Analysis -> FreeBayes

  • Select appropriate reference genome (e.g. saccharomyces_cerevisiae_s288c_sgd_r64-1-1_20110203)
  • Add additional BAM files (this can aid in comparisons, since data is output for all samples at each loci)
  • Choose parameter selection level
    • Simple diploid calling: The simples possible FreeBayes application. Equvalent of using FreeBayes with only a BAM input and no other parameter options.
    • Simple diploid calling with filtering and coverage: Same as #1 plus two additional options: -0 (standard filters: --min-mapping-quality 30 --min-base-quality 20 --min-supporting-allele-qsum 0 --genotype-varinat-threshold 0) and --min-coverage.
    • Frequency-based pooled calling: This is equivalent to using FreeBayes with the following options: --haplotype-length 0 --min-alternate-count 1 --min-alternate-fraction 0 --pooled-continuous --report-monomorphic. This is the best choice for calling varinats in mixtures such as viral, bacterial, or organellar genomes.
    • Frequency-based pooled calling with filtering and coverage: Same as #3 but adds -0 and --min-coverage like in #2.
    • Complete list of all options: Gives you full control by exposing all FreeBayes options as Galaxy widgets.

7 - Filter Variants based on quality or other attributes

Some options will output many variants of low quality in the VCF file. One can easily filter these variants using a number of tools such as:

  • VCF Manipulation -> VCFfilter
  • VCF Manipulation -> Filter a VCF file
  • SNPSift: Filter and manipulate annotated VCF files -> SnpSift Filter

8 - Annotate Variants

Variants may be annotated based on their overlap with other annotation, variants, or even a prediction of their effect using:

  • VCF Manipulation -> VCFannotate: Intersect VCF records with BED annotations
  • SNPEff: SNP Effect Predictor -> SnpEff Variant effect and annotation

Visualization

Both BAM and VCF files can be visualized on most Genome Browsers. Three commonly used ones are the build in Trackster tool in Galaxy, the UCSC Genome Browser, and IGV.

To use Trackster, go to the Visualization menu and select New Visualization. Select the genome reference version you are using and give the visualization a name (e.g. Yeast Mutant Variation). Follow the prompts to add new data sets to the visualization. See the Trackster documentation for more information or just explore on your own. You can save these visualization, make bookmarks for various regions of interest, and even share the visualizations with other users.

The UCSC genome browser is another online tool that can be used to visualize your data alongside other types of annotation or data made available publicly at UCSC. To use this, click the "Display at UCSC main or test" link on the data set you wish to visualize. Alternatively, you can download the files to you computer and upload them to UCSC separately (though this is not recommended for large files).

Finally, you can use IGV, which is software you run on your computer, to view the files. You can download them and import them into IGV or you can view them directly from Galaxy. See the instructions for using IGV and Galaxy together.

Working with VCF files

There are various tools for comparing and analyzing VCF files. The Princeton Galaxy has them categorized in the groups: "Variant Detection and Analysis", "VCF Manipulation", "SNPSift: Filter and manipulate annotated VCF files", and "SNPEff: SNP Effect Predictor". There are also a number of GATK Tools (Genome Analysis Toolkit from the Broad Institute) that work with VCF files.

Other things to consider

Preprocessing the reads

Generally, the output of high-throughput sequencing runs is one or more FASTQ files. Theses files encode the sequence of each read (usually each 50-100 bases in length) as well as a quality value associated with each base that estimates the confidence the sequencer had in determining that particular nucleotide.

Removing low quality bases and sequencing artifacts and reduce the noise and false positives in your data.

Even if you choose not to preprocess the reads it is advisable to get some summary statistics on the reads in order to identify any sequencing or library quality issues. An excellent tool to QC a Fastq or BAM file is FastQC.

Mapping or aligning the reads to a reference genome

There are many alignment programs available. One excellent one is the Burrows-Wheeler Aligner (BWA). BWA is fast, accurate, and allows for short insertions and deletions in alignments. Another excellent option is Bowtie, which is very fast but does not allow gaps.

Alignment software produces alignment files that are typically in a SAM or BAM format. SAM files are text files, while BAM files are compressed binary versions of SAM files. Generally, output will be in the SAM format and be converted into a BAM file which is more compact. Samtools is a piece of software that will allow you to convert between these formats as well as perform a number of other useful manipulations on SAM/BAM files.

Calling SNPs and Indels

Once the reads are aligned to the reference, differences can be identified and filtered to select only those difference which are most likely evidence of a SNP or Indel. If you have heterozygous individuals, or multiple heterozygous individuals, you can use the samtools mpileup command. If you have a pooled sample, homozygous individuals, or some other experimental setup you may wish to use the somewhat more flexible Freebayes software for calling SNPs and Indels. The Naive Variant Caller is a good option if you simply want read counts of various alleles.

FreeBayes, samtools mpileup, and NVC will allow you to call variants for multiple samples all at once. This has the advantage of using information from all samples during variant calling and produces a file with genotype calls for each sample at each position in the genome that was evaluated. This can be a very powerful technique, but does require that you put read groups in your BAM files.