Detection of Single Nucleotide Polymorphisms (SNPs) and short insertions and deletions (indels) is a common goal in high-throughput sequencing experiments.
Below is a simple workflow for variant discovery using single end Illumina high throughput sequencing data.
1 - Get data into Galaxy: Get Data -> Princeton HTSEQ or Upload File
2 - If necessary, split datasets(s) using barcode splitter: HTS: FASTQ Manipulation and QC -> Barcode Splitter
3 - Run FastQC on fastq file(s) for each sample/replicate. Look for low quality, bias in sequence, etc.
HTS: FASTQ Manipulation and QC -> FastQC Read Quality reports
4 - Map reads to reference using BWA (reads <100bp) or BWA-MEM (reads >=100bp)
HTS: Mapping -> Map with BWA (Map with BWA-MEM)
5 - Evaluate mapping results
6 - Call Variants: Variant Detection and Analysis -> FreeBayes
7 - Filter Variants based on quality or other attributes
Some options will output many variants of low quality in the VCF file. One can easily filter these variants using a number of tools such as:
8 - Annotate Variants
Variants may be annotated based on their overlap with other annotation, variants, or even a prediction of their effect using:
Both BAM and VCF files can be visualized on most Genome Browsers. Three commonly used ones are the build in Trackster tool in Galaxy, the UCSC Genome Browser, and IGV.
To use Trackster, go to the Visualization menu and select New Visualization. Select the genome reference version you are using and give the visualization a name (e.g. Yeast Mutant Variation). Follow the prompts to add new data sets to the visualization. See the Trackster documentation for more information or just explore on your own. You can save these visualization, make bookmarks for various regions of interest, and even share the visualizations with other users.
The UCSC genome browser is another online tool that can be used to visualize your data alongside other types of annotation or data made available publicly at UCSC. To use this, click the "Display at UCSC main or test" link on the data set you wish to visualize. Alternatively, you can download the files to you computer and upload them to UCSC separately (though this is not recommended for large files).
Finally, you can use IGV, which is software you run on your computer, to view the files. You can download them and import them into IGV or you can view them directly from Galaxy. See the instructions for using IGV and Galaxy together.
There are various tools for comparing and analyzing VCF files. The Princeton Galaxy has them categorized in the groups: "Variant Detection and Analysis", "VCF Manipulation", "SNPSift: Filter and manipulate annotated VCF files", and "SNPEff: SNP Effect Predictor". There are also a number of GATK Tools (Genome Analysis Toolkit from the Broad Institute) that work with VCF files.
Generally, the output of high-throughput sequencing runs is one or more FASTQ files. Theses files encode the sequence of each read (usually each 50-100 bases in length) as well as a quality value associated with each base that estimates the confidence the sequencer had in determining that particular nucleotide.
Removing low quality bases and sequencing artifacts and reduce the noise and false positives in your data.
Even if you choose not to preprocess the reads it is advisable to get some summary statistics on the reads in order to identify any sequencing or library quality issues. An excellent tool to QC a Fastq or BAM file is FastQC.
There are many alignment programs available. One excellent one is the Burrows-Wheeler Aligner (BWA). BWA is fast, accurate, and allows for short insertions and deletions in alignments. Another excellent option is Bowtie, which is very fast but does not allow gaps.
Alignment software produces alignment files that are typically in a SAM or BAM format. SAM files are text files, while BAM files are compressed binary versions of SAM files. Generally, output will be in the SAM format and be converted into a BAM file which is more compact. Samtools is a piece of software that will allow you to convert between these formats as well as perform a number of other useful manipulations on SAM/BAM files.
Once the reads are aligned to the reference, differences can be identified and filtered to select only those difference which are most likely evidence of a SNP or Indel. If you have heterozygous individuals, or multiple heterozygous individuals, you can use the samtools mpileup command. If you have a pooled sample, homozygous individuals, or some other experimental setup you may wish to use the somewhat more flexible Freebayes software for calling SNPs and Indels. The Naive Variant Caller is a good option if you simply want read counts of various alleles.
FreeBayes, samtools mpileup, and NVC will allow you to call variants for multiple samples all at once. This has the advantage of using information from all samples during variant calling and produces a file with genotype calls for each sample at each position in the genome that was evaluated. This can be a very powerful technique, but does require that you put read groups in your BAM files.