syqada
Introduction
The reproducibility of large-scale, complex analyses is one of the paramount problems of bioinformatics. This is a non-trivial engineering problem that must be addressed to perform high quality research. The System for Quality-Assured Data Analysis (SyQADA), a workflow automation system described here, seeks to address reproducibility while imposing minimal intellectual over-head on the user. SyQADA can be contrasted with other workflow systems because its only dependencies are a unix operating system providing the bash shell and a standard installation of python 2.71.
A SyQADA workflow is simply a list of task definitions. To create new workflows, a user must write a bash script template that uses a simple syntax for specifying parameters that will be substituted with input and output filenames, sample names, and other values that can vary with each invocation of the script. However, SyQADA comes bundled with common next-generation sequencing (NGS) analysis pipelines including those for sequencing alignment, coverage profiling, variant calling, mutation detection, copy number profiling and variant annotation/reporting.
Download/installation
Download syqada.tar.gz, extract it and build it with the following three commands
> tar -xvzf syqada-0.1.2.tar.gz
> cd syqada-0.1.2.tar.gz
> bin/package queue skip-tests
Quick start example
There is an example workflow haplohseq.protocol in the example directory of the syqada download that you can process. You can execute example_run.sh from the example directory of the tar.gz extraction to run the complete example.
> cd syqada/example/haplohseq
> ./example_run.sh
This script, executes three main steps of an NGS allelic imbalance analysis workflow.
Phases the het sites in tumor_exome.vcf using a utility phasing script (simple_phaser.py) provided with the haplohseq bundle. You can instead use MACH, fastPHASE, BEAGLE or your phasing software of choice. However, there are 2 output files that need to be formatted like they are for MACH. Examples of such files are generated using our simple_phaser.py script and can be found in the example_output directory after step 1 is executed. See tumor_example.hap and tumor_exome.pos.
Runs haplohseq to assign AI probabilities across the genome for the test sample. The detailed report that includes this information (tumor_exome_haplohseq.posterior.dat) can be found in the example_output directory.
R is called to generate a plot for visualization of the haplohseq AI probabilities.
The files used to configure the workflow can be found in the example/haplohseq/control directory.
HAPLOHSEQ.protocol: the file that specifies the steps in the workflow.
HAPLOHSEQ.config: specified dependencies and parameters of the workflow.
HAPLOHSEQ.samples: the names of samples to process through the workflow.
To use an existing NGS workflow, this is sufficient. See below for a brief description of these files or to generate detailed documentation for the framework, run:
syqada manual
However, to create a new workflow, you would need to create your own simple task and template definitions for steps in the workflow. Examples of those are provided for this workflow in the example/haplohseq/tasks and example/haplohseq/templates directories.
syqada parameters
> syqada --help
File descriptions and formats
input files
protocol file
Lists sequential steps to be executed by the workflow.
config file
Specifies software and data dependencies for the workflows. These variables can be referenced in protocol, task and template files so that workflows can easily be ported to other platforms by simply modifying config files.
samples file
Lists names of samples to process through workflows. These names are used as prefixes for intermediate and output files of the workflow.
tasks and templates
These are definition files for steps in the workflow. Tasks define resources needed for a step in the workflow to be executed on an LSF or PBS cluster (or on a local server or desktop). Tasks also allow the user to split jobs based on chromosomes. Template files define the actual step to be executed.
output reports
log files
Log files are generated for all steps in the workflow including logs for console output, errors and a job completion status.
FAQ