IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click
here
created by GATK_Team
on 2018-07-26
The gatk-workflows git organization houses a set of repositories containing workflows contributed by the Broad Institute and optimized versions of these workflows contributed by Intel to take advantage of the latest technologies like FPGA processors to accelerate time and performance. The workflows made available include several types of genomic analysis methods using GATK’s Best Practices, such as Data Preprocessing for Variant Discovery, Somatic Sequence Analysis using Mutect, and simpler workflows used for sequence format conversion.
The provided workflows have an accompanying JSON file containing references, resources, default parameters, and input bam files used to test the workflow on the users given platform. The document below will guide users on executing an example workflow on the Google Cloud Platform as well as running the workflow locally.
Please note that Broad is moving towards a cloud-centric computing environment, thus the provided workflows are designed and intended to work on the cloud. Some of these workflows may need to be modified by the user before executing on a local environment.
Key Google Cloud Buckets
Broad References Broad Public Datasets Gatk Test Data
Running Workflows Using Google Cloud Platform
General Prerequisites:
Tool Prerequisites:
Instructions:
- Setup your working directory
- Make a directory to test workflows then change into that directory.
- mkdir gatk-workflows cd gatk-workflows
- Download latest release of Cromwell, the java excutable that will run the WDL.
- wget https://github.com/broadinstitute/cromwell/releases/download/33.1/cromwell-33.1.jar
- Clone the repository you would like to execute. In this example we will being executing validate bam from the seq-format-validation repository.
- git clone https://github.com/gatk-workflows/seq-format-validation.git
- Once you’ve successfully cloned the repository,
seq-format-validation
directory will appear in your gatk-workflows
working directory. The seq-format-validation
directory has multiple files but we are only concerned with the WDL and its json. We’ll be running the validate-bam.wdl
workflow using its accompanying json file validate-bam.inputs.json
. The json contains some required/optional parameters needed to run the workflow, including the path to a test input file located in a Google Cloud bucket. - We have our WDL and we have a json file, but we need one more file to run on Google Cloud. This would be a configuration file to indicate to Cromwell that we would like to execute our workflow on the cloud. You can create your own configuration using the instructions found on Cromwell Documentation. In this example we'll name our conf file
google-adc.conf
and copy the contents below into our file. - Create and edit conf file
- vim google-adc.conf
- Copy the contents below into the file
- include required(classpath("application")) google { application-name = "cromwell" auths = [ { name = "application-default" scheme = "application_default" } ] } engine { filesystems { gcs { auth = "application-default" } } } backend { default = "JES" providers { JES { actor-factory = "cromwell.backend.impl.jes.JesBackendLifecycleActorFactory" config { // Google project project = "<google-project-id>" compute-service-account = "default" // Base bucket for workflow executions root = "<google-bucket>/cromwell-execution" // Polling for completion backs-off gradually for slower-running jobs. // This is the maximum polling interval (in seconds): maximum-polling-interval = 600 // Optional Dockerhub Credentials. Can be used to access private docker images. dockerhub { // account = "" // token = "" } genomics { // A reference to an auth defined in the `google` stanza at the top. This auth is used to create // Pipelines and manipulate auth JSONs. auth = "application-default" // Endpoint for APIs, no reason to change this unless directed by Google. endpoint-url = "https://genomics.googleapis.com/" } filesystems { gcs { // A reference to a potentially different auth for manipulating files via engine functions. auth = "application-default" } } } } } } system { input-read-limits { lines = 1280000 bool = 7 int = 19 float = 50 string = 1280000 json = 1280000 tsv = 1280000 map = 1280000 object = 1280000 } }
- At this point your directory structure should look like this
- |-gatk-workflows/ |-cromwell-33.1.jar |-google-adc.conf |-seq-format-validation/ |-LICENSE |-README.md |-Generic.google-papi.options.json |-Validate-bam.inputs.json |-validate-bam.wdl
- Before you execute the workflow you'll need to two pieces of information: 1) the project that will pay for the run, and 2) where to store your output files.
- The set project name can be determined by entering
gcloud info
in your terminal. The project name will be listed under "Current Properties"- Current Properties: [core] project: [your-project-name] account: [your-account@gmail.com] disable_usage_reporting: [True] [compute] region: [us-central1] zone: [us-central1-a]
- The location of your bucket is completely up to you. It can be one you create or one that was designated to you by the project owner. An example would be
gs://my-bucket/
- It's time to execute the workflow.
- java -Dconfig.file=google-adc.conf -Dbackend.providers.JES.config.project=<your-project-name> - Dbackend.providers.JES.config.root=gs://<my-bucket>/ -jar cromwell-33.1.jar run ./seq-format-validation/validate- bam.wdl --inputs ./seq-format-validation/validate-bam.inputs.json
- While the workflow is running, Cromwell will print out logs to your screen (lots of it). Once it completes it will print out a message indicating the run was successful. Also, it will print out the Google bucket location of the output files that was generated by your workflow.
- You can copy the output file to a local directory using
gsutil cp
. For example:- gsutil cp gs://my-bucket/path/to/output /path/to/local/directory/
Running Workflows Locally
Tool Prerequisites:
Docker Git
Instructions:
- Setup your working directory.
- Make a working directory to test workflows then change into that directory.
- mkdir gatk-workflows cd gatk-workflows
- Make a directory to store input files.
- mkdir inputs
- Download latest release of Cromwell, the java excutable that will run the WDL.
- wget https://github.com/broadinstitute/cromwell/releases/download/33.1/cromwell-33.1.jar
- Clone the repository you would like to execute. In this example we will be executing validate bam from the seq-format-validation repository.
- git clone https://github.com/gatk-workflows/seq-format-validation.git
- Once you’ve successfully cloned the repository,
seq-format-validation
directory will be in your gatk-workflows
working directory. The seq-format-validation
directory has multiple files but we are only concerned with the wdl and its json. We’ll be running the validate-bam.wdl
workflow using its accompanying json file validate-bam.inputs.json
. The json contains some required/optional parameters needed to run the workflow, including the path to the input file located in a Google Cloud bucket. Since we're running this locally we’ll need to first download any files mentioned in the json. In this case we’ll only need to download the input files but the same instructions can be used for reference/resource files. *Special note, because this is a local demo and the size of the medium bam file is 18 GB, we’ll only download and work with the small bam file. The input files listed in the json file are stated to be in the following google buckets "gs://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam",
"gs://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_med.hg38.bam"
The base Google bucket name is gs://gatk-test-data
, a weblink to this Google bucket is provided in this document under the subtitle Key Google Cloud buckets. We use this web link to take us to the Google bucket of interest in our web browser then use the file path provided in the json (e.g. /wgs_bam/NA12878_24RG_hg38/NA12878_24RG_med.hg38.bam
) to locate the files in the bucket. The following files can be downloaded by clicking on the file names.