079. Can a biochemist with one Python course under her belt start running analyses in the cloud

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by shlee

on 2016-11-30

Find out and learn some practical steps to cloud debugging.

Specifically, I tested the alpha release of Google Genomics Pipelines API that uses the command-line. Down the road, we will post similarly for the UI-driven systems FireCloud and Workbench. In this particular challenge, my aim is to first genotype a trio and then a cohort of 17 whole genome BAMs that are available in the cloud. I need the resulting VCF callsets within a week.

I approach this like an experiment.

As a biochemistry PhD, I approach this challenge much like a complicated protocol whose chemistry I do not fully understand, nor need to understand, but that I need to get working for its end results.

I start with the following pieces of information.

GATK commands I must meld together into WDL workflow(s), an idea on how I should parallelize the variant calling step and novice WDL scripting skills. My laptop is set up to run GATK commands and WDL scripts locally ¹.
A publicly available Docker image containing all the tools I need ².
A draft version of the now published GATK-Google Genomics tutorial that tells you what to install and that uses the gcloud alpha genomics pipelines run command to run our PairedEndSingleSampleWf WDL script on publicly available example data and using the same Docker container ³.
A Google Cloud bucket associated with a cloud billing project and the assurance that my runs will be drops in the ocean of our development budget ^4,5.
And not trivially, the location of 17 public-arena indexed BAMs in the cloud that are each over 100GB in size ⁶.

Given the time constraint, I parallelize in two ways.

One of the attractions of the cloud is parallelizable compute and in WDL workflows this is embodied by the scatter function. Given my time constraint, parallelism is a feature I must use. I can parallelize at two levels. First, I can process each sample concurrently for the first part of the pipeline. In the cloud, if it takes an hour to process one sample, then it takes an hour to process a hundred. Second, for some steps I can scatter over genomic territories, where the smallest parallelizable unit is a region bookmarked by gaps, i.e. Ns.

I learn that nested scatters (a.k.a. scatter of scatters) is yet to be supported. Because of this, I break my workflow into two WDL scripts. The first script takes a single sample BAM as input and generates a GVCF in HaplotypeCaller’s ERC mode. The second script takes all the GVCFs and generates a cohort VCF callset. In terms of scattering over genomic territories, I decide to stick with a 17-way scatter over the human genome. I will spend the chunk of time this genome scattering entails judiciously by ensuring my scripts work.

My script testing is iterative.

I manage my script development by incrementally testing locally and in the cloud using a small dataset and then a large unit of data, e.g. a full BAM (see Table). The scripts run my small datasets on the cloud in minutes so the turnaround between job submission and feedback is quick. I find myself resorting to list of lists (a.k.a. array of arrays) to group the scatter intervals and also to group the different sample GVCFs together per genomic territory. The Figure below summarizes my process. For where you can ask for help for different sources of error, see 7.

The tutorial’s example command (shown below) and its .yaml file are what I start with to submit a pipeline job to Google Cloud. This sets in motion a number of processes, described here, including the WDL Runner Python script that then launches Cromwell.

gcloud alpha genomics pipelines run \ --pipeline-file $my_dir/wdl_pipeline_GotC.yaml \ --zones us-central1-f \ --logging gs://shlee-dev/quicktest/logs \ --inputs-from-file WDL=$my_dir/GenotypeScatteredGVCFs_cloud_quicktest.wdl \ --inputs-from-file WORKFLOW_INPUTS=$my_dir/GenotypeScatteredGVCFs_cloud_quicktest.json \ --inputs-from-file WORKFLOW_OPTIONS=$my_dir/empty.json \ --inputs WORKSPACE=gs://shlee-dev/quicktest \ --inputs OUTPUTS=gs://shlee-dev/quicktest/callsets/

Once I submit a pipeline job, I get a pipelines operations ID:

Running: [operations/EKLyxLLwKhi7mbayutDpvl4gw7vetLsXKg9wcm9kdWN0aW9uUXVldWU]

I can poll the status of the pipeline job using the operations ID in the command:

gcloud alpha genomics operations describe EKLyxLLwKhi7mbayutDpvl4gw7vetLsXKg9wcm9kdWN0aW9uUXVldWU

And get a status message, e.g. running or failed, as well as some additional information. Alternatively, I can add --format='yaml(done, error, metadata.events, name)' to the command for a summary of the status. If a job is successful, then I look for the outputs in the directory I had previously specified with --inputs OUTPUTS=. If a job has failed, the status message may or may not include details helpful towards debugging. If it does not, then I turn to a web browser and navigate to my Google bucket ⁴.

Here are roughly the steps I take in looking for the source of error.

Per run, I get three operation log files: *.log, *-stderr.log and *-stdout.log, where * is the operation ID. These I find in the logging directory I had specified with --logging. I mouse-click on each log file from the bucket UI to view each of them in a new tab within the browser. I look for the first instance of ERROR or Failed lines within the logs and then examine these and the surrounding lines for informative messages.
If the error source is ambiguous, then I look for the Cromwell job ID. This should be listed in the meta status message and/or in an operation log if the job got to the point of launching Cromwell. If it is not, then the job likely errored at the Pipelines API level.
1. Let’s examine an example directory structure from a log.
2. gs://shlee-dev/platinum/ScatterHaplotypeCaller/699f1f77-be1c-4392-b3b1-5d103a43b232/call-HaplotypeCaller/shard-0/NA12882_2_S1.chrM.vcf.gz.tbi
3. Here, ScatterHaplotypeCaller is my WDL workflow name, what precedes it is the gcloud workspace directory I had specified with --inputs WORKSPACE= and the long alphanumeric string that follows corresponds to the Cromwell job ID. The Cromwell job folder in turn contains directories for the launched WDL workflow tasks, e.g. call-HaplotypeCaller. Knowing the order of the tasks in my WDL workflow, I can identify the last task that ran. I drill down into the task folders to look for expected results files, and if I don’t see them, then I view the Cromwell log files, *.log, *-stdout.log and *-stderr.log. Again, I search for ERROR or Failed lines and their surrounding context. GATK tool errors are usually found in the stderr.log file.
If the error is at a scattered task, and all expected shards appear present, then well, I can repeat step (2) for each shard. Or, I can use the following bash for loop ⁸ that iterates through all the shards and files ending in stderr.log using globbing. The cat function in conjunction with gsutil ⁹ probes the cloud bucket file system. I grep for ERROR as an example because GATK tools use this convention. The script returns the name of each file then any lines with the word ERROR, if any, within the file.
1. for FILE in `gsutil ls gs://shlee-dev/platinum/ScatterHaplotypeCaller/699f1f77-be1c-4392-b3b1-5d103a43b232/*/*/*stderr.log` ; \ do echo "got $FILE" ; \ gsutil cat $FILE | grep 'ERROR' ; done ;

Here is an example output of the script.

``` got gs://shlee-dev/filtering_tutorial/ScatterHaplotypeCaller/699f1f77-be1c-4392-b3b1-5d103a43b232/call-HaplotypeCaller/shard-9/HaplotypeCaller-9-stderr.log

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.6-2-g69af359):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: The GATK cannot process compressed (.gz) reference sequences. Please unzip the file and try again. Sorry for the inconvenience.

ERROR ------------------------------------------------------------------------------------------

```

Knowing the nature of the error, I can modify my WDL workflow accordingly and go through another iteration. In the case of the example error, I make sure to provide uncompressed reference sequences.

Using the cloud makes me feel the meaning of elastic.

It feels refreshing that I can run as many samples and as many experiments as I can submit with my fingers on the computer. This contrasts with e.g. my experience with fractionating polyribosomes for their associated transcripts. I would run two staggered 100K x g ultracentrifuge runs in a day because the rotor could only hold ten samples at a time and I needed duplicates per variable because per run at least one tube would shatter. In cloud computing, that busted tube doesn’t set you back in terms of more physical toil. Its existence just gets buried in the billing ⁵.

Also, the fact that the bucket address alone enables me to use public data directly, without copying files or the need for additional authentication by Illumina, hits home how seamless sharing on the cloud is.

I come away with the aimed-for callsets.

Debugging was definitely a process and I assume its extent is proportional to my inexperience. Once my script ran successfully for one BAM, running the remaining samples took little time. Because I was developing and running two workflows in tag-team style (first sample-level variant calling with HaplotypeCaller, then cohort genotyping with GenotypeGVCFs), I was too preoccupied to investigate why two of the 17 public-arena samples gave errors for the first workflow. I came away with the aimed-for trio callset and a 15-sample cohort variant callset in time for our team to prepare for the Basel workshop’s hands-on tutorial.

back to top

Notes

[1] At the time of testing in September 2016, I had a run-of-the-mill MacBook Pro with the following specs: 16GB of memory, 250GB of storage, OS X v10.10.5, Java v1.8.0_40 and the latest stable releases of GATK (v3.6), Picard (v2.5.0), Cromwell (v0.19.3) and Samtools (v1.3.1) as of September 2016. My Google Cloud SDK was v124.0.0.

[2] The Docker container is broadinstitute/genomes-in-the-cloud:2.2.3-1469027018. A newer iteration has come out called broadinstitute/genomes-in-the-cloud:2.2.4-1469632282. See the Docker repo for version details. To see system and tool versions in a container, first install Docker. Then e.g. use docker pull broadinstitute/genomes-in-the-cloud:2.2.3-1469027018 to download the container and docker inspect broadinstitute/genomes-in-the-cloud:2.2.3-1469027018 to list its contents.

[3] As Geraldine outlines in the Going public with pipelines scripts blogpost, what I present is only one of three ways to run WDL script workflows in the cloud. Also, the Google Genomics Pipelines API tutorial that uses the wdlrunner is one of several Google Genomics tutorials. You can find the others in the parent Pipelines-api-examples_ repository.

[4] Here’s what my bucket path looks like: https://console.cloud.google.com/storage/browser/shlee-dev/?project=broad-dsde-dev. Notice the URL lists the associated billable project after the bucket name. This is because I set this association up at some point. Browse buckets associated with your Google email at https://console.cloud.google.com/storage/browser/.

[5] We will be posting estimated costs of running the PairedEndSingleSampleWf in the cloud, so be on the lookout. In the meantime here are links to Google’s pricing calculator and the underlying data used by the calculator. Also, this site explains the cost-savings of preemptible instances. In general, processes that take a long time to run are more likely to be preempted compared to shorter processes. If a certain process preempts regularly, consider (i) refactoring runtime parameters to reduce the probability of preemption, e.g. switch from a high to low demand zone, and (ii) whether you can shorten the time a process takes, e.g. remove a dependency that is causing a bottleneck that elongates processing time. An example of such a bottleneck is fetching information from an external server as described in this github issue and solved here.

[6] The Google Genomics Cookbook, hosted on Read the Docs, lists publicly available cloud data, e.g. 1000 Genomes Project data and the Illumina Platinum Genomes. Each data page describes the data and provides the location of the data, e.g. the Illumina Platinum Genomes data bucket. On a side note, figuring out how to examine the headers of these cloud bucket BAMs was not trivial. Thankfully, I have helpful colleagues, one of whom gave me gsutil cp gs://directories/NA12878.bam - | samtools view -H and another who gave me gsutil cat gs://directories/NA12878.bam | samtools view -H -.

[7] Here are three distinct forums to search and ask questions on Google Cloud use, WDL scripting and GATK tools, respectively:

Search the following Google Cloud sites:
If the WDL forum does not answer your question, post it to http://gatkforums.broadinstitute.org/wdl/post/question/wdl.
If the GATK forum does not answer your question, post it to http://gatkforums.broadinstitute.org/gatk/post/question/gatk-support-forum.

[8] Many thanks to esalinas from the FireCloud team for sharing this script.

[9] Google SDK v135.0.0 installation includes gsutils, gcloud and bq command-line tools. See here for gsutil documentation.

back to top

Updated on 2016-12-06

Report abuse