created by Geraldine_VdAuwera
on 2018-02-28
Ah, the eternal question of how to deal with readgroups in a world where the data model does not acknowledge their existence. In a not-so-distant future where rainbows and unicorns roam the clouds, we'll have a more flexible data model where you can explicitly put read groups one level under the samples entity table.
In the meantime, here's how we recommend dealing with readgroup-level files:
In your samples table, declare individual samples as the actual samples you expect to have once your readgroup data will be merged.
For each sample, provide a "file of file names" (which we commonly call FoFN
) containing a list of paths to the readgroup files (typically FastQs or uBAMs) in your bucket. We typically use the gsutil
command line utility (admittedly outside of FC) to generate the FoFN of readgroup files.
For example, if your readgroup file paths all contain the sample name in their filename, then you could run a command to get a list of all file paths containing a particular sample name within a shared folder: gsutil ls gs://bucket/path/to/readgroup_bams_folder/*sampleName* > sampleName.RG_bams.list
Adapt the WDL you want to run to take in a FoFN input, and then use the read_lines()
function to convert the contents of the FoFN into an array of readgroup file paths. In FireCloud, edit your Method Configuration to run on sample
as the root entity.
The additions to your WDL would look something like this:
File file_of_filenames Array[File] flowcell_unmapped_bams = read_lines(file_of_filenames)
Within that WDL, you can then run a scatter across the readgroup files.
The scatter block would look something like this:
# run on the readgroup files in parallel scatter (unmapped_bam in flowcell_unmapped_bams) { # stick your per-readgroup calls here call something_that_maps_bams { input: input_bam = unmapped_bam } }
Optionally, you can run something that merges readgroup files per-sample. The output of the scattered tasks will be arrays of whatever the task produces, so you can easily feed that to a merge operation that takes an array.
So your call would look something like this:
``` # output from somethingthatmapsbams is automatically gathered into an array # when the call's output is referenced from outside of the scatter block. Array[File] mappedbams = somethingthatmapsbams.outputmapped_bam
# merge processed readgroup files call something_that_merges_bams { input: input_bams = something_that_maps_bams.outputbam }
```
In FireCloud, link the final output of the WDL as a sample attribute, eg call it this.output
in the method configuration. If the output of your WDL is a sample-level aggregate (eg per-sample bam) then you should be all set to proceed from there. If the output is not yet at the aggregated sample level (eg it's an intermediate thing per-read group) you can glob it and run whatever step is next on the glob, or something to that effect. We can advise you on the particulars if needed; please ask questions in the comment thread.
You can find an example of what this basic setup looks like in this workspace: https://portal.firecloud.org/#workspaces/help-gatk/five-dollar-genome-analysis-pipeline
Updated on 2018-03-05