I need to process readgrouplevel files but theres no obvious place for them in the data model

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2018-02-28

Ah, the eternal question of how to deal with readgroups in a world where the data model does not acknowledge their existence. In a not-so-distant future where rainbows and unicorns roam the clouds, we'll have a more flexible data model where you can explicitly put read groups one level under the samples entity table.

In the meantime, here's how we recommend dealing with readgroup-level files:

1. Set up the data model

In your samples table, declare individual samples as the actual samples you expect to have once your readgroup data will be merged.

2. Attach readgroups FoFN to each sample

For each sample, provide a "file of file names" (which we commonly call FoFN) containing a list of paths to the readgroup files (typically FastQs or uBAMs) in your bucket. We typically use the gsutil command line utility (admittedly outside of FC) to generate the FoFN of readgroup files.

For example, if your readgroup file paths all contain the sample name in their filename, then you could run a command to get a list of all file paths containing a particular sample name within a shared folder: gsutil ls gs://bucket/path/to/readgroup_bams_folder/*sampleName* > sampleName.RG_bams.list

3. Rewire your method

Adapt the WDL you want to run to take in a FoFN input, and then use the read_lines() function to convert the contents of the FoFN into an array of readgroup file paths. In FireCloud, edit your Method Configuration to run on sample as the root entity.

The additions to your WDL would look something like this:

File file_of_filenames Array[File] flowcell_unmapped_bams = read_lines(file_of_filenames)

4. Process your read groups

Within that WDL, you can then run a scatter across the readgroup files.

The scatter block would look something like this:

# run on the readgroup files in parallel scatter (unmapped_bam in flowcell_unmapped_bams) { # stick your per-readgroup calls here call something_that_maps_bams { input: input_bam = unmapped_bam } }

5. Aggregate per sample

Optionally, you can run something that merges readgroup files per-sample. The output of the scattered tasks will be arrays of whatever the task produces, so you can easily feed that to a merge operation that takes an array.

So your call would look something like this:

``` # output from somethingthatmapsbams is automatically gathered into an array # when the call's output is referenced from outside of the scatter block. Array[File] mappedbams = somethingthatmapsbams.outputmapped_bam

# merge processed readgroup files call something_that_merges_bams { input: input_bams = something_that_maps_bams.outputbam }

```

6. Wire up the output(s)

In FireCloud, link the final output of the WDL as a sample attribute, eg call it this.output in the method configuration. If the output of your WDL is a sample-level aggregate (eg per-sample bam) then you should be all set to proceed from there. If the output is not yet at the aggregated sample level (eg it's an intermediate thing per-read group) you can glob it and run whatever step is next on the glob, or something to that effect. We can advise you on the particulars if needed; please ask questions in the comment thread.

You can find an example of what this basic setup looks like in this workspace: https://portal.firecloud.org/#workspaces/help-gatk/five-dollar-genome-analysis-pipeline

Updated on 2018-03-05

Report abuse