What is uBAM and why is it better than FASTQ for storing unmapped sequence data

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2015-08-13

Most sequencing providers generate FASTQ files with the raw unmapped read sequences, so that is the most common form in which the data is input into the mapping step of the pre-processing pipeline. This is not ideal because among other flaws, much of the metadata associated with sequencing runs cannot be stored in FASTQ files, unlike BAM files which can store more information. See [this blog post](http://blastedbio.blogspot.co.uk/2011/10/fastq-must-die-long-live-sambam.html) for an overview of the many problems associated with the FASTQ format.

At the Broad Institute, we generate unmapped BAM (uBAM) files directly from the Illumina basecalls in order to keep all metadata in one place, and we do not write the data to FASTQ files at any point. This involves a slightly more complex workflow than is shown in the general Best Practices diagram. See [this presentation](https://www.broadinstitute.org/gatk/events/slides/1506/GATKwr8-A-3-GATK_Best_Practices_and_Broad_pipelines.pdf) for more details of how this works.

In case you’re wondering, we still show the FASTQ-based workflow as the default in most of our documentation because it is by far the most commonly-used workflow, and we want to keep the documentation accessible for our more novice users.

Updated on 2016-12-22

From Brian_Bushnell on 2016-12-22

Unmapped bam files are larger than gzipped fastq files. They contain less information – specifically, anything after the first whitespace in a read name is truncated, meaning that any program expecting the original Illumina names will have trouble, and probably treat paired data as single-ended, because the read names were mutilated as required by the sam format to force read 1 and read 2 to have identical names, even though they originally had different names.

Gzipped fastq files compress faster and smaller than your so-called ubam files. They decompress faster. And by faster… I mean, it’s like twice as fast. Why are you recommending a lossy compression format over a lossless compression format that is twice as fast and smaller?

From Geraldine_VdAuwera on 2016-12-22

What can I say, we like to have our metadata attached to the reads from as early on as possible. It helps keep things under control when you’re processing a whole genome’s worth of data every ten minutes.

From sklages on 2017-03-22

I’d go for BAM as well. Especially when it comes to metadata. File size is not really an issue. And if it takes longer to decompress BAM, who cares? I think samtools now decompresses multithreaded. In one thing I agree: at least the illumina fastq headers should be stored completely in one or another way. This way we could reconstruct original fastq files (if needed). PacBio’s Sequel stores read data in BAM as well.

We could invent just another format, .. but this would probably be counterproductive.

From myourshaw on 2018-08-22

The recently published [Standards and Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines](https://jmd.amjpathol.org/article/S1525-1578(17)30373-2/fulltext “https://jmd.amjpathol.org/article/S1525-1578(17)30373-2/fulltext”), which applies to clinical laboratories, requires that laboratory, run, and patient identifiers “must be present within the file’s metadata, and … recommends that the identifiers are also present in the file name itself”. We find that uBAMS are well-suited to meet this requirement; not so sure how one could do internal, metadata in a fastq.

From mglclinical on 2018-09-28

@myourshaw , I am also reading this 2018 Paper regarding the Sample Identity preservation inside the metadata of the files, and it seems uBAMs are better than the fastq files to serve the Recommendation #10 in Guidelines

From mglclinical on 2018-09-28

@Geraldine_VdAuwera, the original link to generate unmapped BAM (uBAM) files directly from the Illumina basecalls is dead. This link (https://software.broadinstitute.org/gatk/events/slides/1506/GATKwr8-A-3-GATK_Best_Practices_and_Broad_pipelines.pdf) seems to be moved,

Could you please point the correct location for this presentation

From jaideepjoshi on 2019-02-27

@Geraldine_VdAuwera: May I please know how you created the unmapped.bam files from bcl (or fastq) (for the five-dollar-analysis-pipeline) ? We are really trying to follow your best practices but fail to find details for this step. None of the links mentioned in various places regarding the same seem to work. Thanks