created by Geraldine_VdAuwera
on 2013-07-03
Objective
Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.
Prerequisites
Steps
Action
Shuffle the reads in the bam file so they are not in a biased order before alignment by running the following HTSlib command:
htscmd bamshuf -uOn 128 aln_reads.bam tmp > shuffled_reads.bam
Expected Result
This creates a new BAM file containing the original reads, which still retain their mapping information, but now they are no longer sorted.
The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked.
Action
Revert the BAM file to FastQ format by running the following HTSlib command:
htscmd bam2fq -a shuffled_reads.bam > interleaved_reads.fq
Expected Result
This creates an interleaved FastQ file called interleaved_reads.fq
containing the now-unmapped paired reads.
Interleaved simply means that for each pair of reads in your paired-end data set, both the forward and the reverse reads are in the same file, as opposed to having them in separate files.
Action
Compress the FastQ file to reduce its size using the gzip utility:
gzip interleaved_reads.fq
Expected Result
This creates a gzipped FastQ file called interleaved_reads.fq.gz
. This file is ready to be used as input for the Best Practices workflow.
BWA handles gzipped fastq files natively, so you don’t need to unzip the file to use it later on.
If you’re feeling adventurous, you can do all of the above with this beautiful one-liner, which will save you a heap of time that the program would otherwise spend performing I/O (loading in and writing out data to/from disk):
htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz
Updated on 2016-06-10
From caddymob on 2013-09-25
Hi Geraldine,
I found the bamshuf command and had a question about this and the overall philosophy of doing so… In the past I have query-sorted bams prior to make FASTQs from bam files (illumina-produced BAMs are notorious for failing SAM spec).
The idea was query sorting should prevent the pitfalls of storing information large rearrangements. So, is `htscmd bamshuf` doing sort-of the same thing or is there a reason to shuffle vs query sort?
Thanks as always!
Jason
From Geraldine_VdAuwera on 2013-09-26
Hi Jason,
I think it’s similar; the idea of using bamshuf is to randomize the order of reads to avoid biases in the mapping process that come from coordinate-sorted data.
From splaisan on 2013-10-09
Thanks for this page Geraldine, I never heard of this bias before and do not find it mentioned elsewhere. Is there a consensus reference text about this issue?
I know that the remaining part is not a direct GATK matter but hopefully it will help others adventurous.
After some level of trial and ERRORs, I found that BWA mem can handle interleaved fastQ provided one adds -p to the bwa mem command. By contrast, I did not find a way to feed BWA aln with interleaved fastQ data (is there a way?).
I then discovered that BWA aln can take bam as input, we can therefore feed it with the shuffled bam instead of going the long way of de-interleaving the FastQ?
~~~~~~~
bwa aln -t 8 ref.fa -b1 shuffled_reads.bam > 1.sai
bwa aln -t 8 ref.fa -b2 shuffled_reads.bam > 2.sai
Stephane
From Geraldine_VdAuwera on 2013-10-09
I’m not aware of any reference text on this, it is a part of the shared knowledge in our group and other collaborators.
Thanks for your comments on the bwa options and the script, I’m sure they will be useful for others.
From ecuenca on 2013-11-18
Hello,
could you please clarify if it is the same to feed BWA with the shuffled bam file or with the interleaved fast?
It is not needed to revert to a fast to lose previous alignment info? Can I skip that part and still have the same result?
Thanks a lot,
Ester
From Geraldine_VdAuwera on 2013-11-19
Hi Ester,
I believe that both approaches will give the same results. But if you want to be absolutely sure, check the BWA documentation — since it is not our software, we can’t give you any guarantees.
From adaywill on 2014-02-03
I have a question about best practices when reverting BAMs to fastq. If I’m going to completely re-analyze (mapping all the way to calling variants) the reverted fastq and there are multiple read-groups in the BAM, is it best to create a fastq for each read group?
From Geraldine_VdAuwera on 2014-02-03
Yes, you should separate the data by read group into individual FASTQ files. As I recall, the reverting process will discard read group information, so you’ll need to add it back in afterward before merging the data back together. If you don’t separate them out you’ll lose the ability to distinguish between read groups, which would be bad.
From adaywill on 2014-02-03
@Geraldine_VdAuwera Thanks. That is what I thought and what I was doing just wanted a sanity check.
From ecuenca on 2014-02-04
Hi Geraldine, to realign a processed bam file, I used to do:
1. RevertSam, 2.MArkIlluminaAdapters, 3.SamToFastq, 4.bwa aln, 5.bwa sampe, 6.MergeBamAlignment
Should I incorporate now as a first step the ‘Shuffled bam files’ step?
Can I continue with my pipeline steps instead of ‘Revert the BAM file to FastQ’ described here?
Thanks,
Ester
From Geraldine_VdAuwera on 2014-02-04
Hi @ecuenca,
The bam file shuffling is meant to eliminate biases (due to the bam being sorted) that affect the mapping step, so yes, I would recommend adding that step to your pipeline.
From Trakhtenberg on 2014-09-02
Is there a parameter which could instruct the command to output two fastQ files, read1 and read2 (for the pairend BAM), instead of having them merged? Also, so that they would still be paired and ready for tophat.
I also have found on forms versions of this command (below), and wonder if they may have some advantages:
htscmd bamshuf -Ou input.bam tmp-prefix | htscmd bam2fq -s se.fq.gz – | gzip > pe.fq.gz
htscmd bamshuf -uO in.bam tmp-prefix | htscmd bam2fq -as se.fq.gz – | bwa mem -p ref.fa -
From Geraldine_VdAuwera on 2014-09-04
As that is not our tool, we don’t provide usage recommendations beyond the specific use case shown here. For other possibilities, please see the tool’s documentation or ask your question in a more general forum such as SeqAnswers.
From Kurt on 2014-09-04
@Trakhtenberg,
If you would like an alternate way to convert bam to fastq, resort them based on query name and have a fastq file outputs for read 1 and read 2. You can check out RevertSam.jar and SamToFastq.jar from the Picard’s suite of tools. Because I am typically converting bam files that have recalibrated base call Q scores, I use RevertSam first and pipe it to SamToFastq. If your bam files don’t have that (i.e. if there is no OQ tag in your bam files) then you can just use SamToFastq.
http://picard.sourceforge.net/index.shtml
Below is the basic cmd line that I (almost) always use, to give you an idea; you can look at the other options to tailor them based on what you have and what you want in the end.
`java -jar $PICARD_DIR/RevertSam.jar \
INPUT=$IN_BAM \
OUTPUT=/dev/stdout \
SORT_ORDER=queryname \
COMPRESSION_LEVEL=0 \
VALIDATION_STRINGENCY=SILENT | java -jar $PICARD_DIR/SamToFastq.jar \
INPUT=/dev/stdin \
OUTPUT_PER_RG=true \
OUTPUT_DIR=$CORE_PATH/FASTQ/ \
VALIDATION_STRINGENCY=SILENT`
From caddymob on 2014-09-04
Try `bedtools bamtofastq` — https://code.google.com/p/bedtools/
Better yet, do it all on a pipe and send to bwa mem like you posted. Great example here in the bcbio code:
https://github.com/chapmanb/bcbio-nextgen/blob/01a6d99c7a8bb7a73ee35313c8af4c6b4d8c66fe/bcbio/ngsalign/bwa.py#L41-L43
From vasilis on 2015-01-27
Hello,
I’m trying to do the shuffle step but I cannot find the htscmd bamshuf command.
Could you help me with that or to recommend me another software that I can use for this step?
I know that its not in the GATK package but I will appreciate your help a lot.
From Geraldine_VdAuwera on 2015-01-29
@vasilis, the usage / command name has changed. See https://github.com/samtools/htslib/issues/32
From vasilis on 2015-02-02
Thank you very much, I found it!
From vasilis on 2015-02-02
Another thing that I noticed is that the -a option doesn’t work for bam2fq tool
From vasilis on 2015-02-06
I’m sorry for my naive question,
but when I am using the shuffle step the shuffled bam file is quite smaller and finally my fq files are smaller, too.
Is there any specific reason for that?
Thank you very much in advance.
From Sheila on 2015-02-06
@vasilis
Hi,
These processing steps discard information, so it is normal for the resulting files to be smaller since they contain less information.
-Sheila
From cpatterson on 2016-06-08
Could we get a corrected example of proper usage since the htscmd command has been changed? I’ve been looking for documentation in both the HTSlib and SamTools, and have come up empty. Any assistance would be greatly appreciated.
From shlee on 2016-06-10
Hi @cpatterson,
We have a new tutorial, [(How to) Generate an unmapped BAM from FASTQ or aligned BAM](http://gatkforums.broadinstitute.org/firecloud/discussion/6484/how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam#latest). Is this the type of documentation you are looking for?
From cpatterson on 2016-06-17
Yes. That is what I needed. Many thanks.
From mglclinical on 2017-08-19
Hi ,
I have read all comments, and I could not still figure out where the tools “htscmd” and/or “bamshuf” are located ? I did not find it in samtools, or bcftools or htslib.
Has the name of the tool “htscmd” been changed to something else ? If so, what is the new name ? Or did the usage has changed ?
Should I just use picard’s SamToFastq instead ?
I don’t think that this tutorial (https://gatkforums.broadinstitute.org/firecloud/discussion/6484/how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam#latest) will be helpful for me, because it doesn’t talk about creating fastq files from BAM files.
From mglclinical on 2017-08-21
Hi,
Based on the README.md(https://sourceforge.net/projects/samtools/files/samtools/1.3/) file for samtools 1.3, I learnt that the command “bamshuf” is renamed to “collate” . However, in samtools-1.4, both commands seems to work :
[sgajja@genlabcs samtools-1.4]$ samtools collate
Usage: samtools collate [-Ou] [-n nFiles] [-l cLevel]
[sgajja@genlabcs samtools-1.4]$ samtools bamshuf
Usage: samtools collate [-Ou] [-n nFiles] [-l cLevel]
thanks,
mglclinical
From shlee on 2017-08-21
Hi @mglclinical,
[Tutorial#6483](https://gatkforums.broadinstitute.org/gatk/discussion/6483)’s step 3A (https://gatkforums.broadinstitute.org/gatk/discussion/6483#step3A) gives instructions on using Picard SamToFastq. If you search the upper right search box with “how to samtofastq”, the tutorial is the third hit.
For Samtools questions, please post to their repo: https://github.com/samtools/samtools.
From mglclinical on 2017-08-21
Hi @shlee
Thank you for pointing me to the Tutorial#6483 .
Thanks,
mglclinical