created by GATK_Team
on 2017-12-30
You're trying to run a GATK or Picard tool that operates on a SAM or BAM file, and getting some cryptic error that doesn't clearly tell you what's wrong. Bits of the stack trace (the pile of lines in the output log that the program outputs when there is a problem) may contain the following: java.lang.String
, Error Type Count
, NullPointerException
-- or maybe something else that doesn't mean anything to you.
The most frequent cause of these unexplained problems is not a bug in the program -- it's an invalid or malformed SAM/BAM file. This means that there is something wrong either with the content of the file (something important is missing) or with its format (something is written the wrong way). Invalid SAM/BAM files generally have one or more errors in the following sections: the header tags, the alignment fields, or the optional alignment tags. In addition, the SAM/BAM index file can be a source of errors as well.
The source of these errors is usually introduced by upstream processing tools, such as the genome mapper/aligner or any other data processing tools you may have applied before feeding the data to Picard or GATK.
To fix these problems, you first have to know what's wrong. Fortunately there's a handy Picard tool that can test for (almost) all possible SAM/BAM format errors, called ValidateSamFile.
We recommend the workflow included below for diagnosing problems with ValidateSamFile. This workflow will help you tackle the problem efficiently and set priorities for dealing with multiple errors (which often happens). We also outline typical solutions for common errors, but note that this is not meant to be an exhaustive list -- there are too many possible problems to tackle all of them in this document. To be clear, here we focus on diagnostics, not treatment.
In some cases, it may not be possible to fix some problems that are too severe, and you may need to redo the genome alignment/mapping from scratch! Consider running ValidateSamFile proactively at all key steps of your analysis pipeline to catch errors early!
First, run ValidateSamFile in SUMMARY
mode in order to get a summary of everything that is missing or improperly formatted in your input file. We set MODE=SUMMARY
explicitly because by default the tool would just emit details about the 100 first problems it finds then quit. If you have some minor formatting issues that don't really matter but affect every read record, you won't get to see more important problems that occur later in the file.
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ MODE=SUMMARY
If this outputs No errors found
, then your SAM/BAM file is completely valid. If you were running this purely as a preventative measure, then you're good to go and proceed to the next step in your pipeline. If you were doing this to diagnose a problem, then you're back to square one -- but at least now you know it's not likely to be a SAM/BAM file format issue. One exception: some analysis tools require Read Group tags like SM
that not required by the format specification itself, so the input files will pass validation but the analysis tools will still error out. If that happens to you, check whether your files have SM
tags in the @RG
lines in their BAM header. That is the most common culprit.
However, if the command above outputs one or more of the 8 possible WARNING
or 48 possible ERROR
messages (see tables at the end of this document), you must proceed to the next step in the diagnostic workflow.
When run in SUMMARY
mode, ValidateSamFile outputs a table that differentiates between two levels of error: ERROR
proper and WARNING
, based on the severity of problems that they would cause in downstream analysis. All problems that fall in the ERROR
category must be addressed to in order to proceed with other Picard or GATK tools, while those that fall in the WARNING
category may often be ignored for some, if not all subsequent analyses.
Example of error summary
This table, generated by ValidateSamFile from a real BAM file, indicates that this file has a total of 1 MISSING_READ_GROUP
error, 4 MISMATCH_MATE_ALIGNMENT_START
errors, 894,289 MATES_ARE_SAME_END
errors, and so on. Moreover, this output also indicates that there are 54 RECORD_MISSING_READ_GROUP
warnings and 33 MISSING_TAG_NM
warnings.
Since ERRORs
are more severe than WARNINGs
, we focus on diagnosing and fixing them first. From the first step we only had a summary of errors, so now we generate a more detailed report with this command:
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ IGNORE_WARNINGS=true \ MODE=VERBOSE
Note that we invoked the MODE=VERBOSE
and the IGNORE_WARNINGS=true
arguments.
The former is technically not necessary as VERBOSE
is the tool's default mode, but we specify it here to make it clear that that's the behavior we want. This produces a complete list of every problematic record, as well as a more descriptive explanation for each type of ERROR
than is given in the SUMMARY
output.
The IGNORE_WARNINGS
option enables us to specifically examine only the records with ERRORs
. When working with large files, this feature can be quite helpful, because there may be many records with WARNINGs
that are not immediately important, and we don't want them flooding the log output.
Example of VERBOSE report for ERRORs only
These ERRORs
are all problems that we must address before using this BAM file as input for further analysis. Most ERRORs
can typically be fixed using Picard tools to either correct the formatting or fill in missing information, although sometimes you may want to simply filter out malformed reads using Samtools.
For example, MISSING_READ_GROUP
errors can be solved by adding the read group information to your data using the AddOrReplaceReadGroups tool. Most mate pair information errors can be fixed with FixMateInformation.
Once you have attempted to fix the errors in your file, you should put your new SAM/BAM file through the first validation step in the workflow, running ValidateSamFile in SUMMARY
mode again. We do this to evaluate whether our attempted fix has solved the original ERRORs
, and/or any of the original WARNINGs
, and/or introduced any new ERRORs
or WARNINGs
(sadly, this does happen).
If you still have ERRORs
, you'll have to loop through this part of the workflow until no more ERRORs
are detected.
If you have no more ERRORs
, congratulations! It's time to look at the WARNINGs
(assuming there are still some -- if not, you're off to the races).
To obtain more detailed information about the warnings, we invoke the following command:
$ java -jar picard.jar ValidateSamFile \ I=input.bam \ IGNORE=type \ MODE=VERBOSE
At this time we often use the IGNORE
option to tell the program to ignore a specific type of WARNING
that we consider less important, in order to focus on the rest. In some cases we may even decide to not try to address some WARNINGs
at all because we know they are harmless (for example, MATE_NOT_FOUND
warnings are expected when working with a small snippet of data). But in general we do strongly recommend that you address all of them to avoid any downstream complications, unless you're sure you know what you're doing.
Example of VERBOSE report for WARNINGs only
ValidateSamFile (VERBOSE)
WARNING: Read name H0164ALXX140820:2:1204:13829:66057
WARNING: Record 1, Read name HARMONIA-H16:1253:0:7:1208:15900:108776
Warning Description
A record is missing a read group
NM tag (nucleotide differences) is missing
Here we see a read group-related WARNING
which would probably be fixed when we fix the MISSING_READ_GROUP
error we encountered earlier, hence the prioritization strategy of tackling ERRORs
first and WARNINGs
second.
We also see a WARNING
about missing NM
tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by adding IGNORE=MISSING_TAG_NM
from now on when you run ValidateSamFile on this file.
Once you have attempted to fix all the WARNINGs
that you care about in your file, you put your new SAM/BAM file through the first validation step in the workflow again, running ValidateSamFile in SUMMARY
mode. Again, we check that no new ERRORs
have been introduced and that the only WARNINGs
that remain are the ones we feel comfortable ignoring. If that's not the case we run through the workflow again. If it's all good, we can proceed with our analysis.
The following two tables describe WARNING
(Table I) and ERROR
(Table II) cases, respectively.
Updated on 2017-12-30