Errors in SAMBAM files can be diagnosed with ValidateSamFile

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by dekling

on 2016-05-04

The problem

You're trying to run a GATK or Picard tool that operates on a SAM or BAM file, and getting some cryptic error that doesn't clearly tell you what's wrong. Bits of the stack trace (the pile of lines in the output log that the program outputs when there is a problem) may contain the following: java.lang.String, Error Type Count, NullPointerException -- or maybe something else that doesn't mean anything to you.

Why this happens

The most frequent cause of these unexplained problems is not a bug in the program -- it's an invalid or malformed SAM/BAM file. This means that there is something wrong either with the content of the file (something important is missing) or with its format (something is written the wrong way). Invalid SAM/BAM files generally have one or more errors in the following sections: the header tags, the alignment fields, or the optional alignment tags. In addition, the SAM/BAM index file can be a source of errors as well.

The source of these errors is usually introduced by upstream processing tools, such as the genome mapper/aligner or any other data processing tools you may have applied before feeding the data to Picard or GATK.

The solution

To fix these problems, you first have to know what's wrong. Fortunately there's a handy Picard tool that can test for (almost) all possible SAM/BAM format errors, called ValidateSamFile.

We recommend the workflow included below for diagnosing problems with ValidateSamFile. This workflow will help you tackle the problem efficiently and set priorities for dealing with multiple errors (which often happens). We also outline typical solutions for common errors, but note that this is not meant to be an exhaustive list -- there are too many possible problems to tackle all of them in this document. To be clear, here we focus on diagnostics, not treatment.

In some cases, it may not be possible to fix some problems that are too severe, and you may need to redo the genome alignment/mapping from scratch! Consider running ValidateSamFile proactively at all key steps of your analysis pipeline to catch errors early!

Workflow for diagnosing SAM/BAM file errors with ValidateSamFile

1. Generate summary of errors

First, run ValidateSamFile in SUMMARY mode in order to get a summary of everything that is missing or improperly formatted in your input file. We set MODE=SUMMARY explicitly because by default the tool would just emit details about the 100 first problems it finds then quit. If you have some minor formatting issues that don't really matter but affect every read record, you won't get to see more important problems that occur later in the file.

$ java -jar picard.jar ValidateSamFile \ I=input.bam \ MODE=SUMMARY

If this outputs No errors found, then your SAM/BAM file is completely valid. If you were running this purely as a preventative measure, then you're good to go and proceed to the next step in your pipeline. If you were doing this to diagnose a problem, then you're back to square one -- but at least now you know it's not likely to be a SAM/BAM file format issue. One exception: some analysis tools require Read Group tags like SM that not required by the format specification itself, so the input files will pass validation but the analysis tools will still error out. If that happens to you, check whether your files have SM tags in the @RG lines in their BAM header. That is the most common culprit.

However, if the command above outputs one or more of the 8 possible WARNING or 48 possible ERROR messages (see tables at the end of this document), you must proceed to the next step in the diagnostic workflow.

When run in SUMMARY mode, ValidateSamFile outputs a table that differentiates between two levels of error: ERROR proper and WARNING, based on the severity of problems that they would cause in downstream analysis. All problems that fall in the ERROR category must be addressed to in order to proceed with other Picard or GATK tools, while those that fall in the WARNING category may often be ignored for some, if not all subsequent analyses.

Example of error summary

This table, generated by ValidateSamFile from a real BAM file, indicates that this file has a total of 1 MISSING_READ_GROUP error, 4 MISMATCH_MATE_ALIGNMENT_START errors, 894,289 MATES_ARE_SAME_END errors, and so on. Moreover, this output also indicates that there are 54 RECORD_MISSING_READ_GROUP warnings and 33 MISSING_TAG_NM warnings.

2. Generate detailed list of ERROR records

Since ERRORs are more severe than WARNINGs, we focus on diagnosing and fixing them first. From the first step we only had a summary of errors, so now we generate a more detailed report with this command:

$ java -jar picard.jar ValidateSamFile \ I=input.bam \ IGNORE_WARNINGS=true \ MODE=VERBOSE

Note that we invoked the MODE=VERBOSE and the IGNORE_WARNINGS=true arguments.

The former is technically not necessary as VERBOSE is the tool's default mode, but we specify it here to make it clear that that's the behavior we want. This produces a complete list of every problematic record, as well as a more descriptive explanation for each type of ERROR than is given in the SUMMARY output.

The IGNORE_WARNINGS option enables us to specifically examine only the records with ERRORs. When working with large files, this feature can be quite helpful, because there may be many records with WARNINGs that are not immediately important, and we don't want them flooding the log output.

Example of VERBOSE report for ERRORs only

These ERRORs are all problems that we must address before using this BAM file as input for further analysis. Most ERRORs can typically be fixed using Picard tools to either correct the formatting or fill in missing information, although sometimes you may want to simply filter out malformed reads using Samtools.

For example, MISSING_READ_GROUP errors can be solved by adding the read group information to your data using the AddOrReplaceReadGroups tool. Most mate pair information errors can be fixed with FixMateInformation.

Once you have attempted to fix the errors in your file, you should put your new SAM/BAM file through the first validation step in the workflow, running ValidateSamFile in SUMMARY mode again. We do this to evaluate whether our attempted fix has solved the original ERRORs, and/or any of the original WARNINGs, and/or introduced any new ERRORs or WARNINGs (sadly, this does happen).

If you still have ERRORs, you'll have to loop through this part of the workflow until no more ERRORs are detected.

If you have no more ERRORs, congratulations! It's time to look at the WARNINGs (assuming there are still some -- if not, you're off to the races).

3. Generate detailed list of WARNING records

To obtain more detailed information about the warnings, we invoke the following command:

$ java -jar picard.jar ValidateSamFile \ I=input.bam \ IGNORE=type \ MODE=VERBOSE

At this time we often use the IGNORE option to tell the program to ignore a specific type of WARNING that we consider less important, in order to focus on the rest. In some cases we may even decide to not try to address some WARNINGs at all because we know they are harmless (for example, MATE_NOT_FOUND warnings are expected when working with a small snippet of data). But in general we do strongly recommend that you address all of them to avoid any downstream complications, unless you're sure you know what you're doing.

Example of VERBOSE report for WARNINGs only

ValidateSamFile (VERBOSE)

WARNING: Read name H0164ALXX140820:2:1204:13829:66057

WARNING: Record 1, Read name HARMONIA-H16:1253:0:7:1208:15900:108776

Warning Description

A record is missing a read group

NM tag (nucleotide differences) is missing

Here we see a read group-related WARNING which would probably be fixed when we fix the MISSING_READ_GROUP error we encountered earlier, hence the prioritization strategy of tackling ERRORs first and WARNINGs second.

We also see a WARNING about missing NM tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by adding IGNORE=MISSING_TAG_NM from now on when you run ValidateSamFile on this file.

Once you have attempted to fix all the WARNINGs that you care about in your file, you put your new SAM/BAM file through the first validation step in the workflow again, running ValidateSamFile in SUMMARY mode. Again, we check that no new ERRORs have been introduced and that the only WARNINGs that remain are the ones we feel comfortable ignoring. If that's not the case we run through the workflow again. If it's all good, we can proceed with our analysis.

Appendix: List of all WARNINGs and ERRORs emitted by ValidateSamFile

We are currently in the process of updating the Picard website to include the following two tables, describing WARNING (Table I) and ERROR (Table II) cases. Until that's done, you can find them here.

Updated on 2016-05-05

From mglclinical on 2016-05-06

Hi @dekling,

I have run ValidateSamFile in a summary mode on my bam file and it outputs a "MATECIGARSTRINGINVALIDPRESENCE" error for 59409 reads/count. The reads that picard points to as ERRORs have a mapping quality of 0, and I don't understand why picard throes this error ? Should I have removed reads with 0 mapping quality from my bam file before running the ValidateSamFile ?

[sgajja@genlabcs bwa]$ java -jar ~/algorithms/picard/20151013/picard-tools-1.140/picard.jar ValidateSamFile I="bwaAlignMDup.bam" MODE= SUMMARY [Fri May 06 14:36:38 EDT 2016] picard.sam.ValidateSamFile INPUT=bwaAlignMDup.bam MODE=SUMMARY MAXOUTPUT=100 IGNOREWARNINGS=false VALIDATEINDEX=true ISBISULFITESEQUENCED=false MAXOPENTEMPFILES=8000 VERBOSITY=INFO QUIET=false VALIDATIONSTRINGENCY=STRICT COMPR ESSIONLEVEL=5 MAXRECORDSINRAM=500000 CREATEINDEX=false CREATEMD5FILE=false GA4GHCLIENTSECRETS=clientsecrets.json [Fri May 06 14:36:38 EDT 2016] Executing as sgajja@genlabcs on Linux 3.10.0-229.el7.x8664 amd64; OpenJDK 64-Bit Server VM 1.7.099-moc kbuild201603232352-b00; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater INFO 2016-05-06 14:37:35 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:56s. Time for last 10,000,000: 56s. Last read position: 8:38,110,502 INFO 2016-05-06 14:38:29 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:50s. Time for last 10,000,000: 54s. Last read position: 19:44,988,530

HISTOGRAM java.lang.String

Error Type Count ERROR:MATECIGARSTRINGINVALIDPRESENCE 59409

I re-ran the ValidateSamFile in verbose mode and got the list of reads with the above error. One of the reads with the above error is .

I did a grep search on my bam file and here are the records :

[sgajja@genlabcs bwa]$ ~/algorithms/samtools/20150929/samtools-1.2/samtools view bwaAlignMDup.bam | grep "NS500510:11:H2T5HAFXX:1:11102:4034:19007" NS500510:11:H2T5HAFXX:1:11102:4034:19007 121 1 24559 0 151M = 24559 0 TCCCAGTAGTAAAAGCTTCCAAGTTGGGCTCTCACTTCAGCCCCTCCCACACAGGGAAGCCAGATGGGTTCCCCAGGACCGGGATTCCCCAAGGGGGCTGCTCCCAGAGGGTGTGTTGCTGGGATTGCCCAGGACAGGGATGGCCCTCTCA /EAA<NS5005100099L001 XA:Z:9,-24672,151M,1;19,-66167,151M,1;2,+114346309,151M,1;15,+102506455,151M,1;12,+78968,151M,3; MC:Z:* MQ:i:0 NS500510:11:H2T5HAFXX:1:11102:4034:19007 181 1 24559 0 * = 24559 0 TTTGGTTGTTTTTTTTTTAAATATTTTTTTTTTCCTTCCATTTGGTGGGCCTATCCTTTATGTTTTTCCTGAACCAAAAAAATATTTCCCCTTTTTTGGTGGTTTGAACTCCTTGGTGTTATTCTCAATATACCCCCCCCCTTATACGCC ///////A/////////NS5005100099_L001 MC:Z:151M MQ:i:0

Both the reads have a Mapping Quality 0, and one of the read is unmapped(flag 181) and other one is mapped(flag 121). This format seems to be genuine and I don't understand why Picard reports this as an error ?

Thanks: mglclinical

From dekling on 2016-05-09

Hello @mglclinical:

Let us start with the easy stuff. First update your Picard to the most recent version. Then try running again and see what happens. Let me know.

-David

From dekling on 2016-05-09

@mglclinical:

Your Samtools appears to be out of date as well.

From mglclinical on 2016-05-10

Hello @dekling ,

thank you for quick response. I will ask our system administrator to install java 8 in order for me to use latest picard, and then see how the results look like with the new picard

Thanks

mglclinical

From mglclinical on 2016-05-12

Hi @dekling

On our linux server, we have java 1.7, and not java 1.8. I learnt from my system administrator that if we update to java 1.8, then he may remove the existing java 1.7 as recommended by Oracle .

Most recent version of picard 2.2.4 can be run only with java 1.8, where as the most recent version of GATK 3.5 can be only run with java 1.7, hence I am in dilemma . Please advice .

I learnt that GATK_3.6 is going to be released soon which requires java 1.8, and not 1.7 . Any estimates on release dates for GATK3.6 ?

Thanks,

mglclinical

From Sheila on 2016-05-12

@mglclinical

Hi mglclinical,

Ah, I see the issue. We do have an article that helps users easily switch between Java versions, but it seems that will not help you :neutral:

However, if you are able to use the latest nightly build, that requires Java 1.8. So, you can upgrade to Java 1.8 and use it for both GATK and Picard.

If you absolutely cannot use the latest nightly build, I suspect the next stable release (3.6) will be out soon (we are finalizing things). But, Geraldine @Geraldine_VdAuwera may be able to comment more :smile:

-Sheila

From mglclinical on 2016-05-12

@Sheila

Thank you for the quick reply

Could you please point me to that article that helps “users easily switch between Java versions” ? I know it may not be completely useful, but I like to look into it.

Thanks,

mglclinical

From Sheila on 2016-05-13

@mglclinical

Hi mglclinical,

[Here](https://www.broadinstitute.org/gatk/guide/article?id=6841) it is.

-Sheila

From mglclinical on 2016-05-13

Thank you @Sheila

From mglclinical on 2016-05-20

Hi @dekling

You have suggested me to upgrade picard and run it. I did upgrade picard and I got the same error "MATECIGARSTRINGINVALIDPRESENCE" . All those reads with errors are the ones with mapping quality 0

C:\sgajjalaprojects\testvalidateBAM\picard20160520\picard-tools-2.3.0> C:\sgajjalaprojects\testvalidateBAM\picard20160520\picard-tools-2.3.0> C:\sgajjalaprojects\testvalidateBAM\picard20160520\picard-tools-2.3.0>java -Xmx6G -jar picard.jar ValidateSamFile I="C:\sgajjalaprojects\testvalidateBAM\novoalignbam\SampleReadsFinal.bam" MODE=SUMMARY [Fri May 20 17:02:57 EDT 2016] picard.sam.ValidateSamFile INPUT=C:\sgajjalaprojects\testvalidateBAM\novoalignbam\SampleReadsFinal.bam MODE=SUMMARY MAXOUTPUT=100 IGNOREWARNINGS=false VALIDATEINDEX=true INDEXVALIDATIONSTRINGENCY=EXHAUSTIVE ISBISULFITESEQUENCED=false MAXOPENTEMPFILES=8000 VERBOSITY=INFO QUIET=false VALIDATIONSTRINGENCY=STRICT COMPRESSIONLEVEL=5 MAXRECORDSINRAM=500000 CREATEINDEX=false CREATEMD5FILE=false GA4GHCLIENTSECRETS=clientsecrets.json [Fri May 20 17:02:57 EDT 2016] Executing as sgajja@B-544469 on Windows 7 6.1 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.065-b17; Picard version: 2.3.0(9a00c87b7ffdb01cfb5a0d6e76556146196babb81463071327) JdkDeflate INFO 2016-05-20 17:03:47 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:50s. Time for last 10,000,000: 50s. Lastread position: 2:43,053,844 INFO 2016-05-20 17:04:37 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:40s. Time for last 10,000,000: 49s. Lastread position: 3:169,343,211 INFO 2016-05-20 17:05:30 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:02:33s. Time for last 10,000,000: 53s. Lastread position: 6:28,648,000 INFO 2016-05-20 17:06:23 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:03:26s. Time for last 10,000,000: 52s. Lastread position: 8:61,156,723 INFO 2016-05-20 17:07:16 SamFileValidator Validated Read 50,000,000 records. Elapsed time: 00:04:18s. Time for last 10,000,000: 52s. Lastread position: 11:824,441 INFO 2016-05-20 17:08:06 SamFileValidator Validated Read 60,000,000 records. Elapsed time: 00:05:09s. Time for last 10,000,000: 50s. Lastread position: 13:78,221,271 INFO 2016-05-20 17:08:57 SamFileValidator Validated Read 70,000,000 records. Elapsed time: 00:06:00s. Time for last 10,000,000: 51s. Lastread position: 16:75,192,511 INFO 2016-05-20 17:09:50 SamFileValidator Validated Read 80,000,000 records. Elapsed time: 00:06:53s. Time for last 10,000,000: 52s. Lastread position: 19:57,065,257

HISTOGRAM java.lang.String

Error Type Count ERROR:MATECIGARSTRINGINVALIDPRESENCE 149067

[Fri May 20 17:16:05 EDT 2016] picard.sam.ValidateSamFile done. Elapsed time: 13.14 minutes. Runtime.totalMemory()=82837504 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Here are my grep results for one of the reads that caused "MATECIGARSTRINGINVALIDPRESENCE" error by Picard :

[sgajja@genlabcs novoalign]$ [sgajja@genlabcs novoalign]$ ~/algorithms/samtools/20150929/samtools-1.2/samtools view SampleReadsFinal.bam | grep "NS500510:11:H2T5HAFXX:4:11508:2247:4270" NS500510:11:H2T5HAFXX:4:11508:2247:4270 77 * 0 0 * * 0 0 TGCCACCTCTACTGAACCCACAACCACAACAACAGCCAGCACTACAACTGAACAGTCTGCCACAACAACTGGAGGTACTTCAACTGGAGTTACAACCACAACAGCAACCACTAGCACCTCTACTGAACCCAGCACTACGACAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEENS5005100099L004 PU:Z:6L004 LB:Z:Lib1 ZS:Z:NMMC:Z:* MQ:i:0 NS500510:11:H2T5HAFXX:4:11508:2247:4270 141 * 0 0 * * 0 0 GCCTGCAGTTGTGGTGGCAGACTGGTCAGTCACAGTGCCAATTGTCGTAGTGCTGGGTTCAGTAGAGGTGCTAGTGGTTGCTGTTGTGGTAGTAACTCCAGTACAAGTACCTCCAGTTGTTGTGGCAGACTGTTCAG AAAAAEEEEEEEEEEEEEEEEEEEAEEEEEAE/EEEEE/EE/EEE/EEEEEEAE/E/AE6E/6EAA/E/E/E/ENS5005100099L004 PU:Z:6_L004 LB:Z:Lib1 ZS:Z:NM MC:Z:* MQ:i:0 [sgajja@genlabcs novoalign]$

So, the above reads seems to be genuine and I don't understand why Picard calls them out as "MATECIGARSTRINGINVALIDPRESENCE" ?

From Geraldine_VdAuwera on 2016-05-21

Hi @mglclinical,

I’m not familiar with this case figure so I looked up the test for this error in the Picard source code. This is the read pair in the test file, which is considered illegal by the SAM specification:

pair_read 73 chr7 3 9 6M = 3 0 CAACAG )’..+ MC:Z: PG:Z:0 RG:Z:0 NM:i:4 UQ:i:33 pair_read 133 chr7 3 0 * = 3 0 NCGCGG &/1544 MC:Z:6M PG:Z:0 RG:Z:0

As far as I can tell this is the same case figure as what you have, ins’t it? If so the tool is correct and these read pairs were formatted incorrectly by the aligner you used. My understanding is that the mapped read should not have an MC tag. I think Picard FixMate may be able to clean this up; I would recommend you give it a try on a snippet that contains some of the offending reads.

From Geraldine_VdAuwera on 2016-05-21

Sorry I meant [FixMateInformation](https://broadinstitute.github.io/picard/command-line-overview.html#FixMateInformation).

From mglclinical on 2016-05-23

@Geraldine_VdAuwera

Thank you so much for a quick reply over the weekend.

In my current state of my pipeline, I run HaplotypeCaller on my previous bam file(unclean version) and my results are good enough (good sensitivity and specificity for NA12878 ; and also picked up positive expected variants from positive control test samples)

So, If I clean up my bam file with Picard’s FixMateInformation, would it affect my results in vcf file when I run with HaplotypeCaller ?

Thanks,

mglclinical

From Geraldine_VdAuwera on 2016-05-23

Hi @mgclinical,

No, I wouldn’t expect this to affect the results of HaplotypeCaller, as it doesn’t use or even read in mate cigar information. This would only be a problem if you wanted to run some other tool that needs to access MC and can’t handle this case figure.

From mglclinical on 2016-05-23

@Geraldine_VdAuwera thank you

From mglclinical on 2016-06-08

This is just an information for future Picard users

Source – https://sourceforge.net/p/picard/wiki/Main_Page/

Q: Why am I getting errors from Picard like “MAPQ should be 0 for unmapped read” or “CIGAR should have zero elements for unmapped read?”

A: BWA can produce SAM records that are marked as unmapped but have non-zero MAPQ and/or non-”*” CIGAR. Typically this is because BWA found an alignment for the read that hangs off the end of the reference sequence. Picard considers such input to be invalid. In general, this error can be suppressed in Picard programs by passing VALIDATION_STRINGENCY=LENIENT or VALIDATION_STRINGENCY=SILENT. For ValidateSamFile, you can pass the arguments IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR.

This is what happened in my case, one of the reads had a valid CIGAR string when mapping quality was 0, and that is how BWA assigns the CIGAR strings for reads found at the end of the reference sequences.

Thanks,

mglclinical

From dekling on 2016-06-09

Thank you @mglclinical.

From MUHAMMADSOHAILRAZA on 2016-07-27

@Geraldine_VdAuwera

Hi,

I run ValidateSamFile tool in summary mode:

java -Xmx10g -jar $PICARD ValidateSamFile \

I=$INPUT/Sample.modRG.bam \

O=$OUTPUT/Sample.modRG.bam.txt \

MODE=SUMMARY

It prompts error messages i.e. Error Type Count ERROR:INVALID_PLATFORM_VALUE 2

However, when i checked the RG tag in header section, the PL looks like this:

RG ID:HS2000-1259_212.L2 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L2 CN:Illumina-ltd

RG ID:HS2000-1259_212.L3 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L3 CN:Illumina-ltd

My questions are:

1. Here the PLATFORM_VALUE (PL) is “HiSeq2000”, is it wrong or am i missing something??

2. I am curious either is due to change in PU value format or something else??

3. can i proceed without giving serious attention towards this error, if not, how to resolve it?

Thanks!

From dekling on 2016-07-27

@MUHAMMADSOHAILRAZA: According to the SAM specification sheet, there are a limited number of PL values which include: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO. HiSeq2000 is not one of them. Use Picard’s AddOrReplaceReadGroups to fix your SAM/BAM files and then try ValidateSamFile again. Let us know if you have more questions.

From MUHAMMADSOHAILRAZA on 2016-07-27

@dekling

hi,

If i dont wanna change PL “HiSeq2000”, is it okay for downstream analysis?

From dekling on 2016-07-27

@MUHAMMADSOHAILRAZA: I do not know for sure. You are welcome to try it, but I strongly recommend against it.

From MUHAMMADSOHAILRAZA on 2016-07-27

@dekling

Actually, i received merged BAM file with multiple RGs, i think, if i am not missing anything, Picard’s AddOrReplaeReadGroups tool work on single lane of sample.. Is there any way to change PL values for multiple RGs in merged SAM/BAM by picards. i.e. replacing only PL value below,

RG ID:HS2000-1259_212.L2 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L2 CN:Illumina-ltd

RG ID:HS2000-1259_212.L3 LB:LP6005592-DNA_E05-PE PL:HiSeq2000 SM:LP6005592-DNA_E05 PU:E05_HS2000-1259_212.L3 CN:Illumina-ltd

i know it can be done by text editing, but want to find any other good way:

From dekling on 2016-07-27

@MUHAMMADSOHAILRAZA: I am not sure, but you can try experimenting with Picard’s ReplaceSamHeader tool.

From MUHAMMADSOHAILRAZA on 2016-07-28

Thanks! @dekling

From junhou on 2017-07-03

Hi All,

I am using picard-2.9.4. When I run ValidateSamFile \ I=input.bam \ MODE=SUMMARY

I got an error message: ERROR: Unrecognized option:

….

followed by options, including

OPTIONS_FILE=File File of OPTION_NAME=value pairs. No positional parameters allowed. Unlike command-line options, unrecognized options are ignored. A single-valued option set in an options file may be overridden by a subsequent command-line option. A line starting with ‘#’ is considered a comment. Required.

is it true that a OPTIONS_FILE has to be referred under this version of picard to run this function?

Thanks for answering.

From Sheila on 2017-07-03

@junhou

Hi,

I just tried running this exact command:

`java -jar picard.jar ValidateSamFile I=input.bam MODE=SUMMARY` and I get no errors.

Can you post what exactly comes after `ERROR: Unrecognized option:`?

Thanks,

Sheila

From pschnepp on 2017-10-12

I’ve been trying to run Picard’s tool AddorReplaceReadGroups but I keep getting the following message:

Code: java -jar $PICARD_JARS/AddOrReplaceReadGroups.jar INPUT=aligned_reads.merged.bam OUTPUT=aligned_reads.correct.bam RGID=1 RGLB=lib1 RGPL=illumina_nextra RGPU=unit1 RGSM=ERR1211180

Error message:

Exception in thread “main” net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Record 1, Read name ERR1211180.544729, RG ID on SAMRecord not found in header: group1 at net.sf.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448) at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:540) at net.sf.samtools.BAMFileReader$BAMFileIterator.(BAMFileReader.java:499) at net.sf.samtools.BAMFileReader$BAMFileIterator.(BAMFileReader.java:487) at net.sf.samtools.BAMFileReader.getIterator(BAMFileReader.java:289) at net.sf.samtools.SAMFileReader.iterator(SAMFileReader.java:319) at net.sf.samtools.SAMFileReader.iterator(SAMFileReader.java:37) at net.sf.picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:98) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119) at net.sf.picard.sam.AddOrReplaceReadGroups.main(AddOrReplaceReadGroups.java:66)

I’m not sure what I’m doing wrong. The only error message for this bam file is that the read group info is not found.

Thanks for any help!

From Sheila on 2017-10-16

@pschnepp

Hi,

It seems you already have read group tags in your individual reads and you may be trying to add conflicting read group information to the header. If you look at your individual reads, you should be able to determine the RGID and add that manually to your header (no need to use AddOrReplaceReadGroups).

-Sheila

P.S. You may be able to find some extra help if you do a google search for your error message :smiley:

From pschnepp on 2017-10-17

I was hoping there would be an easier way than manually changing the read group info. I’ve tried doing a google search but hadn’t come up with any answers that solved my problem. Guess brute force and manually changing is the way to go. Thanks for the help.

From bhaas on 2018-01-26

Is there a way to get it to generate a ‘clean’ bam file containing only the records that pass validation?

From Sheila on 2018-01-29

@bhaas

Hi,

There are some Picard tools that can help, however you may need to use a few tools to clean your file. What are the errors that you are getting? Also, if you just have a few offending records, and you know which ones they are, you can manually remove them.

-Sheila

From bhaas on 2018-01-29

Thanks, Sheila. I ended up just running the validator first to identify all offending records, and then wrote a separate little tool to remove the offending records from the bam. That’ll suffice for now. Ideally, there’d be a flag somewhere in the picard tools to tell it to output just the records that validate, though. Would have saved me some effort. ;-)

From Sheila on 2018-01-31

@bhaas

Hi,

Yes, I agree that would be nice. But, people may abuse that and move on with just 50% of their data when they could recover 100% of their data with a fix. We have to make it “kind of” tough for people to move on without proper fixes :smiley: Glad to hear you solved this on your own.

-Sheila

From bhaas on 2018-01-31

I appreciate the sentiment, but think it would still be better to make it easy to do this and just properly inform the user. You could add a parameter so that if less than some percent of reads are output as ‘valid’, then it would crash. Having just 0.0001% of your records flagged as invalid for whatever reason just makes it difficult to move on with the existing setup, and if the user isn’t a programmer, it’s of course difficult to impossible for them to move on.

cheers,

~brian

From Sheila on 2018-02-02

@bhaas

Hi Brian,

Thanks for sharing. I will bring this up to the team, but something tells me this is never going to happen :smile:

-Sheila

From bhaas on 2018-02-02

when I looked at the code for this, I came to pretty much the same conclusion. ;-) We can dream.

From CarolineJudy on 2018-06-27

Hi everyone,

Recently I ran ValidateSamFile and received the following errors for my bam files.

```

ERROR:MATE_NOT_FOUND 1457087

ERROR:MISSING_READ_GROUP 1

WARNING:RECORD_MISSING_READ_GROUP 32944969

```

To fix the offending bam files, I ran them through AddOrReplaceReadGroups and samtools fixmate. But it appears that doing so introduced some new problems. Re-running the new bam files through the ValidateSamFile script revealed the following new errors:

```

ERROR:INVALID_FLAG_FIRST_OF_PAIR 473730

ERROR:INVALID_FLAG_SECOND_OF_PAIR 983357

ERROR:RECORD_OUT_OF_ORDER 424595

```

Are there guidelines for how I might go about fixing these specific errors?

Many thanks,

Caroline

From Sheila on 2018-06-29

@CarolineJudy

Hi Caroline,

How did you align your reads? Using BWA? Did you use the very latest version of the aligner?

-Sheila

From CarolineJudy on 2018-06-29

Hi @Sheila,

Thanks,

I used bwa version 0.7.17 which is the latest version, to my knowledge.

Run BWA mem

```

/bin/sh
————————Parameters——————————— #

#$ -S /bin/sh #$ -pe mthread 14 #$ -q mThC.q #$ -l mres=2G,h_data=2G,h_vmem=2G #$ -cwd #$ -j y #$ -N aln_de_novo #$ -o aln_de_novo.log #$ -M judyc@si.edu #

————————Modules————————————- #
module load bioinformatics/bwa

————————Your Commands—————————- #

echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME

echo + NSLOTS = $NSLOTS #

bwa mem -M -t 14 /prepare_ref/Trinity.fasta /2993_L008_R1_001_val_1.fq /S2993_L008_R2_001_val_2.fq > results/sam/2993_denovo.sam #

echo = `date` job $JOB_NAME done

```

From CarolineJudy on 2018-06-29

Here are some additional details of my troubleshooting attempts:

ValidateSamFile

```

/bin/sh

----------------Parameters----------------------

$ -S /bin/sh

$ -q mThC.q

$ -cwd

$ -j y

$ -N valmaxvalue_set

$ -o val.maxvalueset.log

----------------Modules-------------------------

module load bioinformatics/picard-tools #

----------------Your Commands-------------------

# echo + date job $JOBNAME started in $QUEUE with jobID=$JOBID on $HOSTNAME # runpicard ValidateSamFile \ I=my.bam \ MODE=SUMMARY MAXOPENTEMPFILES=3050 # echo = date job $JOBNAME done

output

HISTOGRAM java.lang.String

Error Type Count ERROR:MATENOTFOUND 1457087 ERROR:MISSINGREADGROUP 1 WARNING:RECORDMISSINGREAD_GROUP 32944969 ```

AddOrReplaceReadGroups

```

/bin/sh

----------------Parameters----------------------

$ -S /bin/sh

$ -q mThC.q

$ -l mres=6G,hdata=6G,hvmem=6G

$ -cwd

$ -j y

$ -N addreadgroups

$ -o addreadgroups.log

$ -M judyc@si.edu

----------------Modules-------------------------

module load bioinformatics/picard-tools #

----------------Your Commands-------------------

# echo + date job $JOBNAME started in $QUEUE with jobID=$JOBID on $HOSTNAME # runpicard AddOrReplaceReadGroups \ I=/pool/genomics/judyc/variantcalling/results/bam/2993denovo.cleanedsorteddeduplicated.bam \ O=2993denovo.cleanedsorteddeduplicatedreadgroupsadded.bam \ RGID=HG5Y5BBXX8ACATTGGC \ RGLB=CVD245 \ RGPL=illumina \ RGPU=HG5Y5BBXX8ACATTGGC \ RGSM=CVD245 # echo = date job $JOB_NAME done ```

Re-validate using ValidateSamFile

HISTOGRAM java.lang.String Error Type Count ERROR:MATE_NOT_FOUND 1457087

Run samtools fixmate

```

/bin/sh

----------------Parameters----------------------

$ -S /bin/sh

$ -q sThC.q

$ -l mres=2G,hdata=2G,hvmem=2G

$ -cwd

$ -j y

$ -N matefix_2993

$ -o matefix_2993.log

----------------Modules-------------------------

module load bioinformatics/samtools #

----------------Your Commands-------------------

# echo + date job $JOBNAME started in $QUEUE with jobID=$JOBID on $HOSTNAME # samtools sort -n ../../results/bam/2993denovo.cleanedsorteddeduplicatedreadgroupsadded.bam -o ../../results/bam/2993namesorted.bam -O bam | samtools fixmate -O bam ../../results/bam/2993namesorted.bamø 2993matefix.bam # echo = date job $JOBNAME done ```

Re-validate using ValidateSamFile

ERROR:INVALID_FLAG_FIRST_OF_PAIR 473730 ERROR:INVALID_FLAG_SECOND_OF_PAIR 983357 ERROR:RECORD_OUT_OF_ORDER 424595

Try sorting (by coordinates), then re-run validation

```

HISTOGRAM java.lang.String

Error Type Count ERROR:INVALIDFLAGFIRSTOFPAIR 378322 ERROR:INVALIDFLAGSECONDOFPAIR 786354 ```

From CarolineJudy on 2018-06-29

Seems most of the errors are generated using samtools fixmate. I’m using samtools vs. 1.6.

From Sheila on 2018-07-02

@CarolineJudy

Hi Caroline,

Okay. Let’s go back through your steps. What tools have you run on your BAM file? Perhaps [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/11457/gatherbamfiles-fixmateinformation-validatesamfile) may help.

-Sheila

From CarolineJudy on 2018-07-02

Hi @Sheila,

Thanks for your response! I’m currently in the pre-processing steps for the [GATK Best Practices for variant calling on RNAseq](https://gatkforums.broadinstitute.org/gatk/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail “GATK Best Practices for variant calling on RNAseq”). After performing the alignments, I performed sort/deduplication using samtools and picard-tools.

```

/bin/sh
————————Parameters——————————— #

#$ -S /bin/sh #$ -q mThC.q #$ -l mres=6G,h_data=6G,h_vmem=6G #$ -cwd #$ -j y #$ -N sort_deduplicate_loop #$ -o sort_deduplicate_loop.log #$ -M judyc@si.edu #

————————Modules————————————- #
module load bioinformatics/picard-tools
module load bioinformatics/samtools

————————Your Commands—————————- #

echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME

echo + NSLOTS = $NSLOTS #

sam=$1

set —”“ newfile=”$(basename $sam .sam)“ samtools view -bS $sam | samtools sort – -o $sam.sorted.bam runpicard MarkDuplicates INPUT=$sam.sorted.bam OUTPUT= ../bam/${newfile}.cleaned_sorted_deduplicated.bam METRICS_FILE=${newfile}.metricsfile MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=250 ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=True #

echo = `date` job $JOB_NAME done

```

From CarolineJudy on 2018-07-02

After performing the sort/deduplication step, I ran ValidateSamFile, and then undertook the trouble shooting steps detailed in my comment from June 29th.

From mshruti on 2018-07-06

Hi,

I ran bwa mem on paired end fastq files to generate sam file. I get the following error when I run

FixMateInformation on it:

sing GATK jar /home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar

Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar FixMateInformation -I /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam -O /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.bam —ADD_MATE_CIGAR true —ASSUME_SORTED false —IGNORE_MISSING_MATES true —SORT_ORDER coordinate

14:20:45.697 INFO NativeLibraryLoader – Loading libgkl_compression.so from jar:file:/home/groups/euan/apps/gatk/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

[Fri Jul 06 14:20:46 PDT 2018] FixMateInformation —INPUT /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam —OUTPUT /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.bam —SORT_ORDER coordinate —ASSUME_SORTED false —ADD_MATE_CIGAR true —IGNORE_MISSING_MATES true —VERBOSITY INFO —QUIET false —VALIDATION_STRINGENCY STRICT —COMPRESSION_LEVEL 2 —MAX_RECORDS_IN_RAM 500000 —CREATE_INDEX false —CREATE_MD5_FILE false —GA4GH_CLIENT_SECRETS client_secrets.json —help false —version false —showHidden false —USE_JDK_DEFLATER false —USE_JDK_INFLATER false

[Fri Jul 06 14:20:46 PDT 2018] Executing as mshruti@sh-17-11.int on Linux 3.10.0-862.3.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.0.4.0

INFO 2018-07-06 14:20:46 FixMateInformation Sorting input into queryname order.

[Fri Jul 06 14:21:12 PDT 2018] picard.sam.FixMateInformation done. Elapsed time: 0.45 minutes.

Runtime.totalMemory()=4734320640

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

htsjdk.samtools.SAMFormatException: Error parsing text SAM file. length(QUAL) != length(SEQ); File /scratch/PI/euan/common/udn/externalUdnData/from_wu/bwa_DCM/06042018/dcm141/DCM141_USD16060084_HNNVYCCXX_L6.sam; Line 1604998

Line: E00489:44:HNNVYCCXX:6:1101:8389:51324 147 2 115826625 60 150M = 115826414 -361 TTGTGGTTTCATATGCTTTAGGATTTCTTTTGTATTTCTGTAAATAATACCATTGGATTTTTATGGGGCTTGCATTGATTCTGTATTTGTTTTGAATTTTTAACAATATAGATTCTTCCAATATATAAACATAGATTTTTTTGTTTATTTJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ at htsjdk.samtools.SAMLineParser.reportErrorParsingLine(SAMLineParser.java:457) at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:338) at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:268) at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:255) at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:228) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548) at picard.sam.FixMateInformation.doWork(FixMateInformation.java:200) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

I get the following error when I run ValidateSamFile:

Error Type Count

ERROR:MISMATCH_READ_LENGTH_AND_QUALS_LENGTH 1

ERROR:MISMATCH_SEQ_QUAL_LENGTH 1

WARNING:MISSING_TAG_NM 1

WARNING:RECORD_MISSING_READ_GROUP 1

I understand that Length of sequence string and length of base quality string do not match in line ‘E00489:44:HNNVYCCXX:6:1101:8389:51324’ but how should I fix this?

I am using gatk4 tools.

Thanks,

Shruti

From CarolineJudy on 2018-07-09

Hi @Sheila

Just checking back regarding my post from July 2. Any ideas on what might be going wrong?

From Geraldine_VdAuwera on 2018-07-10

@CarolineJudy Sorry for the lag; Sheila is on vacation so I’m looking through your issue now. To be frank, we don’t have the bandwidth to provide further one-on-one support since we don’t develop bwa, samtools and other such pre-processing tools. I would recommend you reach out to the samtools developers, who have their own mailing list, and see if they can help you. Good luck!

From Geraldine_VdAuwera on 2018-07-10

@mshruti I believe RevertSam has a SANITIZE option that should allow you to exclude these malformed reads.

From mshruti on 2018-07-10

Thanks @Geraldine_VdAuwera. I will check it out.

From CarolineJudy on 2018-07-13

Okay, thanks @Geraldine_VdAuwera!

From CarolineJ on 2018-07-16

Hi @Geraldine_VDAuwera,

To correct a missing mate problem, I ran the picard tool FixMateInformation as recommended above. However, running SamFileValidation shows that over 1M reads are still missing mate information.

ERROR:MATE_NOT_FOUND 1457087

Is there a way to filter these 1457087 “malformed” reads using picard tools?

Many thanks in advance for any help you can offer,

From CarolineJ on 2018-07-16

Hi everyone – I resolved this problem – the missing mate issue was generated when I used MarkDuplicates with option “REMOVE_DUPLICATES=TRUE”, which appears to have created orphan reads in my outputted bam file. A [Biostars discussion](https://www.biostars.org/p/83806/ “Biostars discussion”) hypothesizes that the picard tool MarkDuplicates leaves unmapped mates in the BAM for pairs where only one of the two paired reads map. I think this might be what happened to me.

From Sheila on 2018-07-17

@CarolineJ

Hi,

Thank you for posting your findings.

-Sheila

From CarolineJ on 2018-07-18

In terms of best practices, do you have any suggestions? Currently, I plan to proceed with variant calling without performing the deduplication step, but I’m concerned with the bias that may create.

Thanks for any input you can provide.

From Sheila on 2018-07-24

@CarolineJ

Hi,

Can you try not using `REMOVE_DUPLICATES=TRUE` in MarkDuplicates?

Thanks,

Sheila

From CarolineJ on 2018-07-25

Hi Sheila,

That works! Thanks!

From clayf on 2018-09-30

Greetings,

I have run into a similar problem as CarolineJ. When running ValidateSamFile on a specific .bam file, I get an extremely high number of MATE_NOT_FOUND errors (>80 milllion, >50% of the total number of reads). My protocol so far:

HISAT2 to align
FixMateInfo and AddOrReplaceReadGroups to process
ValidateSamFile to check

All the other data sets processed simultaneously using this protocol and these tools validated fine with no errors. I have not marked/removed duplicates yet. samtools flagstat suggests that the majority of reads in the problematic data set are mapped with a mate.

Thanks!

From shlee on 2018-10-02

Hi @clayf,

ValidateSamFile’s MATE_NOT_FOUND error is explicitly about not being able to find the mate’s read physically in the data file itself. Is there any possibility the file was truncated, data was subset to certain contigs or the second of pair sequencing run was problematic?

From clayf on 2018-10-03

Hi @shlee,

I’m not sure, but I think it’s unlikely. The original alignment file produced by HISAT2 (before running FixMateInfo or AddOrReplaceReadGroups) looks fine, and as I said, samtools seems to think the read mates are mapped correctly. HISAT2 also seems to think that the majority of mates are mapping correctly. I did run ValidateSamFile on the original alignment file and got the same error. Is there some reason that samtools flagstat would think that the mates are okay, but ValidateSamFile would not?

From shlee on 2018-10-03

Yes, @clayf, the validation stringency between samtools and GATK is different. I suspect this has something to do with pair/proper pair designations of secondary/supplementary alignments, which MergeBamAlignment solves for our workflows. Do you have some idea how HISAT2 assigns pair designations for secondary/supplementary alignments, e.g. reads split across introns? Are you by chance working on RNA-Seq data? In this case, our workflows require an additional processing step, SplitNCigarReads.

From clayf on 2018-10-03

@shlee, ah yes this is indeed RNA-seq data, so it could definitely be related to that. However, I would then wonder why the rest of these RNA-seq samples would pass ValidateSamFile with no errors or problems as they are, but this particular data set fails.

From shlee on 2018-10-03

Indeed that is odd @clayf. I’ve seen from other users certain pipelines do not handle single-end reads well in that the pipelines expect paired-end reads. In some cases, it turned out the preprocessing was done incorrectly and the SAM flags are misassigned. If you pull out some of the reads that cause the `MATE_NOT_FOUND` warning (with e.g. `samtools view xyz.bam | grep ‘’`; you can get a list of reads with ValidateSamFile by changing the `MODE` to be more verbose), do both pairs come out? Then if you examine their SAM flags (see [here](https://broadinstitute.github.io/picard/explain-flags.html)), do they make sense? My recommendation is to see if your caller of choice then also chokes on this particular sample or is able to call variants on it and if the variant calls are of the same quality as your other samples.

Report abuse