Collected FAQs about input files for sequence read data BAMCRAM

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2012-08-11

1. What file formats do you support for sequence data input?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). Starting with version 3.5, the CRAM format is supported as well. SAM format is not supported but can be easily converted with Picard tools.

2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.

3. What are the formatting requirements for my BAM file(s)?

All BAM/CRAM files must satisfy the following requirements:

It must be aligned to one of the references described here.
It must be sorted in coordinate order (not by queryname and not "unsorted").
It must list the read groups with sample names in the header.
Every read must belong to a read group.
The BAM file must pass Picard ValidateSamFile validation.

See the official BAM specification for more information on what constitutes a valid BAM file.

4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...

5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 @SQ SN:MT LN:16571 @SQ SN:NT_113887 LN:3994 ...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.

6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.

7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG' @RG ID:0 PL:solid PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:1 PL:solid PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:2 PL:LS454 PU:R_2008_10_02_06_06_12_FLX01080312_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:3 PL:LS454 PU:R_2008_10_02_06_07_08_rig19_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:4 PL:LS454 PU:R_2008_10_02_17_50_32_FLX03080339_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC ...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG' EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85 EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 ...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).

8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information. You can use Picard's AddOrReplaceReadGroups tool to add read group information.

11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

java -jar GenomeAnalysisTK.jar -R reference.fasta -I full_input.bam -T PrintReads -L chr1:10-20 -o subset_input.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

Updated on 2015-11-20

From Geraldine_VdAuwera on 2014-09-01

Questions and comments up to August 2014 have been moved to an archival thread here:

http://gatkforums.broadinstitute.org/discussion/4558/questions-about-bam-files

From everestial007 on 2016-03-03

HI Geraldine_VdAuwera Sheila

After we do the bwa alignment > sorting of the bam > and mark-duplicates, should we get rid of unmapped reads and/or reads with low mapQ (based on SAM specification) for downstream analyses.

I hunch is both yes and no based on what my overall goal is (that is what best practices recommend). But, how would it affect my choice if my main goal is SNP vs. Indel calling.

Also, https://www.broadinstitute.org/gatk/guide/article?id=2799 link is down.

Thanks,

From Geraldine_VdAuwera on 2016-03-03

You can get rid of those reads if you want, it won’t hurt the analysis. The downside is that those reads will be gone so you can’t change your mind. We don’t remove anything.

That document has been replaced by https://www.broadinstitute.org/gatk/guide/article?id=6747

I’ll fix the link so that it doesn’t just break.

From everestial007 on 2016-03-07

Thanks much @Geraldine_VdAuwera !

From sqanbar on 2016-11-16

I am feeding a list of BAM.gz files into PrintReads to get a subset of BAM files for a region. I get this error message !

ERROR MESSAGE: SAM/BAM/CRAM file BL01.sorted.dedup.rlgn.bam.gz is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

I do not see any problem with BAM files. the question is can I feed "bam.gz" files into PrintReads ?

From Geraldine_VdAuwera on 2016-11-16

sqanbar The error message says that the header section of your bam file does not have Read Group information (lines starting with RG). If that is true, you must add these before proceeding with analysis. See Picard AddOrReplaceReadGroups. If this is not true and you can show us the RG lines, I will check why you would get this error.

From Jutta on 2016-12-20

Hi Geraldine,

i run again into a new problem with my RNA-seq data set. The following is my program code for HaplotypeCaller according to the best practice pipeline. The previous steps run without any problems for this sample.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Zeamays.AGPv3.31.dna.genome.edit.fa -I sampleSplitNCigar.bam -dontUseSoftClippedBases -standcallconf 20.0 -standemitconf 20.0 -o sample_HaploCaller.vcf

I obtained the following error message:

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: SAM/BAM/CRAM file /media/storage/Serverdaten/GATK-SNPcalling/SplitNTrim/A554-II-2SplitNCigar.bam is malformed. Please see https://www.broadinstitute.org/gatk/guide/article?id=1317for more information. Error details: Unrecognized tag type: :

ERROR ------------------------------------------------------------------------------------------

As described on the website, I checked my reorder.bam file, the one which I reorder using ReorderSam after removing duplicates using MarkDuplicate, with ValidateSamFile: java -jar /opt/picard/picard.jar ValidateSamFile I=sample_reordered.bam MODE=SUMMARY

HISTOGRAM java.lang.String

Error Type Count ERROR:MATENOTFOUND 4378094

Furthermore, I re-run MarkDuplicates and reorderSam and checked my .bam file again. Still I obtained the same ERROR message.

I also checked ValidateSamFile for another sample for which Haplotype-calling step worked fine without any error messages. For this sample I obtained the same SUMMARY output of ValidateSAMFile as for the first sample, were I run into the problem with the "unrecognized tag type".

Now, I am wondering, if the I should correct the error "MATENOTFOUND", although it seems, that this error is not the cause for the failure of HaplotypeCaller. In case it is, how do I correct for this error? And otherwise, how can I remove the "Unrecognized tag type: : ". What is the cause for this error message?

I am happy for any recommendations :) Best wishes Jutta

From Geraldine_VdAuwera on 2016-12-21

Hmm, that’s one I have not seen before. What did you use to align your data?

From Jutta on 2016-12-21

I used TopHat2

From Geraldine_VdAuwera on 2016-12-22

I would recommend using STAR instead, as recommended in our best practices. In our hands it is the aligner that is most well-behaved.

From Jutta on 2017-01-03

Hm, I have in total 180 RNA-samples and for 179 the whole pipeline worked quite well, just for this one sample not. To redo the mapping with STAR for all the samples would take to much time and I have to do a lot of other analyses again. I guess, just mapping the one error-prone sample with STAR and keeping the other samples mapped with TopHat is not recommendable?!

So, I would very appreciate it, if anyone has a solution how to fix the problem with the “unrecognized tag type”

Thanks for your help,

Jutta

From shlee on 2017-01-03

Hi @Jutta,

If I understand correctly, only one of 180 files is causing an error for you. All other samples processed successfully. So, let’s try to figure out what is different about this one file compared to the other 179 and then figure out what may be the actual cause of error—missing mates or unrecognized tag types. I think it may be more likely that the former is the case since I would assume all your samples run with the same pipeline would give you the same set of tags. Let’s confirm.

First, have you tried rerunning this particular sample through your pipeline? Do you get the same exact output? We should rule out whether the data was truncated due to some random glitch that would cause mates to go missing.

If you can reproduce the error, then we should figure out why this one file is different from your other 179.

To list all tags within a BAM, use this command:

```

samtools view input.bam | cut ~~f 12~~ | tr ‘\t’ ‘\n’ | cut -d ‘:’ -f 1 | awk ‘{ if(!x[$1]++) { print }}’

```

Do you see any unusual tags not in your other files?

To obtain summary metrics of properly paired mates, use this command:

```

samtools flagstat input.bam

```

Do you see an unexpected proportion of unpaired mates?

If unpaired mates is the problem, then you should know that HaplotypeCaller does not use mate information in genotyping. So one quick solution I think is for you to unpair your reads by removing the 0×1 bitwise flag. The following command will create SE data:

```

samtools view -h input_pe.bam | gawk ‘{printf “%s\t”, $1; if(and($2,0×1))

{t=$2-0×1}else{t=$2}; printf “%s\t” , t; for (i=3; i printf “%s\n”,$NF}’| samtools view -Sb – > output_se.bam

```

Let us know what you find and any solutions that work or don’t work.

From shlee on 2017-01-04

@Jutta,

I need to clarify that HaplotypeCaller does use mate information to determine whether it will use reads in genotyping. If reads are paired, then the mates must align to the same contig for HaplotypeCaller to consider them in genotyping. If the mates align to different contigs, HaplotypeCaller ignores the reads in genotyping. What I meant to say above was that the mate information does not contribute to better genotyping, e.g. phasing. The last tentative solution I suggest above is one way for you to test whether unpaired mates is the reason for the error.

From Jutta on 2017-01-05

Hi Shlee,

First, yes I rerun the sample with the same pipeline again and I could reproduce the mentioned error message. Therefore, i performed all the three suggestions you made:

1.check the tags within the bam file: I applied to provided code to each bam file I produced within the pipeline based on the GATK Best practice. The output for the raw mapped reads (direct unprocessed output of TopHat2) is following: samtools view accepted_hits.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

AS XM XO XG MD NM XS NH XN YT CC CP HI

This output is the same as for other samples which I checked randomly. The output for the deupped and reordered bam file of this samples looks similar and is also the same for other samples, which I checked: samtools view dedupped.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

MD PG RG XG NH NM XM XN XO AS YT HI CC CP XS

A problem occurred, when I checked the tags for the bam file resulted from SplitNCigarReads: samtools view SplitNCigar.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

MD PG RG XG NH NM XM XN XO AS YT HI CC CP XS XX 1 20 �Y ZU �Y Y Here I observed these last 7 "tags", which I didn't observe in other samples, for which the whole pipeline worked fine! Seems like there is a problem with SplitNCigar? The code I used for all samples is: java -jar .../GenomeAnalysisTK.jar -T SplitNCigarReads -R Zeamays.AGPv3.31.dna.genome.edit.fa -I reordered.bam -o SplitNCigar.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOWNCIGARREADS

2.unexpected proportion of unpaired reads? samtools flagstat SplitNCigar.bam

84282659 + 0 in total (QC-passed reads + QC-failed reads) 8821999 + 0 secondary 0 + 0 supplementary 29552683 + 0 duplicates 84282659 + 0 mapped (100.00% : N/A) 74719435 + 0 paired in sequencing 37500927 + 0 read1 37218508 + 0 read2 63033090 + 0 properly paired (84.36% : N/A) 69735729 + 0 with itself and mate mapped 4983706 + 0 singletons (6.67% : N/A) 1355867 + 0 with mate mapped to a different chr 863143 + 0 with mate mapped to a different chr (mapQ>=5)

I can't observe here any unexpected proportions? Or do you?

3.I unpaired the mates with the code you provided and used the output file as input for Haplotype caller, but I again obtained the following error. Not sure, if I used the right input file or if I can somehow index the file prior to inputting it to Haplotypecaller?! samtools view -h SplitNCigar.bam | gawk '{printf "%s\t", $1; if(and($2,0x1)) {t=$2-0x1}else{t=$2}; printf "%s\t" , t; for (i=3; i

[E::samparse1] incomplete aux field [W::samread1] parse error at line 12778196 [main_samview] truncated file.

I don't unterstand what this output tells me and if the "unpairing" of the mates was successfully. Nevertheless, I just used SplitNCigar_se.bam as input for Haplotypecaller with the following output: ... INFO 11:08:12,732 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:08:12,810 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08 INFO 11:08:12,847 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 11:08:13,039 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Invalid command line: Cannot process the provided BAM/CRAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAM/CRAMs in --unsafe mode, but this feature is unsupported -- use it at your own risk!

Sorry, for this long post!

From shlee on 2017-01-05

Hi @Jutta,

All this information is helpful so no need to apologize for the long post. Thanks for going through all the steps and also figuring out where in the pipeline the odd tags come from. Your SplitNCigarReads command must be fine because it works for the other samples without error. For the single problem sample, the extra tags (XX, 1, 20, �Y, ZU, �Y, Y) are definitely unexpected and likely the cause of error. Since you can recapitulate the error with the same sample file, my guess is that this is something different about the original sample data file that is causing SplitNCigarReads to produce strangely tagged records.

Results from (2) are less interesting so let’s consider the first results from (3). Samtools is saying the file is truncated (`incomplete aux field`, `truncated file`) and let’s assume for now that this relates to the odd tags. If I google `[E::sam_parse1] incomplete aux field`, I come upon this github thread: . To quote from it:

> It’s typically due to some sort of memory issue during the [pipeline] run.

The thread suggests this message appears due to a memory allocation issue in a pipeline. Errors due to memory constraints is not an area I am familiar with so please let me know if this makes sense to you. Is it possible your particular sample requires more memory to process than the other samples? Is the file size larger than the others or is there reason to believe it may have other unusual properties that necessitate more compute (for SplitNCigarReads or prior steps)? If you process this sample through the same steps manually, outside the pipeline, with more memory, or within the pipeline with more memory, do you get the same odd tagging?

From shlee on 2017-01-05

P.S. @Jutta. Can you also run your problem BAM through [ValidateSamFile](https://software.broadinstitute.org/gatk/guide/article?id=7571)? This is an easier way to suss out potential issues with BAMs. Hopefully, our tool catches such odd files as yours.

From Gossie on 2017-11-16

Hi @Geraldine_VdAuwera

In the section ”7. How can I tell if my BAM file has read group and sample information?“，you give a command `samtools view /path/to/my.bam | grep ‘^@RG`to print the read wich include “RG”. I think it is useless to match “RG” in reads, because reads don’t start with “RG”. May you want to type `samtools view /path/to/my.bam | grep ‘RG’`, right?

If I am wrong, Please tell me.

yours, gossie

From Geraldine_VdAuwera on 2017-11-16

No, the command looks specifically for lines that start with @RG, which is specific of header lines that define read groups. You could restrict the samtools view command to just the header, for efficiency, but aside from that the command is correct.

From Gossie on 2017-11-17

@Geraldine_VdAuwera,

Thanks for your reply. Yes, the command is good for header lines, but the same command is ok for read lines?

Following content from your original:

> In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

```

$ samtools view /path/to/my.bam | grep ‘^@RG‘

EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?><==?=?=>>?>><=?<:=?>?<==?=>:; EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85

```

From Geraldine_VdAuwera on 2017-11-17

Ah, well that’s a different purpose. What we’re talking about here is a quick check to see if read group definitions are present or not. If you actually want to verify that each read has a read group, you’re better off running ValidateSamFile which will report the results more sanely.

From Gossie on 2017-11-17

It is you that use the command `samtools view /path/to/my.bam | grep ‘^@RG` to match read lines with “RG” and It’s wrong, although it’s a small mistake. I advise you to read the section in your original.

From Sheila on 2017-11-20

@Gossie

Hi,

I just re-read the section, and it looks fine. I am not sure where you are seeing “match read lines with “RG” “?

-Sheila

From taoyan on 2017-12-16

@Geraldine_VdAuwera @Sheila Hi, when I run GATK HaplotypeCaller command I got an issue,but I got a result raw.vcf contain some information: ```

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: SAM/BAM/CRAM file /media/taoyan/cotton-zhangxianlong/analysis/bwa-bam/SRR1580583.marked.bam is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more information. Error details: Did not inflate expected amount

ERROR ------------------------------------------------------------------------------------------

And my command is:

!/bin/bash

java -Xmx32g -jar /home/taoyan/biosoft/gatk/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R /media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa \ -I /media/taoyan/cotton-zhangxianlong/analysis/bwa-bam/SRR1580583.marked.bam \ --genotypingmode DISCOVERY \ -o /media/taoyan/cotton-zhangxianlong/data/seqs/SRR1580583.raw.snps.indels.vcf

first I aligh the sequence with BWA,then sort 、reorder and Mrkduplicate with picard.My bam is like that:

$ samtools view -H SRR1580583.marked.bam|grep '^@RG' @RG ID:583 LB:583 PL:Illumina PU:583 SM:583 $ samtools view -H SRR1580583.marked.bam |head @HD VN:1.5 SO:coordinate @SQ SN:A01 LN:99884700 M5:12e28e0311e760f4fcee0d72006a2ab3 UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa @SQ SN:A02 LN:83447906 M5:8c578a1af12657fa12427cf7f9048498 UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa @SQ SN:A03 LN:100263045 M5:e152719f9d4abdb6754e69e0c83b5478 UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa @SQ SN:A04 LN:62913772 M5:a8356a76049c178286d25786388baab9 UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa @SQ SN:A05 LN:92047023 M5:2e21804703461bdad8f79727d964ca4a UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa @SQ SN:A06 LN:103170444 M5:68d43f50938bd83b112c446b9a4dc410 UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa @SQ SN:A07 LN:78251018 M5:41f31563d8dc05dcd7fdf31000c2d393 UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa @SQ SN:A08 LN:103626341 M5:aaea5b5a9ca8f8ad6b894b381bcfee1f UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa @SQ SN:A09 LN:74999931 M5:eb20c9cea068cd638a2e35e4226cada8 UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBIGossypiumhirsutumv1.1.fa

So is there any suggestion how to fix it? I have run the command to validate it, and here it's the result:

$ java -jar ~/biosoft/picard/picard.jar ValidateSamFile I=/media/taoyan/cotton-zhangxianlong/analysis/bwa-bam/SRR1580583.marked.bam MODE=SUMMARY 14:19:22.327 INFO NativeLibraryLoader - Loading libgklcompression.so from jar:file:/home/taoyan/biosoft/picard/picard.jar!/com/intel/gkl/native/libgklcompression.so [Sat Dec 16 14:19:22 CST 2017] ValidateSamFile INPUT=/media/taoyan/cotton-zhangxianlong/analysis/bwa-bam/SRR1580583.marked.bam MODE=SUMMARY MAXOUTPUT=100 IGNOREWARNINGS=false VALIDATEINDEX=true INDEXVALIDATIONSTRINGENCY=EXHAUSTIVE ISBISULFITESEQUENCED=false MAXOPENTEMPFILES=8000 SKIPMATEVALIDATION=false VERBOSITY=INFO QUIET=false VALIDATIONSTRINGENCY=STRICT COMPRESSIONLEVEL=5 MAXRECORDSINRAM=500000 CREATEINDEX=false CREATEMD5FILE=false GA4GHCLIENTSECRETS=clientsecrets.json USEJDKDEFLATER=false USEJDKINFLATER=false [Sat Dec 16 14:19:22 CST 2017] Executing as taoyan@davinci on Linux 4.10.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Picard version: 2.15.0-SNAPSHOT INFO 2017-12-16 14:20:05 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:41s. Time for last 10,000,000: 41s. Last read position: A04:24,798,585 INFO 2017-12-16 14:20:45 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:21s. Time for last 10,000,000: 40s. Last read position: A07:76,082,514 INFO 2017-12-16 14:21:26 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:02:02s. Time for last 10,000,000: 40s. Last read position: A11:35,688,146 INFO 2017-12-16 14:22:06 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:02:42s. Time for last 10,000,000: 40s. Last read position: D02:35,617,652 INFO 2017-12-16 14:22:47 SamFileValidator Validated Read 50,000,000 records. Elapsed time: 00:03:23s. Time for last 10,000,000: 40s. Last read position: D08:24,990,584 ERROR 2017-12-16 14:23:18 ValidateSamFile java.util.zip.DataFormatException: invalid distance too far back [Sat Dec 16 14:23:18 CST 2017] picard.sam.ValidateSamFile done. Elapsed time: 3.93 minutes. Runtime.totalMemory()=6561464320 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp ``` Thanks!

From Sheila on 2017-12-16

@taoyan

Hi,

Did you ever compress your file? It seems like something went wrong during compression or decompression. Can you go back through your steps and find the step where the error may have occurred? You can try running ValidateSamFile on each output after each step.

-Sheila

From taoyan on 2017-12-17

Sheila said: >

taoyan

> Hi,

> Did you ever compress your file? It seems like something went wrong during compression or decompression. Can you go back through your steps and find the step where the error may have occurred? You can try running ValidateSamFile on each output after each step.

> -Sheila

@Sheila yeah,I am going back through every step to see where cause the error.And anyway thank you very much for your answer!

Report abuse