How can I prepare a FASTA file to use as reference

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2012-10-02

This article describes the steps necessary to prepare your reference file (if it's not one that you got from us). As a complement to this article, see the relevant tutorial.

Why these steps are necessary

The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.

NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.

Creating the fasta sequence dictionary file

We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.

> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict [Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict [Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done. Runtime.totalMemory()=2112487424 44.922u 2.308s 0:47.09 100.2% 0+0k 0+0io 2pf+0w

This produces a SAM-style header file describing the contents of our fasta file.

> cat Homo_sapiens_assembly18.dict @HD VN:1.0 SO:unsorted @SQ SN:chrM LN:16571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb @SQ SN:chr1 LN:247249719 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6 @SQ SN:chr2 LN:242951149 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d @SQ SN:chr3 LN:199501827 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0e48ed7f305877f66e6fd4addbae2b9a @SQ SN:chr4 LN:191273063 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cf37020337904229dca8401907b626c2 @SQ SN:chr5 LN:180857866 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:031c851664e31b2c17337fd6f9004858 @SQ SN:chr6 LN:170899992 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfe8005c536131276d448ead33f1b583 @SQ SN:chr7 LN:158821424 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:74239c5ceee3b28f0038123d958114cb @SQ SN:chr8 LN:146274826 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1eb00fe1ce26ce6701d2cd75c35b5ccb @SQ SN:chr9 LN:140273252 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:ea244473e525dde0393d353ef94f974b @SQ SN:chr10 LN:135374737 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4ca41bf2d7d33578d2cd7ee9411e1533 @SQ SN:chr11 LN:134452384 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:425ba5eb6c95b60bafbf2874493a56c3 @SQ SN:chr12 LN:132349534 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d17d70060c56b4578fa570117bf19716 @SQ SN:chr13 LN:114142980 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c4f3084a20380a373bbbdb9ae30da587 @SQ SN:chr14 LN:106368585 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c1ff5d44683831e9c7c1db23f93fbb45 @SQ SN:chr15 LN:100338915 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:5cd9622c459fe0a276b27f6ac06116d8 @SQ SN:chr16 LN:88827254 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3e81884229e8dc6b7f258169ec8da246 @SQ SN:chr17 LN:78774742 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2a5c95ed99c5298bb107f313c7044588 @SQ SN:chr18 LN:76117153 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3d11df432bcdc1407835d5ef2ce62634 @SQ SN:chr19 LN:63811651 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2f1a59077cfad51df907ac25723bff28 @SQ SN:chr20 LN:62435964 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f126cdf8a6e0c7f379d618ff66beb2da @SQ SN:chr21 LN:46944323 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f1b74b7f9f4cdbaeb6832ee86cb426c6 @SQ SN:chr22 LN:49691432 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2041e6a0c914b48dd537922cca63acb8 @SQ SN:chrX LN:154913754 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d7e626c80ad172a4d7c95aadb94d9040 @SQ SN:chrY LN:57772954 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:62f69d0e82a12af74bad85e2e4a8bd91 @SQ SN:chr1_random LN:1663265 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cc05cb1554258add2eb62e88c0746394 @SQ SN:chr2_random LN:185571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:18ceab9e4667a25c8a1f67869a4356ea @SQ SN:chr3_random LN:749256 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cc571e918ac18afa0b2053262cadab6 @SQ SN:chr4_random LN:842648 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cab2949ccf26ee0f69a875412c93740 @SQ SN:chr5_random LN:143687 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:05926bdbff978d4a0906862eb3f773d0 @SQ SN:chr6_random LN:1875562 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d62eb2919ba7b9c1d382c011c5218094 @SQ SN:chr7_random LN:549659 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:28ebfb89c858edbc4d71ff3f83d52231 @SQ SN:chr8_random LN:943810 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0ed5b088d843d6f6e6b181465b9e82ed @SQ SN:chr9_random LN:1146434 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf @SQ SN:chr10_random LN:113275 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:50be2d2c6720dabeff497ffb53189daa @SQ SN:chr11_random LN:215294 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfc93adc30c621d5c83eee3f0d841624 @SQ SN:chr13_random LN:186858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:563531689f3dbd691331fd6c5730a88b @SQ SN:chr15_random LN:784346 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bf885e99940d2d439d83eba791804a48 @SQ SN:chr16_random LN:105485 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:dd06ea813a80b59d9c626b31faf6ae7f @SQ SN:chr17_random LN:2617613 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:34d5e2005dffdfaaced1d34f60ed8fc2 @SQ SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f3814841f1939d3ca19072d9e89f3fd7 @SQ SN:chr19_random LN:301858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:420ce95da035386cc8c63094288c49e2 @SQ SN:chr21_random LN:1679693 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:a7252115bfe5bb5525f34d039eecd096 @SQ SN:chr22_random LN:257318 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4f2d259b82f7647d3b668063cf18378b @SQ SN:chrX_random LN:1719168 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f4d71e0758986c15e5455bf3e14e5d6f

Creating the fasta index file

We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.

> samtools faidx Homo_sapiens_assembly18.fasta 108.446u 3.384s 2:44.61 67.9% 0+0k 0+0io 0pf+0w

This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:

> cat Homo_sapiens_assembly18.fasta.fai chrM 16571 6 50 51 chr1 247249719 16915 50 51 chr2 242951149 252211635 50 51 chr3 199501827 500021813 50 51 chr4 191273063 703513683 50 51 chr5 180857866 898612214 50 51 chr6 170899992 1083087244 50 51 chr7 158821424 1257405242 50 51 chr8 146274826 1419403101 50 51 chr9 140273252 1568603430 50 51 chr10 135374737 1711682155 50 51 chr11 134452384 1849764394 50 51 chr12 132349534 1986905833 50 51 chr13 114142980 2121902365 50 51 chr14 106368585 2238328212 50 51 chr15 100338915 2346824176 50 51 chr16 88827254 2449169877 50 51 chr17 78774742 2539773684 50 51 chr18 76117153 2620123928 50 51 chr19 63811651 2697763432 50 51 chr20 62435964 2762851324 50 51 chr21 46944323 2826536015 50 51 chr22 49691432 2874419232 50 51 chrX 154913754 2925104499 50 51 chrY 57772954 3083116535 50 51 chr1_random 1663265 3142044962 50 51 chr2_random 185571 3143741506 50 51 chr3_random 749256 3143930802 50 51 chr4_random 842648 3144695057 50 51 chr5_random 143687 3145554571 50 51 chr6_random 1875562 3145701145 50 51 chr7_random 549659 3147614232 50 51 chr8_random 943810 3148174898 50 51 chr9_random 1146434 3149137598 50 51 chr10_random 113275 3150306975 50 51 chr11_random 215294 3150422530 50 51 chr13_random 186858 3150642144 50 51 chr15_random 784346 3150832754 50 51 chr16_random 105485 3151632801 50 51 chr17_random 2617613 3151740410 50 51 chr18_random 4262 3154410390 50 51 chr19_random 301858 3154414752 50 51 chr21_random 1679693 3154722662 50 51 chr22_random 257318 3156435963 50 51 chrX_random 1719168 3156698441 50 51

Tags:

fastareference, intro, official, inputs, basic, analyst

Updated on 2013-09-13

From weihua on 2013-02-14

Hi. I generated a fasta file (sequences of a gene) and followed what you say. Finally, I did get the whole pineline through (get the vcf file). However, I lost coordinates. This is the fasta format

>1 dna:chromosome chromosome:GRCh37:1:196621008:196716634:1 ACAGCATTAACATTTAGTGGGAGTGCAGTGAGAATTGGGTTTAACTTCTGGCATTTCTGGGCTTGTGGCT….

I loaded it into IGV, the reads started from coordinate1 instead of 196621008. Am I missing something? I think so, But I could not google it out. Anyone has similar problems?

From Geraldine_VdAuwera on 2013-02-14

If you generated a custom reference with the sequence of just your gene, then this is normal. All the position counting will be done from the start of the sequence in the file, not from the original coordinates of the gene in the genome. If you want the calls in the VCF to have the true genome position coordinates, you should call them using the full genome of your organism. Otherwise you can simply calculate what they should be by adding the call position to the original start position of your gene in the genome. Make sense?

From weihua on 2013-02-15

Thank you the reply. It is quite helpful.

From weihua on 2013-02-15

> @weihua said:

> Thank you for the reply. It is quite helpful.

And I assume, without coordinates data, I can not do local realignment (or any procedures which involves coordinates) using files in the bundle.

From Sophia on 2013-05-31

What can be done about references containing N’s? Can they be used in GATK, e.g. with the variant calling and variant annotation walkers?

From Geraldine_VdAuwera on 2013-05-31

N’s should be fine, they will just be skipped.

From frankib on 2014-01-16

Once you created the two files (.dict and .fai) which one do you input in the command line to use the Realigner target creator?

From Geraldine_VdAuwera on 2014-01-16

@frankib, neither of those files need to be specified in the command line. Please see the example commands given in the documentation for the tools you want to run.

From frankib on 2014-01-17

Ok Thank you.

From timwartewig on 2014-01-24

Hello! Please could you tell me how to get a sorted *.dict for my mm9.fasta reference file? I got it from http://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/

Thanks a lot in advance.

From Geraldine_VdAuwera on 2014-01-24

@timwartewig, you can do it with Picard tools as described above. For more details, please see the Tutorials section of the Guide, or check out the Picard project documentation website.

From timwartewig on 2014-01-24

Thank you Geraldine. Sorry that I had not wrote what I have already tried. I used picard: java -jar CreateSequenceDictionary.jar R=mm9.fa O=mm9.dict which produced me the dict file. My bam file headers/contigs are sorted with picard SortSam followed by samtools reheader. However, this is the error: ##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.

ERROR reads contigs = [chr10, chr11, chr12, chr13, chr13random, chr14, chr15, chr16, chr16random, chr17, chr17random, chr18, chr19, chr1, chr1random, chr2, chr3, chr3random, chr4, chr4random, chr5, chr5random, chr6, chr7, chr7random, chr8, chr8random, chr9, chr9random, chrM, chrUnrandom, chrX, chrXrandom, chrY, chrY_random]

ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chrX, chrY, chrM, chr13random, chr16random, chr17random, chr1random, chr3random, chr4random, chr5random, chr7random, chr8random, chr9random, chrUnrandom, chrXrandom, chrY_random]

From Geraldine_VdAuwera on 2014-01-27

Oh I see. Picard’s ReorderSam should fix that for you, see http://www.broadinstitute.org/gatk/guide/article?id=58

From frankib on 2014-02-04

I don’t understand why I’m able to run the CreateSequenceDictionary without problem but when I run the faidx tool I got the following error:

open: No such file or directory

[_razf_open] fail to open human_g1K_v37.fasta

[fai_build] fail to open the FASTA file human_g1K_v37.fasta

From Geraldine_VdAuwera on 2014-02-04

@frankib, what’s your command line?

From huilin on 2015-02-11

> @Geraldine_VdAuwera said:

> frankib, neither of those files need to be specified in the command line. Please see the example commands given in the documentation for the tools you want to run.

if neither .fai or .dict are needed in the command line, why we generated them in the first place?

From Geraldine_VdAuwera on 2015-02-11

@huilin Those files are needed by the tools. You don’t write them in the command line because GATK automatically finds them.

From aishin88 on 2015-06-10

Hi,

I am trying to use gatk to convert hapmap#28 release data to vcf. I performed the steps you mentioned here and checked and they look like the examples here. then i right the code in the same way from the site but it gives error: I/O error loading or writing tribble index file

the files are in text format and it seems the program tries to make a index of the input data but can’t. my code: java -jar /softw/GenomeAnalysisTK.jar -T VariantsToVCF -R /mnt/NAS/share/gatk_bundle/2.8/hg18/Homo_sapiens_assembly18.fasta -o output.vcf —variant:RawHapMap /mnt/NAS/projects/2015_ayshin_sift/HAPMAP/hapmap#28/genotypes_chr8_CHB_r28_nr.b36_fwd.txt

thanks

From Sheila on 2015-06-10

@aishin88

Hi,

It looks like the —variant argument cannot accept text files as input. Have a look at the documentation for acceptable file formats: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToVCF.php#—variant

-Sheila

From aishin88 on 2015-06-10

thank you @Sheila I look at that part, it is the same just in gzip form,I tried the code with genotypes_chr1_ASW_r27_nr.b36_fwd.txt.gz somrthing like this (in gz form) it gives an error saying : an index is required, but not found. does this means that I should use samtools to make an index for input file like the way I did for reference?

thank you

From aishin88 on 2015-06-12

Hi again,

I have a problem using hapmap data as input. I use the RawHapMap in gzip form as shown in the site : https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_utils_codecs_hapmap_RawHapMapCodec.php

An index is required, but none found.it gives this error. How am I supposed to make an index of hapmap data?

thank you

From Sheila on 2015-06-12

@aishin88

Hi,

You can use Tabix to generate .gz file indices. http://www.htslib.org/doc/tabix.html

-Sheila

From aishin88 on 2015-06-12

ok, I will try it now

thank you

From d3abb7c9 on 2015-06-24

When I run this command:

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R /genomes/glaberrima/Oryza_glaberrima.fa -nt 24 -I Oryza_glaberrima-deduped.bam -o target_intervals.list

I get this error message: ```

ERROR MESSAGE: Fasta dict file /genomes/glaberrima/Oryzaglaberrima.dict for reference /genomes/glaberrima/Oryzaglaberrima.fa does not exist.

I have created the dictionary file from the fasta and it is in my current working directory but GATK is looking for it in the

/genomes/` directory. Can you make it so that you can specify the location of the dictionary on the command line? Or alternatively you could make it so that it looks for the dictionary in the current working directory

From Sheila on 2015-06-24

@d3abb7c9

Hi,

Unfortunately, the .dict file has to be in the same directory as the .fa file.

-Sheila

From d3abb7c9 on 2015-06-24

Thanks for your reply. Our genomes are stored in `/genomes` so that they’re not duplicated in everyone’s `/home` directory wasting space. The `/genomes` directory is not user writable. Forcing the .dict file to be in the same directory as the .fa seems pretty inflexible. I hope you would consider changing this.

From Geraldine_VdAuwera on 2015-06-24

@d3abb7c9 That’s not up to us — this functionality comes from the htsjdk library, which is a project that falls under the samtools organization.

You can solve this problem easily by creating a sequence dictionary in your /genomes directory so that it will be available for all users of your system. If you don’t have admin rights to this directory, just let your sysadmin know that the .dict file is required for analysis and should be included in that directory. This is not an exotic requirement; other tools also make use of the sequence dictionary.

From mbxat1 on 2015-08-04

Hello,

I try to run the “Realigner TargetCreator” but encountered this error message (Fasta dict file /home/mbxat1/African.cattle.project/Mapping/Reference.data/Bos_taurus.UMD3.1.dna.toplevel.dict for reference /home/mbxat1/African.cattle.project/Mapping/Reference.data/Bos_taurus.UMD3.1.dna.toplevel.fa does not exist). My files “.fa and .dict” are in one directory and so i dont know what could be wrong. your help will be appreciated.

Thank you

From Sheila on 2015-08-04

@mbxat1

Hi,

Can you tell me which version of GATK you are using? Also, please post your exact command line.

Thanks,

Sheila

From mbxat1 on 2015-08-05

@ Sheila, thank you for your response.

I am using the current GATK version (GenomeAnalysisTK-3.4-0). my command line as follows:

logfile=${DATOUT}/SampleKN002_RealignerTargetCreator.error

java -d64 -Xmx48g -jar ${GATK}/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/mbxat1/African.cattle.project/Bos_taurus.UMD3.1.dna.toplevel.fa -I ${DATOUT}/SampleKN002_mkdup.bam -o ${DATOUT}/SampleKN002_mkdup_intervals.list —filter_mismatching_base_and_quals —fix_misencoded_quality_scores -nt 4 2> >(tee “$logfile”)

Thank you

From Geraldine_VdAuwera on 2015-08-05

Are you sure the dict file you have follows the required naming convention? It should have exactly the same name as the name mentioned in the error. If no, change the name to that.

If yes, the other possibility is that the file is somehow damaged. Just delete it and create a new one.

From mbxat1 on 2015-08-05

@ Geraldine,

thank you for your response. i might have been able to overcome the initial problem, i had to delete my reference file and download a new one from ensembl.org.

the error now is (Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool)

any ideas please?

Thanks

From Geraldine_VdAuwera on 2015-08-05

The mis-encoded base qualities error has been covered many times in the forum… Are you using the `—fix_misencoded_quality_scores` argument for a specific reason? Or is this a command you inherited from someone else?

From mbxat1 on 2015-08-05

you are right Geraldine, i inherited the command but i have removed the “—fix_misencoded_quality_scores” argument and I am able to proceeded without any error.

Thank you for your time.

From Geraldine_VdAuwera on 2015-08-05

My advice is to always check the purpose of every argument when someone else gives you commands, to avoid problems like this, or version-related problems. Trust no one ;)

From mbxat1 on 2015-08-06

well noted, thank you

From carrigj on 2016-02-11

Hi, I keep getting this error and i don’t know why, any help would be greatly appreciated

“ ERROR MESSAGE: Fasta index file /Users/joannecarrig1/Fabianii/combined_ref.fasta.fai for reference /Users/joannecarrig1/Fabianii/combined_ref.fasta does not exist.”

note the fasta file was indexed

From Geraldine_VdAuwera on 2016-02-12

@carrigj Make sure the fasta index is in the same directory.

From sumedhagarg on 2016-07-02

I have managed to create .fai file for my ref sequence but not .dict file, despite samtools running the command. What could be going wrong?

Thanks

Sumedha

cmd screenshot.png

From Sheila on 2016-07-02

@sumedhagarg

Hi Sumedha,

You will need to use Picard’s [CreateSequenceDictionary](https://broadinstitute.github.io/picard/command-line-overview.html#CreateSequenceDictionary).

-Sheila

From sumedhagarg on 2016-07-03

@Sheila

Thanks Sheila. Yes, it worked for a small reference file, but getting an error for much bigger file with whole human gDNA fasta file, at a particular line, as attached.

Sumedha

CreateSeqDict error.png

From Sheila on 2016-07-03

@sumedhagarg

Hi Sumedha,

It looks like an issue with your index file. Can you try deleting it and re-indexing the reference?

Thanks,

Sheila

From sumedhagarg on 2016-07-03

@Sheila

Hi Sheila

I tried that but had the same error again. My reference file is from ensembl (ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.alt.fa.gz). Is there way to address this issue?

Thanks again!

Sumedha

From Sheila on 2016-07-03

@sumedhagarg

Hi Sumedha,

Can you try unzipping the FASTA file? Maybe the issue is that the tools are not working with .gz files.

-Sheila

From sumedhagarg on 2016-07-03

@Sheila

Hiya,

I am using unzipped file already.

Thanks

Sumedha

From sumedhagarg on 2016-07-04

Sheila Geraldine_VdAuwera

Would it be possible for you to index this file for me that I could download? I am really stuck as can’t proceed until I have it working.

Many thanks

Sumedha

From Geraldine_VdAuwera on 2016-07-05

Sorry, we can’t provide that level of support for a reference file that we didn’t produce ourselves.

Try doing both the indexing and dictionary creation using Picard tools instead of samtools.

From sumedhagarg on 2016-07-05

Thanks a lot for your advice. Please could you tell me the tool name for indexing fasta file in picard?

From Will_Gilks on 2016-07-06

@sumedhagarg

I use this code for making the various reference genome helper-files. Ideally the helper-files would only be made once by the same group that assembled the genome. This would prevent errors caused by people using different methods, and prevent someone having to spend time making their own files.

## Define variables my_fasta=Spinius_nastius.fa full_path=god/baby_jesus/work/secret_plans/reference_sequences/Spinius_nastius/ ## Make index with BWA http://bio-bwa.sourceforge.net/ module load bio/1.15 bwa index -a bwtsw ${my_fasta} module unload bio/1.15 ## Create index with SAMtools http://www.htslib.org/ module load samtools/1.0 samtools faidx ${full_path}${my_fasta} module unload samtools/1.0 ## Build Genome and Hash files with Stampy http://www.well.ox.ac.uk/~gerton/README.txt module load stampy/1.0.23 stampy.py -G ${my_fasta} ${my_fasta} stampy.py -g ${my_fasta} -H ${my_fasta} module unload stampy/1.0.23 ## Create sequence dictionary with Picard tools. Note, this assumes fasta file suffix is .fa module load picard-tools/1.77 CreateSequenceDictionary \ R= ${full_path}${my_fasta} \ O= ${full_path}${my_fasta%.fa}.dict \ TMP_DIR=${full_path} module unload picard-tools/1.77

From sumedhagarg on 2016-07-06

@Will_Gilks

Thanks a lot for responding. I have a different issue now. Would you be able to help with that please? I have posted it at : http://gatkforums.broadinstitute.org/gatk/discussion/1328

Sumedha

From twotwo on 2017-09-29

Hi, I want to create the reference, but cannot do it.

module load gatk/3.6

module load java/8.121

java -jar CreateSequenceDictionary.jar R= N1_DHE02016-1_HW5WCCCXX_L2_1.fq.gz O= Homo_sapiens_assembly18.dict

Error: Unable to access jarfile CreateSequenceDictionary.jar

Do you have any comments on that?

From Sheila on 2017-10-02

@twotwo

Hi,

Do you have [Picard](http://broadinstitute.github.io/picard/) installed? CreateSequenceDictionary is a Picard tool, not a GATK tool.

-Sheila

From Kishor_Tribhuvan on 2019-03-09

I want to generate statistics file of GATK pipeline using parse_metrics.sh (in house), can anyone gives detail of this command

Step 22 Compile Statistics

Tool parse_metrics.sh (in house)

Input alignment_metrics.txt,

insert_metrics.txt,

raw_snps.vcf,

filtered_snps.vcf,

raw_snps_recal.vcf,

filtered_snps_final.vcf,

depth_out.txt

Output report.csv

Notes A single report file is generated with summary statistics for all libraries processed containing the following pieces of information:

of Reads
of Aligned Reads
% Aligned
Aligned Bases
Read Length
% Paired
Mean Insert Size
SNPs, # Filtered SNPs
SNPs after BQSR, # Filtered SNPs after BQSR
Average Coverage

Report abuse