created by Geraldine_VdAuwera
on 2012-07-26
NOTE: we recently made some changes to the bundle on the FTP server; see the Resource Bundle page for details. In a nutshell: minor directory structure changes, and Hg38 bundle now mirrors the cloud version.
See the Resource Bundle page. In a nutshell, there's a Google Cloud bucket and an FTP server. The cloud bucket only has Hg38 resources; the resources for other builds are currently only available through the FTP server. Let us know if you want them on the Cloud too.
This contains all the resource files needed for Best Practices short variant discovery in whole-genome sequencing data (WGS). Exome files and itemized resource list coming soon(ish).
All resources below this are available only on the FTP server, not on the cloud.
Additionally, these files all have supplementary indices, statistics, and other QC data available.
Includes the UCSC-style hg19 reference along with all lifted over VCF files.
Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.
Includes the 1000 Genomes pilot b36 formatted reference sequence (humanb36both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.
Updated on 2016-12-01
From Geraldine_VdAuwera on 2016-08-17
Questions and comments up to August 2016 have been moved to an archival thread here:
http://gatkforums.broadinstitute.org/discussion/4561/questions-about-the-resource-bundle
http://gatkforums.broadinstitute.org/gatk/discussion/8175/questions-about-the-resource-bundle-continued
From shorebean on 2016-10-13
I’m trying to download the resource bundle as below.
$ lftp -u gsapubftp-anonymous ftp.broadinstitute.org
But I cannot access the ftp server files. Is there any problem with ftp server?
From Sheila on 2016-10-14
@shorebean
Hi,
I think our server was down for a bit, but it should be up and running now :smiley:
-Sheila
From heskett on 2016-10-19
Once you have the bundle:
What do I need to do to make each file usable by GATK? Decompress, sort, index, compress?
From Sheila on 2016-10-21
@heskett
Hi,
Usually the files are usuable “as is” from the bundle. GATK accepts compressed .gz files. However, I am not sure what is going on with your other dbSNP issue. The tools should accept the .gz VCF.
-Sheila
P.S. On second thought, you may just try re-downloading the index file. Couldn’t hurt! :smile:
From shorebean on 2016-10-27
@Sheila
Thank you . I can access it now.
From Rash on 2016-11-25
Sheila
Geraldine_VdAuwera Is there any resource bundle availble for mouse genome (mm9 or mm10)? Many thanks in advance,
Rahel
From Eugenie on 2016-11-28
>
Rash said: >
Sheila @Geraldine_VdAuwera Is there any resource bundle availble for mouse genome (mm9 or mm10)? Many thanks in advance,
> Rahel
I am also interested)
From kerbs on 2016-11-29
The server is down. Please make the files accessible again.
From Sheila on 2016-11-29
Rash
Eugenie
Hi,
Sorry, we only provide resources for humans.
-Sheila
From Sheila on 2016-11-29
@kerbs
Hi,
Can you try again? The server is finicky, but should work if you keep trying. Sorry for the inconvenience.
-Sheila
From kerbs on 2016-11-30
Thanks, I was now able to download the necessary files.
From suhye on 2017-01-04
@Sheila
Hello, I’m working this step with Mouse(GRCm38,mm10) WES data.
But I Can’t get mouse’s known INDEL/SNP site for running GATK RealignerTargetCreator
How to i get these VCF files? ( I got dbSNP vcf for Mus musculus )
From Sheila on 2017-01-05
@suhye
Hi,
You will need to do some research on your own to find those, as we provide help with human resource files. Perhaps someone in the mouse field will jump in here. However, if you have the dbSNP VCF, you can use that in RealignerTargetCreator. Also note, Indel Realignment is no longer needed if you are using the latest Best Practices. Have a look at [this blog post](https://software.broadinstitute.org/gatk/blog?id=7847) for more information.
-Sheila
From suhye on 2017-01-07
@Sheila
Sheila, So thank you for your reply. But I met a Error for running Mutect2 with Mouse dbsnp vcf. my customized mouse dbsnp vcf file have only 1~19, X, Y chromosomes. but I got this error message! (attached captured pic) So, I added some contig information like this > "##contig="
But I can't solve this problem. How can i solve this? Can I remove some unknown site(e.g. chrJH~~, chrGL~~) from reference fasta file?
From Geraldine_VdAuwera on 2017-01-07
The extra contigs present in the reference are not the cause of the problem. As stated in the error message, the contigs are not in the same order in the reference and in the dbsnp file. So you must reorder the dbsnp file. See the link provided in the error message; it goes to a document that explains how to do it.
From suhye on 2017-01-10
@Geraldine_VdAuwera
really thank you for advice!
I did make fasta dict file by picard tools and I used picard Sortvcf with this fasta dict!
but I got similar trouble message!
I used same reference fasta file, and also .dict file is made from that reference fasta…
How can i solve this problem?
From Geraldine_VdAuwera on 2017-01-10
Hi there, the Picard SortVcf does not regenerate the index file correctly, so you need to delete the index of the dbsnp vcf file. GATK will regenerate it for you, then it should work.
From suhye on 2017-01-12
@Geraldine_VdAuwera
Oh really thank you! with your advices I can do troubleshoot :smile:
From mmokrejs on 2017-01-27
Hi, I just realized the ftp.broadinstitute.org/bundle/hg38/hg38bundle/Homo_sapiens_assembly38.fasta.gz contents are modified. For example, I am just guessing that somebody intentionally masked the homologous portions of chrY with N’s (just compare to CM000686.2 from https://www.ncbi.nlm.nih.gov/nuccore/CM000686.2 )? What other modifications shall I expect in the file? I thought I am getting a plain GRCh38 build with:
[chr1, chr2, …, chrX, chrY, chrM, chr1_KI270706v1_random, …, chr1_KI270762v1_alt, chrEBV, chrUn_KN707606v1_decoy, HLA-A*01:01:01:01, HLA-DRB1*16:02:01]
instead of original
[1, 2, …, X, Y, MT]
Please document contents of the bundle, both on GATK website and also in a README file inside the tarball. I am especially curious to read which regions were masked and why/how. From a quick glance it does not seem it was a simple low-complexity based masking approach but who knows … Also it would be nice if you commented on the interpretation of the user’s alignment results. IMHO reads from male samples will be mapped to X chromozome, cause false/distorted SNPs, failing sex in-silico checks. I wonder what I missed in reading the docs and why I ever picked ftp.broadinstitute.org/bundle/hg38/hg38bundle/Homo_sapiens_assembly38.fasta.gz as my reference.
Thank you
From shlee on 2017-01-31
Hi @mmokrejs,
The [Reference Genome Components](http://gatkforums.broadinstitute.org/gatk/discussion/7857/reference-genome-components) article explains briefly the difference between analysis set references and the reference set, e.g. that IGV displays. The section on PAR regions should interest you.
The reference set provided in the resource bundle is an analysis set and should include the HLA, decoy, alt and EBV contigs.
The contig nomenclature you refer to, e.g. chr1 vs. 1, are vestiges of reference build 37 (GRCh37 and hg19). GRCh38 consolidates the nomenclature to use `chr`. I’m not sure which file you are referring to in our bundle that shows the [1, 2, …, X, Y, MT] naming. Can you be more specific?
Finally, I believe folks here who were originally involved in preparing the bundle’s reference set are preparing a README file to include in the bundle as well as to update the resource doc.
From mmokrejs on 2017-01-31
Hi @shlee, thank you for your answer. I will answer just the easy for the very moment.
> I’m not sure which file you are referring to in our bundle that shows the [1, 2, …, X, Y, MT] naming. Can you be more specific?
Well in none in your bundle. I meant that GRCh38 original contains these (I put a link to CM000686.2 from https://www.ncbi.nlm.nih.gov/nuccore/CM000686.2 because that seems to be the original sequence).
> GRCh38 consolidates the nomenclature to use chr.
Aha, that I did not realize. I had to edit the chromosome names back and forth when using hg38 but that was because of other VCF files used for annotations still using [1, 2, MT] namings although based on hg38?
From shlee on 2017-02-01
Hi again @mmokrejs,
If resource VCFs are using [1, 2, MT] nomenclature, then that is an effect from simply lifting over the coordinates to the new assembly without regard to the naming nomenclature. I don’t believe this should be the case for any of our resource bundle files. For information on GRCh38, please refer to the original [Genome Reference Consortium (GRC)](http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/) site. Please use our forum to ask about the resources GATK provides. For resources from other sites, please ask your questions to those sites.
From CarlosBorroto on 2017-05-03
> The cloud bucket only has Hg38 resources; the resources for other builds are currently only available through the FTP server. Let us know if you want them on the Cloud too.
Would it be possible to get b37 in the cloud bucket? We are heavily dependent on ExAC/gnomAD and until those resources are available on Hg38 we won’t be able to migrate.
Thanks!
Carlos
From shlee on 2017-05-03
Hi @CarlosBorroto, I’ll get back to you on this.
From shlee on 2017-05-03
@CarlosBorroto, you are in luck! Someone on the team has already placed this in a cloud bucket (for public sharing) and I’ve found the addresses for you.
- The bucket address, e.g. for use with gsutil is `gs://gatk-legacy-bundles/b37`.
- The console view is at https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37.
From CarlosBorroto on 2017-05-25
@shlee thanks!
From obigbando on 2017-05-29
Hi,
I noticed that the HG38 bundle doesn’t include 1000G_phase1.indels.vcf.gz, that was included in HG19 bundle. For genome version HG19, we use dbsnp_138.hg19.vcf, Mills_and_1000G_gold_standard.indels.hg19.vcf, and 1000G_phase1.indels.hg19.sites.vcf as knownSites datasets for running BaseRecalibrator. So would you provide the corresponding indels.hg38.sites.vcf.gz in the future?
Thanks
From Sheila on 2017-06-02
@obigbando
Hi,
You can use the Mills_and_1000G_gold_standard.indels.hg38.vcf.gz and Homo_sapiens_assembly38.known_indels.vcf.gz as a replacement for the three original indel files.
-Sheila
From rpandya on 2017-10-04
Hi,
To ensure compatibility over time in our pipelines, we’d like to know what are the specific versions/patches of each of the references included in each bundle. Also, how often they are updated, and how updates are announced so that we can track them. Is this information available somewhere?
Also, hg38 doesn’t seem to include md5 hashes for the files so we can verify our downloads – could you add those?
Thanks,
Ravi
From Sheila on 2017-10-05
@rpandya
Hi Ravi.
I am pretty sure our bundle files are provided “as is” for now. This may change in GATK4, but I need to check with the team. Someone will get back to you soon.
-Sheila
From rpandya on 2017-10-11
Thanks Sheila. We understand that they are “as-is”, but it would be good to be able to tell users of our pipeline exactly what they are getting (e.g. patch release), and know when the bundle changes. Also, the md5 is a good sanity check to ensure our download was correct – you already provide those on hg19 & b37, so hopefully it should be easy to add on hg38 when you get a chance.
Ravi
From Sheila on 2017-10-12
@rpandya
Hi Ravi,
Thanks for the suggestions. I will see what the team says about prioritizing this while we prepare for the GATK4 release :smile:
-Sheila
From cdz512 on 2017-10-19
Hi,
I am trying to access the ftp server for HapMap data in VCF format. However, I am having difficulties logging in via the instructions on the Resource Bundle page (and this link ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/). Is the server down?
Thanks
From jianxinwang on 2017-10-20
I’m having the same problem as of today.
From pranitha on 2017-10-20
I have the same problem too. Im trying to connect for hg19 annotations
From pranitha on 2017-10-23
Sheila
Geraldine_VdAuwera Is there any other location other than FTP to download the GATK resouce bundle for hg19?
From cdz512 on 2017-10-24
Sheila
shlee @Geraldine_VdAuwera Can you help us with accessing the resource bundle through FTP? I’d like to get the data for HapMap genotypes and sites VCFs. Thanks!
From Sheila on 2017-10-24
cdz512
pranitha @jianxinwang
Hi everyone,
Sorry for the issues. I am having trouble too. I will make a note for the team and get back to you. Unfortunately, if you are using hg19, the only place to find the resources are through the ftp. If you are using hg38 or b37, you can try the Google Cloud.
-Sheila
From Sheila on 2017-10-25
cdz512
pranitha @jianxinwang
Hi everyone,
Sorry again about that. The IT team has fixed the issue :smile:
-Sheila
From tony on 2017-10-28
Hi,
Is there any documentation on how are built the various indels files in the bundle and which sources feed their content ?
For instance are the 1000G phase3.v4 indels included in the Mills_and_1000G_gold_standard file ?
Is the “known_indels” file in hg38 the equivalent to 1000G phase1 indels in hg37 ?
Many thanks
Anthony
From cdz512 on 2017-10-28
Thank you!
From Sheila on 2017-10-28
@tony
Hi Anthony,
Unfortunately, right now the files are provided “as-is”, and we don’t have much documentation on the contents of them. However, I have a ticket for improving the documentation for this in GATK4.
-Sheila
P.S. I will add your questions to the ticket, and see if I can get a concrete answer for you, as I am not 100% sure of the answers right now.
From Sheila on 2017-10-30
rpandya
tony
Hi again,
Unfortunately, this is not a high priority for us, so we cannot provide much help with your questions. If you need specific versions/data, it may be best to download the data from the specific websites you are interested in.
-Sheila
From KyuriChoi on 2018-01-12
Could I get b38 lifted over from hg38 file? When you done process that make hg38 bundle and b38 bundle? plz reply my comment! :)
From shlee on 2018-01-16
@KyuriChoi,
The split reference assemblies are now merged in GRCh38. That is, if before we had hg19 and GRCh37, we now only have GRCh38.
From igor on 2018-01-16
There used to be a resource bundle guide at https://software.broadinstitute.org/gatk/user%20guide/article.php?id=1213
It looks like that page may have gotten lost in the GATK 4 transition. Is it possible to restore it? Or is there a new location for it?
From shlee on 2018-01-16
Hi @Igor,
You can still find the mirrored article on the forum at https://gatkforums.broadinstitute.org/gatk/discussion/1213.
From KyuriChoi on 2018-01-22
@shlee Thank you for your reply. So, you mean, now GATK use only one reference bundle, not two kinds of bundle, right? Then, can I use new,and only one, resource bundle(Grch38/hg38) with GATK version 3.8? And did you made new bundle totally done now? It is okay to used in our pipeline now?
From Sheila on 2018-01-23
@KyuriChoi
Hi,
We are now supporting hg38, if that is what you are asking. You can indeed use hg38 with version 3.8 and GATK4. The bundle files are [here](https://software.broadinstitute.org/gatk/download/bundle).
-Sheila
From KyuriChoi on 2018-01-25
@Sheila Thank you, but I received error message like “Input files known2 and reference have incompatible contigs.“when I run gatk3 realignertargetcreator. My input known 2 file is Mills_and_1000G_gold_standard.indels.hg38.vcf in new GRCh38 bundle and Reference is Homo_sapiens_assembly38.fasta in new GRCh38 bundle too. I do try again waiting for your reply! :)
From KyuriChoi on 2018-01-25
ddddd
From KyuriChoi on 2018-01-25
I do run again, and it’s fine. thanks for your help!
From hashish on 2018-04-01
Hello,
Is Mills_and_1000G_gold_standard.indels.b37.sites.vcf
the same as
Mills_and_1000G_gold_standard.indels.b37.vcf?
BaseRecalibrator and VariantRecalibrator request the file with ‘sites’ in the name while the bundle has the one without it.
Please advise.
From Sheila on 2018-04-01
@hashish
Hi,
You can use Mills_and_1000G_gold_standard.indels.b37.vcf for Mills_and_1000G_gold_standard.indels.b37.sites.vcf. I suspect the names changed over time for the same file, and our documents did not get updated. If you let us know which docs contain issues, we can fix them.
Thanks,
Sheila
From hashish on 2018-04-02
@Sheila
Thank you for your reply.
I came across it in the following pages:
https://software.broadinstitute.org/gatk/documentation/article.php?id=1247
https://software.broadinstitute.org/gatk/documentation/article.php?id=1259
https://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr
From Sheila on 2018-04-04
@hashish
Hi,
Fixed. Thank you for letting us know :smile:
-Sheila
From Begali on 2018-08-02
Geraldine_VdAuwera
Sheila
hi
could you please provide all resources for hg19 at Google Cloud as
as I need to get Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz and dbsnp_138.hg19.vcf.gz
ftp://ftp.broadinstitute.org/bundle/
gives me peoblem loading since three days
Moreover here only fasta file
https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/hg19/?pli=1
with best regards
From Sheila on 2018-08-06
@Begali
Hi,
I don’t think the team has any plans to move hg19 files to the cloud. But, the ftp is working now; I just tested it.
-Sheila
From Begali on 2018-08-07
@Sheila
hi
thanks but still not working with different browser and different operating system, may related to internet network
provide me different errors
directory not exist
can not load this page and I tried to fix Proxy but not successful also with this link
https://github.com/snewhouse/ngs_nextflow/wiki/GATK-Bundle
was trying download via wget but time out connected
However thanks so much for your reply :smiley:
From Sheila on 2018-08-10
@Begali
Hi,
Not sure if this helps, but have you tried ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/? That works for me. If it does not, I will see if I can transfer the files to the cloud myself.
-Sheila
From Begali on 2018-08-12
@Sheila
hi
Thanks so much for your helping
now it is working as I download from different internet network :)
best regards