Where can I get a gene list in RefSeq format

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2012-08-11

1. About the RefSeq Format

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses.

2. In the GATK

The GATK uses RefSeq in a variety of walkers, from indel calling to variant annotations. There are many file format flavors of ReqSeq; we've chosen to use the table dump available from the UCSC genome table browser.

3. Generating RefSeq files

Go to the UCSC genome table browser. There are many output options, here are the changes that you'll need to make:

clade: Mammal genome: Human assembly: ''choose the appropriate assembly for the reference you're using'' group: Genes abd Gene Prediction Tracks track: RefSeq Genes table: refGene region: ''choose the genome option''

Choose a good output filename, something like geneTrack.refSeq, and click the get output button. You now have your initial RefSeq file, which will not be sorted, and will contain non-standard contigs. To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order.

4. Running with the GATK

You can provide your RefSeq file to the GATK like you would for any other ROD command line argument. The line would look like the following:

-[arg]:REFSEQ /path/to/refSeq

Using the filename from above.

Warning:

The GATK automatically adjusts the start and stop position of the records from zero-based half-open intervals (UCSC standard) to one-based closed intervals.

For example:

The first 19 bases in Chromosome one: Chr1:0-19 (UCSC system) Chr1:1-19 (GATK)

All of the GATK output is also in this format, so if you're using other tools or scripts to process RefSeq or GATK output files, you should be aware of this difference.

Updated on 2016-10-03

From vivekdas_1987 on 2013-11-26

I have some doubts, I have been using the GATK 2.3 for last one and half months and formulated a pipeline to analyze the exome datasets for our samples, I am having 2 tumor samples and its 4 IPSC lines which I am after reconsidering the Best Practices and being the first time user did it on single samples and then proceeded with annotation. I am indeed grateful to the developers of the tool and the people who have been constantly upgrading the document keeping in line with the latest findings and incorporating them in the forums and the docs page. Some things which I missed out while I was developing the pipeline. I would like to perform is that I want to find out the DepthofCoverage of my aligned bam files , this would help me getting a statistics of number of bases being read each time. Please correct me if am wrong. From the posts I read this can be be done. However I see have to generate a refSeq file from the UCSC genome browser. Is this not the hg19 reference genome fasta file which I have been using for the analysis or do I have to generate another one? The command states

java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T DepthOfCoverage \ -o file_name_base \ -I input_bams.list [-geneList refSeq.sorted.txt] \ [-pt readgroup] \ [-ct 4 -ct 6 -ct 10] \ [-L my_capture_genes.interval_list]

I am a bit confused how to generate the -geneList refSeq.sorted.txt , this can be generate from the UCSC right with the instructions provided above? but then the output format is in bed format right so what should I do there? also the my_capture_genes.interval_list. Can you please guide me about this? Is this some customized interval list? Or if I say that I have the below data with me then can I generate the coverage statistics?

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T DepthOfCoverage -o output_file_base -I SRR062634.sorted.bam

What is the input bam list referring to. If I want to do for single samples every time, I have not encountered much problem with your toll and the sailing have been quite smooth with your guidance but here am a bit confused. It would be nice if you can guide me through.

From Geraldine_VdAuwera on 2013-11-26

Hi @vivekdas_1987,

You don’t necessarily have to provide a RefSeq file to run DepthOfCoverage, that’s only needed if you want to get coverage metrics aggregated by gene. What you do have to provide, since you’re working with exome data, is the list of capture intervals that was used to generate your data. In fact, you should also be using that interval list in other steps of the Best Practices (especially base recalibration and variant calling). If you do not have this list, you should contact the lab that generated the sequence data and ask them what exome intervals were used and where you can get a copy of the intervals file.

From vivekdas_1987 on 2013-11-26

Yes I have that list of intervals its a bed file and the version is V4 sure select interval list. This is the file SureSelect_XT_Human_All_Exon_V4.bed. Please let me know if this is correct or not? Then if I reframe the code it should look like this below:

java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T DepthOfCoverage -o output_file_base -I SRR062634.sorted.bam -L SureSelect_XT_Human_All_Exon_V4.bed

Please let me know if this looks correct or not? I hope this would serve the purpose now?

From Geraldine_VdAuwera on 2013-11-26

Yes, that looks correct.

From vivekdas_1987 on 2013-11-26

Thanks I have prepared the scripts for all my 6 samples and now gave it for run. Once It finishes will update here about the run and if I face any problem in understanding any features. Once again thanks a lot for all the help and the advices.

From rglowe on 2014-10-17

I would like to know the recommended method to make a refseq file in the “UCSC table dump” format for a genome that isn’t on the UCSC genome browser. My genome of interest isn’t on the UCSC browser and my initial searches haven’t yet found a specification for that table layout. Can I use a GTF or GFF to supply genes to CalculateDepthOfCoverage, for example? I guess I could calc coverage per gene afterwards…

regards,

Rohan

From Geraldine_VdAuwera on 2014-10-17

@rglowe I’m not aware of a formal spec sheet, so my recommendation would be to pick an organism that is in the UCSC db, do a table dump, and then use it as a template to build yours. Primitive but it should work.

If there is a kind soul out there who wants to be helpful, do that and derive a spec sheet from it that we can share with the world, we’re happy to add it to the docs.

From pdexheimer on 2014-10-17

I believe that the “spec”, such as it is, is provided by the .sql files associated with each table. I personally find it easier to use the “describe table schema” button in the UCSC Table Browser

From rglowe on 2014-10-19

Thanks for the advice. I’ve had a look at the table for Human and I might be able to replicate it with my rudimentary parsing skills. Hopefully I can leave out the ‘exonFrames’ column otherwise I’ll have to generate that data.

One further question: in the original post you mention

“To run with the GATK, contigs other than the standard 1-22,X,Y,MT must be removed, and the file sorted in karyotypic order.”

I am guessing this only applies to Human? I would be using different naming convention and don’t have karyotype information.

Thanks again.

From rglowe on 2014-10-20

I found a short description of the Refseq table format in the document describing RefSeqCodec.

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_utils_codecs_refseq_RefSeqCodec.php

From Biocyberman on 2016-10-02

@Geraldine_VdAuwera

A technical problem: I could not find the sortByRef.pl anywhere.

Update (after 2 mins): I found it in gatk/public/perl directory of my local github repo. But still can’t find it online. You may need to update the link to the file.

From Geraldine_VdAuwera on 2016-10-03

We deprecated that script some time ago, and removed it from the codebase. We now recommend using Picard SortVcfs for this purpose.

From Biocyberman on 2016-10-03

Good to know. Then it is also a good idea to update the tutorial to reflex the change.

From Geraldine_VdAuwera on 2016-10-03

Ah, fair point, will do.

From Geraldine_VdAuwera on 2016-10-03

Actually, my bad — SortVcf doesn’t run on refseq files; I was thinking of the VCF sorting use case. We deprecated the perl script thinking that all use cases were covered but it looks like we didn’t account for the refseq case. That does mean we no longer provide an official recommendation for generating properly sorted refseq files. I’ll look into options for fixing that (aside from bringing back the perl script, which we’d rather avoid) but in the meantimes we’d be happy to take recommendations from the community.

From Biocyberman on 2016-10-04

Well, I am not sure why the old Perl file is deprecated. I tried it and it does not handle every thing completely and produce in compatible output to GATK. I wrote a short Linux-based, bash script to do the job. With the ucsc.refseq.txt prepared the way you described in the tutorial, the script’s output can be run with GATK’s DepthOfCoverage.

makerefseq.sh.txt

From Geraldine_VdAuwera on 2016-10-04

@Biocyberman Thanks for sharing your solution! As far as I can remember the Perl script broke and no one wanted to put in the effort to fix it.

From mglclinical on 2016-10-05

Dear GATK Team,

On UCSC’s Table Browser website, the parameter “output format” has the following 8 options :

all fields from selected table

selected fields from primary and related tables

sequence

GTF- gene transfer format

CDS FASTA alignment from multiple alignment

BED – browser extensible data

custom track

hyperlinks to Genome Browser

Could you please specify which “output format” should I select in order to make that output file to be given as input to my -geneList parameter of the DepthOfCoverage tool ?

Thanks,

mglclinical

From mglclinical on 2016-10-05

Based on the format posted on this link : http://gatkforums.broadinstitute.org/gatk/discussion/40/performing-sequence-coverage-analysis

I suspect that the answer is “all fields from selected table” .

From shlee on 2016-10-05

Hi @mglclinical, the GTF format sounds familiar but I’d have to double-check for this specific tool what this is used for and if it is appropriate. Can you try this and let us know if your output is as expected?

`P.S. 10/6`: Turns out we have an article related to this [here](https://software.broadinstitute.org/gatk/guide/article?id=1329). Also, check out [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/2421/depth-of-coverage-genelist) and [this slightly digressive thread](http://gatkforums.broadinstitute.org/gatk/discussion/1831/depth-of-coverage-only-first-gene-summary-output).

From Biocyberman on 2016-10-06

I can fix the Perl script if you prefer. But it can be only in December when I have a little free time.

From Geraldine_VdAuwera on 2016-10-06

Thanks, but I think we’d do better to make a proper GATK tool that can sort simple non-vcf reference-ordered data formats, rather than rely on external scripts. I’ll see if we can get a new dev to take it on as an onboarding exercise.

From mglclinical on 2016-10-11

Hi @shlee

I have downloaded the refseq file with the output format “all fields from selected table” and that worked for me, it is the same format that is specified in this link http://gatkforums.broadinstitute.org/gatk/discussion/40/performing-sequence-coverage-analysis . After downloading the file I have processed it (sorted, removed unnecessary contigs) in order to make it work with DepthOfCoverage tool

however, I did not try the GTF format.

Thanks,

mglclinical

From mglclinical on 2016-10-11

Hi @Geraldine_VdAuwera

I believe I have successfully used the DepthOfCoverage tool(GATK 3.5) for calculating gene level coverage by using the -geneList parameter .

I have performed some manual QC on the gene level coverage(example human genes : ABCD1, LRRC8C, and GNG10) and compared those values to the “_mean_cvg” column values in the “.sample_gene_summary” file to make sure that the gene level depth of coverage is accurately being calculated by DepthOfCoverage tool.

I observed that my manual calculations(total_coverage & mean_coverage) for genes ABCD1 and LRRC8C matched identically against the calculations from DepthOfCoverage tool. Where as for gene GNG10 they did not match.

My total_coverage calculation for GNG10 was ~10,000, where as DepthOfCoverage’s calculation of total_coverage for GNG10 is 2623, which is a gross under representation. I went back and looked into my calculations and observed that 1st exon of GNG10 has a total_coverage of 2623 and that matched the over all DepthOfCoverage’s calculation for gene GNG10. And this GNG10 has multiple transcripts(NM_001017998, NM_001198664) in my refseq file. My other example genes ABCD1 and LRRC8C that I considered for my manual QC have single transcripts in my geneList file. So, it seems to suggest to me that coverage for genes with multiple transcripts in the geneList file are being be underrepresented by the DepthOfCoverage tool.

I am not sure if this is a bug in the DoC tool, and please let me know if you have any advice on this.

Thanks,

mglclinical

From Sheila on 2016-10-13

@mglclinical

Hi mglclinical,

Ah, yes. Only the first transcript is represented in the output of DepthOfCoverage.

-Sheila

From mglclinical on 2016-10-14

Hi @Sheila

GNG10 gene has at least 2 CDS exons in humans.

DepthOfCoverage tool calculated the coverage based on the reads in 1st exon region, and it completely ignored the reads in the 2nd exon region , hence I believe it is a bug in the software(DepthOfCoverage)

For exon1 : total_coverage & mean_coverage are ~2000 and ~17

For exon2 : total_coverage & mean_coverage are ~9000 and ~77

So, for overall gene_level calculation the total_coverage & mean_coverage should be around ~11000 and ~50.

But the DepthOfCoverage tool’s gene_level calculation of total_coverage & mean_coverage is 2623 and 16.39

Hence I think the DepthOfCoverage tool has considered only the 1st exon and ignored the 2nd exon for gene level coverage calculation

Thanks,

mglclinical

From Sheila on 2016-10-17

@mglclinical

Hi mglclinical,

Would you be able to submit a bug report? Instructions are [here](http://gatkforums.broadinstitute.org/gatk/discussion/1894/how-do-i-submit-a-detailed-bug-report).

Thanks,

Sheila

From mglclinical on 2016-10-20

Hi @Sheila , I will submit a bug report, Thanks mglclinical

From zhiyewang1 on 2018-04-01

Hello! I am a plant guy. There is no RefSeq of Arabidopsis in UCSC genome table browser. Where can I get a gene list of Arabidopsis in the RefSeq format?

From Sheila on 2018-04-01

@zhiyewang1

Hi,

No need to post twice. We are helping [here](https://gatkforums.broadinstitute.org/gatk/discussion/11725/where-can-i-get-a-gene-list-of-arabidopsis-in-refseq-format).

-Sheila

From mLalonde on 2018-07-03

Hi,

I think the “track” and “table” and “region” options in UCSC have changed. I think this article should be slated for a re-write, and maybe could include a description (or downloadable example) of what the output should look like?

In any case, this tutorial is no longer up-to-date b/c of changes to the UCSC table browser UI.

Cheers,

Matt

Edit: I just read other topics that indicate tools such as DepthOfCoverage are not yet ported to GATK4, so I’d imagine the documentation is even further down the list of priorities. In that case, I just want to say that I appreciate all your hard work, and I look forward to GATK4 being fully built out and documented.

From shlee on 2018-07-05

Hi @mLalonde,

Thanks for letting us know that the options in UCSC have changed. I’ve let the team know we should consider updating this document.

Report abuse