dbNSFP

INTRODUCTION:

    dbNSFP is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. Its current version is based on the Gencode release 22 / Ensembl version 79 and includes a total of 82,832,027 nsSNVs and ssSNVs (splicing-site SNVs).  It compiles prediction scores from eleven prediction algorithms (SIFT, Polyphen2, LRT, MutationTaster, MutationAssessor, FATHMM, VEST3, CADD, MetaLR, MetaSVM, PROVEAN), 4 conservation scores (PhyloP, phastCons, GERP++ and SiPhy) and other related information including allele frequencies observed in the 1000 Genomes Project phase 3 data, UK10K cohorts data, ExAC consortium data and the NHLBI Exome Sequencing Project ESP6500 data, various gene IDs from different databases, functional descriptions of genes, gene expression and gene interaction information, etc.
    Some dbNSFP contents (may not be up-to-date though) can also be accessed through variant tools, ANNOVAR, KGGSeq, UCSC Genome Browser's Variant Annotation Integrator, Ensembl Variant Effect Predictor, SnpSift and HGMD. Please cite our papers (see below) if you used dbNSFP contents through those tools.
    Please note some component score/content of dbNSFP has specific requirements or licence for non-academic usage. dbNSFP does not grant the non-academic usage of those scores/contents, so please contact the original score/content provider for that purpose.  

    We thank Dr. Chunlei Wu from The Scripps Research Institute and Dr. CS (Jonathan) Liu from Softgenetics for providing hosting space.
 
    We welcome developers of functional prediction methods to provide their predictions and scores to the database. Please contact Dr. Xiaoming Liu (xmliu.uth{at}gmail.com). 

CITATION:

1. Liu X, Jian X, and Boerwinkle E. 2011. dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictionsHuman Mutation. 32:894-899.
2. Liu X, Jian X, and Boerwinkle E. 2013. dbNSFP v2.0: A Database of Human Non-synonymous SNVs and Their Functional Predictions and AnnotationsHuman Mutation34:E2393-E2402. 

If you uses dbNSFP v1.x, please cite our paper 1. If you used dbNSFP v2.x, please cite our papers 1 & 2.

If you used our ensemble scores (MetaSVM and MetaLR), which are based on 10 component scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. Please cite:

1. Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K* and Liu X*. (2015) Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human Molecular Genetics 24(8):2125-2137. *corresponding author [preprint]


CURRENT VERSION:

    UPDATE (April 12, 2015)dbNSFP v3.0 beta2 is released. This update fixed the issues due to inconsistent mitochondrial reference sequences used by different resources. I thank Dr. Lishuang Shen at MEEI for helping solving the issues. For mitochondrial SNV, the pos (i.e. hg38) refers to the rCRS (GenBank: NC_012920) and hg19_pos refers to a YRI sequence (GenBank: AF347015). The ancestral allele of mitochondrial SNV now comes from the Reconstructed Sapiens Reference Sequence (RSRS, doi:10.1016/j.ajhg.2012.03.002). The affected content include ancestral alleles, Neanderthal/Denisova genotypes and MutationTaster columns of the chrM file. The rankscores of MutationTaster has also been updated to reflect the update of its chrM scores. dbscSNV has been updated to v1.1 and added hg38 positions liftovered from its hg19 positions. Using search_dbNSFP30b2a or search_dbNSFP30b2c you can search dbscSNV1.1 along with dbNSFP v3.0b2 with either hg19 coordinates or hg38 coordinates. If you find any bugs/issues or have questions/comments please feel free to contact me. 

    Two branches of dbNSFP are provided: dbNSFP3.0b2a suitable for academic use, which includes all the resources, and dbNSFP3.0b2c suitable for commercial use, which does not include VEST3 and CADD. dbNSFP3.0b2a can be downloaded from softgenetics ftp, googledrive or onedriveThe md5sum of the zip file can be found here. A README file is heredbNSFP3.0b2c can be downloaded from softgenetics ftp, googledrive or onedriveThe md5sum of the zip file can be found here. A README file is here. 

    REMINDER: if your snp coordinates are based on hg19, remember to add option "-v hg19" when using the search program because the default position is now in hg38.    

    TIP: For onedrive: To download with wget, use wget --no-check-certificate  [the_link_address] -O [filename.zip]. 


ATTACHED DATABASE:
    
    dbscSNV includes all potential human SNVs within splicing consensus regions (−3 to +8 at the 5’ splice site and −12 to +2 at the 3’ splice site), i.e. scSNVs, related functional annotations and two ensemble prediction scores for predicting their potential of altering splicing. 

    UPDATE (April 12, 2015)dbscSNV has been updated to v1.1 and added hg38 positions liftovered from its hg19 positions. dbscSNV1.1 is available for download from the softgenetics ftp or googledrive.Since v3.0b2 the companion search program supports searching dbNSFP along with dbscSNV1.1 using option "-s".

    dbscSNV v1.0 is available for download from the softgenetics ftp or googledrive. From v2.6 to v3.0b1 the companion search program supports searching dbNSFP along with dbscSNV using option "-s".

    CITATION

    1. Jian X, Boerwinkle E and Liu X. 2014. In silico prediction of splice-altering single nucleotide variants in the human genomeNucleic Acids Research 42(22):13534-13544.


UPDATE HISTORY:

     UPDATE (April 6, 2015): dbNSFP v3.0 beta1 is released. The core set of nsSNVs and ssSNVs has been rebuilt based on Gencode 22/Ensembl 79 with human reference sequence hg38. Putative genes have been included. Genes with incomplete 5' have been excluded (I thank Chris Gillies for reporting the issues for genes with incomplete 5' end). Genes on mitochondrial DNA have been included. Allele frequencies from the UK10K cohorts and genotypes of two Neanderthals have been added. Some resources have been updated, including the MutationTaster (I thank Dr. Dominik Seelow for kindly providing the scores), allele frequencies from the 1000 Genomes Project populations, ancestral alleles, dbSNP, ClinVar and InterPro. The presentation of the prediction scores has been improved by adding columns for the corresponding transcript/protein ids. PhyloP and PhastCons conservation scores based on hg19 have been replaced by the scores based on hg38. Some resources have been dropped due to various reasons, including SLR test statistic, UniSNP ids, allele frequencies from the ARIC cohorts and allele counts in COSMIC. dbNSFP_gene has also been completely rebuilt using the up-to-date resources. Residual Variation Intolerance Scores (RVIS) have been added. GO Slim terms have been replaced by full GO terms. If you find any bugs/issues or have questions/comments please feel free to contact me. 
    Two branches of dbNSFP are now provided: dbNSFP3.0b1a suitable for academic use, which includes all the resources, and dbNSFP3.0b1c suitable for commercial use, which does not include VEST3 and CADD. dbNSFP3.0b1a can be downloaded from softgenetics ftp or googledriveThe md5sum of the zip file can be found here. A README file is heredbNSFP3.0b1c can be downloaded from softgenetics ftp or googledriveThe md5sum of the zip file can be found here. A README file is here. 

    UPDATE (February 3, 2015): dbNSFP v2.9 is released. SIFT score has been updated to ensembl66 version. PROVEAN (Protein Variation Effect Analyzer) score v1.1 has been added. I thank Dr. Yongwook Choi from J. Craig Venter Institute for providing the SIFT and PROVEAN scores. CADD score has been updated to 1.3 version. Please note the following copyright statement for CADD: "CADD scores (http://cadd.gs.washington.edu/) are Copyright 2013 University of Washington and Hudson-Alpha Institute for Biotechnology (all rights reserved) but are freely available for all academic, non-commercial applications. For commercial licensing information contact Jennifer McCullar (mccullaj@uw.edu)." Allele frequency v0.3 of ~60,706 unrelated individuals from the Exome Aggregation Consortium (ExAC) has been added. ExAC data are released under a Fort Lauderdale Agreement. Please refer to http://exac.broadinstitute.org/terms for terms of use. The zipped database (7.8 Gb in size) can be downloaded from Scripps ftpsoftgenetics ftpgoogledrive or onedrive. The md5sum of the zip file can be found here. A README file is here. I thank Dr. CS (Jonathan) Liu from Softgenetics for providing hosting space.
    TIP: Depending on the method you choose to download dbNSFP from the Scripps ftp, you may be asked for username and password. If that is the case, you can use "Anonymous" as username and your own email address as password, (i.e. "wget --user=Anonymous --password=your_email_address ftp://ftp.scripps.edu/incoming/asu/dbNSFPv3.0b1a.zip". I thank Mihail Halachev). It seems using Chrome to download does not need username or password.   

    UPDATE (December 16, 2014): Some rows in the dbNSFP2.8_gene and dbNSFP2.8_gene.complete were truncated. I thank Jocob Hsu for identifying this issue. If you have already download dbNSFPv2.8 you can download and replace the old files with the updated files: dbNSFP2.8_gene and dbNSFP2.8_gene.complete. The updated complete database can be downloaded with the links below.
    UPDATE (November 21, 2014): dbNSFP v2.8 is released. COSMIC (Catalogue Of Somatic Mutations In Cancer) annotations have been added. Pathway information from BioCarta and KEGG (old version) has been added to the dbNSFP2.8_gene. A bug causing inconsistency between MutationTaster scores and MutationTaster_pred, which affects v2.5 to v2.7, has been fixed. I thank Adam Novak for reporting this bug. The zipped database (6.8 Gb in size) can be downloaded from softgenetics ftp or googledriveThe md5sum of the zip file can be found here. A README file is here. 

    UPDATE (Septermber 12, 2014): dbNSFP v2.7 is released. Chromosomes and positions of human reference hg38 have been added. search_dbNSFP27.class now supports query dbNSFP using the positions based on hg38 with the "-v hg38" option.  clinvar (freeze 20140902) annotations have been added. Allele frequencies from 2303 exomes of African Americans  and 3203 exomes of European Americans from the Atherosclerosis Risk in Communities (ARIC) cohort study  have been added. As the columns for gene interactions in dbNSFP_gene table contain very long strings, especially  for gene UBC, which may cause problems when viewing the results in Excel, now we only report the number of  interacting genes in those columns. Full information is retained in the dbNSFP_gene.complete table. The zipped database (6.8 Gb in size) can be downloaded from hereThe md5sum of the zip file can be found here. A README file is here.
    
    UPDATE (July 26, 2014): dbNSFP v2.6 is released. rs numbers from dbSNP 141 have been added to the variant database files. Mouse and zebra fish homolog genes and phenotypes have been added to the gene database file (I thank Alex Li for his suggestion and helps). Trait_association(GWAS) was also updated. The zipped database (6.7 Gb in size) can be downloaded from hereThe md5sum of the zip file can be found here. A README file is here
    CORRECTION (September 9, 2014): the rs numbers in v2.6 are from the latest dbSNP 141 (not 138 as previously noted). The README file has been updated accordingly. I thank Jason J. Corneveaux for pointing this out. 

    UPDATE (June 1, 2014): dbNSFP v2.5 is released. A new functional score VEST 3.0 has been added. We thank Dr. Karchin for kindly providing the score. Non-commercial use of VEST is free. Commercial users of VEST please contact the Johns Hopkins Technology Transfer officeA bug that causes the MutationTaster score error since v2.1 for variants with a prediction of  "Polymorphism_automatic" has been fixed. We thank John McGuigan and James Ireland for reporting this bug. As MutationTaster can also predict splicing change and other functional effects, in case a variant has multiple predictions based on their different model, we took the most damaging score and prediction for dbNSFP. The zipped database (7.3 Gb in size) can be downloaded from hereThe md5sum of the zip file can be found here. A README file is here.

    UPDATE (March 5, 2014): dbNSFP v2.4 is released. A whole genome functional prediction score called CADD was added, along with five more conservation scores (phyloP46way_primate, phyloP100way_vertebrate, phastCons46way_primate, phastCons46way_placental, phastCons100way_vertebarate). Please note the following copyright statement for CADD: "CADD scores (http://cadd.gs.washington.edu/) are Copyright 2013 University of Washington and Hudson-Alpha Institute for Biotechnology (all rights reserved) but are freely available for all academic, non-commercial applications. For commercial licensing information contact Jennifer McCullar (mccullaj@uw.edu)." To facilitate comparison between scores, we added rank scores for most functional prediction scores and conservation scores, and replacing the  "converted" scores in the previous versions. In short, for a given type of prediction/conservation scores, all its scores in dbNSFP were first ranked and the rankscore is the rank divided by the total number of all its scores. Roughly speaking, the rankscore will range from 0 to 1, and the larger the score, the higher rank the score in dbNSFP, therefore the SNP is more likely to have damaging effect. The zipped database (6.9 Gb in size) can be downloaded from hereThe md5sum of the zip file can be found here. A README file is here

    UPDATE (Fedruary 12, 2014): A bug was fixed in dbNSFP v2.2 and v2.3, which caused missing delimiters in columns aapos_SIFT, SIFT_score_converted and SIFT_pred. For those who need to use information from those columns, please re-download the database(s) using the above links.

    UPDATE (January 26, 2014): dbNSFP v2.3 is released. In collaboration with Dr. Kai Wang's lab at USC, we constructed two ensemble scores (MetaSVM and MetaLR) based on 10 component scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. Based on our comparison, the two ensemble scores outperform all their component scores. A manuscript describing the ensemble scores has been published in Human Molecular Genetics. This release added the two ensemble scores and their predictions. The zipped database (4.4 Gb in size) can be downloaded from here. The md5sum of the zip file can be found here. A README file is here.

    UPDATE (January 23, 2014): dbNSFP v2.2 is released. SIFT and FATHMM now have multiple scores corresponding to different Ensembl ENSP ids and amino acid positions (aapos_SIFT and aapos_FATHMM). Accordingly, our companion search program now supports SNP searches based on Ensembl ENSP ids and amino acid positions. A bug is fixed for a small proportion of MutationTaster scores. The zipped database (4 Gb in size) can be downloaded from here. The md5sum of the zip file can be found here. A README file is here.

    UPDATE (October 3, 2013): dbNSFP v2.1 is released. MutationTaster and FATHMM scores have been updated. To facilitate interpretation of the prediction scores, converted scores of SIFT, LRT, MutationTaster, MutationAssessor and FATHMM have been added. The converted scores are all scaled to 0~1 with the larger number indicating more likely to be damaging. Columns of SIFT and FATHMM predictions have been added. The gene database has also been updated. Database IDs are updated. GO Slim terms, pathway and protein interaction information from the ConsensusPathDB, and list of essential and non-essential genes (based on phenotypes of mouse homologs) have been added. The zipped database (3.3 Gb in size) can be downloaded from here. The md5sum of the zip file can be found here. A README file is here

    UPDATE (August 12, 2013): The java search program is updated with an option for users to choose whether to output all columns from the input vcf file to the output file. You can download it from here.
    
    UPDATE (May 31, 2013): The source code of the companion Java search program is now available under the RECEX SHARED SOURCE LICENSE. You can download it from here.  

    UPDATE (March 22, 2013): A bug which caused a lot of missing FATHMM scores has been fixed. The database files have been updated. Please use the above link (February 25, 2013) to download the database. The alternative companion java search program (March 12, 2013) is now the default search program included in the zip file. 

    UPDATE (March 12, 2013): Here is an alternative companion java search program, which outputs queries that are not found into an error file instead of the system output. It can be downloaded from here. You can just replace the companion search program packed with the database file. 

    NEW VERSION (February 25, 2013): Finally dbNSFP v2.0 is released. A new functional prediction score FATHMM is added.  It can be downloaded from here. A README file is here.

    UPDATE (October 27, 2012): dbNSFP v2.0b4 is released. A new functional prediction score MutationAssessor is added. Allele frequencies from ESP 5400 data set are replaced by ESP 6500 data set. It can be downloaded from here. A README file is here
    UPDATE (November 19, 2012): a bug was found in the companion java search program search_dbNSFP20b4, which causes missing output when only position queries are included in the input file. The fixed program can be download from here. The program in the database zip file linked above has been replaced too.

    UPDATE (August 28, 2012): The companion java search program search_dbNSFP20b3 is updated. Added features include supporting vcf file as input file and options for output contents (columns). It can be downloaded from here. A README file is here. Simply replacing the old search_dbNSFP20b3.class file with the new file.    

    UPDATE (July 2, 2012): dbNSFP v2.0b3 is released. To facilitate filtering, an additional 2.2 million splicing site SNPs have been added to dbNSFP_variant. In the table those SNPs have missing (".") in aaref, aaalt and "-1" in aapos. There's no change to the format of search input file.  It can be downloaded from here. A README file is here.  Bug reports are very welcome.

    UPDATE (June 2, 2012): dbNSFP v2.0 beta 2 is released, which includes both the dbNSFP_variant and dbNSFP_gene sub-databases. Slight changes have been made to the Ensembl gene and transcript ids of dbNSFP_variant in order to be compatible to other database sources. For each gene, dbNSFP_gene includes various ids of the gene for different databases, function description, gene expression information, gene interaction information, diseases or traits the gene causes or associated with, estimated probability of haploinsufficiency,  estimated probability of causing recessive disease, etc. It can be downloaded from here. A README file is here.  Bug reports are very welcome. 

    UPDATE (April 11, 2012): The long waited dbNSFP v2.0 is on the horizon now. The new database is rebuilt based on the Gencode release 9 / Ensembl version 64. The default coordinate is hg19, but hg18 is still supported. There will be two parts of the database: one focuses on variant annotation and the other focuses on gene annotation. The variant sub-database is now open for beta test and can be downloaded from here. A README file is here. SIFT, Polyphen-2 and MutationTaster scores are updated. Please note that now all scores are RAW scores, without imputation and transformation. One more conservation score, SiPhy, is added along with other new annotations such as the protein functional domains, the allele frequencies observed in the 1000 Genomes phase 1 data and the NHLBI's Exome Sequencing Project data, etc. Bug reports are very welcome.
    
    dbNSFP_light is a light version of dbNSFP, which contains less annotation entries but some additional 9,285,316 NSs that are not in CCDS version 20090327.
    dbNSFP_light v1.0 can be downloaded from here. A README file is here. Scores of PhyloP, SIFT, Polyphen2, LRT and MutationTaster are included but missing data are not imputed. Prediction of LRT and MutationTaster are also included, as well as the omega estimated by LRT. A companion Java program called search_dbNSFP_light.class can be downloaded from here and used for local queries. 
    dbNSFP_light v1.1 added GERP++ neutral rates and RS scores. It can be downloaded from here (including readme and the corresponding java search program). A README file is here.
    dbNSFP_light v1.2 added Uniprot ID, accession number and amino acid position based on the Polyphen-2 annotations. Users can now search amino acid change directly referring to a Uniprot ID or accession number. dbNSFP_light v1.2 can be downloaded from here (including readme and the corresponding java search program). A README file is here.
    dbNSFP_light v1.3 updated SIFT scores (August, 2011 version) and Polyphen-2 scores (May, 2011 version). SIFT: 7,097,009 scores added, 48,011,111 updated. Polyphen-2: 2,136,757 scores added, 53,712,654 updated. Uniprot ID, accession number and amino acid position based on the Polyphen-2 annotations have been updated too. It can be downloaded from here (including readme and the corresponding java search program). A README file is here.

    dbNSFP v1.3 added Uniprot ID, accession number and amino acid position based on the Polyphen-2 annotations. Users can now search amino acid change directly referring to a Uniprot ID or accession number. dbNSFP v1.3 can be downloaded from here (including readme and the corresponding java search program). A README file is here.
    Update (Nov. 10, 2011): A bug was found in the conpanion search program for dbNSFP v1.3, which causes invalid search using AA mutations with Uniprot ID or accession number. Please use the updated search program. The search program in the  dbNSFP v1.3 zip file has been updated.

    dbNSFP v1.2 added GERP++ neutral rates and RS scores. It can be downloaded from here (including readme and the corresponding java search program). A README file is here.

    dbNSFP v1.1 added the following entries: rs numbers from UniSNP (a cleaned version of dbSNP build 129), allele frequency recorded in dbSNP, allele frequency reported by 1000 Genomes Project, alternative gene names, descriptive gene name, database cross references (gene IDs of HGNC, MIM, Ensembl and HPRD). The unziped database is 18Gb.
    dbNSFP v1.1 can be downloaded from here. A README file is here
    A companion Java program called search_dbNSFP11.class can be downloaded from here and used for local queries.
    
    dbNSFP v1.0 can be downloaded from here. A README file is here. More details about the database can be found in our paper
    A companion Java program called search_dbNSFP.class can be downloaded from here and used for local queries.
    
     
Subpages (1): Changelog