CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT was originally developed by Dr. Weizhong Li at Dr. Adam Godzik's Lab at the Burnham Institute (now Sanford-Burnham Medical Research Institute)
CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.
The CD-HIT package has CD-HIT, CD-HIT-2D, CD-HIT-EST, CD-HIT-EST-2D, CD-HIT-454, CD-HIT-PARA, PSI-CD-HIT, CD-HIT-OTU, CD-HIT-LAP, CD-HIT-DUP and over a dozen scripts.
CD-HIT (CD-HIT-EST) clusters similar proteins (DNAs) into clusters that meet a user-defined similarity threshold.
CD-HIT-2D (CD-HIT-EST-2D) compares 2 datasets and identifies the sequences in db2 that are similar to db1 above a threshold.
CD-HIT-454 identifies natural and artificial duplicates from pyrosequencing reads.
CD-HIT-OTU clusters rRNA tags into OTUs
CD-HIT-DUP identifies duplicates from single or paired Illumina reads
CD-HIT-LAP identifies overlapping reads
The usage of other programs and scripts can be found in CD-HIT user's guide.
CD-HIT is currently maintained by the Dr. Li's group (http://weizhongli-lab.org/) at J Craig Venter Institute. We thank the support from National Center for Research Resources (Grant # 1R01RR025030, 2008-2011). We thank all users that report bugs, give us suggestions and comments.
NEWS
(June 2015) CD-HIT moves to github(June 2013) We were awarded an "AWS in Education Research Grant Award" by Amazon. This will support us to develop cloud-based cd-hit applications.
(May 2013) We invite users to help test psi-cd-hit before public release.
(October 2012) The new CD-HIT paper was just published at Bioinformatics.
(July 2012) A paper was just published at Briefings in Bioinformatics. This paper describes several clustering applications in metagenomic data analysis.(June 2011) cd-hit-otu is a special cd-hit extension for clustering rRNA tags into OTUs. It is very fast and very accurate.
(July 2010) cdhit@GoogleCode is a new Google Code project created for releasing the latest development version of CDHIT. Usually new minor versions will be released as soon as bug fixings or improvements become available.
(Oct. 2009) CDHIT-454 is a new program to identify exact duplicates and near identical duplicates in pyrosequencing reads: CDHIT-454 (websever), CDHIT-454 (standalone).
(September 2009) CD-HIT web server is now available to run cd-hit or download some pre-calculated clusters.
(December 2006) I made some major updates including several very useful new options for clustering such as alignment coverage control, switch between local and global sequence identity. Please check the newest release and have a try.
(February 2006) I recently developed several new programs based on CD-HIT's algorithm: CD-HIT-2D, CD-HIT-EST and CD-HIT-EST-2D. CD-HIT-2D compares two protein sets and report similar matches between them. CD-HIT-EST and CD-HIT-EST-2D are nucleotide versions.