News (December 2006) I made some major updates including several very useful new options for clustering such as alignment coverage control, switch between local and global sequence identity. Please check the newest release and have a try. Weizhong Li.
News (February 2006) I recently developed several new programs based on CD-HIT's algorithm: CD-HIT-2D, CD-HIT-EST and CD-HIT-EST-2D. CD-HIT-2D compares two protein sets and report similar matches between them. CD-HIT-EST and CD-HIT-EST-2D are nucleotide versions. Weizhong Li.
CD-HIT is a program for clustering large protein database at high sequence identity threshold. The program removes redundant sequences and generate a database of only the representatives. It can be applied in protein family classification, domain analysis, organizing large protein databases, improving performance of database search, and much more.
The program is written by
Weizhong Li, liwz@sdsc.edu at Adam Godzik's lab
CD-HI is the first version, CD-HIT is modified from CD-HI. CD-HIT yields much higher speed than CD-HI, but user will have to tolerate a very small amount of redundant sequence in the output database. Since the amount of redundancy is so small, We suggest users use CD-HIT only for all applications. I am only maintaining CD-HIT now.

The CD-HIT manual and download is available from bioinformatics.org. If you have special request, discuss it with the author

If you find CD-HIT useful, please cite:
1. "Clustering of highly homologous sequences to reduce the size of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam Godzik Bioinformatics, (2001) 17:282-283. PDF Pubmed
2. "Tolerating some redundancy significantly speeds up clustering of large protein databases", Weizhong Li, Lukasz Jaroszewski & Adam Godzik Bioinformatics, (2002) 18:77-82. PDF Pubmed
3. "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9. Open source PDF Pubmed

Who is using CD-HIT
The CD-HIT program is currently used by hundreds of research and educational groups, including some of the worlds best-known institutions such as UniProt, PDB, EBI, and TIGR.

UniProt is the world's most comprehensive catalog of information on proteins. In UniProt, CD-HIT program is used to generate the UniRef reference data sets, UniRef90 and UniRef50. CD-HIT is also used at the PDB to treat redundant sequences. Google CD-HIT.

Related resources:
NRDB90 and nrdb90.pl, a nonredundant sequence database and the perl script used to generate it.
RSDB, Representative protein Sequence DataBases.