Computational Biology Tools and Databases



Microbial genome databases
Protein Information Resource
Comparative genome analysis in P. Bork laboratory
TIGR: The Comprehensive Microbial Resource Home Page—the omniome
Genome databases other than NCBI
Genome list at NIH
Mitochondrial DNA Database MitBASE
E. coli genome project
E. coli genome and proteome database GenProtEC
E. coli index
Organelle genome sequences
Parasite genome databases and genome research resources
Retroviral genotyping and analysis site
GenBank at the National Center of Biotechnology Information, National Library of Medicine, Washington, DC accessible from:
European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England
DNA DataBank of Japan (DDBJ) at Mishima, Japan
Protein International Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC
The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research in Epalinges/Lausanne
The Sequence Retrieval System (SRS) at the European Bioinformatics Institute allows both simple and complex concurrent searches of one or more sequence databases. The SRS system may also be used on a local machine to assist in the preparation of local sequence databases.
Protein data bank (PDB) at the State University of New Jersey (Rutgers)a atomic coordinates of structures as PDB files, models, viewers, links to many other Web sites for structural analysis and classification
COG (cluster of orthologous groups):
DOGS: Database of genome sizes A comprehensive gene index (catalog) derived from ESTs and predicted genes
GeneCensus Genome Comparisons by encoded protein structures
GeneQuiz: An integrated system for large-scale biological sequence analysis and data management (Andrade et al. 1999; Hoersch et al. 2000)
Genes and disease: Map location on human chromosomes
Genome channel at Oak Ridge National Laboratories
GOLD™: Genomes OnLine Database (Kyrpides 1999)
IMGT ImMunoGeneTics Database specializing in Immunoglobulins, T-cell receptors, and Major Histocompatibility Complex (MHC) of all vertebrate species
KEGG: Kyoto Encyclopedia of Genes and Genomes (Kanehisa and Goto 2000)
MIA Molecular Information Agent: A Web server that searches biological databases for information on a macromolecule
Orthologous gene alignments at TIGR
PEDANT: A protein extraction, description, and analysis tool
STRING Search Tool for Recurring Instances of Neighboring Genes http://www.Bork.EMBL-Heidelberg.DE/STRING/
Taxonomy browser at the NCBI arranges genomes taxonomically for sequence retrieval
UniGene System gene-oriented clusters of GenBank sequences useful for gene identification
2D gel analysis of proteins: List of organisms
AlignAce for promoter analysis of coordinately regulated genes, e.g., microarrays by Gibbs sampling (Roth et al. 1998; Hughes et al. 2000; McGuire et al. 2000)
ArrayExpress database at European Bioinformatics Institute for microarray analysis
BRITE: Database of protein-protein interactions and cross-reference links
Ecocyc electronic encyclopedia of genes and metabolism of E. coli (Karp et al. 2000)
Expression Profiler tools for analysis and clustering of gene expression and sequence data
Functional genomics sites
GENECLUSTER; Tamayo et al. (1999)
GeneX: A Collaborative Internet Database and Toolset for Gene Expression Data
MetaCyc metabolic encyclopedia (see EcoCyc)
Microarray guide: P. Brown lab
Microarray project at NIH
Microarray software
SMART: For the study of genetically mobile protein domains (Schultz et al. 2000)
SWISS-2DPAGE: Two-dimensional polyacrylamide gel electrophoresis database (Hoogland et al. 2000)
TIGR: Annotation and gene indexing resources, including analysis of the transcribed sequences represented in the public EST databases.
WIT (What is there?): Interactive metabolic reconstruction on the Web (Overbeek et al. 2000)
GFF (Gene-Finding Features): Specification for describing genes and other features of genomics
GO (gene ontology) controlled vocabulary
MAGPIE: Multipurpose Automated Genome Project Investigation Environment,
TAMBIS: A conceptual model of molecular biology and bioinformatics and methods for querying the model (Baker et al. 1999)
RDP: The Ribosomal Database Project (RDP) provides ribosome related data services to the scientific community, including online data analysis, rRNA derived phylogenetic trees, and aligned and annotated rRNA sequences
"GO: dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing.


Miscelleneous Tools For Bioinformatics Analysis On the WWW  

Pairwise Sequence Alignment  
Global alignment programs (GAP, NAP)  Huang (1994)
BLAST 2 sequence alignment (BLASTN, BLASTP)  Altschul et al. (1990)
Bayes block aligner 
BCM Search Launcher: Pairwise sequence alignmenta 
SIM—Local similarity program for finding alternative alignments 
FASTA program suite  Pearson and Miller (1992); Pearson (1996)
Likelihood-weighted sequence alignment (lwa)c 

Multiple Sequence Alignment  
CLUSTALW or CLUSTALX (latter has graphical interface) FTP to a,d Thompson et al. (1994a, 1997); Higgins et al. (1996)
MSA,,, cFTP to   Lipman et al. (1989);Gupta et al. (1995)
PRALINE  Heringa (1999)

DIALIGN segment alignment  Morgenstern et al. (1996)
MultAlin  Corpet (1988)
Parallel PRRN progressive global alignment  Gotoh (1996)
SAGA genetic algorithm  Notredame and Higgins (1996)
 Protein Profile Generation Tools Based on MSA 
Aligned Segment Statistical Evaluation Tool (Asset) FTP to  Neuwald and Green (1994)
BLOCKS Web site   Henikoff and Henikoff (1991, 1992)
eMOTIF Web server  http://dna.Stanford.EDU/emotif/  Nevill-Manning et al. (1998)
GIBBS, the Gibbs sampler statistical method FTP to  Lawrence et al. (1993); Liu et al. (1995); Neuwald et al. (1995)
HMMER hidden Markov model software  Eddy (1998)
MACAW, a workbench for multiple alignment construction and analysis FTP to  Schuler et al. (1991)
MEME Web site, expectation maximization method  Bailey and Elkan (1995); Grundy et al. (1996, 1997); Bailey and Gribskov (1998)
Profile analysis at UCSDa,e  Gribskov and Veretnik (1996)
SAM hidden Markov model Web site  Krogh et al. (1994); Hughey and Krogh (1996)

RNA Tools  
MFOLD minimum energy RNA configuration  Zuker et al. (1991)
RNA editing Web site, UCLA  Simpson et al. (1998)
RNA editing, uridine insertion/deletion  Simpson et al. (1998)
RNA modification database  Limbach et al. (1994); Rozenski et al. (1999)
RNA secondary structures, Group I introns, 16S rRNA, 23S rRNA  Gutell (1994); Schnare et al. (1996 and references therein)
tRNAscan-SE search server  Lowe and Eddy (1997)
Vienna RNA package for RNA secondary structure prediction and comparison  Hofacker et al. (1998); Wuchty et al. (1999)

DATABASE SEARCHES (Sequence similarity search with query sequence protein sequence database (or genomic sequencesa) search for database sequence that can be aligned with query sequence single sequence, e.g.,DAHQSNGA)   


PROFILESEARCH Alignment search with profile (scoring matrixb,d with gap penalties) protein sequence database prepare profile from a multiple sequence alignment (Profilemake) and align profile with database sequence profile representing gapped multiple sequence alignment, e.g.,D-HQSNGA,ESHQ-YTM,EAHQSN-L EGVQSYSL

MAST Search with position-specific scoring matrixc,d (PSSM) representing ungapped sequence alignment (BLOCK) protein sequence database prepare PSSM from ungapped region of multiple sequence alignment or search for patterns of same length in unaligned sequences,c then use for database search PSSM representing ungapped alignment, e.g.,DAHQSN,ESHQSY,EAHQSN,EGVQSY

PSI- BLAST Iterative alignment search for similar sequences that starts with a query sequence, builds a gapped multiple alignment, and then uses the alignment to augment the searchd ses initial matches to query sequence to build a type of scoring matrix and searches for additional matches to the matrix by an iterative search methodd builds matches to query sequence, e.g.,DAHQSNGA,iteration 1H-SNGA EAHQSN-L -> further iterations.  PSI-BLAST finds a set of sequences related to each other by the presence of common patterns (not every sequence may have same patterns).

PROSITE Search query sequence for patterns representative of protein familiese database of patterns found in protein families search for patterns represented by scoring matrix or hidden Markov model (profile HMM)e single sequence, e.g., DAHQSNGA
BCM Search Launcher (with programming links to several servers) 
bic-swa Bic server European Bioinformatics Institute 
MPsearchb National Institute of Agrobiological Resources, Tsukuba, Japan 

Scanps G.Barton, European Bioinformatics Institute ;

SSEARCH E-mail server DNA Databank of Japan 

Swatc Phil Green, University of Washington 

Programs and Web sites for database similarity searches with a regular expression, motif, block, or profile  
 Regular Expression and Motifsa 
EMOTIF Scan SwissProt and Genpept 
Prosite patterns SwissProt and TrEMBL 
ISREC pattern-finding service SwissProt and non-redundant EMBL database 
fpat PDB SwissProt Genpept 
PHI-BLAST BLAST databases 
MOTIF SwissProt, PDB, PIR, PRF, Genes 
BLOCKSb most databases 
MASTc most databases 
BLIMPSd locally available databases anonymous FTP 
Probee BLAST databases anonymous FTP 
Genefindf PIR 
 PROFILE Programs 
Profilesearchg locally available databases anonymous FTP 
Profile-SSh most databases 

Search Genes and Coding Regions  

FGENES and related programs that use linear discriminant analysis or hidden Markov modelsa  Solovyev et al. (1995);
GeneFinder access site at the Sanger Center collection of methods
Genehacker for microbial genomes based on HMMs  Hirosawa et al. (1997)
GeneID-3 Web server using rule-based models, and GeneID+b  Mail server at
GeneMark and GeneMark.hmmc uses hidden Markov models 
GeneParsera,b Web page, uses combination of neural network and dynamic programming methods Snyder and Stormo (1993, 1995)  
Genescan using Fourier transform of DNA sequences to find characteristic patterns  Tiwari et al. (1997)
Genetic code variations 
GenLang using linguistic methods  Dong and Searls (1994)
GenScan based on probabilistic model of gene structure for vertebrate, Drosophila, and plant genes  Burge and Karlin (1998)
Genseqer for aligning genomic and EST sequences Close to SplicePredictor
Glimmer uses interpolated Markov models for prokaryotic translation  Salzberg et al. (1998)
GrailIIa,b prediction by neural networks based on scores of characteristic sequence patterns and composition  Uberbacher and Mural (1991); Uberbacher et al. (1996)
Initiation codon analysis 
Microbial genome coding region identification based on Markov chains of order 5  Audic and Claverie (1998)
Procrustes based on comparison of related genomic sequences  Gelfand et al. (1996)
Push-button Gene Finder for gene identification using Markov and hidden Markov models 
Translate tool at ExPASy 
Translation machine on the Web at EBI 
Translation of large genome sequences on the Web 
Veil (Viterbi exon-intron locator) uses hidden Markov models for vertebrate DNA  Henderson et al. (1997)
Webgene, a set of gene prediction tools and concurrent database similarity searches 
Webgenemark and Webgenemark.hmmc  see GeneMark; Lukashin and Borodovsky (1998)

Promoter Prediction Program  

ConsInspector–see Transfac databasea 
FastM for transcription factor binding sites  Klingenhoff et al. (1999)
GeneExpress analysis of transcriptional regulations with TRRD database Kolchanov et al. (1999a, b) 
Genome inspector for combined analysis of multiple signals in genomes  Quandt et al. (1997) GrailIIb prediction of TSS by neural networks based on scores of characteristic sequence patterns and composition
MAR-FINDER for finding matrix attachment regions  Kramer et al. (1997); Singh et al. (1997)
MatInspectora – Transfac database  (for downloading)
Mirage (Molecular Informatics Resource for the Analysis of Gene Expression)d 
NNPP Promoter Prediction by Neural Network for prokaryotes or eukaryotes  Reese et al. (1996)
NSITE–search for TF binding sites or other consensus regulatory sequences 
OOTFD Object-Oriented Transcription Factor Database  Ghosh (1998)
Pol3scan for RNAP III/tRNA promoter sequences using pattern scoring matrices  Pavesi et al. (1994)
Promoter element weight matrices and HMMs  Bucher (1990)
Promoter II for recognition of PolII sequences by neural networks  Knudsen (1999)
PromoterScane  Prestridge (1995) and see Web site
RegScan for promoter classification  Babenko et al. (1999)
Sequence walkers for graphical viewing of the interaction of regulatory protein with DNA binding site  Schneider (1997)
Signal scan for transcriptional elements Prestridge (1991, 1996)  
TargetFinder for promoter searching in selected annotated sequences  Lavorgna et al. (1999)
TESS for searching for transcription factor binding sites Schug and Overton (1997a, b)  
Tfbind for transcription factor binding sites  Tsunoda and Takagi (1999)
Transfac programs providing search for TF binding sites. MatInd for making scoring matrices and MatInspector for searching for matches to matrices,,  Knüppel et al. 1994);Quandt et al. (1995);Heinemeyer et al. (1999);Klingenhoff et al. (1999)
Wentian Li's Website for multiple analysis .

Protein Structure Analysis  

The PredictProtein server at the European Molecular Biology Laboratory at Heidelberg, Germany important site for secondary structure prediction by PHD, predator, TOPITS, threader 
Swiss Institute of Bioinformatics, Geneva basic types of protein analysisd databases, the Swiss-Model resource for prediction of protein models, Swiss-PdbViewer 

Protein Structure Viewer  
Chime  A Web browser plug-in that can be used to display and manipulate structures inside a Web page. There are many mouse-driven controls. Excellent for lecture presentations.
Cn3da  (Hogue 1997) Provides viewing of three-dimensional structures from Entrez and MMDBa. Cn3D runs on Windows, MacOS, and Unix; simultaneously displays structural and sequence alignments; can show multiple superimposed images from NMR studies.
Mage  (see Richardson and Richardson 1994) Standard molecular viewing features with animation and kaleidoscope effects.
Rasmolb  (Sayle and Milner-White 1995) Most commonly used viewer for Windows, MacOS, UNIX, and VMS operating systems. Performs many functions.
Swiss 3D viewer, Spdbv  (Guex and Peitsch 1997) Protein models can be built by structural alignments; calculates atomic angles and distances, threading, energy minimation, and interacts with the Swiss Model server.

Protein Secondary Structure Prediction  

Modeller  dynamic programming alignment of sequences and structures and molecular dynamics methods Sali et al. (1995)
Swiss-model  sequence alignment of query with sequences of known structure Peitsch (1996)
Whatif  flexible molecular graphics rendering of models Rodriguez et al. (1998)
Baylor College of Medicine (BCM)  collection of methods and linked to other servers
DSC  linear discrimination King et al. (1997)
J-Pred structure prediction server  NNSSP, DSC, Predator, Mulpred,b Zpred,c Jnet,e and PHD Cuff et al. (1998);
NNPRED  neural networks enhanced to detect sequence periodicity Kneller et al. (1990)
NPS@ server, MLR combination for secondary structure predictiona  combination of prediction methods using multivariate linear regression to optimize the predictions Guermeur et al. (1999)
Protein Sequence Analysis (PSA) Systemd  discrete space models (hidden Markov models) for patterns of a helices, b strands, tight turns, and loops in specific structural classes Stultz et al. (1993, 1997); White et al. (1994)
PREDATOR  based on analysis of long- and short-range amino acid interactions and alignments of sequence pairs Frishman and Argos (1995, 1996, 1997)
Predict Protein server ; see also mirror sites neural networks of multiple sequence alignment Rost and Sander (1994); Rost (1996)
PSSP nearest neighbor enhanced by non-intersecting local and multiple sequence alignments Salamov and Solovyev (1995, 1997)  
Simpa96  nearest-neighbor method Levin (1997)
SOPM, SOPMA  nearest-neighbor method based on sequence alignments Geourjon and Deleage (1994, 1995)
SSP  linear discriminant analysis based on amino acid composition of local and adjacent regions see H option for this program on Web page
UCLA-DOE structure prediction server  collection of methods and linked to other servers Fischer and Eisenberg (1996)

Threading servers and program   
123D  contact potentials between amino acid side groups Alexandrov et al. (1996)
3D-PSSM  sequence-structure using position-specific scoring matrices Russell et al. (1997)
Honig lab  threading methods using biophysical properties
Libra I  target sequence and 3D profile are aligned by dynamic programming Ota and Nishikawa (1997)
NCBI structure site  Gibbs sampling algorithm used to align sequence and structurea Bryant (1996)
Profit  fold recognition by the contact potential method M. Sippl
Threader 2  prediction by recognition of the correct fold from a library of alternatives Jones et al. (1995)
TOPITS detects similar motifs of secondary structure and accessibility between a sequence of unknown structure and a known fold Rost (1995a,b)
UCLA-DOE structure prediction server  fold-recognition using 3D profiles and secondary structure prediction methods Fischer and Eisenberg (1996)
CASP  overall assesment of the methods


EMBOSS ( ) dowloadable source codes.  

alignment consensus FUNCTION AUTHOR
cons Creates a consensus from multiple alignments HGMP
megamerger Merge two large overlapping nucleic acid sequences HGMP
merger Merge two overlapping sequences HGMP

alignment differences  
diffseq Find differences between nearly identical sequences HGMP

alignment dot plots  
dotmatcher Produces a dotplot of two sequences. Sanger
dotpath Displays a non-overlapping wordmatch dotplot of two sequences HGMP
dottup DNA sequence dot plot Sanger
polydot Multiple dotplot Sanger

alignment global  
est2genome Align EST and genomic DNA sequences Sanger
needle Needleman-Wunsch global alignment. HGMP
stretcher Global alignment of two sequences. Sanger

alignment local  
matcher Local alignment of two sequences Sanger
seqmatchall Does an all-against-all comparison of a set of sequences Sanger
supermatcher Finds a match of a large sequence against one or more sequences Sanger
water Smith-Waterman local alignment. HGMP
wordmatch Finds all exact matches of a given size between 2 sequences Sanger

alignment multiple  
emma Multiple alignment program HGMP
infoalign Displays some simple information about sequences HGMP
plotcon Plots the quality of conservation of a sequence alignment HGMP
prettyplot Displays aligned sequences, with colouring and boxing. Sanger
showalign Display a multiple sequence alignment HGMP
tranalign Align nucleic coding regions given the aligned proteins HGMP

cirdna Draws circular maps of DNA constructs Norway
lindna Draws linear maps of DNA constructs Norway
pepnet Protein helical net plot HGMP
pepwheel Shows protein sequences as helices HGMP
prettyseq Output sequence with translated ranges HGMP
remap Display a sequence with restriction cut sites, translation etc.. HGMP
seealso Finds programs sharing group names HGMP
showdb Displays information on the currently available databases HGMP
showfeat Show features of a sequence. HGMP
showseq Display a sequence with features, translation etc HGMP
sixpack Display a DNA sequence with 6-frame translation and ORFs LION
textsearch Search sequence documentation text. SRS and Entrez are faster! HGMP

biosed Replace or delete sequence sections HGMP
cutseq Removes a specified section from a sequence. HGMP
degapseq Removes gap characters from sequences HGMP
descseq Alter the name or description of a sequence. HGMP
entret Reads and writes (returns) flatfile entries HGMP
extractfeat Extract features from a sequence HGMP
extractseq Extract regions from a sequence. HGMP
listor Writes a list file of the logical OR of two sets of sequences HGMP
maskfeat Mask off features of a sequence HGMP
maskseq Mask off regions of a sequence. HGMP
newseq Type in a short new sequence. HGMP
noreturn Removes carriage return from ASCII files HGMP
notseq Excludes a set of sequences and writes out the remaining ones HGMP
nthseq Writes one sequence from a multiple set of sequences HGMP
pasteseq Insert one sequence into another. HGMP
revseq Reverse and complement a sequence. HGMP
seqret Reads and writes (returns) a sequence. Sanger
seqretsplit Reads and writes (returns) sequences in individual files HGMP
skipseq Reads and writes (returns) sequences, skipping the first few HGMP
splitter Split a sequence into (overlapping) smaller sequences. HGMP
trimest Trim poly-A tails off EST sequences HGMP
trimseq Trim ambiguous bits off the ends of sequences HGMP
union Reads sequence fragments and builds one sequence LION
vectorstrip Strips out DNA between a pair of vector sequences HGMP
yank Reads a range from a sequence, appends the full USA to a list file LION

enzyme kinetics  
findkm Calculates Km and Vmax for an enzyme reaction HGMP

feature tables  
coderet Extract CDS, mRNA and translations from feature tables HGMP
twofeat Finds neighbouring pairs of features in sequences HGMP

infoseq Displays some simple information about sequences HGMP
tfm Displays a program's help documentation manual HGMP
whichdb Search all databases for an entry HGMP
wossname Finds programs by keywords in their one-line documentation. HGMP

nucleic codon usage  
cai CAI codon usage statistic HGMP
chips Codon usage statistics HGMP
codcmp Codon usage table comparison HGMP
cusp Create a codon usage table HGMP
syco Synonymous codon usage Gribskov statistic plot HGMP

nucleic composition  
banana Bending and Curvature Plot in B-DNA Sanger
btwisted Calculates the twisting in a B-DNA sequence HGMP
chaos Create a chaos plot for a sequence. Sanger
compseq Counts the composition of dimer/trimer/etc words in a sequence HGMP
dan Plot melting temperatures for DNA. HGMP
freak Residue/base frequency table or plot HGMP
isochore Plots isochores in large DNA sequences Sanger
sirna Finds siRNA duplexes in mRNA HGMP
wordcount Counts words of a specified size in a DNA sequence. Sanger

nucleic cpg islands  
cpgplot Plot CpG rich areas HGMP
cpgreport Reports CpG rich regions HGMP
geecee Calculates the fractional GC content of nucleic acid sequences Sanger
newcpgreport Report CpG rich areas EBI
newcpgseek Reports CpG rich regions EBI

nucleic gene finding  
getorf Finds and extracts open reading frames (ORFs) HGMP
marscan Finds MAR/SAR sites in nucleic sequences HGMP
plotorf Plot potential open reading frames HGMP
showorf Pretty output of DNA translations HGMP
wobble Wobble base plot HGMP

nucleic motifs  
dreg Regular expression search of a nucleotide sequence Sanger
fuzznuc Nucleic acid pattern search HGMP
fuzztran Protein pattern search after translation HGMP

nucleic mutation  
msbar Mutate sequence beyond all recognition HGMP
shuffleseq Shuffles a set of sequences maintaining composition HGMP

nucleic primers  
eprimer3 Picks PCR primers and hybridization oligos HGMP
primersearch Searches DNA sequences for matches with primer pairs HGMP
stssearch Searches a DNA database for matches with a set of STS primers Sanger

nucleic profiles  
profit Scan a sequence or database with a matrix or profile HGMP
prophecy Creates matrices/profiles from multiple alignments HGMP
prophet Gapped alignment for profiles HGMP

nucleic repeats  
einverted Finds DNA inverted repeats Sanger
equicktandem Finds tandem repeats Sanger
etandem Looks for tandem repeats in a nucleotide sequence. Sanger
palindrome Looks for inverted repeats in a nucleotide sequence. HGMP

nucleic restriction  
recoder Find and remove restriction sites but maintain the same translation HGMP
redata Isoschizomers, references and Suppliers for Restriction Enzymes HGMP
restover Finds restriction enzymes that produce a specific overhang Sloan-Kettering Cancer Center
restrict Finds Restriction Enzyme Cleavage Sites HGMP
silent Silent mutation restriction enzyme scan HGMP

nucleic transcription  
tfscan Scans DNA sequences for transcription factors. HGMP

nucleic translation  
backtranseq Back translate a protein sequence HGMP
transeq Translates nucleic acid sequences. HGMP

distmat Creates a distance matrix from multiple alignments HGMP

protein 2d structure  
garnier Predicts protein secondary structure EBI
helixturnhelix Finds nucleic acid binding domains. HGMP
hmoment Hydrophobic moment calculation HGMP
pepcoil Predicts coiled coil regions HGMP
tmap Predict transmembrane proteins Sanger

protein composition  
charge Protein charge plot HGMP
checktrans ORF property statistics EBI
emowse Protein identification by mass spectrometry HGMP
iep Calculates the isoelectric point of a protein HGMP
mwfilter Filter noisy molwts from mass spec output HGMP
octanol Displays protein hydropathy Sanger
pepinfo Plots simple amino acid properties in parallel HGMP
pepstats Protein statistics HGMP
pepwindow Displays protein hydropathy Sanger
pepwindowall Displays protein hydropathy of a set of sequences Sanger

protein motifs  
antigenic Finds antigenic sites in proteins HGMP
digest Protein proteolytic enzyme or reagent cleavage digest HGMP
fuzzpro Protein pattern search HGMP
oddcomp Finds protein sequence regions with a biased composition. Norway
patmatdb Matching a Prosite motif against a Protein Sequence Database. HGMP
patmatmotifs Compares a protein sequence to the PROSITE motif database. HGMP
pestfind Finds PEST motifs as potential proteolytic cleavage sites Austria
preg Regular expression search of a protein sequence Sanger
pscan Locates fingerprints (multiple motif features) in a protein sequence. HGMP
sigcleave Predicts signal peptide cleavage sites HGMP

utils database creation  
aaindexextract Extract data from AAINDEX HGMP
cutgextract CUTG: Codon Usage Tabulated from GenBank by organism HGMP
printsextract Preprocesses the PRINTS database for use with the program PSCAN HGMP
prosextract Extracts ID, AC, and PA lines from the PROSITE motif database. HGMP
rebaseextract Extract data from REBASE HGMP
tfextract Extract data from TRANSFAC HGMP

utils database indexing  
dbiblast Database indexing for BLAST 1 and 2 indexed databases Sanger
dbifasta Index a fasta database HGMP
dbiflat Database indexing for flat file databases Sanger
dbigcg Database indexing for GCG formatted databases Sanger

utils misc  
embossdata Finds or fetches the data files read in by the EMBOSS programs HGMP
embossversion Writes the current EMBOSS version number HGMP

PHYLIP TOOLS ( ) downloadable source codes. 

Heuristic search for best tree 

PROTPARS Estimates phylogenies from protein sequences (input using the standard one-letter code for amino acids) using the parsimony method, in a variant which counts only those nucleotide changes that change the amino acid, on the assumption that silent changes are more easily accomplished."

DNAPARS. Estimates phylogenies by the parsimony method using nucleic acid sequences. Allows use the full IUB ambiguity codes, and estimates ancestral nucleotide states. Gaps treated as a fifth nucleotide state."

DNACOMP. Estimates phylogenies from nucleic acid sequence data using the compatibility criterion, which searches for the largest number of sites which could have all states (nucleotides) uniquely evolved on the same tree. Compatibility is particularly appropriate when sites vary greatly in their rates of evolution, but we do not know in advance which are the less reliable ones.

DNAML.  Estimates phylogenies from nucleotide sequences by maximum likelihood. The model employed allows for unequal expected frequencies of the four nucleotides, for unequal rates of transitions and transversions, and for different (prespecified) rates of change in different categories of sites, with the program inferring which sites have which rates.

NAMLK. Same as DNAML but assumes a molecular clock. The use of the two programs together permits a likelihood ratio test of the molecular clock hypothesis to be made.

RESTML. Estimation of phylogenies by maximum likelihood using restriction sites data (not restriction fragments but presence/absence of individual sites). It employs the Jukes-Cantor symmetrical model of nucleotide change, which does not allow for differences of rate between transitions and transversions. This program is VERY slow."

FITCH. Estimates phylogenies from distance matrix data under the "additive tree model" according to which the distances are expected to equal the sums of branch lengths between the species. Uses the Fitch-Margoliash criterion and some related least squares criteria. Does not assume an evolutionary clock. This program will be useful with distances computed from DNA sequences, with DNA hybridization measurements, and with genetic distances computed from gene frequencies.

KITSCH. Estimates phylogenies from distance matrix data under the "ultrametric" model which is the same as the additive tree model except that an evolutionary clock is assumed. The Fitch-Margoliash criterion and other least squares criteria are assumed. This program will be useful with distances computes from DNA sequences, with DNA hybridization measurements, and with genetic distances computed from gene frequencies.

NEIGHBOR An implementation by Mary Kuhner and John Yamato of Saitou and Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage clustering) method. Neighbor Joining is a distance matrix method producing an unrooted tree without the assumption of a clock. UPGMA does assume a clock. The branch lengths are not optimized by the least squares criterion but the methods are very fast and thus can handle much larger data sets.

ONTML.  Estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations. Does not assume a molecular clock. An alternative method of analyzing this data is to compute Nei's genetic distance and use one of the distance matrix programs.

MIX.  Estimates phylogenies by some parsimony methods for discrete character data with two states (0 and 1). Allows use of the Wagner parsimony method, the Camin-Sokal parsimony method, or arbitrary mixtures of these. Also reconstructs ancestral states and allows weighting of characters."

DOLLOP Estimates phylogenies by the Dollo or polymorphism parsimony criteria for discrete character data with two states (0 and 1). Also reconstructs ancestral states and allows weighting of characters. Dollo parsimony is particularly appropriate for restriction sites data; with ancestor states specified as unknown it may be appropriate for restriction fragments data.

Branch-and-bound exact search for best tree 
DNAPENNY.  Finds all most parsimonious phylogenies for nucleic acid sequences by branch-and-bound search. This may not be practical (depending on the data) for more than 10 or 11 species.
PENNY.  Finds all most parsimonious phylogenies for discrete-character data with two states, for the Wagner, Camin-Sokal, and mixed parsimony criteria using the branch-and-bound method of exact search. May be impractical (depending on the data) for more than 10-11 species.
DOLPENNY.  Finds all most parsimonious phylogenies for discrete-character data with two states, for the Dollo or polymorphism parsimony criteria using the branch-and-bound method of exact search. May be impractical (depending on the data) for more than 10-11 species.
CLIQUE.  Finds the largest clique of mutually compatible characters, and the phylogeny which they recommend, for discrete character data with two states. The largest clique (or all cliques within a given size range of the largest one) are found by a very fast branch and bound search method. The method does not allow for missing data. For such cases the T (Threshold) option of MIX may be a useful alternative. Compatibility methods are particular useful when some characters are of poor quality and the rest of good quality, but when it is not known in advance which ones are which.

Distances or bootstrap samples 

DNADIST Computes four different distances between species from nucleic acid sequences. The distances can then be used in the distance matrix programs. The distances are the Jukes-Cantor formula, one based on Kimura's 2- parameter method, Jin and Nei's distance which allows for rate variation from site to site, and a maximum likelihood method using the model employed in DNAML. The latter method of computing distances can be very slow.

PROTDIST Computes a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on the genetic code plus a constraint on changing to a different category of amino acid. The distances can then be used in the distance matrix programs

SEQBOOT Reads in a data set, and produces multiple data sets from it by bootstrap resampling. Since most programs in the current version of the package allow processing of multiple data sets, this can be used together with the consensus tree program CONSENSE to do bootstrap (or delete-half-jackknife) analyses with most of the methods in this package. This program also allows the Archie/Faith technique of permutation of species within characters.

GENDIST Computes one of three different genetic distance formulas from gene frequency data. The formulas are Nei's genetic distance, the Cavalli- Sforza chord measure, and the genetic distance of Reynolds et. al. The former is appropriate for data in which new mutations occur in an infinite isoalleles neutral mutation model, the latter two for a model without mutation and with pure genetic drift. The distances are written to a file in a format appropriate for input to the distance matrix programs.

FACTOR Takes discrete multistate data with character state trees and produces the corresponding data set with two states (0 and 1). Written by Christopher Meacham

Tree manipulation, plotting, consensus 

DRAWGRAM Plots rooted phylogenies, cladograms, and phenograms in a wide variety of user-controllable formats. The program is interactive and allows previewing of the tree on PC graphics screens, and Tektronix or DEC graphics terminals. Final output can be on a laser printer (such as the Apple Laserwriter or HP Laserjet), on graphics screens or terminals, in files readable by drawing programs such as PC Paintbrush, MacDraw, Idraw, and Xfig, on pen plotters (Hewlett-Packard or Houston Instruments) or on dot matrix printers capable of graphics

DRAWTREE Similar to DRAWGRAM but plots unrooted phylogenies

CONSENSE Computes consensus trees by the majority-rule consensus tree method, which also allows one to easily find the strict consensus tree. Does NOT compute the Adams consensus tree. Trees are input in a tree file in standard nested-parenthesis notation, which is produced by many of the tree estimation programs in the package. This program can be used as the final step in doing bootstrap analyses for many of the methods in the package

RETREE Reads in a tree (with branch lengths if necessary) and allows you to reroot the tree, to flip branches, to change species names and branch lengths, and then write the result out. Can be used to convert between rooted and unrooted trees.

Interactive tree manipulation 

DNAMOVE Interactive construction of phylogenies from nucleic acid sequences, with their evaluation by parsimony and compatibility and the display of reconstructed ancestral bases. This can be used to find parsimony or compatibility estimates by hand.

MOVE Interactive construction of phylogenies from discrete character data with two states (0 and 1). Evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree. This can be used to find parsimony or compatibility estimates by hand.

DOLMOVE Interactive construction of phylogenies from discrete character data with two states (0 and 1) using the Dollo or polymorphism parsimony criteria. Evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree. This can be used to find parsimony or compatibility estimates by hand.

RETREE Reads in a tree (with branch lengths if necessary) and allows you to reroot the tree, to flip branches, to change species names and branch lengths, and then write the result out. Can be used to convert between rooted and unrooted trees. Does not refer to any data.

List of Other Phylogenetic Analysis Tools (

EBI Tools 

Homology & Similarity programs can be used to look for sequence similarity  - the BLAST   or Fasta
Protein Functional Analysis InterProScan
Structural Analysis  can be used to search for motifs in your protein sequence   - MSDfold  or DALI
Sequence Analysis can be used to query your protein structure and compare it to those in the Protein Data Bank (PDB)   - ClustalW
Miscellaneous Tools  a sequence alignment tool  Expression Profiler: A set of tools for clustering, analysis and visualization of gene expression and other genomic data


Proteomics and sequence analysis tools  
Proteomics PeptIdent PeptideMass
DNA -> Protein Translate
Similarity searches BLAST
Pattern and profile searches ScanProsite
Post-translational modification and topology prediction  
Primary structure analysis ProtParam,  pI/MW ProtScale
Secondary and tertiary structure prediction SWISS-MODEL Swiss-PdbViewer
Alignment T-COFFEE SIM

Biological text analysis Software for 2-D PAGE analysis
Roche Applied Science's Biochemical Pathways  

RCSB-Developed Software  

mmCIF Resources  
CIFLIB  C language application program interface
CIFOBJ  A class library of mmCIF dictionary access tools
CIFPARSE  A library of access tools for mmCIF
CIFPARSE-OBJ  A library of access tools for mmCIF in C++
CIFTABLE (SSTable)  A class library of table access tools (old version)
CIFTABLE (ISTable)  A class library of table access tools
mmCIF loader  An application to load mmCIF data into relational databases and XML
OpenMMS Toolkit A suite of Java source code that includes an mmCIF parser, RDBMS loader, XML translator, and Corba server 
STAR (CIF) parser  Several object-oriented Perl modules for parsing mmCIF files and other STAR-compliant files without nested loops
Deposition Resources  
ADIT - Workstation Version (alpha release)  A package for editing and checking structure data entries
MAXIT  An application for processing and curation of macromolecular structure data
PDB_EXTRACT  (download) Tools and examples for extracting mmCIF data from structure determination applications
PDB Validation Suite (beta version)  A tool for processing and checking structure data
FTP Archive Resources  
bnl2rcsb  Perl script to convert a BNL FTP directory structure to an RCSB FTP directory structure
getPdbUpdate  Perl script to retrieve files from any update found at

Other Software Links*  

mmCIF software tools  
A library of ANSI-C functions providing a simple mechanism for accessing Crystallographic Binary Files (CBF files) and Image-supporting CIF (imgCIF) files   
cif2pdb Program to convert mmCIF to pseudo-PDB format
Extended CIF Tool Box (Fortran) with CYCLOPS and cif2cif   
Applications to manipulate STAR files (Objective-C)   
Scripts to filter a PDB entry and produce mmCIF   


ARP/wARP A system for the refinement of protein structures via automatic updating and re-building of the model and solvent structure

CCP4 suite of programs covering all aspects of crystallographic structure determination, refinement and analysis  

CNS A system for structure determination from crystallographic or NMR data
MAIN interactively driven suite of programs for molecular modeling, density modification, model refinement and structure analysis An interactive system for building and manipulating models in electron density maps

SHELX A set of programs for direct structure solution and refinement with high resolution diffraction data

SOLVE An automated system for phase determination from MIR and MAD data

X-PLOR 3.851 A program for structure determination from crystallographic or NMR data (Yale version)

X-PLOR/CNX A program for structure determination from crystallographic or NMR data (Accelrys version)

XtalView An interactive system for building and manipulating models in electron density map and for phase determination from MIR or MAD data.


CNS A system for structure determination from crystallographic or NMR data

CYANA A program for the structure calculation of biological macromolecules on the basis of conformational constraints from NMR

Fantom A program for structure calculation and refinement using torsion angle minimization with NMR data

X-PLOR 3.851 A program for structure determination from crystallographic or NMR data (Yale version)
Structure Analysis and Verification  

CE/CL Software for structure comparison by Combinatorial Extension (CE) and Compound Likeness (CL)
A Web server for searching homologous sequences and giving information on secondary structure elements, accessibility, hydropathy and protein-protein contacts   

ESPript Easy Sequencing in Postscript

Non-covalent bond finder Software for finding non-covalent interactions for use with Chime 2 or higher

PASS A fast cavity-detection program for the identification and visualization of possible protein binding sites

Procheck A program that checks the stereochemical quality of a protein structure

ProFit A program for fitting protein structures on to each other

SARF2 A program which searches for similar structural motifs (via an analysis of backbone fragments) in protein structures
Surface Racer 
A program that calculates exact accessible surface area, molecular surface area and average curvature of molecular surface, and analyzes cavities in the protein interior inaccessible from the outside.   

SURFNET A program which generates surfaces and void regions between molecular surfaces

WHAT_CHECK A system for protein structure validation derived from the WHAT IF program

WHAT IF protein structure analysis program that may be used for mutant prediction, structure verification and molecular graphics  

Modeling and Simulation  

ANALYZE Cornell Theory Center program to classify and analyze conformations obtained from global searches; includes capabability to compare NMR intensites and coupling constants to experimental data

AMBER Assisted Model Building with Energy Refinement - a molecular dynamics and energy minimization program
A suite of automated docking tools designed to predict how small molecules, such as substrate or drug candidates, bind to a receptor of known 3D structure   

CHARMM Chemistry at HARvard Molecular Mechanics - a molecular dynamics and energy minimization program

ECEPPAK Cornell Theory Center package to carry out global conformational searches using the ECEPP/3 force field

FTDOCK A program for carrying out rigid-body docking between biomolecules

GROMOS A general-purpose molecular dynamics computer simulation package for the study of biomolecular systems

GROMACS modelling package for proteins, membrane systems and more, including fast molecular dynamics, normal mode analysis, essential dynamics analysis and many trajectory analysis utilities  

ICM ICM programs and modules for applications including for structure analysis, modeling, docking, homology modeling and virtual ligand screening  
Suite of tools for model building, structure prediction and refinement, reconstruction, and minimization; for SGI, Linux, and Sun Solaris   

LOOPP Linear Optimization of Protein Potentials. Cornell Theory Center program for potential optimization and alignments of sequences and structures
MAtching Molecular Models Obtained from THeory - a program for automated pairwise and multiple structural alignments; for SGI, Linux, and Sun Solaris   

MidasPlus program for displaying, manipulating and analysing macromolecules  

MODELLER A program for automated protein homology modeling

MOIL Cornell Theory Center package for molecular dynamics simulation of biological molecules

NAMD A parallel object-oriented molecular dynamics simulation program

WAM - Web Antibody Modelling A server for automated structure modeling from antibody Fv sequences

123D program which threads a sequence through a set of structures using substitution matrix, secondary structure prediction and contact capacity potential  

Molecular Graphics  


Shockwave 3D PDB Viewer A tool for creating and viewing dynamic, formatted structure annotations; for Windows
Free, easy to use tool for viewing molecular structures through a Web page--streams data directly from PDB on PC's and Mac; developed in Ireland   
Chemscape Chime 
From MDL Information Systems. This program allows visualisation of structures within WWW browser pages. For further information about Chime see the UMass Chime Resources Page 
Java3D Molecular Visualisation System 
Free Java/Java3D progam and source code   

Mage and Kinemages molecular display for research and educational uses. Free, open source for Macintosh, PC, Unix, and Linux. A Java version does 3-D Web display without plug-ins.  
A program for displaying, analyzing, and manipulating the 3-D structure of biological macromolecules, with special emphasis on the study of protein or DNA structures determined by NMR   

RasMol free viewing system for PDB coordinate files that runs on Macintosh, PC and UNIX systems. Open source versions

Raster3D set of tools for generating high quality raster images of proteins or other molecules. Freeware for UNIX, LINUX and PC.  

RasTop (v. 2.0) free user-friendly graphical interface to RasMol molecular visualization software (v., available for Windows platforms  

Ribbons A program for molecular illustration and error analysis
A Tcl/Tk script responsible to redirect PDB files or RasMol scripts to multiple RasMol sessions; can be used as a Web browser helper application or as a standalone program.   
Swiss PDB viewer available from Switzerland  | Australia
A 3D graphics and molecular modeling program for the simultaneous analysis of multiple models and for model-building into electron density maps. The software is available for Macintosh or PC  
Uppsala Electron Density Server Generated density maps

MolScript A program for displaying structures in both detailed and schematic formats and writing images in various formats

MolView and MolView Lite Free molecular visualization programs for the Macintosh
Free, user-friendly server that converts PDB files to animated gif files that can be used in Web pages and presentations. Simple step-by-step instructions can be found here .
Program to view and manipulate PDB files on a PocketPC   
Free viewer to display and manipulate PDB files and create animations and slides of proteins   
A free and open-source molecular graphics system for visualization, animation, editing, and publication-quality imagery. PyMOL is scriptable and can be extended using the Python language. Supports Windows, Mac OSX, and Unix   
A lightweight OpenGL based molecular viewer for Windows 95/NT/00 and X Windows   
ViewerLite and ViewerPro (Discovery Studio) Molecular visualization programs for Macintosh and PC from Accelrys
VMD (Visual Molecular Dynamics) runs on many platforms including MacOS X, and several versions of Unix and Windows. VMD provides visualization, analysis, and Tcl/Python scripting features, and has recently added sequence browsing and volumetric rendering features. VMD is distributed free of charge.  
WebMol A Java PDB Viewer. WebMol was designed to display and analyze structural information contained in the Protein Data Bank (PDB). It can be run as an applet or as a stand-alone application.
World Index of Molecular Visualization Resources
A Visitor-Maintained Indices (VMI)TM Site by Eric Martz and Trevor D. Kramer. Contains many links to visualization tools, tutorials, and other resources.  

TIGR Tools 

Gene Finding/Annotation  

MANATEE  is a  web-based gene evaluation and genome annotation tool. Manatee can store and view annotation for prokaryotic and eukaryotic genomes. The Manatee interface allows biologists to quickly identify genes and make high quality functional assignments, such as GO classifications, using search data, paralogous families, and annotation suggestions generated from automated analysis.

GlimmerM organisms.  A gene finder derived from Glimmer, but developed specifically for eukaryotes. It is based on a dynamic programing algorithm that considers all combinations of possible exons for inclusion in a gene model and chooses the best of these combinations. The decision about what gene model is best is a combination of the strength of the splice sites and the score of the exons generated by an interpolated Markov model (IMM). The system has been trained for Arabidopsis thaliana, Oryza sativa (rice), and Plasmodium falciparum (the malaria parasite), and should work well on closely

Glimmer   A system for finding genes in microbial DNA, especially the genomes of bacteria and archaea. (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA.
GeneSplicer : A computational method for splice site prediction A fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been trained and tested successfully on Plasmodium falciparum (malaria), Arabidopsis thaliana and human genomes. Training data sets for Human and Arabidopsis thaliana are included. It is fully described in Pertea M, Lin X, Salzberg SL. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 2001 Mar 1;29(5):1185-90 .

TransTerm is a  program that finds rho-independent transcription terminators in bacterial genomes. Each terminator found by the program is assigned a confidence value that provides an estimate of its probability of being a true terminator. TransTerm has been published: Prediction of Transcription Terminators in Bacterial Genomes Ermolaeva, M.D., Khalak, H.G., .White, O., Smith, H.O., Salzberg, S.L. Journal of Molecular Biology 301, 27-33 (2000)

EXONomy  is a  new gene finder based on the Generalized Hidden Markov Model (GHMM) framework, similar to Genscan and Genie. It is highly reconfigurable and includes software for retraining. The replaceable submodels of the GHMM include homogeneous and inhomogeneous Markov models of selectable order, nonstationary Markov chains, windowed and non-windowed Weight Array Matrices (WWAM/WAM/WMM), Maximal Dependence Decomposition (MDD) trees, and codon bias. An EXONomy Web Interface is available.

Unveil   is a new gene finder based on a 283-state Hidden Markov Model (HMM) similar to that described in [Henderson,J., Salzberg,S., and Fasman,K.H. (1997) J. Comput. Biol. 4, 127-141]. An Unveil Web Interface is available.

ELPH    is a  general-purpose Gibbs sampler for finding motifs in a set of DNA or protein sequences. The program takes as input a set containing anywhere from a few dozen to thousands of sequences, and searches through them for the most common motif, assuming that each sequence contains one copy of the motif.

RepeatFinder is a  computational system for analysis of repetitive structure of genomic sequences. The method uses suffix trees for efficient computation of exact repeats and organizes those repeats into classes. The method can be applied to individual genome sequences or sets of sequences. The output is multi-fasta file of found repeat sequences that can be used as the target of searches.

RBSfinder is a  Perl script that implements an algorithm to find ribosome binding sites for genes in bacterial and archaeal genomes. It is normally run as a post-processor to the Glimmer gene finder or to other prokaryotic gene finders.

Combiner is a  program that predicts gene models using the output from other annotation software. It uses a statistical algorithm to identify patterns of evidence corresponding to gene models.

HBQCM:  Hexamer Based Quality Control Method as described in White O., Dunning T., Sutton G., Adams M., Venter J.C., and Fields C. (1993) A quality control algorithm for DNA sequencing projects. Nucleic Acids Research 21:3829-3838.


MUMmer  A system for aligning whole genome sequences. Using an efficient data structure called a suffix tree, the system is able rapidly to align sequences containing millions of nucleotides. It is fully described in: A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of whole genomes. Nucleic Acids Research, 27:11 (1999), 2369-2376. A graphical viewer for the MUMmer output can be found here.

AAT  A tool for analyzing and annotating genomic sequences. Huang, X., Adams, M.D., Zhou, H. and Kerlavage, A.R. (1997) Genomics 46, 37-45. The AAT package includes two sets of programs, one set (DPS/NAP) for comparing the query sequence with a protein database, and the other (DDS/GAP2) for comparing the query with a cDNA database.

Assembler:  A tool for assembly of large sets of overlapping sequence data such as ESTs, BACs, or small genomes. This updated assembly tool delivers better performance and results than the previous version, assembling EST, BAC, and genome data with greater care given to repeat detection and contig-level overlapping. TIGR Assembler has been published (Sutton G., White, O., Adams, M., and Kerlavage, A. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Science & Technology 1:9-19). Also available, without a license, is the utility ta2ace for converting TIGR Assembler output into the "new" .ACE format used by Consed and other sequence assembly editors.

BAMBUS is the  first publicly available genome sequence scaffolding program. It orders and orients contigs into scaffolds based on various types of linking information. Additionally, BAMBUS allows users to build scaffolds in a hierarchical fashion by prioritizing the order in which links are used. BAMBUS runs on Unix systems.

Lucy:  A Sequence Cleanup Program. Lucy is a utility that prepares raw DNA sequence fragments for sequence assembly, possibly using the TIGR Assembler. The cleanup process includes quality assessment, confidence reassurance, vector trimming and vector removal. The primary advantage of Lucy over other similar utilities is that it is a fully integrated, stand alone program. You can view the Program Requirements. The Windows version of Lucy is available from Hui-Hsien Chou's webpage. Lucy is fully described in: DNA sequence quality trimming and vector removal. H.-H. Chou and M.H. Holmes. Bioinformatics, 17:12, pp. 1093-1104, 2001


TM4: A package of Open Source software programsfor Microarray analysis   TIGR Microarray Data Analysis System (MIDAS) is a microarray data quality filtering and normalization tool that allows raw experimental data to be processed through various data normalizations, filters, and transformations via a user-designed analysis pipeline. Currently implemented normalization and data analysis algorithms include total-intensity normalization, Lowess (Locfit) normalization, flip-dye consistency checking, replicates analysis, intensity-dependent z-score filtering (slice analysis), etc. MIDAS is implemented by Java language and thus a platform-independent application. It requires JDK v1.3 or higher. Refer to the included manual for details.

MADAM (MicroArray DAta Manager)   Microarray experiments produce large amounts of data for even the simplest of experiments. In order to analyze data from many experiments that data must be stored in an accessible form, such as in a database. MADAM (MicroArray DAta Manager) is a java-based application designed to load and retrieve microarray data to and from a database (also supplied with the software). MADAM provides data entry forms, data report forms and additional applications necessary to maintain microarray data for further analysis. Madam requires JRE 1.3.1.

TIGR MultiExperiment Viewer (MEV) is a   Java application designed to allow the analysis of microarray data to identify patterns of gene expression and differentially expressed genes. Numerous normalization, clustering and distance algorithms have been implemented, along with a variety of graphical displays to best present the results. MEV was written to be flexible and expandable, and supports a variety of input and output formats. MEV requires version 1.2 or higher of Sun's JRE and J3D package.

TIGR Spotfinder   is a software tool designed for Microarray image processing using the TIFF image files generated by most microarray scanners. TIGR Spotfinder was written in C/C++ for PCs running Windows NT/2000/ME/XP.

ArrayViewer is written in Java for cross-platform compatibility and reads and writes data using flat files or a database through stored procedures, See the ArrayViewer Overview as a Adobe Acrobat PDF File. Machines that lack the requirements for the MultiExperiment Viewer may use ArrayViewer for single experiment analysis.  A software tool designed to facilitate the presentation and analysis of microarray expression data, leading to the identification of genes that are differentially expressed. 

TIGR McCoder is a  software package designed for a portable scanner with Palm OS to collect bar codes and then transfer the bar codes to PC as a plain text file. The package includes two programs: one that runs on the handheld scanner and one that runs on a regular PC with Windows 95/98/2000/NT. Transferred to PC, the scanned bar codes could be manipulated easily with McCoder.
Scheduler  is a web based tool that provides an efficient reservation method to manage lab instruments and office facilities. The Scheduler is designed as a two-tier system running on the Internet and can be configured to meet a variety of requirements.

NCBI Tools 

The Basic Local Alignment Search Tool (BLAST,  for comparing gene and protein sequences against others in public databases, now comes in several flavors including PSI-BLAST, PHI-BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human, microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins, and tentative human consensus sequences.

Clusters of Orthologous Groups (COGs  currently covers 21 complete genomes from 17 major phylogenetic lineages. A COG is a cluster of very similar proteins found in at least three species. The presence or absence of a protein in different genomes can tell us about the evolution of the organisms, as well as point to new drug targets.

Map Viewer  shows integrated views of chromosome maps for 17 organisms. Used to view the NCBI assembly of complete genomes, including human, Map Viewer is a valuable tool for the identification and localization of genes, particularly those that contribute to diseases.  

LocusLink  combines descriptive and sequence information on genetic loci through a single query interface. LocusLink covers information on official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, OMIM numbers, UniGene clusters, homology, map information, and related web sites.

UniGene  cluster is a non-redundant set of sequences that represents a unique gene. Well-characterized genes, as well as thousands of expressed sequence tag (EST) sequences have been included. Each cluster record also contains information such as the tissue types in which the gene has been expressed and map location. UniGene can assist in gene discovery, gene mapping projects, and large-scale expression analysis.

ORF finder   identifies all possible ORFs in a DNA sequence by locating the standard and alternative stop and start codons. The deduced amino acid sequences can then be used to BLAST against GenBank. ORF finder is also packaged in the sequence submission software Sequin. 

Electronic PCR   allows you to search your DNA sequence for sequence tagged sites (STSs), which have been used as landmarks in various types of genomic maps. It compares the query sequence against data in NCBI's UniSTS, a unified, non-redundant view of STSs from a wide range of sources.

VAST Search   is a structure-structure similarity search service. It compares 3D coordinates of a newly determined protein structure to those in the MMDB/PDB database. VAST Search computes a list of similar structures that can be browsed interactively, using molecular graphics to view superimpositions and alignments.  

The Cancer Chromosome Aberration Project (CCAP)  compiles information on the distinct chromosome aberrations that are associated with different cancers. The identification of chromosomal abnormalities by clinicians can enable the diagnosis of, classification of, and treatment selection for a given cancer. 

HumanMouse Homology Maps  compare genes in homologous segments of DNA from human and mouse sources, sorted by position in each genome. A total of 1793 loci are presented, most of which are genes. This map should be interpreted as a reflection of probable, not confirmed, homology relationships because of the lack of further information available for about half the loci. 

VecScreen is a  tool for identifying segments of a nucleic acid sequence that may be of vector, linker or adapter origin prior to sequence analysis or submission. VecScreen was developed to combat the problem of vector contamination in public sequence databases.   dbMHC provides an open, publicly accessible platform for DNA, and clinical data related to the human Major Histocompatibilty Complex (MHC). In addition the dbMHC will provide tools for further submission and analysis of research data linked to the MHC. 

The Cancer Genome Anatomy Project (CGAP)  aims to decipher the molecular anatomy of cancer cells. CGAP develops profiles of cancer cells by comparing gene expression in normal, precancerous, and malignant cells from a wide variety of tissues. 

mRNA to Genomic Alignments: Spidey  aligns one or more mRNA sequences to a single genomic sequence. Spidey will try to determine the exon/intron structure, returning one or more models of the genomic structure, including the genomic/mRNA alignments for each exon. 

Biology WorkBench

Protein Tools 

Ndjinn  Multiple Database Search
BL2SEQ  Compare proteins to each other with BLAST
BL2SEQX  Compare a protein to nucleotide sequences with BLAST
BLASTP  Compare a PS to a PS DB
TBLASTN  Compare a PS to a translated DB
PSIBLASTP  Position Specific Iterative BLAST
FASTA  Heuristic Sequence Similarity Search (PS Or DB)
TFASTA  Compare a PS to a NS, PS DB
TFASTX  Comp PS to Trans DNA (NS Or DB)
TFASTY  Comp PS to Trans DNA (NS Or DB)
SSEARCH  Smith Waterman Local Alignment of Proteins
CLUSTALW  Multiple Sequence Alignment
CLUSTALWPROF  Align Sequences to Existing Alignment (Profile)
ALIGN  Optimal Global Alignment of Two PS
MSA  Multiple Sequence Alignment (Sum of Pairs Criterion)
LALIGN  Calculate N Best Local PS Alignments
LFASTA  Local Alignment of Two PS
ROBUST  Global alignment of Two PS (Show Robust Pairs)
SIM  N Best Local Similarities Using Affine Weights
BESTSCOR  Calculate the Best Self Comparison Score
CTREE  Align protein sequences with confidence estimates
PRSS  Compare a PS to a Shuffled PS
SAPS  Statistical Analysis of PS
AASTATS  Statistics Based on Amino Acid Abundance, including weight and specific volume
GREASE  Kyte Doolittle Hydropathy Profile
RPSBLAST  Compare a PS to a Conserved Domain DB
FINGERPRINTSCAN  PRINTS fingerprint identification
PROSEARCH  Search Prosite DB for Patterns in a PS
PPSEARCH  Search Prosite DB for Patterns in a PS
PFSCAN  Sequence Search Against a Set of Profiles (PROSITE and PFAM)
HMMPFAM  Search against Pfam HMM database
BLIMPS  Sequence Search Against a Set of Profiles (BLOCKS)
PATTERNMATCHDB  Search for Regular Expressions (Patterns) in a protein sequence DB
PATTERNMATCH  Search for Regular Expressions (Patterns) in a protein sequence
GOR4  Predict Secondary Structure of PS
RANDSEQ  Randomize a Sequence
CHOFAS  Predict Secondary Stucture of PS(s) (Chou Fasman)
HTH  Predict HTH Motifs in Protein Chains
PELE  Protein Structure Prediction
DSSP  Secondary Structure/Solvent Exposure of PDB Proteins
TMAP  Prediction of Transmembrane Segments
TMHMM  Predict location of transmembrane helices and location of intervening loop regions
EXTCOEF  Extinction coefficient calculation
PI  Isoelectric point determination

Nucleic Acid Tools 

 BL2SEQ  Compare nucleotides to each other with BLAST
 BL2SEQX  Compare a nucleotide to protein sequences with BLAST
 BLASTN  Compare a NS to a NS DB
 BLASTX  Compare a PS Derived from NS to a PS DB
 TBLASTX  Compare a translated NS to a translated DB
 FASTA  Nucleic Acid Sequence Comparisons (NS or DB)
 FASTX  Compare Translated NS to PS DB
 FASTY  Compare Translated NS to PS DB
 CLUSTALW  Multiple Sequence Alignment
 CLUSTALWPROF  Align Sequences to Existing Alignment (Profile)
 ALIGN  Optimal Global Sequence Alignment
 LALIGN  Calculate Optimal Local Sequence Alignments
 LFASTA  Calculate Local Sequence Alignments (Heuristic)
 PATTERNMATCHDB  Search for Regular Expressions (Patterns) in a nucleic sequence DB
 PATTERNMATCH  Search for Regular Expressions (Patterns) in a nucleic sequence
 TACG  Analyze a NS for Restriction Enzyme Sites
 PRIMER3  Design Primer Pairs and Probes
 NASTATS  Nucleic Acid Statistics
 BESTSCOR  Calculate the Best Self Comparison Score
 PFSCAN  Sequence Search Against a Set of Profiles (PROSITE)
 PRIMERCHECK  Calculates melting point, length, %GC for a primer sequence
 PRIMERTM  Designs end primers based on a minimum Tm
 SIXFRAME  Generate & Import 6 Frame Translations on a NS
 REVCOM  Generate Reverse Complement of NS
 RANDSEQ  Randomize a Sequence

Alignment Tools 

 Ndjinn  Multiple Database Search
 SPLITSplit Alignment Into Component Sequences 
 DEGAP_SPLITSplit Alignment Into Component Sequences and Remove Gap Characters 
Download Aligned Sequences 
 TEXSHADE  Color Coded Plots of Pre Aligned Sequences
 BOXSHADE  Color Coded Plots of Pre Aligned Sequences
 CLUSTALWPROF  Align Two Existing Alignments (Profiles)
 TMAP  Prediction of Transmembrane Segments
 DRAWTREEDRAWTREE  Draw Unrooted Phylogenetic Tree from Alignment
 DRAWGRAM  Draw Rooted Phylogenetic Tree from Alignment
 CLUSTALDIST  Generate Distance Matrix with Clustal W
 CLUSTALTREE  Phylogenetic Analysis with Clustal W
 DNADIST  Compute Evolutionary Distance Matrix from NS Alignment
 PROTDIST  Compute Evolutionary Distance Matrix from PS Alignment
 DNAPARS  Infer an Unrooted Phylogeny from NS Alignment
 PROTPARS  Infer an Unrooted Phylogeny from PS Alignment
 MVIEW  Multiple Alignment Display

Structure Tools 

 PDF  PDF Knowledge
 CONVERT  File format conversion utility
 TNT  Macromolecular Refinement Package