COP 5510‎ > ‎

An Evaluation of Protein Database Searching Programs

Members: Stephen Williams


Abstract

The purpose of this project is to investigate the current programs for finding similar protein sequences in protein databases. Speed and accuracy are the two most important elements of a search. Each program presents both positives and negatives, solutions to previous problems and new complications. This evaluation has the goal of determining where each program excels and where it is lacking. I hope to explore the relationship between the rating that is returned for the results and the differences in the shape of the sequences. The weighing of protein characteristics when determining similarity to the originating sequence should rely heavily on the elements that dictate its purpose, shape and sequence.

Plan of Action

  • Determine 5-10 database search engines that target protein sequences
  • Gather all information about each search
    • Method Used
    • Method of Scoring Relevancy
    • Variation of what standard method
  • Select partial protein sequence strings at 10, 25, and 50 lengths from a single chain for 10 different proteins along with the complete chain sequence
  • Determine if the protein exists within the data store of the search engine
  • Run a search for each of the lengths on each of the databases. Document:
    • The returned list of proteins and the locations of the matches
    • Any scores returned for each of the found items
    • The location of the actual protein in the returned search items
    • The shape of first 5-10 returned proteins at the sequence
    • How many perfect matches vs close matches
  • Evaluate

Papers to be read

A Krause and M Vingron
A set-theoretic approach to database searching and clustering
Bioinformatics, Jun 1998; 14: 430 - 438.

T Rognes and E Seeberg
SALSA: improved protein database searching by a new algorithm for assembly of sequence fragments into gapped alignments
Bioinformatics, Nov 1998; 14: 839 - 845.

Mehdi Pirooznia, Tanwir Habib, Edward J. Perkins, and Youping Deng
GOfetcher: a database with complex searching facility for gene ontology
Bioinformatics, 1 November 2008; 24: 2561 - 2563.

David R. Schreiber
The bioinformaticist's toolbox in the post-genomic age : applications and developments
University of Florida, 2007

Ingvar Eidhammer adn Inge Jonassen, William R. Taylor
Protein bioinformatics : an algorithmic approach to sequence and structure analysis
New York ; Chichester : J. Wiley & Sons, 2004

Tsung-Lu Lee
BAXQL_BLAST : an enhanced BLAST bioinformatics homology search tool with batch and structured query support 
University of Florida, 2002