SGD Help: BLAST
BLAST stands for Basic Local Alignment Search Tool and was developed by Altschul et al. (1990). It is a very fast search algorithm that is used to separately search protein or DNA sequence databases. BLAST is best used for sequence similarity searching, rather than for motif searching. For searches using a query sequence of fewer than 20 residues, PatMatch is the best choice.
More information about BLAST searching can be found in the NCBI BLAST Help Manual.
BLAST searches offered by SGD allow users to compare any query sequence to S. cerevisiae sequence datasets. To search other (non-yeast) datasets, NCBI BLAST can be used. To search fungal sequences, use SGD's Fungal BLAST tool.
- Using BLAST
- BLAST Results
- Graphic Display
- One-line Descriptions
- Sequence Alignments
- Parameters and Statistics
- Using the BLAST Options to Refine Your Results
The query page has several options as described below.
Step 1: Enter the query sequence
Sequences can be submitted for a BLAST search in two different ways. The sequence can be uploaded from a local text file with FASTA, GCG, or RAW formatting, or the sequence can be typed or pasted into the Query Sequence window. (Note: The contents of an uploaded sequence file will not be displayed in the Query Sequence window of the search page.) To use the Upload Local File option:
- Macintosh - Click on the Browse button; choose the desired file.
- PC - Click on the Browse button; change the file type from "HTML" to "all files"; choose the desired file for upload
- UNIX - Click on the Browse button; change *.html to * at the end of the string in the Filter box; click on a folder and then the Filter button to open the folder; choose the desired file and click OK to upload it
Step 2: Choose the appropriate BLAST program
SGD offers five BLAST programs to accommodate different types of searches:
- BLASTN compares a nucleotide query sequence against a nucleotide sequence dataset
- BLASTP compares an amino acid query sequence against a protein sequence dataset
- BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence dataset
- TBLASTX compares the six-frame translations of a DNA sequence to the six-frame translations of a nucleotide sequence dataset
- TBLASTN compares a protein query sequence against a nucleotide sequence dataset dynamically translated in all six reading frames (both strands)
Step 3: Choose one or more Sequence Datasets
SGD offers a selection of sequence databases that can be searched, including sequences from a large variety of Saccharomyces cerevisiae strains.
Step 4: Run BLAST
BLAST search results are shown in the user's web browser.
The results of a BLAST query are reported in roughly the same format, regardless of the program selected. The first section is a graphical overview of the results, the second is a series of one-line descriptions of matching database sequences, the third is a set of the actual alignments of the query sequence with database sequences, and the last section lists the parameters used and the statistics generated during the search.
The graphical display and one-line descriptions give information about database sequences that form a High-Scoring Segment Pair (HSP) with the query sequence. An HSP is created when two sequence fragments (one from the query sequence and the other from a database sequence) show a locally maximal alignment for which the alignment exceeds a pre-defined cutoff score. BLAST uses HSPs to identify hits.
- How HSPs are Shown
Each hit may contain one or more high-scoring segment pairs (HSPs). Each HSP is drawn as a line, and is aligned with the query sequence. This figure shows two short HSPs and three long ones running off the right edge. The smallest HSP begins at 185 bp and ends at 233 bp along the query sequence.
- HSPs are Directional
In the full text BLAST results, each HSP is either either plus or minus. If the query and HSP strands are the same, the HSP is termed forward. If they differ, the HSP is termed reverse.
- HSPs Share a Background Color
All HSPs for a displayed hit are drawn. They share a single background color to signify their relationship. Here are two hits, each containing multiple HSPs. For the first hit, YAL029C, the background is white. For the second, YHR023W, the background is gray.
- Hits are Color Coded
The hits are color coded according to their P value. A set of five fixed ranges is used to determine a color for each hit. These ranges, from "worst" to "best," are:
- 1.0 to 1e-10
- 1e-10 to 1e-50
- 1e-50 to 1e-100
- 1e-100 to 1e-200
- 1e-200 to 0.0
The key shows these colors, and notes the value of the negative exponents in each range. It progresses from "worst" on the left to "best" on the right. Note that ranges might not contain any hits, since the ranges are fixed while the hit P-values are not. When ranges share a boundary value (e.g.: 1e-50), that value falls in the "better" range and will be colored thus (e.g.: green).
- How Hits are Chosen for Display
Often, there will be more data available than can be displayed in the graphic. The current system takes a particular approach to selecting data to include, biased in favor of giving a complete overview of the data rather than showing only the top hits. The rationale is that it can be important to show results further away from identity.
First, the hits are sorted into color coded ranges. Next, the top hit from each range is picked, starting with the "best." It keeps track of how much space each hit will take up when drawn; if, after including those, there is still room left over, it iterates once more, picking the next top hit from each range. This process continues until there are either no more hits, or there is no room left in the display.
Note that the final drawing of the hits will be in proper order, even though hits have been selected in an interleaved fashion: all of the best hits are drawn at the top of the image.
- Range Counts
If not all hits are shown, range counts will appear at the right side of the graph. In our example, all hits from the top range are shown and thus the annotation says "All." However, not all hits in the next range were able to be displayed so "1/3" indicates two omitted hits.
Note that if a range contains no hits, no count is shown (thus, there are no green or cyan notations in our example). If all of the BLAST results fit into the graph, no range counts are displayed at all.
Hit names and P-values are displayed at the left side of the graph.
p=0.0e0 s=7741 YOR326W|MYO2, Chr XV from 925712-930436
The one-line descriptions summarize information about the database sequences that form HSPs with the query sequence. At the left end of each one-line description is the name of the database sequence that forms an HSP with the query sequence. Each description also includes the score and P-value for the hit.
The sequence alignments show the query sequence at the top, with the aligned database sequence (Sbjct, or subject) at the bottom. The starting and ending coordinates of the areas of similarity are shown at the left and right of the aligned sequences. When nucleotide sequences are being aligned, vertical lines between the bases signify identities. Amino acid identities are shown by the repetition of the one-letter code for that amino acid between the residues. Conservative amino acid changes are shown by a "+" sign between the aligned residues. Places where gaps had to be introduced to achieve the alignment are signified by a "-" in the query or subject sequences.
Parameters and Statistics
For amino acid sequences, the default filter setting is "seg." This filter removes repetitive sequences. Removed residues are indicated by Xs. For nucleic acid sequences, the default filter setting is "dust." The removed residues are represented as Ns. To turn off this filter, return to the BLAST search page and select "Off" as a filter option.
Using the BLAST Options to Refine Your Results
If the BLAST search results don't look optimal, you can experiment with several of the parameters, as follows:
- Change the database searched.
- Change the protein comparison matrix.
- Change the number of alignments to show.
- Change the Expect threshold. The Expect threshold (E threshold) reflects the number of matches expected to be found by chance. If the statistical significance of a match is greater than the E threshold, the match will not be reported. The E threshold default is set to 10. Decreasing the E threshold will increase the stringency of the search: fewer matches will be reported. Increasing the E threshold will decrease the stringency of the search and result in more matches reported.
- Change the Cutoff Score (S value). If a query sequence is short (less than about 30 residues), adjusting the Cutoff Score to a lower value will result in a less stringent criterion for reporting matches.
- Change the word length (W). BLAST first searches for a perfect match of at least the word length. Once a match is found it then tries to extend the high-scoring segment pair (HSP). The default W value for BLASTN is 11; for all other programs the default is 3. If the word length is less than 11 the query sequence must be less than 5000 bp.
- Change the filter option. The filter removes repetitous sequences from a query, so that the results of the BLAST search will be less numerous and, ideally, more informative. For nucleic acid query sequences, the "dust" filter is used as the default. For all other searches, the "seg" filter is the default.
Go to BLAST