SGD Help: Pattern Matching

PatMatch permits the identification of patterns or motifs within the collection of all S. cerevisiae protein or DNA sequences. The pattern can be either a simple string or a regular expression. Standard substitutions are allowed in the string, such as using "R" for any purine base when performing a nucleotide search. Pattern matching offers an alternative to sequence alignment techniques such as BLAST for identifying nucleotide or peptide sequences with conserved or biologically interesting regions.

Contents

  1. Using PatMatch
  2. Tips for Pattern Matching
  3. Results

Using PatMatch

The query page has several options as described below.

  • Step 1: Choose a genome to search
  • The pull-down contains a the list of S. cerevisiae strains to choose from.
  • Step 2: Enter a peptide or nucleotide sequence
  • See the lower half of the interface for examples of the types of sequences that PatMatch can search. If your sequence is greater than 20 residues (either amino acids or nucleotides) and has no degenerate positions, use BLAST rather than PatMatch.
  • Step 3: Choose a Sequence Database
  • SGD offers a selection of sequence datasets that can be searched, depending on the user's requirements. For nucleotide patterns, any of these databases can be selected:
    • GenBank is the subset of DNA sequences submitted to GenBank that have been derived from S. cerevisiae DNA. It includes results of the systematic sequencing project as well as results from individual laboratories.
    • genoSc is the complete, up-to-date Saccharomyces cerevisiae genome sequence maintained at SGD.
    • ORF-Coding consists of ORF sequences from the initial ATG to the stop codon but without upstream or downstream sequences, intron sequences, or bases not translated due to translational frameshifting. Contains all ORFs except those classified as Dubious and pseudogenes.
    • ORF-Genomic consists of ORF sequences from the initial ATG to the stop codon including intron sequences and any bases not translated due to translational frameshifting, but not including upstream or downstream sequences. Contains all ORFs except those classified as Dubious and pseudogenes.
    • ORF-Genomic-1000 consists of ORF sequences from the initial ATG to the stop codon including intron sequences and any bases not translated due to translational frameshifting, plus 1000 bp upstream and downstream of each ORF. Contains all ORFs except those classified as Dubious and pseudogenes.
    • NotFeature includes those portions of the systematic sequence that are not an ORF, ARS, centromere, rRNA gene, tRNA gene, snRNA gene, snoRNA gene, LTR, telomeric element or Ty element.

For peptide patterns, either of these databases can be selected:

    • ORF-Trans is a dataset containing protein translations of all systematically named ORFs except those classified as Dubious and pseudogenes.
    • NRSC is a non-redundant set of S. cerevisiae protein sequences from GenBank. For example, while there may be 10 GenBank entries for a particular sequence, it will be represented only once in the NRSC.
  • Step 4: Start pattern search

You may want to change the default settings under "More Options" (see Tips for Pattern Matching below) before you start the search.

Note: If you need to abort the search, click on the button labeled "Click here to abort the search". This will stop the process running on the SGD server. This is better than hitting the "Back" button on the browser, which will not stop the SGD computer from continuing to run the process.

Tips for Pattern Matching

  1. The pattern may be lowercase or uppercase. There is no maximum or minimum pattern size.
  2. A description of the allowed syntax of the pattern is provided at the bottom of the Pattern Matching page.
  3. The Strand option is used for restricting NUCLEOTIDE searches to only one strand of the specified dataset. The default is that both strands are searched. If the "Strand in dataset" option is chosen, then only the strand that is actually present in the dataset will be searched. The table lists the ramifications of choosing this option for various datasets:

Choosing "Reverse complement of strand in dataset" restricts the PatMatch search to the reverse complement of the strands described above. Please note that in the displayed sequence, only the Watson strand will be shown, regardless of which strand option is chosen. If your pattern has a match on the Crick strand, the reverse complement of the pattern will be highlighted in the Watson sequence.

4. The Mismatch, Deletion or Insertion options will permit matches to sequences that contain a defined number of substitutions, deletions or insertions relative to the input pattern. This number can range from 1 to 3. At this time, patterns containing regular expressions do not support the mismatch, deletion and insertion options.

5. When searching for patterns near the beginning or end of a sequence, bear in mind that nucleotide sequences will include the stop codon (TAA, TAG, or TGA) and start codon (5' ATG). Peptide sequence will include the initiator methionine, whether or not it is removed in vivo.

6. At this time, PatMatch will not find overlapping hits.

7. If a PatMatch search results in no or few matches, try to increase the number of matches by:

    • changing the database searched (for example, from genoSc to GenBank)
    • using a less selective pattern
    • increasing the number of allowed mismatches, deletions or insertions.

Results

The results page displays a chromosome graphic and a table of the full results in cases where the genoSc, ORF-Coding, ORF-Genomic, ORF-Genomic-1000, ORF-Trans or NotFeature dataset is searched. If the GenBank or NRSC dataset is used, only the results table is shown.

The chromosome graphic displays all the hits in the 16 yeast chromosomes; click on any region in any chromosome bar to go to the Features Map for viewing the hits. The table shows the name of the sequences containing a match, number of hits, matching pattern, matching positions, the link to a DNA or protein sequence and any information about the sequence. Matching position is given relative to the entire sequence matched (listed in the Sequence Name column); the sequence may be an entire chromosome, an ORF (DNA or amino acid sequence), or a region of untranslated DNA.

The default PatMatch search is set to return a maximum of 500 hits. If short query sequences are entered the number of hits returned may exceed 500. Therefore, if the number of total hits returned is equal to 500 you should select a higher value for "Maximum hits" in the "More Options" section and rerun your search to ensure hits from the entire genome have been returned.

Go to PatMatch