SGD Help: Glossary

The Glossary contains definitions of terms that are commonly used in SGD.

The Glossary contains definitions of terms that are commonly used in SGD.

2-point data

This refers to data generated by tetrad analysis of a cross in which the segregation of two genetic markers is followed. These data yield the distance between the two markers (usually mutant alleles of genes) on the genetic map.

5' UTR intron

An intron located in the 5' prime untranslated region (SO:0000447).

Accession number

This refers to the unique GenBank identifier assigned to a sequence. This number can be used to search Genbank records for a specific sequence.

AceDB

AceDB was the database software previously used by SGD before the move to an ORACLE relational database. More information on AceDB is available here.

Affinity capture-MS

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is inferred when a "bait" protein is affinity captured from cell extracts by either polyclonal antibody or epitope tag and the associated interaction partner is identified by mass spectrometric methods.

Affinity capture-RNA

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is inferred when a "bait" protein is affinity captured from cell extracts by either polyclonal antibody or epitope tag and the associated interaction partner is identified by specific RNA binding.

Affinity capture-Western

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is inferred when a "bait" protein is affinity captured from cell extracts by either polyclonal antibody or epitope tag and the associated interaction partner is identified by Western blotting with a specific polyclonal antibody or second epitope tag.

Affinity Chromatography

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is detected by chromatographic purification (for example, GST fusions purified with glutathione-Sepharose beads).

Affinity Precipitation

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, a "bait" protein is affinity captured from cell extracts by either polyclonal antibody or epitope tag and the associated interaction partner is identified by immunoblot with a specific polyclonal antibody or second epitope tag. This category is also used if an interacting protein is visualized directly by dye stain or radioactivity.

Alias

'Alias' refers to a non-standard name for a locus. When multiple names are published for a locus, one name is designated the standard name (following the Gene naming guidelines) and the other published names are retained under 'Alias'. If a name has been reserved for a gene for a significant period of time, it will also be retained as an alias even if it has not been published.

Alignment

A presentation of two compared sequences that show the regions of greatest statistical similarity.

Annotation

At SGD, annotation refers to information that has been extracted from the literature and associated, on the database pages, with various aspects of an S. cerevisiae gene or chromosomal feature. SGD makes several types of annotations, such as GO, Sequence, and Literature Guide annotations.

Anonymous FTP

A method of sharing files on the Internet. A variety of software that can provide FTP function is available in most networking software packages. Anonymous FTP simply means a computer will allow anyone using the FTP software access to a special directory of files on its disk drive. This service is called Anonymous FTP because the user name used is "anonymous." When asked for a password, simply enter your e-mail address.

AmiGO

A web application developed by the Gene Ontology (GO) Consortium that can be used to search, browse and visualize Gene Ontology data. AmiGO displays detailed information related to GO terms and the gene products annotated to those terms. Using AmiGO, it is possible to access GO annotations for the many different species for which GO annotations have been submitted to the GO Consortium.

Aromaticity score (Aromo)

This index is the frequency of aromatic amino acids (Phe, Tyr, Trp) in the hypothetical translated gene product. The hydropathicity and aromaticity protein scores are indices of amino acid usage. The strongest trend in the variation in the amino acid composition of E. coli genes is correlated with protein hydropathicity, the second trend is correlated with gene expression, while the third is correlated with aromaticity (Lobry and Gautier 1994). The variation in amino acid composition can have applications for the analysis of codon usage. If total codon usage is analyzed, a component of the variation will be due to differences in the amino acid composition of genes.

ARS Consensus Sequence (ACS)

The ACS is an 11-bp sequence of the form 5'-WTTTAYRTTTW-3' which is at the core of every yeast ARS, and is necessary but not sufficient for recognition and binding by the origin recognition complex (ORC). Functional ARSs require an ACS, as well as other cis elements in the 5' (C domain) and 3' (B domain) flanking sequences of the ACS.

Associate

In Colleague information, "Associate" refers to coworkers or collaborators.

ATCC

American Type Culture Collection; maintains collections of yeast strains and clones.

Author

An author of a paper or personal communication included in SGD. When searching for an individual's name, use the "*" wildcard character (i.e., Johnson*) to achieve the best results.

Autonomously Replicating Sequence (ARS)

A DNA sequence element occurring on average every 40 kb in yeast and originally defined by its ability to confer replication on extrachromosomal circular DNA molecules. ARS elements correspond to chromosomal origins of replication (ORIs), tend to be A/T rich, and have been implicated in the binding of the primosome complex.

Binding site

Consensus sequence to which a specific molecule binds.

Biochemical Activity

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is inferred from the biochemical effect of one protein upon another, for example, GTP-GDP exchange activity or phosphorylation of a substrate by a kinase.

BioGRID

BioGRID (General Repository for Interactions database) is a database of genetic and physical interactions developed by the Tyers Group at Mount Sinai Hospital, Toronto, Canada.

Biological process

One of the three categories used by the Gene Ontology project, biological process describes broad biological goals, such as mitosis or purine metabolism.

BioSci

BIOSCI is a set of internet newsgroups and e-mail lists for biologists. SGD maintains an archive of the yeast BIOSCI list.

BLAST

Basic Local Alignment Search Tool is a search algorithm developed by Altschul et al. (1990). It is a very fast search algorithm that is used by the blastn, blastp, and blastx programs to separately search protein or DNA databases. BLAST is best used for sequence similarity searching, rather than for motif searching.

blastn

A BLAST program that compares a nucleotide query sequence against a nucleotide sequence database. The user must enter a NUCLEOTIDE sequence and select a DNA database to search.

blastp

A BLAST program that compares an amino acid query sequence against a protein sequence database. The user must submit an AMINO ACID sequence and select a PROTEIN database for the search.

blastx

A BLAST program that compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. The user must enter a NUCLEOTIDE sequence and select a PROTEIN database for the search.

BLOSUM100

An alternative scoring matrix for BLAST searches.

BLOSUM30

An alternative scoring matrix for BLAST searches.

BLOSUM50

A scoring matrix that is used as the default in FASTA searches.

BLOSUM62

A scoring matrix that is used as the default in blastp, blastx, and tblastn BLAST searches.

CDS

CoDing Sequence, region of nucleotides that corresponds to the sequence of amino acids in the predicted protein. The CDS includes start and stop codons, therefore coding sequences begin with an "ATG" and end with a stop codon. In SGD, unexpressed sequences, including the 5'-UTR, the 3'-UTR, introns, or bases not expressed due to frameshifting, are not included within a CDS. Note that the CDS does not correspond to the actual mRNA sequence.

centimorgan

The unit of linkage that refers to the distance between two gene loci determined by the frequency with which recombination occurs between them. Two loci are said to be one centimorgan (cM) apart if recombination is observed between them in 1% of meioses. (from the Genetics Home Reference at the NIH). In yeast, recombination frequency is assayed by tetrad analysis. A centimorgan is equivalent to a map unit (m.u.). The centimorgan is named after the geneticist Thomas Hunt Morgan.

Centromere

This term refers to the portion of a chromosome where the kinetochore assembles. The kinetochore attaches chromosomes to mitotic and meiotic spindles. Thus, the centromere is critical for the proper segregation of chromosomes during mitosis and meiosis. In S. cerevisiae, the centromeres (CENs) are comprised of specific DNA sequences (CDEI, CDEII, and CDEIII), though in most eukaryotes this is not the case. While the physical position of a gene is given in kilobase pairs, with 1 bp located at a telomere, the genetic position of a gene is given relative to the centromere.

Centromere DNA Element I (CDEI)

Smallest of three adjacent centromeric domains, CDEI is an 8-11 bp consensus sequence that is bound by centromere binding factor 1 (Cbf1p).

Centromere DNA Element II (CDEII)

Central of three adjacent centromeric domains, CDEII is AT-rich and ~ 75-100 bp in length.

Centromere DNA Element III (CDEIII)

Most essential of three adjacent centromeric domains, CDEIII consists of a 25-bp consensus sequence and provides the binding site for the centromere DNA binding factor 3 (CBF3) complex.

Cellular Component

One of the three categories used by the Gene Ontology project, cellular component encompasses subcellular structures, locations, and macromolecular complexes. Examples include nucleus, telomere, and origin recognition complex.

Child Term

This term is used in the context of the Gene Ontology. It refers to a controlled vocabulary term that is more specific, or granular, aspect of biology than its one or more parent terms. Child terms are placed lower in the ontology than their parent terms. For example, "endoplasmic reticulum" and "Golgi apparatus" are child terms of the parent term "cytoplasm".

Chr_Basepair_Coord

Chromosome basepair coordinates consist of two numbers that specify the begining and ending location of the sequence as positioned on the chromosomal sequence.

Chromosome

Chromosome refers to the structure in the cell composed of a very long molecule of DNA and associated proteins called histones. At SGD, if a locus has been physically mapped, the chromosomal coordinates will appear on the Locus Summary page. There are 16 chromosomes in S. cerevisiae. The Genomic View is a graphic representation of the entire yeast genome that allows you to display a chromosomal features map, physical map, or combined physical and genetic map.

Chromosome arm

The part of a chromosome that includes the DNA sequence from one telomere to the centromere. Usually one arm is physically longer than the other arm. In humans the short arm is designated as 'p' (petite) and the long arm is called 'q' (the letter following p in the Latin alphabet). Before the S. cerevisiae genome sequence was determined, yeast chromosomal arms were designated "left" or "right", where the left arm was the shorter one based on genetic position and recombination frequencies of the genes it carried. Subsequent sequence information showed that a genetically short arm may be physically longer; however, the genetic designations are still used today in yeast gene names. For nomenclature information, see ORF-naming conventions.

Clone

Clone is the term used for any physical piece of DNA that has been localized to a particular region of a chromosome. A prime clone is any piece of DNA that is available from the ATCC; these are mostly the Olson-Riles set of cosmid and lambda clones, as well as many of the cosmid and lambda clones sequenced by the systematic sequencing groups.

ClustalW

Clustal W is an alignment program for DNA and proteins with improved sensitivity for the alignment of divergent protein sequences. (See Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gappenalties and weight matrix choice. Nucleic Acids Res. 22:4673-80.]

Co-crystal Structure

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is directly demonstrated at the atomic level by X-ray crystallography.

Coding Exon

An exon that directs the production of a peptide sequence.

Coding Sequence

See CDS.

Codon Adaptation Index (CAI)

Codon adaptation index is a measurement of the relative adaptiveness of the codon usage of a gene towards the codon usage of highly expressed genes. The relative adaptiveness (w) of each codon is the ratio of the usage of each codon, to that of the most abundant codon for the same amino acid. The CAI index is defined as the geometric mean of these relative adaptiveness values. Non-synonymous codons and termination codons (dependent on genetic code) are excluded. CAI values range from 0 to 1, with higher values indicating a higher proportion of the most abundant codons. (See Sharp, P. M., and W. H. Li, (1987). The codon adaptation index a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research 15: 1281-1295; also see Jansen R., Bussemaker H.J., and Gerstein M. (2003) Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models. Nucleic Acids Res. 31(8):2242-51.)

Codon Bias Index (CBI)

Codon bias index is another measure of directional codon bias, it measures the extent to which a gene uses a subset of optimal codons. CBI is similar to Fop, with expected usage used as a scaling factor. In a gene with extreme codon bias, CBI will equal 1.0, in a gene with random codon usage CBI will equal 0.0. Note that it is possible for the number of optimal codons to be less than expected by random change. This results in a negative value for CBI. (See Bennetzen, J. L., and B. D. Hall (1982). Codon selection in yeast. Journal of Biological Chemistry 257: 3026-3031.)

CodonW

CodonW is a software program, written by John Peden in the lab of Paul Sharp (Dept of Genetics, University of Nottingham), that analyzes the correspondence between amino acids and codon usage in a set of protein sequences, based on a given genetic code, to calculate values such as Codon Adaptation Index and Codon Bias Index. Decisions regarding whether an amino acid is synonymous or non-synonymous, the translation of a codon, the number of codons in a codon family, how many synonyms a codon has, are all determined at run time. Seven alternatives to the universal genetic code, including S. cerevisiae chromosomal and S. cerevisiae mitochondrial, have been built into the program.

Co-fractionation

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is inferred from the presence of two or more protein subunits in a partially purified protein preparation.

Colleagues

Researchers with an interest in yeast may add their contact information to SGD to be listed as SGD Colleagues. Colleague information may include addresses, phone and fax numbers, research interests, web pages, and links to other Colleague entries for lab members, lab heads, or collaborators.

Co-localization

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is inferred from co-localization of two proteins in the cell, including co-dependent association of proteins with promoter DNA in chromatin immunoprecipitation experiments.

Comparison Matrix

Programs used to align and identify regions of sequence similarity.

Computational GO Annotations

Computational GO annotations are made by a variety of computational methods, such as sequence similarity methods, including protein domains and motifs, and keyword mapping files. When annotations based on computational methods are NOT reviewed by a curator, they are placed in the Computational GO annotations section. Note that the criterion for including a GO annotation in this section is whether or not it was reviewed by a curator; when annotations made by a computational method, such as sequence analysis, are reviewed by a curator, they may be found in the Manually curated section.

Co-purification

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is inferred from the identification of two or more protein subunits in a purified protein complex, as obtained by classical biochemical fractionation or affinity purification and one or more additional fractionation steps.

Contact

If a gene name is reserved for a feature, then SGD provides the name of the researcher who has reserved it as the 'Contact' under the 'Locus History' section. The name of the "contact" is linked to the address information for that person under the Colleague section.

Contained_Loci

A list of loci that are contained within the clone.

Correspondence Analysis (COA)

Correspondence analysis is an ordination technique that identifies the major trends in the variation of the data and distributes genes along continuous axes in accordance with these trends. Correspondence analysis has the advantage that it does not assume that the data fall into discrete clusters and therefore can represent continuous variation accurately.

Crick Strand ORF

An open reading frame (ORF) encoded on the Crick or bottom strand of the chromosome, which runs 5' to 3' from the right to left ends of the chromosome.

Curator

A keeper of the Saccharomyces Genome Database information, responsible for collecting and compiling data about yeast genetic loci and DNA sequences and providing online assistance to users of the database. The SGD Staff page lists all current yeast curators.

DAG

Directed Acyclic Graph (DAG) refers to a way of arranging objects based on their relationships and allows a child to have multiple parents.

DB_info

Identifies the database source of information.

DDBJ

DNA DataBase of Japan. DDBJ is a repository of DNA sequences. DDBJ is produced in collaboration with GenBank and EMBL.

Deleted Feature

A chromosomal feature that has been removed from the yeast genome catalog. Typically, features are "Deleted" because they are effectively destroyed by a sequence or annotation change (e.g., YCL006C), or because the original annotation was in error or inappropriate (e.g., YCRX03C). For record keeping, the "Deleted" feature is not removed from SGD, but is instead given "Deleted" status as a flag. Note that "Deleted" features are distinct from "Dubious" features in that "Deleted" features have been demonstrated to be incorrect and have been officially withdrawn.

Dendrogram

A branching tree-like diagram that illustrates the hierarchical relationships among items in a dataset; for example, the relationships among protein sequences of different organisms can be represented by a dendrogram.

Description

A brief description of the role that the gene plays in the cell, or a general description of the gene product.

Dosage Growth Defect

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, a genetic interaction is inferred when overexpression or increased dosage of one gene causes a growth defect in a strain that is mutated or deleted for another gene.

Dosage Lethality

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, overexpression or increased dosage of one gene causes lethality in a strain that is mutated or deleted for another gene.

Dosage Rescue

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, a genetic interaction is inferred when overexpression or increased dosage of one gene rescues the lethality or growth defect of a strain that is mutated or deleted for another gene.

Dubious ORF

A Dubious open reading frame (ORF) is one that is unlikely to encode an expressed protein. Dubious ORFs may meet some or all of the following criteria: 1) the ORF is not conserved in other Saccharomyces species; 2) there is no well-controlled, small-scale, published experimental evidence that a gene product is produced; 3) a phenotype caused by disruption of the ORF can be ascribed to mutation of an overlapping gene; and 4) the ORF does not contain an intron. Many ORFs classified as "Dubious" are small and overlap a larger ORF of the class "Verified" or "Uncharacterized"; however, overlap with another ORF does not mandate that an ORF be classified as "Dubious."

Epistatic Mini Array Profile (e-map)

A method that creates and quantifies high density genetic interaction maps. In this method, observed double mutant colony sizes are compared to those that would be expected from a distribution of typical double mutant colonies of each strain. Each interaction is assigned a score which indicates the magnitude of the difference from the expected value and the certainty of the score. A negative (or aggravating) score < -3 would imply synthetic sick/lethal interaction and a positive (alleviating) score > +3 would imply suppressor interaction (Schuldiner M, et al, 2005).

EC number

The number assigned by the Enzyme Commission for a particular enzyme activity. Currently, SGD contains EC assignments to individual proteins, made by UniProtKB/Swiss-Prot curators. EC numbers assigned to individual proteins are displayed in the "External Classifications" section of Protein Information pages, and protein-specific links to the Enzyme nomenclature database are listed in the external links sections of both the Locus Summary and Protein Information pages. These assignments are also included in the dbxref.tab file in our Download Data directories.

EMBL

European Molecular Biology Labs. The EMBL Nucleotide Sequence database is a comprehensive database of DNA and RNA sequences. The database is produced in collaboration with GenBank and the DNA Database of Japan (DDBJ).

Entrez

The Entrez Search System was developed by NCBI. Entrez allows you to retrieve molecular biology data and bibliographic citations from integrated nucleotide (GenBank, DDBJ, EMBL), protein (Swiss-Prot, PIR, PRF, PDB), and bibliographic (PubMed) databases. Within SGD database pages, external links are provided to one or more of these databases.

Epistasis

A type of genetic interaction: the nonreciprocal interaction of nonallelic genes in which the expression of one gene masks the expression of another. For example, if the expression of Gene A masks that of Gene B, Gene A is said to be epistatic to Gene B, whereas Gene B is hypostatic to Gene A.

Epistatic gene

See Epistasis

Exon

A portion of a split gene that is included in the transcript of a gene and survives processing of the RNA to become part of the spliced messenger of a structural RNA. Exons generally occupy three distinct regions of genes that encode proteins. Exons in the first region are not translated into protein, but signal the beginning of RNA transcription and contain sequences that direct the mRNA to ribosomes for protein synthesis. Exons in the second region contain the information that is translated into the amino acid sequence of the protein, and are sometimes referred to as coding exons. Exons in the third region are transcribed into the part of the mRNA that contains the signals for the termination of translation and for the addition of a polyadenylate tail.

Expect threshold

The Expect threshold ("E") is a BLAST parameter that reflects the number of matches expected to be found by chance. If the statistical significance of a match is greater than the Expect threshold, the match will not be reported. Decreasing the E threshold will increase the stringency of the search: fewer matches will be reported. On the other hand, increasing the E threshold will decrease the stringency of the search and result in more matches being reported.The E threshold default is set to 10 specifically for the SGD WU-BLAST tool. The E-value cut off used for other resources and tools at SGD is documented in their respective help pages.

External Transcribed Spacer (ETS)

The ETS is a region of DNA in the rDNA repeat which flanks the 18S-5.8S-25S gene cluster and is included as part of its transcription unit. The 5' ETS is immediately upstream of the 18S gene and includes the A0 processing site. The 3' ETS is immediately downstream of the 25S gene.

Far Western

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is detected between a protein immobilized on a membrane and a purified protein probe.

FASTA

Program used to search simultaneously both protein and DNA sequence databases (Pearson and Lipman, 1988). FASTA uses a fast search to initially identify sequences with a high degree of similarity to the query sequence and then conducts a second comparison on the selected sequences. FASTA is slower than BLAST, but can be more sensitive and sometimes yields different results.

FDR

The False Discovery Rate (FDR) is a multiple-hypothesis testing error measure indicating the expected proportion of false positives among the set of significant results. For example, if in a collection of 100 genes where each had different expression levels with a maximum FDR of 0.10, then a maximum of 10 genes can be expected to be false positives. FDR calculation is distinct from p-value calculation, in that p-value tests individual hypotheses rather than multiple hypotheses. The FDR is particularly useful in the analysis of high-throughput data such as microarray gene expression data.

Filter options

Filtering masks of portions of a query sequence that have low compositional complexity (such as short internal repeats or poly-A sequences) to reduce the frequency of statistically significant but biologically uninteresting BLAST results.

Frequency of Optimal Codons (Fop)

This index is the ratio of optimal codons to synonymous codons (genetic code dependent). Fop values for the original index are always between 0 (where no optimal codons are used) and 1 (where only optimal codons are used). When calculating the modified Fop index, negative values are adjusted to zero. [Ikemura, T. (1981). Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli system. Journal of Molecular Biology 151:389-409]

FRET

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is inferred when close proximity of interaction partners is detected by fluorescence resonance energy transfer between pairs of fluorophore-labeled molecules, such as occurs between CFP (donor) and YFP (acceptor) fusion proteins.

Function

See molecular function.

GBrowse

Developed by the Generic Model Organism Database (GMOD) project, GBrowse is an interactive genome browser that can be customized to show selected chromosomal features as well as display user provided annotations.

GenBank

GenBank is the DNA sequence database sponsored by the US National Institutes of Health. GenBank is produced in collaboration with EMBL and DDBJ. There is also a searchable DNA sequence database maintained by SGD (Yeast GenBank) that contains the subset of DNA sequences submitted to GenBank that have been derived from S. cerevisiae DNA. It includes results of the systematic sequencing as well as results from individual laboratories.

Gene

The definition of a gene changes as more properties are revealed. Two classes are generally recognized: (1) genes that are transcribed into mRNAs, which enter ribosomes and are translated into polypeptide chains, and (2) genes whose transcripts are used directly (tRNAs, rRNAs, snRNAs, etc.). Class I genes are also known as structural genes, and have been referred to as cistrons in earlier literature. There are also other shorter DNA segments that are not transcribed but instead serve as recognition sites for enzymes and other proteins that function during replication or transcription. These types of elements are generally referred to as regulatory sequences, and should not be confused with regulatory genes, which encode proteins that bind to regulatory sequences.

Gene_Info

The guide to the literature formerly called Gene_Info is now called the Literature Guide.

Gene name

With respect to S. cerevisiae genetic nomenclature, a "gene name" refers to a name for a specific genetic marker; S. cerevisiae gene names follow a standardized format consisting of three letters (the gene symbol) followed by an integer (e.g. ADE2). Dominant alleles of the gene (most often wild-type) are denoted by all uppercase letters, while recessive alleles are denoted by all lowercase letters. Within SGD, "gene name" is synonymous with Locus. For more information please refer to the guide to S. cerevisiae nomenclature published in Trends in Genetics (download pdf).

Gene Registry

At the request of the yeast community, SGD maintains a list of accepted ("standard") gene names and a system for the approval and reservation of new names. See the Gene Registry Help page for more information.

Gene Ontology (GO)

The Gene Ontology (GO) project was established to provide a common language to describe aspects of a gene product's biology. The use of a consistent vocabulary allows genes from different species to be compared based on their GO annotations. For each of three categories of biological information--molecular function, biological process, and cellular component--a set of terms has been selected and organized. Each set of terms uses a controlled vocabulary, and parent-child relationships between terms are defined. This combination of a controlled vocabulary with defined relationships between items is referred to as an ontology. Within an ontology, a child may be a "part of" or an example ("instance") of its parent. There are three independently organized controlled vocabularies, or gene ontologies, one for molecular function, one for biological process, and one for cellular component. Many-to-many parent-child relationships allowed in the ontologies. A gene may be annotated to any level in an ontology, and to more than one item within an ontology.

Gene product

A description of the protein or RNA product (and its function, if relevant) that is coded for by the gene.

Gene/Sequence Resources

This is a resource at SGD that allows one to retrieve a list of options for accessing information available for 1) a named gene or sequence, 2) a specified chromosomal region, or 3) a raw DNA or protein sequence. This information includes biological information, table/map displays, and sequence analysis and retrieval options.

Gene Summary Paragraphs

A Gene Summary Paragraph is a summary of published biological information for a gene and its product which is designed to familiarize both yeast and non-yeast researchers with the general facts and important subtleties regarding a locus. SGD curators compose Gene Summary Paragraphs using natural language and a controlled vocabulary based on the Gene Ontology (GO). Gene Summary Paragraphs contain references and links to further information, and highlight connections between genes from yeast and other species wherever possible.

Gene symbol

S. cerevisiae gene names consist of three letters (the gene symbol) followed by an integer (e.g. ADE2). The 3-letter gene symbol is almost always a mnemonic, standing for a description of a phenotype, gene product, or function. Most (but not all) gene symbols have only one associated description, i.e., all the genes which share that 3-letter gene symbol have a related phenotype, gene product or gene function.

Genetic Map

The S. cerevisiae Genetic Map was originally known as the Mortimer Map. The last such Genetic Map was Edition 12 released in January 1995. It is a representation of the order of and distances between genetic markers (usually mutant alleles of genes) along each of the 16 different chromosomes. It is generated using the two-point data submitted from laboratories world-wide. On the map, the genetic position of a gene is given relative to the centromere, and is expressed in centimorgans.

Genetic Position

This term refers to the genetic distance between the gene and the centromere, as derived from two-point data, and is expressed in centimorgans (cM). Locations to the left of the centromere are represented as negative numbers, and locations to the right of the centromere are represented as positive numbers. For example, GCN4/YEL009C has a genetic position of -3 cM. This means the gene is 3 cM (also called map units) to the left of the centromere (on the left arm of the chromosome). TRP2/YER090W has a genetic position of 76 cM. This means it is 76 cM (map units) to the right of the centromere (on the right arm of the chromosome). Early yeast geneticists denoted the shorter arm of each chromosome, in terms of genetic distance, as the left arm and the longer arm as the right arm. However, later physical mapping efforts and sequencing of the genome showed that for some chromosomes, the arm historically called "left" is physically longer than the "right" arm. The relationship between physical distance (kilobase pairs) and genetic distance (cM) can vary greatly within and between chromosomes.

genoSc

A searchable DNA sequence database maintained by SGD that contains the complete Saccharomyces cerevisiae genome sequence as revealed by the international systematic sequencing effort.

GO

See Gene Ontology.

GO Annotation

GO Annotations are statements generated from published literature about the function(s) and biological role(s) of a gene product in the cell, and where (location) in the cell the gene product carries out its functions. These statements consists of 4 mandatory components: a gene product, a term from one of the three Gene Ontology (GO) controlled vocabularies, a reference, and an evidence code. A gene product is typically a protein or a gene but can also be a functional RNA.

GO Annotation Method

Used to identify the methods used in the cited reference and the curation method used to add make a GO annotation. Current methods include Manually curated, High-throughput, and Computational.

GO Annotation Source

Refers to the Annotating/Database group that made the GO annotation.

GO Slim

A GO Slim is a selection of high-level terms from the Biological Process, Molecular Function, and Cellular Component ontologies. These are more general terms that represent major branches in each ontology. For example, the GO term nucleus is a GO Slim term from the Cellular Component ontology. Its children (perinuclear space, nuclear matrix, etc) are more detailed GO terms and not GO Slim terms. The GO Slim Mapper identifies the GO Slim terms for a list of genes based on their annotation to detailed GO terms. The go_slim_mapping.tab file available on the SGD Download page maps all gene products to a yeast-specific GO Slim. The yeast-specific GO Slim contains a set of GO terms that best represent the major biological processes, functions, and cellular components that are found in S. cerevisiae.

High score

In the results of a BLAST search, the scores of the highest-scoring HSP found with each database sequence is listed in the "high score" column.

High Scoring Segment Pairs (HSPs)

In a BLAST search, an HSP is two sequence fragments (one from the query sequence and the other from a database sequence) that show a locally maximal alignment for which the alignment exceeds a pre-defined cutoff score.

High-throughput GO Annotations

Refers to the GO annotation method that includes annotations made from published experiments performed on a high-throughput or genome-wide basis where the annotations are not reviewed by curators. Evidence for only a subset of results from a high-throughput or genome-wide study is reviewed by a curator, but not each result.

Hydropathicity of protein (GRAVY score)

This index is the general average hydropathicity or (GRAVY) score for the hypothetical translated gene product. It is calculated as the arithmetic mean of the sum of the hydropathic indices of each amino acid (Kyte and Doolittle 1982). This index has been used to quantify the major COA trends in the amino acid usage of E. coli genes.

Hypostatic gene

See Epistasis

Identity

An alternative comparison matrix for FASTA searches.

Identity-weighted

An alternative comparison matrix for FASTA searches.

Indel

A hybrid term (combining the words "insertion" and "deletion") used to describe a difference in sequence due to either an insertion or a deletion event; especially used when the evolutionary direction of the change is unspecified.

Interactions Database

See BioGRID.

Internal Transcribed Spacer (ITS)

The ITS is a region of DNA in the rDNA repeat which flanks the 5.8S gene and is included as part of the transcription unit of the 18S-5.8S-25S gene cluster. ITS1 is immediately upstream of the 5.8S gene and ITS2 is immediately downstream of the 5.8S gene.

Intron

A portion of a split gene that is transcribed into RNA, but subsequently removed from within the transcript prior to translation.

JBrowse

Developed by the Generic Model Organism Database (GMOD) project, JBrowse is a genome browser with a fully dynamic AJAX interface, developed as the successor to GBrowse. It is very fast and scales well to large datasets. JBrowse is javascript-based and does almost all of its work directly in the user's web browser, with minimal requirements for the server.

Keyword

A keyword is a word identified as particularly informative about an object. In a sequence, a keyword often relates to the identity of a gene or the function of the gene product. References often have a list of keywords that are Medline MeSH terms. Keywords are good to use in text searches.

Kyoto

An external link in the Locus or Clone page to the Kyoto Encyclopedia of Genes and Genomes. The link goes directly to the information for that specific enzyme.

Last_update

"Last_update" in the GO annotations page indicates the most recent date that information was entered into the database for a given locus.

Literature Guide

The Literature Guide (formerly called Gene_Info) is a guide to the literature for a given locus and is derived from journal articles. SGD performs a search through all PubMed literature for all papers mentioning that locus and any aliases. SGD curators read the abstract or full text of those papers and assign the papers to one or more Topics that describe the kind of biological information they contain. The Literature Guide is thus designed to help the user easily find the papers relevant to a given locus. See the Literature Guide Help page for more information.

Literature Guide Annotation

At SGD, Literature guide annotations are topics that are associated with papers in order to categorize them, to facilitate searching by users for specific types of information. These annotations may be linked to genes or not, depending on the information in the paper. A complete list of literature guide topics is available in the Literature Guide Help document.

Locus

A "locus" most often is a gene, characterized by a mutant phenotype or by a DNA sequence, which has been either genetically mapped or otherwise localized (e.g. by DNA sequence comparison or hybridization) to a particular spot in the yeast genome. A locus may also be a DNA sequence feature such as a centromere. A very small number of "loci" which are contained in the database have not been genetically mapped or otherwise localized, but instead have only been shown to be a mutant phenotype that segregates as a single gene. Therefore these are not "loci" in the strict sense of the word, but they are included in the database because the names and information about these putative "loci" have been published.

Locus history

Locus history records any comments of interest associated with the gene, such as mapping information, other names that the gene has been called (especially in the case where the other name is used in the database for yet a different locus), etc., and can be viewed by clicking the Locus History link from the bottom of each locus page. It includes update information from the Locus_notes category as well as notes added since the conversion to Oracle. For reserved gene names, the Locus history includes the reservation date and expiration date.

Locus_notes

Locus_notes section of a locus history page is used to document any comments of interest associated with the gene, such as mapping information, other names that the gene has been called (especially in the case where the other name is used in the database for yet a different locus), etc. The number that precedes the comment refers to the edition of Mortimer et al. (i.e., the yeast genetic and physical map publication) in which the comment first appears.

Long Terminal Repeat (LTR)

Identical sequences, typically several hundred nucleotides in length, that are located both at the ends of intact Ty retrotransposons and as solo elements present in multiple copies throughout the genome. There are several types of LTR elements in yeast: delta, tau, sigma and omega.

Manually curated GO Annotations

Refers to the GO Annotation Method that includes annotations made by curators reading the literature for each gene and making annotations from published papers when available. When published literature is available, such annotations may include those based on experiments, sequence similarity, or other computational analyses described in the paper, or on statements made by the authors. When no published literature is available for a gene, annotations may be made on the basis of curatorial judgements.

Map

If a locus has been genetically mapped, the "ORF Map" and "Genetic position" under the Sequence Coordinates section of the locus page will display details of the locus/feature. The Roman numeral to the right of "Map" indicates the chromosome to which the locus maps. The number to the right of "Genetic Position" indicates the map position of the locus (in centimorgans) from the centromere, where negative numbers indicate distances to the left of the centromere (the left arm) and positive numbers correspond to right arm distances.

Mapping_data

This displays links to all of the 2-point cross tetrad data where the locus was used as one of the markers.

Medline

Medline is the National Library of Medicine's database of biomedical papers; it contains all citation information for each paper, as well as abstracts for most of the papers.

Medline UID

The "Medline" tag that appears within the listed information for a paper contains the Medline unique identifying number (UID) for the paper; the first 2 numbers usually (but not always) indicate the year of publication.

Merged Feature

A chromosomal feature that was once annotated as a distinct entity, but that has now been subsumed by another feature. Typically, features become "Merged" because of a change in chromosomal sequence or annotation (e.g., YAR004W). For record keeping, the "Merged" feature is not removed from SGD, but is instead given the "Merged" status as a flag.

Minimal Tiling Path

A map or table showing placement and order of a set of clones that completely, contiguously cover some segment of DNA in which you are interested.

Molecular Function

One of the three categories used by the Gene Ontology project, molecular function describes the tasks performed by individual gene products; examples are transcription factor and DNA binding.

motif

A meaningful pattern of nucleotides or amino acids that is shared by two or more molecules.

Multigene Locus

A closely linked cluster of functionally related genes.

N

In the results of a BLAST search, the number of HSPs that are present in the set that was assigned the lowest P-value is reported in the "N" column.

NCBI

The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM) in the National Institutes of Health (NIH). Its mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. NCBI developed and maintains the Entrez Search System and PubMed database.

Negative Genetic

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, mutations/deletions in separate genes, each of which alone causes a minimal phenotype, result in a more severe fitness defect or lethality under a given condition when combined in the same cell.

NiceZyme

An external link on the Locus or Clone page) to the Enzyme nomenclature database maintained by SwissProt. The link goes directly to the information for that specific enzyme.

Nomenclature Note

This note is used by SGD on the locus page to clarify gene naming issues like two different ORFs being referred by the same gene name in the literature.

Nontranscribed Spacer (NTS)

The NTS is a region of DNA flanking the 5S rRNA gene within the ribosomal DNA repeat. NTS1 lies between the 25S and 5S genes, and NTS2 lies between the 5S and 18S genes. Note that NTS regions are not included in the 35S pre-rRNA transcript, and the 5S gene is transcribed independently of the other rRNA genes and in the opposing direction.

NotFeature

A BLAST dataset that includes only those sequences that are NOT included in a genetic feature. This dataset DOES NOT include ORFs, RNA genes, centromeres, transposons, etc.

Non-coding Exon

An exon that does not direct the production of a peptide sequence.

NRSC

A searchable DNA sequence database maintained by SGD that contains a non-redundant set of S. cerevisiae proteins. For example, while there may be 10 individual DNA sequences or protein sequences for a particular gene, it will only be represented once in the NRSC.

Null mutation

A mutation that results in the complete loss of function of a gene product. Often used to refer to the complete deletion of a protein-coding sequence; however, point mutations, partial deletions, and insertions may also result in a null phenotype.

Olson_Map

On a Clone page, the position of the clone on the physical map is given under this tag. The first term of this tag indicates the chromosome (in Roman numerals) on which the clone resides. The subsequent numeric entries are the beginning and ending chromosomal basepair coordinates of the clone's DNA sequence.

Olson_Rest_Data

The EcoRI-HindIII restriction pattern of a Olson-Riles clone.

Olson-Riles clones

The Olson-Riles clones are a set of overlapping cosmid and lambda clones covering the entire yeast genome that were ordered on the basis of EcoRI and HindIII restriction mapping (see Riles, L. et al. (1993) Genetics 134:81-150). Only a subset of these clones (the ones available from the ATCC) are presented on the GBrowse display; note that ATCC and Washington Univ. clone numbers are cross-referenced for these.

ORF

'ORF' refers to a stretch of DNA that could potentially be translated into a polypeptide or RNA: i.e., it begins with an ATG "start" codon and terminates with one of the 3 "stop" codons. For an ORF to be considered as a good candidate for coding a bona fide cellular protein, a minimum size requirement has often been set, e.g., during the yeast genome sequencing project an ORF was defined as a stretch of DNA that would encode a protein of 100 amino acids or more. An ORF is not usually considered equivalent to a gene or locus until there has been shown to be a phenotype associated with a mutation in the ORF, and/or an mRNA transcript or a gene product generated from the ORF's DNA has been detected. See ORF naming conventions for how ORFs are named in Saccharomyces cerevisiae. The usage of the term ORF within SGD and typically by the Saccharomyces community is generally called a Coding Sequence (CDS).

ORF-coding

This is a dataset of DNA sequences searchable by BLAST or FASTA. This dataset consists of the standard ORFs defined by the yeast genomic sequencing project. The DNA sequences include stop codons, but do not include introns or any upstream or downstream sequences.

ORF name

Also called "systematic name." If a locus (gene) has been sequenced and placed onto the yeast genome sequence, it has an ORF name, and the "ORF_name" will appear under the Systematic Name category and the "Feature Type" category will mention that it is an ORF, in the Locus page. (See "ORF naming conventions" for how ORF's are named in Saccharomyces cerevisiae.)

ORF naming conventions

All S. cerevisiae ORFs are designated by a symbol consisting of three uppercase letters followed by a number and then another letter, as follows: Y (for "Yeast"); A - P for the chromosome upon which the ORF resides (where "A" is chromosome I, up to "P" for chromosome XVI); L or R (for Left or Right arm); a 3-digit number corresponding to the order of the open reading frame on the chromosome arm (starting from the centromere and counting out to the telomere); and W or C for whether the open reading frame is on the "Watson" or "Crick" strand (where "Watson" runs 5' to 3' from left telomere to right telomere). Most ORF designations by the systematic sequencing groups use a predicted 100 amino acid polypeptide as the minimum size limit, except when a smaller gene has already been characterized and localized to the chromosomal sequence. If a new ORF is discovered between two existing ORFs, the new ORF will usually be named by taking the name of an adjacent ORF and adding an "A" or "B" to the end of it (this avoids re-numbering all the distal ORF's). See the Nomenclature Conventions page for more information.

ORF_sequence

If sequence data are available for a locus, the sequence may be retrieved via the Locus Summary page. (See "ORF naming conventions" for how ORF's are named in Saccharomyces cerevisiae.) The ORF_sequence tag gives the ORF name for the locus, and connects to the text entry in the Sequence class for that ORF.

ORF-Trans

This is a dataset of protein sequences searchable using BLAST or FASTA. This dataset consists of protein translations of the standard ORFs defined by the yeast genomic sequencing project. The protein sequences include stops.

Orthologs

Genes that have evolved directly from the same ancestral locus. See Paralog.

Parent Term

This term is used in the context of the Gene Ontology. It refers to a controlled vocabulary term that represents a less specific, general aspect of biology than its one or more child terms and is placed at a higher level in the ontology. Parent terms have one or more child terms. For example, the term "cytoplasm" is a parent term that has several child terms, including "endoplasmic reticulum" and "Golgi apparatus".

PathCalling

See Two Hybrid Portal PathCalling.

P(N)

In the results of a BLAST search, the lowest P-value given to any set of HSPs found in a database are listed in the "P(N)" column.

P-value For BLAST

In a BLAST search, a P-value refers to the probability of obtaining, by chance, a pairwise sequence comparison of the observed similarity given the length of the query sequence and size of the database searched. Thus, low P-values indicate sequence similarities of high significance.

P-value For GO Term Finder

To determine the statistical significance of the association of a particular GO term with a group of genes in the list, GO Term Finder calculates the p-value: the probability or chance of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO Term. That is, the GO terms shared by the genes in the user's list are compared to the background distribution of annotation. The closer the p-value is to zero, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance).

PAM120

Sequence alignment matrix that allows 120 accepted point mutations per 100 amino acids. A higher PAM is more suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing closely related sequences (Swartz and Dayhoff, 1978).

PAM250

Sequence alignment matrix that allows 250 accepted point mutations per 100 amino acids. PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences (Swartz and Dayhoff, 1978).

PAM250-Gonnet

Sequence alignment matrix that allows 250 accepted point mutations per 100 amino acids using scoring tables recalculated since the creation of PAM250 (Gonnet et al., 1992). PAM250-Gonnet is better than PAM250 for comparing distantly related sequences.

PAM40

Sequence alignment matrix that allows 40 accepted point mutations per 100 amino acids. PAM40 is sutiable for comparison of closely related sequences, while a higher PAM is suitable for comparison of more distantly related sequences (Swartz and Dayhoff, 1978).

Paralog

A gene that originated by duplication and then diverged from the parent copy by mutation and selection or drift. See orthologs.

PatMatch

This is a pattern matching program that permits the identification of patterns or motifs within the collection of all S. cerevisiae protein or DNA sequences. It offers an alternative to sequence alignment techniques such as BLAST and FASTA for identifying nucleotide or peptide sequences with conserved or biologically interesting regions.

PCA

Protein-fragment Complementation Assay (PCA) Any of a family of protein-protein interaction assays in which a bait protein is expressed as fusion to one of either N- or C-terminal peptide fragments of a reporter protein and prey protein is expressed as fusion to the complementary N- or C-terminal fragment of the same reporter protein. Interaction of bait and prey proteins bring together complementary fragments, which can then fold into active reporter (e.g. PMID: 15048128, Fig. 1a, Fig. 2a). Reporter protein examples include: dihydrofolate reductase (DHFR), green fluorescent proteins and different emission variants, beta-lactamase and luciferaces (e.g. PMID: 17599086).

PDB

The Protein Data Bank (PDB) is an archive of experimentally determined three-dimensional structures of biological macromolecules, based at the Brookhaven National Laboratory.

Phenotype

In the Locus page, "phenotype" refers to the observable traits of strains that carry a mutation at that locus.

Phenotypic Enhancement

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, a genetic interaction is inferred when mutation or overexpression of one gene results in enhancement of any phenotype (other than lethality/growth defect) associated with mutation or overexpression of another gene.

Phenotypic Suppression

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, a genetic interaction is inferred when mutation or overexpression of one gene results in suppression of any phenotype (other than lethality/growth defect) associated with mutation or overexpression of another gene.

Physical map

The physical map was originally defined as the Olson-Riles clones, a set of overlapping cosmid and lambda clones covering the entire yeast genome that were ordered on the basis of EcoRI and HindIII restriction mapping. However, with the completion of the entire yeast genome's DNA sequence, the physical map is now equivalent to the genomic sequence. SGD provides an on-the-fly generated Genome Browser that includes the locations of ATCC clones, ORFs and other sequence features.

PIR

PIR (Protein Information Resource) is a protein database. The PIR database has three sites, PIR-DE based in Germany, PIR-JP based in Japan, and PIR-US in the United States.

Point mutation

A single nucleotide change that substitutes one nucleotide for another. A point mutation in the coding sequence of a gene affects a single codon, and often allows expression of an intact but nonfunctional or partially functional protein.

Position

In the Colleague information, "Position" refers to the job title held. Assistant professor, graduate student, staff scientist, university president are all examples of positions.

Positive Genetic

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, mutations/deletions in separate genes, each of which alone causes a minimal phenotype, result in a less severe fitness defect than expected under a given condition when combined in the same cell.

Positive_locus

This gives the locus name associated with the clone (usually an ORF clone).

Prime clone

The term "prime clone", as used in the SGD database, refers to any piece of DNA that is available from the ATCC; these are mostly the Olson-Riles set of cosmid and lambda clones as well as many of the cosmid and lambda clones sequenced by the systematic sequencing groups.

Process

See biological process.

Profession

In the colleague information, this refers to the type of work done by the researcher. Examples of Professions are molecular biologist, biochemist, instructor, winemaker, doctor, lawyer etc.

Protein Info

Protein Info refers to the information like molecular weight, length of the protein, pI etc, pertaining to the protein produced by the gene. Users can reach this page from the locus page or by searching for a gene/ORF name from the Search Protein Info option. The Protein Info page for each gene also provides links to the Literature guide, PDB homologs/Motifs, Interactions, and Comparison resources.

Protein-peptide

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is detected between a protein and a peptide derived from an interaction partner. This includes phage display experiments.

Protein-RNA

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, an interaction is detected between a protein and an RNA.

PubMed

PubMed is a database of bibliographic information developed by NCBI.

Purified Complex

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, interaction is detected when proteins are co-purified in some manner (co-elution, co-fractionation, co-sedimentation, etc.) where there is not an actual bait-prey type interaction.

Query sequence

A sequence, either amino acid or nucleotide, chosen by the user to use in a database search. Two utilities, BLAST and FASTA, use a query sequence to perform searches. A query sequence can be typed or pasted into the the query window on the search form. Lengthy sequences can be copied after retrieval from the SGD Sequence form and pasted into the query window using the Netscape EDIT commands. BLAST searches require a minimum query sequence length of 15 nucleotides or amino acids. FASTA searches require a minimum sequence length of 8 nucleotides or amino acids. PATMATCH can also be used to query the SGD for short peptide or nucleotide sequences.

RAW format

A format in which the nucleotide sequence appears without headers or comments. RAW format must be used when performing an S. cerevisiae search in BLAST or FASTA.

Reconstituted Complex

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, interaction is detected between purified proteins in vitro.

Reference

Within SGD, a "reference" is most often a published article in a scientific journal or book; however some references are unpublished results or personal communications to SGD. A comprehensive list of references may be obtained for a given locus within its Literature Guide section.

Regulatory Region

A region that is involved in controlling the expression of a gene. Regulatory regions can include, but are not limited to, transcription factor binding sites that regulate the transcription of a gene or untranslated regions that regulate the protein levels of a gene product.

Related Sequences

A feature of Entrez that finds related nucleotide (GenBank) or protein (GenPept) sequences using similarity searches.

Repeat Region

A region containing some type of tandemly repeated sequence. For example, in the case of the telomeric Y' elements, these are known as "36-bp repeats".

Research_interest

In the Colleague information section, Research_interest refers to the broad areas of study the colleague is pursuing. Examples might be: protein translocation, DNA replication, histones or cytoskeleton.

Reserved gene name

Gene names that are soon to be published can be reserved with SGD.

Retired name

A 'Retired name' is a gene name that was reserved for an ORF by a member of the yeast community, but never published. A gene name reservation is good for one year. After this time, if SGD is unable to determine that the gene name has been published and unable to contact the person who made the reservation or if the submitter of the reserved gene name requests that SGD discontinue/delete the gene name reservation, such gene names become Retired names. SGD retains such gene names rather than deleting them since these names have existed in the database for a significant period of time (usually more than 2 months). When this occurs, it is documented with a note in the Locus History Page of the relevant ORF.

Retrotransposon

Transposable element that mobilizes via an RNA intermediate. Each DNA segment in the host chromosome is transcribed into RNA and then reverse-transcribed via a reverse transcriptase into a DNA segment. This is reinserted into the host genome, usually at a new site. Retroposon is a shortened form of retrotransposon, and also appears in the literature.

Ribosomal RNA (rRNA)

The abundant RNA component of both cytosolic and mitochondrial ribosomes. 5S, 5.8S, 18S and 25S rRNAs are components of the cytosolic ribosomes encoded in a 1-2 Mb region, known as the ribosomal DNA repeat, on the right arm of chromosome XII. This region contains 100-200 tandem copies of a 9.1 kb repeat that is transcribed as a precursor 35S pre-rRNA which is later processed to yield the mature form. Mitochondrial ribosomes contain 15S and 21S rRNAs that are encoded by the mitochondrial chromosome.

RNA gene

The non-coding RNA components of ribonucleoproteins (for example see NME1, RPR1, RUF1-8, SCR1 and TLC1).

SAGE class

Each SAGE tag is put into one of four "classes" based on its location relative to known ORFs and is assigned a color in graphic displays:

1 - within an ORF (orange);

2 - within 500 bp 3' of an ORF (violet);

4 - on the strand opposite an ORF (yellow);

3 - none of the above (bright pink).

SAGE tag

The SAGE technique (Serial Analysis of Gene Expression) has been used to analyze the expression profile of thousands of genes across the yeast genome, i.e. the yeast "transcriptome" (Velculescu, et al., (1997) Cell 88:243-251). A SAGE tag is a 14-nucleotide sequence that has been found within a mRNA. The relative abundance of a particular SAGE tag within a pool of tags gives some indication of the level of expression of the gene(s) containing that tag. The SGD SAGE data can be viewed on the GBrowse Genome Browser.

Segment

To ease handling of the large amount of DNA sequence, the genomic sequences have been divided into 10 kb segments that overlap their neighbors by 5 kb. The segment's name shows from which chromosome it was generated and where on that chromosome it is located. For example, segment G165 is a segment of chromosome VII ("G" is the seventh letter of the alphabet) that extends from coordinate 165,001 to 175,000.

Sequence

Sequence in SGD consists not only of all S. cerevisiae sequences that are publicly available via GenBank/EMBL/DDBJ, but also each of the 16 complete chromosome sequences generated by the systematic sequencing effort. Sequence for a named or uncharacterized ORF can be can be retrieved in FASTA or GCG format via the 'Retrieve Sequence' pull down menu present on the right hand side of each locus page. Sequence analysis tools such as BLAST, Gene/Seq Resources, Genome Restriction map, Design Primers are also available.

Sequence Annotation

Sequence annotations refer to information regarding the positions of generic elements on a chromosome.

Sequence Coordinates

Sequence coordinates on each locus page refer to the start and stop coordinates of the ORF on the chromosome with information on exons and introns. A link to GBrowse is also available as a visual aid.

Sequence features

A sequence feature is defined as any gene or other genetic element that resides on a chromosomal sequence, including ORF's, tRNA's, snRNA's, and CEN's (centromeres).

SGD

Saccharomyces Genome Database. The SGD project collects information and maintains a database of the molecular biology of the yeast Saccharomyces cerevisiae. This database includes a variety of genomic and biological information. SGD is funded by the National Center for Human Genome Research (NCGHR) at the U.S. National Institutes of Health. The SGD is in the Department of Genetics at the School of Medicine, Stanford University.

SGD curated paper

An "SGD curated paper" is any reference (published or unpublished) that is relevant to SGD and may or may not be manually curated.

SGD gene naming guidelines

The rules used by SGD curators to give genes names conforming to standard yeast nomenclature. The guidelines also include the recommendations used for resolving gene name conflicts.

SGDID

A unique identifying number within SGD which is specific for a single item such as a feature name.

SGDID_Secondary

If a locus is found to be identical to a previously-named locus, the information for the two loci will be merged under the chosen Standard name, and the other name will be listed as an alias (Alias). The SGDID number that was previously used for the other name will then be listed under the SGDID_Secondary ID.

Signal Sequences

These refer to a continuous stretch of aminoacid residues that get removed from the mature protein once the protein has been sorted. These are also referred to as Sorting Signal Sequences. Predicted cleavage sites of sorting signal sequences are displayed at SGD on the protein pages.

Signal Patches

These refer to amino acid residues that are distant to one another in the primary sequence of the protein but come close to each other in three dimensional space, when the protein is properly folded. Signal patches are not cleaved from the mature protein after sorting. This category of signal sequences is very difficult to predict and are not displayed on the protein pages at SGD.

Small nuclear RNA (snRNA)

The small RNA component of small nuclear ribonucleoproteins (snRNPs) or snurps. They are located in the nucleus and are important for splicing of hnRNAs and telomeric maintenance.

Small nucleolar RNA (snoRNA)

The stable RNA component (numbering 75-100) of ribonucleoproteins that are located in the nucleolus and are typically involved in nucleotide modification or rRNA, cleavage of precursor rRNA, rRNA folding and assembly of ribosomal subunits. They generally fall into two groups: the box C/D family and box H/ACA family.

Smith-Waterman alignment

An amino acid sequence alignment that illustrates sequence similarity. The alignment is generated using the Smith-Waterman algorithm (T. F. Smith and M. S. Waterman, (1981) J. Mol. Biol. 147:195-197 and W.R. Pearson (1991) Genomics 11:635-650).

Source Exons

Source exons identifies the size of each exon encoded by a larger sequence. If the coding region contains introns, then multiple exons will be listed. Each exon is indicated by beginning and ending basepair coordinates relative to the start of the coding sequence. The exon sizes are obtained from the GenBank sequence file.

Source range

In a BLAST alignment, the range of the query sequence (expressed as nucleotide or amino acid coordinates) over which it is aligned with the target sequence.

Split Gene

Gene containing coding regions (exons) that are interrupted by one or more non-coding regions (introns).

Standard locus name

The "standard locus name" refers to the name that the SGD has decided to use as the primary name for a given locus (gene), based on the SGD gene-naming guidelines. All information in the database concerning this gene will be listed within the standard name's locus window. Any other names that have been used for this gene are listed as "Alias" within the standard name locus window; these "other names" also have separate locus windows associated with them, but they serve merely as links to the standard locus name window.

Subsequence

"Subsequence" in the Sequence page indicates that the sequence is part of a larger GenBank sequence. The beginning and ending basepair coordinates of the subsequence relative to the entire GenBank entry are given after the subsequence name.

SwissProt

SwissProt is an annotated protein sequence database. Within a Locus page, an external link is provided (at the "SwissProt" tag) to the SwissProt entry for the gene, which includes the amino acid sequence for the protein encoded by the gene.

Synteny

Location of genes on the same chromosome, e.g., genes with a common chromosomal location are said to be part of the same syntenic group. Also used to refer to conservation of gene order across species, e.g., if orthologous genes are located together and in the same relative order in different species, then the block of genes is said to be syntenic between the species.

Synthetic genetic interaction

A genetic interaction in which a combination of mutations in two or more genes of a single strain results in a phenotype that is different in degree or nature from the phenotypes conferred by the individual mutations. The most common type of synthetic interaction is synthetic lethality, in which two mutations, neither of which causes inviability when present individually, cause inviability when present together in the same strain. Synthetic interactions may result in other phenotypes, e.g., slow growth or respiratory growth defect. Synthetic interaction can be an indication that the genes involved participate in the same pathway or process; synthetic interaction between two point mutant alleles may be an indication that the gene products physically interact.

Synthetic growth defect

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, a genetic interaction is inferred when mutations in separate genes, each of which alone causes a minimal phenotype, result in a significant growth defect under a given condition when combined in the same cell.

Synthetic lethality

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, mutations or deletions in separate genes, each of which alone causes a minimal phenotype, result in lethality when combined in the same cell under a given condition.

Synthetic rescue

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, mutation/deletion of one gene rescues the lethality or growth defect of a strain mutated/deleted for another gene.

Target range

In a BLAST alignment, the range of the target sequence (expressed as nucleotide or amino acid coordinates) over which it is aligned with the query sequence.

tblastn

A BLAST program that compares a protein query sequence against a nucleotide sequence dataset dynamically translated in all six reading frames (both strands). The user must enter a AMINO ACID sequence and select one of the NUCLEOTIDE datasets (i.e., genoSc or GenBank) for the search.

tblastx

A BLAST program that compares the six-frame translations of a nucleotide sequence to the six-frame translations of a nucleotide sequence dataset. The user must enter a NUCLEOTIDE sequence and select one of the NUCLEOTIDE datasets (i.e., genoSc or GenBank) for the search.

TC number

The number assigned by the Transporter Classification System for a transporter protein family, subfamily or individual example. This system was developed by Milton Saier (Saier MH Jr (2000) Microbiol Mol Biol Rev 64(2):354-411) and is maintained at the Transport Classification Database (TCDB). Currently, SGD contains TC assignments to individual proteins, made by the Yeast Transporter Information (YETI) project, part of Genolevures (De Hertogh B, et al. (2006) Genetics 172(2):771-81). The TC numbers assigned to individual proteins are displayed in the "External Classifications" section of Protein Information pages, and links to TCDB are listed in the external links sections of both the Locus Summary and the Protein Information pages. These assignments are also included in the dbxref.tab file available from our Download Data site.

Telomeric region

DNA sequence located at the very ends of a chromosome. The telomeric regions are complex mosaics of several different types of telomeric and subtelomeric elements known as X element core sequences, X element combinatorial repeats, telomeric repeats, and . Possible functions include roles in chromosomal segregation, maintenance of chromosome stability, recombinational sequestering, or as a barrier to transcriptional silencing.

Telomeric repeat

The telomeric repeat is a G-rich terminal DNA sequence of the form (TG(1-3))n or more precisely ((TG)(1-6)TG(2-3))n.The repeats are maintained by telomerase and there is generally 300 (+/-) 75 bp of TG(1-3) at a given end. Telomeric repeats function in completing chromosome replication and protecting the ends from degradation and end-to-end fusions.

Tetrad analysis

When diploid S. cerevisiae cells undergo meiosis, four haploid spores are produced which are enclosed within a sac called an ascus. Using a micromanipulator, the 4 spores (the "tetrad") can be "dissected" out of the ascus and separated on an agar plate. The spores then germinate, and the colonies that arise can be tested for the presence or absence of phenotypes characteristic of the genetic markers present in the original diploid strain. In this way the segregation pattern of the genetic markers can be assessed, allowing determination of whether the genes are linked and the estimation of the genetic distance between the markers. When 2 genetic markers are analyzed in this way, the data are referred to as 2-point data.

Topic

Published biological information is categorized by SGD curators into pre-determined "topics". These topics comprise the Literature Guide (formerly called Gene_Info), a way of grouping the literature into related categories in SGD. Descriptions are available for the current Literature Topics.

Transmembrane Domain

Refers to the domains in amphipathic membrane proteins where the hydrophobic regions traverse the lipid bilayers of the membranes, while the hydrophilic regions extend on either side of the membrane and interact with water.

Transposon

Any of the five classes (TY1 through Ty5) of mobile genetic elements in yeast that contain long terminal repeats flanking a central epsilon element that encodes two gene products, TyA (structural component) and TyB (reverse transcriptase). Ty elements are retrotransposons that move about the genome via an RNA intermediate.

Tree-View

Tree-View refers to the display of parent-child relationships of GO terms within an ontology.

tRNA gene

A gene encoding one of the approximately 300 transfer RNA molecules containing triplet codon-amino acid adaptor activity that are involved in bonding to and transferring of amino acids to the ribosomes where proteins are translated.

tRNA naming conventions

The tRNAs in SGD have been systematically named. The names are in the format tX(anticodon)Z, where X is the one-letter code for the conjugate amino acid and Z is the one letter code for the chromosome (A is chromosome I, B is chromosome II, C is chromosome III, etc.). For an example, see tD(GUC)D.

Two Hybrid

This term is used to identify and describe interaction data displayed at SGD. In this type of experiment, a "bait" protein fused to a DNA-binding domain is tested against a "prey" protein (often a library of proteins) fused to a DNA-activating domain. An interaction between the "bait" and "prey" is indicated by transcriptional activation of a test gene; often, a prey protein that results in a positive interaction is called a "hit". This experimental system is also used for variations of the Two Hybrid system such as split-ubiquitin.

Two Hybrid Portal PathCalling

PathCalling is a tool offered by Curagen Corporation to identify protein pathways and protein-protein interactions.

Ty naming conventions

In collaboration with the Dan Voytas group and colleagues, the Ty retrotransposons were given systematic names in the format "YX(L/R)(W/C)Ty?-n", where "X" is the letter that stands for the chromosome number, "L" or "R" is used to signify the left or right arm of the chromosome, "W" or "C" is used to show whether the element starts on the Watson or Crick strand, "Ty?" indicates which type of Ty it is, and "n" is a unique number for that type of element on the chromosome. For an example, see YJRWTy1-1. The Ty LTRs are also named. The format is "YX(L/R)(W/C)element-n." The only difference is that the type of Ty LTR ("element") is indicated by a Greek letter. Ty1 and Ty2 LTRs are indicated "delta," Ty3 TLRs are "sigma," Ty4 LTRs are "tau," and Ty5 LTRs are "omega." Two examples can be seen in YFLWdelta2 and YORWtau3.

Uncharacterized ORF

An Uncharacterized open reading frame (ORF) is one that is likely to encode an expressed protein, as suggested by the existence of orthologs in one or more other species, but for which there are no specific experimental data demonstrating that a gene product is produced in S. cerevisiae. While most Uncharacterized ORFs have systematic names only (e.g., YKL100C), a few have been given genetic names (e.g., PAU8). Evidence from large-scale analyses that indicates an ORF may be biologically relevant is sometimes but not always enough to upgrade an ORF from "Uncharacterized" to "Verified", depending on the individual case. Also see the description of "Dubious" ORFs.

uORF

Small upstream open reading frame (uORF) that precedes the major open reading frame (ORF). uORFs usually inhibit downstream translation by blocking ribosomal scanning to promote efficient termination. In some cases, uORFs stimulate translation of the major ORF by allowing scanning ribosomal subunits to proceed via leaky scanning and reinitiation to the major ORF.

UTR5_sc

utr5_sc_500, utr5_sc_1000, and utr5_sc_2000 are datasets consisting of the 5' untranslated regions 500, 1000, and 2000 basepairs upstream, respectively, of the start codons of the yeast coding sequences (ORFs) defined by the systematic sequencing effort. These sequences do not include the 5' ATG and correspond to the sense strand (i.e., sequences for "C" ORFs are reverse complemented). Some of these UTRs will be erroneous if the predicted ORF is not expressed or if the true 5' end starts at a different ATG. For ORFs near the chromosome termini, the 5' UTRs may not be the desired length. For example, there is only 334 bp between the left end of chrI and the start of YAL069W.

Verified ORFs

ORFs for which experimental evidence exists that a gene product is produced in S. cerevisiae. Generally these have obvious orthologs in one or more other Saccharomyces species. Most named genes are in this class. Evidence from large-scale analyses that indicates an ORF may be biologically relevant is sometimes but not always enough to upgrade an ORF from "Uncharacterized" to "Verified", depending on the individual case.

Virtual-Library yeast

The Virtual Library document has been replaced by a General Yeast Topics wiki page.

W region

Leftmost segment of homology in the HML and MAT mating loci (not present in HMR).

WashU

Each member of the set of Olson-Riles clones (generated at Washington Univ.) not only has an ATCC number associated with it, but also a WashU number, which is crosslinked with the ATCC number in the SGD database.

Watson Strand ORF

An open reading frame (ORF) encoded on the Watson or top strand of the chromosome, which runs 5' to 3' from the left to right ends of the chromosome.

Wildcard character

SGD uses an asterisk "*" as a wildcard symbol. In a search, the wildcard character shows where any text can be tolerated. For example, searching for the locus "cdc*" will produce all cdc genes. Searching for the Author "Johns*" will produce all authors whose last name begins with those letters.

Word length

The Word Length (W) is a BLAST parameter that determines the minimum length of a match. BLAST first searches for a perfect match of at least the word length. Once a match is found then it tries to extend the HSP.

X region

One of two segments of homology found at all three mating loci (HML, MAT, and HMR).

X element combinatorial repeats

Formerly known as subtelomeric repeats (STRs), X element combinatorial repeats are located between the X element core sequence and the telomere or adjacent Y' element, and are usually present as a combination of one or more of several types of smaller elements (designated A, B, C, or D). X element combinatorial repeats contain Tbf1p binding sites, and possible functions include a role in telomerase-independent telomere maintenance via recombination or as a barrier against transcriptional silencing.

X element core sequence

The only region shared by all chromosome ends, the X element core sequence is a small conserved element (~475 bp) that contains an ARS sequence and in most cases an Abf1p binding site. Between these is a GC-rich region nearly identical to the meiosis-specific regulatory sequence URS1. Possible functions include roles in chromosomal segregation, maintenance of chromosome stability, recombinational sequestering, or as a barrier to transcriptional silencing.

Y region

Segment of nonhomology between a and alpha mating alleles, found at all three mating loci (HML, MAT, and HMR), has two forms (Ya and Yalpha).

Y' element

A repetitive sequence found in many but not all subtelomeric regions, the Y' element is located next to the telomeric repeats, or adjacent X element combinatorial repeats, either as a single copy or tandem repeat of two to four copies. Two types of Y' elements are known, Y'-L and Y'-S, and any particular array will consist of only one type, not a combination of both. Y' elements contain helicase-encoding ORFs which are expressed only during meiosis and in telomerase-deficient cells. Possible functions include rescue of telomeres when the telomeric repeats are no longer present and a role in telomere maintenance during meiosis.

Yeast GenBank

A collection of all GenBank sequences that were derived from Saccharomyces cerevisiae.

Yeast Swiss-Prot

The collection of Swiss-Prot protein sequences that are derived from Saccharomyces cerevisiae.

YPD

The Yeast Proteome Database maintained by BIOBASE. YPD contains physical, functional and some genetic information about Saccharomyces cerevisiae. YPD was originally developed and maintained by Proteome Inc. Access now requires a paid subscription.

Z1 region

One of two segments of homology found at all three mating loci (HML, MAT, and HMR).

Z2 region

Rightmost segment of homology in the HML and MAT mating loci (not present in HMR).

Zymogen

Some proteins are synthesized as zymogens, which are enzymatically inactive precursors of proteolytic enzymes. Zymogens usually become activated by posttranslational modifications, such as cleavage in a particular peptide sequence.