SGD Help: Nomenclature Conventions

This page provides information on genetic and systematic nomenclature for S. cerevisiae genes and chromosomal features.

Contents

Gene Name Assignment

Gene names, also referred to as genetic names (for example, COX2 or CDC28), are conferred upon genes by a researchers on the basis of genetic, biochemical, or molecular characterization. Most genes having Gene Names are ORFs, but tRNAs and other non-protein coding RNAs have also received Gene Names. In addition, there are named genes in SGD that have not yet been mapped to a physical location on the chromosome. Gene names are optional, and chromosomal features that are completely uncharacterized generally do not have gene names, only systematic names (see below).

The official name of an S. cerevisiae gene is referred to as the Standard Name on an SGD locus page, and generally becomes the standard name based on its publication in a peer-reviewed paper describing characterization of that gene. A gene name may also be reserved for a locus when publication of the name is upcoming, and is called a Reserved Name. A Reserved Name, if it remains unique and is the first published name, becomes a Standard Name upon its publication. In cases where it is not clear what name should be the standard name, the Standard Name is determined by an amalgam of 1) consensus of the research community, 2) literature usage, 3) clarity relative to function, and 4) priority in the literature. Any alternative Gene Name is referred to as an Alias.

When naming a gene, the full text of the Gene Naming Guidelines for Saccharomyces cerevisiae should be consulted. An explanation of the conventions for Saccharomyces cerevisiae nomenclature was published in the Trends in Genetics gene nomenclature guide (download pdf), and the conventions are also detailed below.

Gene Name Format

The accepted format for gene Names in S. cerevisiae is comprised of three uppercase letters followed by a number. Generally, the letters signify a phrase (referred to as the "Name Description" in SGD) that provides information about a function, mutant phenotype, or process related to that gene, for example "ADE" for "ADEnine biosynthesis" or "CDC" for "Cell Division Cycle". Gene names for many types of chromosomal features follow this basic format regardless of the type of feature named, whether an ORF, a tRNA, another type of non-coding RNA, an ARS, or a genetic locus.

Some S. cerevisiae gene names that pre-date the current nomenclature standards do not conform to this format: for example, RPL1A and RPL1B, or OM45. Although non-standard historical names such as these are maintained in SGD, any new names for yeast genes must conform to the standard format.

Systematic Name Assignment

The Systematic Name is the name generated by the systematic sequencing project, or conferred later according to the appropriate guidelines for systematic nomenclature for that type of feature or gene. Every gene or feature annotated on the genomic sequence receives an unique systematic name, whether or not it has a genetic name.

There are guidelines for designating a Systematic Name for a new feature, i.e. one not originally named by the systematic sequencing project, depending on the feature type. The specifics (detailed below) depend on the type of feature, i.e. ORF, tRNA, etc. If you have a newly discovered feature, please contact SGD in order to have the proper systematic name assigned.

Systematic Nomenclature Conventions and Formats

Open Reading Frames

Nuclear ORFs

Systematic names for nuclear-encoded ORFs begin with the letter 'Y' (for 'Yeast'); the second letter denotes the chromosome number ('A' is chr I, 'B' is chr II, etc.); the third letter is either 'L' or 'R' for left or right chromosome arm; next is a three digit number indicating the order of the ORFs on that arm of a chromosome starting from the centromere, irrespective of strand; finally, there is an additional letter indicating the strand, either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere).

Examples:

On an ongoing basis, any nuclear ORFs that are newly annotated receive a systematic name based on that of the centromere proximal ORF plus an additional letter to indicate the order between previously assigned ORFs. When multiple new open reading frames are identified between previously assigned ORFs, the letter designation assigned to each is based on the order in which they were discovered, and is independent of strand. The following steps are used to determine the correct systematic name.

1. Researchers contact SGD with the coordinates of a new ORF.

2. The base name of the new ORF is the same as the closest centromere proximal ORF. The correct base names for the example new ORFs are indicated in green below. Note that the closest centromere proximal ORF does not have to be on the same strand, although it can be. The new ORF may overlap an existing ORF. When this occurs, if any portion of the existing overlapping ORF is closer to the centromere than the new ORF, then the existing overlapping ORF is "centromere proximal" relative to the new ORF.

3. The W/C suffix indicates the strandedness of the new ORF.The W/C suffix of the new ORF is independent of the strandedness of the centromere proximal ORF. The correct suffixes for the example new ORFs are indicated in green below.

4. An additional suffix, -[letter], is appended to the name of the new ORF.This distinguishes the new ORF from ORFs named in the original annotation. The letters are assigned in alphabetical order, per base name, in order of discovery (see additional examples below). The correct suffixes for the example new ORFs are indicated in green. If several neighboring new ORFs are added simultaneously, then the -[letter] suffix is assigned in alphabetical order, from the centromere to the telomere. However, since new neighboring new ORFs are not necessarily discovered simultaneously, the -[letter] suffix does not always indicate relative position.

Examples:

In the rare event that a new ORF is discovered at the extreme end of a chromosome, the new ORF is given the next number in the sequence and does not require a -[letter] suffix. This is only applicable in cases where there are no existing ORFs between the new ORF and the end of the chromosome.

Mitochondrial ORFs

Systematic names for mitochondrially encoded ORFs start with the letter 'Q' to designate the mitochondrial chromosome; the rest consists of a four digit number. Examples are Q0010 and Q0032.

2-micron Plasmid ORFs

Systematic names for ORFs encoded in the 2-micron plasmid start with the letter 'R' to designate the 2-micron plasmid; the rest consists of a four digit number followed by the letter 'W' or 'C' for Watson and Crick. Examples are R0010W and R0020C.

RNA-Coding Genes

ncRNA Genes

All annotated S. cerevisiae ncRNAs are designated by a symbol consisting of four uppercase letters, a four-digit number, and another letter, as follows: Y for “Yeast”, NC for “noncoding”, A-Q for the chromosome on which the ncRNA gene resides (where “A” is chromosome I, “B” is chromosome II, etc., up to “P” for chromosome XVI, and lastly “Q” for the mitochondrial chromosome), a four-digit number corresponding to the sequential order of the ncRNA gene on the chromosome (starting from the left telomere and counting toward the right telomere), and W or C indicating whether the ncRNA gene is encoded on the “Watson” or “Crick” strand (where “Watson” runs 5′ to 3′ from left telomere to right telomere, and “Crick” runs 3’ to 5’). 

Example:  

When evidence is published pointing to new ncRNA genes, they will be added to the annotation using the next sequential number available for the specific chromosome on which the ncRNA gene resides. In cases in which more than one ncRNA gene is added to any particular chromosome during the same annotation update (i.e., same genome revision), they will be named using the next sequential number starting with the leftmost ncRNA gene and proceeding to the right of the chromosome.

tRNA Genes

Systematic names of nuclear-encoded tRNA genes begin with a lowercase 't'; the second letter corresponds to the single letter code for the appropriate amino acid, e.g., A = alanine, C = cysteine, etc.; next the sequence of the anticodon of the tRNA is given in the 5' -> 3' direction within parentheses, e.g., (AGC) or (GUC); finally, there is an indication of which chromosome the tRNA gene resides on using the letters 'A' through 'P' to designate nuclear chromosomes (in the same way as for nuclear-encoded ORFs). If a given nuclear chromosome contains more than one copy of a tRNA gene, individual copies of the same tRNA family (those of identical sequence, including the anticodon sequence) are distinguished from each other by the addition of a single number, starting with '1', after the letter designating the chromosome.

Examples:

Mitochondrially-encoded tRNAs are named the same way as nuclear-encoded tRNAs, using the letter 'Q' to designate the mitochondrial chromosome, except that the presence of a number indicates that two or more tRNAs encode the same amino acid, though they do not necessarily contain the same anticodon sequence.

Examples:

snRNA and snoRNA Genes

The systematic name of a small nuclear RNA (snRNA) or small nucleolar RNA (snoRNA) starts with the lowercase letters 'sn'; next is a capital 'R'; this is followed by a number by a number. The number is unique, but does not convey any positional information. Frequently, the Gene Name of snRNAs and snoRNAs is the same as the Systematic Name, but with all caps, e.g. 'SNR'. Different copies of duplicated genes may be indicated by either adding a letter, e.g. 'A' or 'B' to the end of the name.

Examples:

Note: SNR7 is an exception in that its transcript is alternatively processed yielding two products: SNR7-S (short form) and SNR7-L.

rRNA Genes

The systematic names and gene names of loci representing the nuclear encoded rRNA genes are identical to each other. The "loci" representing the rDNA repeats, the rRNA transcripts, and the mature rRNAs are named with the three letter acronym 'RDN' for Ribosomal DNA. While S. cerevisiae contains multiple repeats of the ribosomal DNA (rDNA), only two rDNA repeats were sequenced as part of the systematic sequencing project.

Examples:

A more complete explanation of the representation and naming of the rDNA repeats and rRNAs within it is present on the RDN1 locus page which represents the entire rDNA region on Chromosome XII.

Other Features

Autonomously Replicating Sequences

Autonomously Replicating Sequences (ARS) are named with the three letters ARS followed by a number. ARS features added after October 2000 are named systematically using the three letters ARS followed by one or two digits to represent the chromosome, e.g. chromosome I = 1, chromosome II = 2, chromosome X = 10. This is followed by an additional whole number to designate the particular ARS on that chromosome in the order named, starting with the digits '01'. Note that the number merely indicates the order in which the ARS elements were reported and named, and does not necessarily denote any location information relative to other ARS features. Note also that decimal points are NOT used. Some "historical" ARS features were given Gene Names prior to the establishment of this systematic naming system, e.g. ARS1, ARS2, ARS120. In these cases, an ARS-based Gene Name does not indicate the chromosomal location. 

Examples: 

Centromeres

Centromeres are named with the three letters 'CEN' followed by one or two digits to represent the chromosome. 

Examples: 

Non-reference genes

Non-reference genes are designated by a symbol consisting of three uppercase letters and a four-digit number, as follows: Y for “Yeast”, SC for “Saccharomyces cerevisiae”, and a four-digit number corresponding to the sequential order in which the gene was added to SGD. 

Examples: 

Currently, SGD has 55 such genes. As evidence is published pointing to other S. cerevisiae genes not present in the S288C reference, they will be added to the annotation using the next sequenctial number available. 

Ty Elements

The systematic name of a full length Ty element starts with a 'Y'; the second letter corresponds to the chromosome number (given in Roman numerals, e.g., chr I is 'A', chr VIII is 'H'); the third letter is either 'L' or 'R' for left or right of the centromere; the fourth letter is either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere); next are the letters 'Ty' followed by a number, 1-5, to indicate the type of Ty element. The first Ty element of a given type is indicated with -1; additional full length Ty elements of the same type on the same chromosome are given a number incremented by one from the previous one.

Examples:

Ty Long Terminal Repeat Elements

The systematic name of a Ty Long Terminal Repeat (LTR) element starts with a 'Y'; the second letter corresponds to the chromosome number (given in Roman numerals, e.g., chr I is 'A', chr VIII is 'H'); the third letter is either 'L' or 'R' for left or right of the centromere; the fourth letter is either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere); next is a word for a Greek letter indicating the type of LTR element, e.g. 'delta', 'sigma', 'tau', 'omega'. The first Ty LTR element of a given type is given the number '1'; additional Ty LTR elements of the same type on the same chromosome are given a number incremented by one from the previous one.

Examples:

Note: there are four systematic names (YCLWdelta2a, YCLWdelta2b, YDRCdelta6a, and YDRCdelta6b) that do not conform to the nomenclature rules. Please contact SGD if you need to use this nomenclature.

Telomeric Elements

SGD currently annotates several different types of features at the ends of chromosomes, listed below (click on the element name for a definition). When there are multiple examples of a type of telomeric element at a single chromsome end (e.g. more than one Telomeric Repeat), the elements will be numbered after the suffix, with number 1 being the closest to the end of the chromosome.

Telomeric Region: "TEL" followed by a two digit number indicating the chromosome number, then "L" or "R" to indicate the left or right arm of the chromosome. 

Example

X element Combinatorial Repeats: The same base name used for the Telomeric Region feature, appended with a suffix of "-XR". 

Example: 

X element Core sequence: The same base name used for the Telomeric Region feature, appended with a suffix of "-XC". 

Example: 

Y' element: The same base name used for the Telomeric Region feature, appended with a suffix of "-YP". 

Example: 

Telomeric Repeat: The same base name used for the Telomeric Region feature, appended with a suffix of "-TR". 

Example: 

Correlation between Gene Names and Systematic Names

While all ORFs identified in the genome sequence have a Systematic Name, e.g. YAL001C, YGR116W, YAL034W-A, or Q0010, many ORFs have not been given a Gene Name, e.g. a name such as COX2 or CDC28. In addition, Gene Names have been conferred on non-ORF features such as tRNAs, other non-coding RNAs, and on genetic loci which have not yet been mapped to a specific position on a chromosome. In this last case, because the chromosomal location is not known, there will not be a systematic name associated with the Gene Name.

An ORF, or other chromosomal feature, with a systematic name may have been associated with more than one common usage name, or Gene Name. Only one of these will be designated as the Standard Name; any other associated name is referred to as an Alias.