SGD Help: Nomenclature Conventions

The nomenclature conventions for Saccharomyces cerevisiae were published in Trends in Genetics in 1998 (download pdf of gene nomenclature guide). These conventions are also detailed below.

Gene Name Assignment
Gene Name Format
Protein Name Format
Systematic Name Assignment
Systematic Nomenclature Conventions and Formats
1. Open Reading Frames
  1. Nuclear ORFs
  2. Mitochondrial ORFs
  3. 2-micron Plasmid ORFs
2. RNA-Coding Genes
  1. ncRNA Genes
  2. tRNA Genes
  3. snRNA and snoRNA Genes
  4. rRNA Genes
3. Other Features
  1. Autonomously Replicating Sequences
  2. Centromeres
  3. Non-reference genes
  4. Ty Elements
  5. Ty Long Terminal Repeat Elements
  6. Telomeric Elements
Correlation between Gene Names and Systematic Names

Gene Name Assignment

Gene names, also referred to as genetic names (for example, COX2 or CDC28), are conferred upon genes by a researchers on the basis of genetic, biochemical, or molecular characterization. Most genes having Gene Names are ORFs, but tRNAs and other non-protein coding RNAs have also received Gene Names. In addition, there are named genes in SGD that have not yet been mapped to a physical location on the chromosome. Gene names are optional, and chromosomal features that are completely uncharacterized generally do not have gene names, only systematic names (see below).

The official name of an S. cerevisiae gene is referred to as the Standard Name on an SGD locus page, and generally becomes the standard name based on its publication in a peer-reviewed paper describing characterization of that gene. A gene name may also be reserved for a locus when publication of the name is upcoming, and is called a Reserved Name. A Reserved Name, if it remains unique and is the first published name, becomes a Standard Name upon its publication. In cases where it is not clear what name should be the standard name, the Standard Name is determined by an amalgam of 1) consensus of the research community, 2) literature usage, 3) clarity relative to function, and 4) priority in the literature. Any alternative Gene Name is referred to as an Alias.

When naming a gene, the full text of the Gene Naming Guidelines for Saccharomyces cerevisiae should be consulted.

Gene Name Format

The accepted format for gene Names in S. cerevisiae is comprised of three uppercase letters followed by a number. Generally, the letters signify a phrase (referred to as the "Name Description" in SGD) that provides information about a function, mutant phenotype, or process related to that gene, for example "ADE" for "ADEnine biosynthesis" or "CDC" for "Cell Division Cycle". Gene names for many types of chromosomal features follow this basic format regardless of the type of feature named, whether an ORF, a tRNA, another type of non-coding RNA, an ARS, or a genetic locus.

Some S. cerevisiae gene names that pre-date the current nomenclature standards do not conform to this format: for example, RPL1A and RPL1B, or OM45. Although non-standard historical names such as these are maintained in SGD, any new names for yeast genes must conform to the standard format.

Protein Name Format

Proteins are referred to by the relevant gene symbol, non-italic, initial letter uppercase and with the suffix ‘p’ (for example, Ade5p). The suffix can be omitted if the word 'protein' is appended (for example, the Ade5 protein).

Systematic Name Assignment

The Systematic Name is the name generated by the systematic sequencing project, or conferred later according to the appropriate guidelines for systematic nomenclature for that type of feature or gene. Every gene or feature annotated on the genomic sequence receives an unique systematic name, whether or not it has a genetic name.

There are guidelines for designating a Systematic Name for a new feature, i.e. one not originally named by the systematic sequencing project, depending on the feature type. The specifics (detailed below) depend on the type of feature, i.e. ORF, tRNA, etc. If you have a newly discovered feature, please contact SGD in order to have the proper systematic name assigned.

Systematic Nomenclature Conventions and Formats

Open Reading Frames

Nuclear ORFs

Systematic names for nuclear-encoded ORFs begin with the letter 'Y' (for 'Yeast'); the second letter denotes the chromosome number ('A' is chr I, 'B' is chr II, etc.); the third letter is either 'L' or 'R' for left or right chromosome arm; next is a three digit number indicating the order of the ORFs on that arm of a chromosome starting from the centromere, irrespective of strand; finally, there is an additional letter indicating the strand, either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere).

Examples:

YAL001C first ORF to the left of the centromere on chromosome I (A is the 1st letter of the English alphabet), on the complement or Crick strand
YGR116W 116th ORF right of the centromere on chromosome VII (G is the 7th letter of the English alphabet), on the Watson strand

On an ongoing basis, any nuclear ORFs that are newly annotated receive a systematic name based on that of the centromere proximal ORF plus an additional letter to indicate the order between previously assigned ORFs. When multiple new open reading frames are identified between previously assigned ORFs, the letter designation assigned to each is based on the order in which they were discovered, and is independent of strand. The following steps are used to determine the correct systematic name.

1. Researchers contact SGD with the coordinates of a new ORF.

2. The base name of the new ORF is the same as the closest centromere proximal ORF. The correct base names for the example new ORFs are indicated in green below. Note that the closest centromere proximal ORF does not have to be on the same strand, although it can be. The new ORF may overlap an existing ORF. When this occurs, if any portion of the existing overlapping ORF is closer to the centromere than the new ORF, then the existing overlapping ORF is "centromere proximal" relative to the new ORF.

3. The W/C suffix indicates the strandedness of the new ORF.The W/C suffix of the new ORF is independent of the strandedness of the centromere proximal ORF. The correct suffixes for the example new ORFs are indicated in green below.

4. An additional suffix, -[letter], is appended to the name of the new ORF.This distinguishes the new ORF from ORFs named in the original annotation. The letters are assigned in alphabetical order, per base name, in order of discovery (see additional examples below). The correct suffixes for the example new ORFs are indicated in green. If several neighboring new ORFs are added simultaneously, then the -[letter] suffix is assigned in alphabetical order, from the centromere to the telomere. However, since new neighboring new ORFs are not necessarily discovered simultaneously, the -[letter] suffix does not always indicate relative position.

Examples:

YAL034W-A a new ORF on the Watson strand of the left arm of chromosome I, farther from the centromere than YAL034C
YHR214C-E a new ORF on the Crick strand of the right arm of chromosome VIII, farther from the centromere than YHR214W

In the rare event that a new ORF is discovered at the extreme end of a chromosome, the new ORF is given the next number in the sequence and does not require a -[letter] suffix. This is only applicable in cases where there are no existing ORFs between the new ORF and the end of the chromosome.

Mitochondrial ORFs

Systematic names for mitochondrially encoded ORFs start with the letter 'Q' to designate the mitochondrial chromosome; the rest consists of a four digit number. Examples are Q0010 and Q0032.

2-micron Plasmid ORFs

Systematic names for ORFs encoded in the 2-micron plasmid start with the letter 'R' to designate the 2-micron plasmid; the rest consists of a four digit number followed by the letter 'W' or 'C' for Watson and Crick. Examples are R0010W and R0020C.

RNA-Coding Genes

ncRNA Genes

All annotated S. cerevisiae ncRNAs are designated by a symbol consisting of four uppercase letters, a four-digit number, and another letter, as follows: Y for “Yeast”, NC for “noncoding”, A-Q for the chromosome on which the ncRNA gene resides (where “A” is chromosome I, “B” is chromosome II, etc., up to “P” for chromosome XVI, and lastly “Q” for the mitochondrial chromosome), a four-digit number corresponding to the sequential order of the ncRNA gene on the chromosome (starting from the left telomere and counting toward the right telomere), and W or C indicating whether the ncRNA gene is encoded on the “Watson” or “Crick” strand (where “Watson” runs 5′ to 3′ from left telomere to right telomere, and “Crick” runs 3’ to 5’).

Example:

YNCP0002W: is the second ncRNA gene from the left end of chromosome XVI and is encoded on the Watson strand.

When evidence is published pointing to new ncRNA genes, they will be added to the annotation using the next sequential number available for the specific chromosome on which the ncRNA gene resides. In cases in which more than one ncRNA gene is added to any particular chromosome during the same annotation update (i.e., same genome revision), they will be named using the next sequential number starting with the leftmost ncRNA gene and proceeding to the right of the chromosome.

tRNA Genes

Systematic names of nuclear-encoded tRNA genes begin with a lowercase 't'; the second letter corresponds to the single letter code for the appropriate amino acid, e.g., A = alanine, C = cysteine, etc.; next the sequence of the anticodon of the tRNA is given in the 5' -> 3' direction within parentheses, e.g., (AGC) or (GUC); finally, there is an indication of which chromosome the tRNA gene resides on using the letters 'A' through 'P' to designate nuclear chromosomes (in the same way as for nuclear-encoded ORFs). If a given nuclear chromosome contains more than one copy of a tRNA gene, individual copies of the same tRNA family (those of identical sequence, including the anticodon sequence) are distinguished from each other by the addition of a single number, starting with '1', after the letter designating the chromosome.

Examples:

tC(GCA)B: a tRNA for cysteine, with the anticodon sequence 'GCA', located on chromosome II
tS(AGA)D1: a tRNA for serine, with the anticodon sequence 'AGA', one of two or more tRNAs from this family (containing the AGA anticodon) located on chromosome IV

Mitochondrially-encoded tRNAs are named the same way as nuclear-encoded tRNAs, using the letter 'Q' to designate the mitochondrial chromosome, except that the presence of a number indicates that two or more tRNAs encode the same amino acid, though they do not necessarily contain the same anticodon sequence.

Examples:

tR(UCU)Q1: a tRNA for arginine, with the anticodon sequence (UCU), one of two or more tRNAs for arginine on the mitochondrial chromosome
tR(ACG)Q2: a tRNA for arginine, with the anticodon sequence (ACG), one of two or more tRNAs for arginine on the mitochondrial chromosome

snRNA and snoRNA Genes

The systematic name of a small nuclear RNA (snRNA) or small nucleolar RNA (snoRNA) starts with the lowercase letters 'sn'; next is a capital 'R'; this is followed by a number by a number. The number is unique, but does not convey any positional information. Frequently, the Gene Name of snRNAs and snoRNAs is the same as the Systematic Name, but with all caps, e.g. 'SNR'. Different copies of duplicated genes may be indicated by either adding a letter, e.g. 'A' or 'B' to the end of the name.

Examples:

snR6 a snRNA, produces the U1 spliceosomal RNA
snR17a a snoRNA, one of two copies of snoRNA U3
snR17b a snoRNA, one of two copies of snoRNA U3

Note: SNR7 is an exception in that its transcript is alternatively processed yielding two products: SNR7-S (short form) and SNR7-L.

rRNA Genes

The systematic names and gene names of loci representing the nuclear encoded rRNA genes are identical to each other. The "loci" representing the rDNA repeats, the rRNA transcripts, and the mature rRNAs are named with the three letter acronym 'RDN' for Ribosomal DNA. While S. cerevisiae contains multiple repeats of the ribosomal DNA (rDNA), only two rDNA repeats were sequenced as part of the systematic sequencing project.

Examples:

RDN1 the entire 1-2 Mb rDNA region on Chromosome XII, consisting of 100-200 tandem copies of a 9.1 kb repeat which contains the genes for 5S, 5.8S, 25S and 18S rRNAs
RDN18-1 represents a specific copy of a region which encodes an 18S ribosomal RNA
RDN37-2 represents a specific copy of a region which encodes a primary rRNA transcript which is processed into the 25S, 18S and 5.8S rRNAs

A more complete explanation of the representation and naming of the rDNA repeats and rRNAs within it is present on the RDN1 locus page which represents the entire rDNA region on Chromosome XII.

Other Features

Autonomously Replicating Sequences

Autonomously Replicating Sequences (ARS) are named with the three letters ARS followed by a number. ARS features added after October 2000 are named systematically using the three letters ARS followed by one or two digits to represent the chromosome, e.g. chromosome I = 1, chromosome II = 2, chromosome X = 10. This is followed by an additional whole number to designate the particular ARS on that chromosome in the order named, starting with the digits '01'. Note that the number merely indicates the order in which the ARS elements were reported and named, and does not necessarily denote any location information relative to other ARS features. Note also that decimal points are NOT used. Some "historical" ARS features were given Gene Names prior to the establishment of this systematic naming system, e.g. ARS1, ARS2, ARS120. In these cases, an ARS-based Gene Name does not indicate the chromosomal location.

Examples:

Centromeres

Centromeres are named with the three letters 'CEN' followed by one or two digits to represent the chromosome.

Examples:

Non-reference genes

Non-reference genes are designated by a symbol consisting of three uppercase letters and a four-digit number, as follows: Y for “Yeast”, SC for “Saccharomyces cerevisiae”, and a four-digit number corresponding to the sequential order in which the gene was added to SGD.

Examples:

MAL21/YSC0004
MATA/YSC0046
XDH1/YSC0051

Currently, SGD has 55 such genes. As evidence is published pointing to other S. cerevisiae genes not present in the S288C reference, they will be added to the annotation using the next sequenctial number available.

Ty Elements

The systematic name of a full length Ty element starts with a 'Y'; the second letter corresponds to the chromosome number (given in Roman numerals, e.g., chr I is 'A', chr VIII is 'H'); the third letter is either 'L' or 'R' for left or right of the centromere; the fourth letter is either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere); next are the letters 'Ty' followed by a number, 1-5, to indicate the type of Ty element. The first Ty element of a given type is indicated with -1; additional full length Ty elements of the same type on the same chromosome are given a number incremented by one from the previous one.

Examples:

YARCTy1-1 a Ty element of type 1 on the right arm of Chromosome I, on the Crick strand
YCLWTy5-1 a Ty element of type 5 on the left arm of Chromosome III, on the Watson strand
YDRCTy1-1 a Ty element of type 1 on the right arm of Chromosome IV, on the Crick strand

Ty Long Terminal Repeat Elements

The systematic name of a Ty Long Terminal Repeat (LTR) element starts with a 'Y'; the second letter corresponds to the chromosome number (given in Roman numerals, e.g., chr I is 'A', chr VIII is 'H'); the third letter is either 'L' or 'R' for left or right of the centromere; the fourth letter is either 'W' for Watson (the strand with 5' end at the left telomere) or 'C' for Crick (the complement strand, 5' end is at the right telomere); next is a word for a Greek letter indicating the type of LTR element, e.g. 'delta', 'sigma', 'tau', 'omega'. The first Ty LTR element of a given type is given the number '1'; additional Ty LTR elements of the same type on the same chromosome are given a number incremented by one from the previous one.

Examples:

YARCdelta8 a Ty LTR of the delta type on Chromosome I
YARWsigma1 a Ty LTR of the sigma type on Chromosome I
YBLCtau1 a Ty LTR of the tau type on Chromosome II
YCLWomega1 a Ty LTR of the omega type on Chromosome III

Note: there are four systematic names (YCLWdelta2a, YCLWdelta2b, YDRCdelta6a, and YDRCdelta6b) that do not conform to the nomenclature rules. Please contact SGD if you need to use this nomenclature.

Telomeric Elements

SGD currently annotates several different types of features at the ends of chromosomes, listed below (click on the element name for a definition). When there are multiple examples of a type of telomeric element at a single chromsome end (e.g. more than one Telomeric Repeat), the elements will be numbered after the suffix, with number 1 being the closest to the end of the chromosome.

Telomeric Region: "TEL" followed by a two digit number indicating the chromosome number, then "L" or "R" to indicate the left or right arm of the chromosome.

Example:

TEL08L

X element Combinatorial Repeats: The same base name used for the Telomeric Region feature, appended with a suffix of "-XR".

Example:

TEL08R-XR

X element Core sequence: The same base name used for the Telomeric Region feature, appended with a suffix of "-XC".

Example:

TEL08L-XC

Y' element: The same base name used for the Telomeric Region feature, appended with a suffix of "-YP".

Example:

TEL12L-YP1

Telomeric Repeat: The same base name used for the Telomeric Region feature, appended with a suffix of "-TR".

Example:

TEL08R-TR1

Correlation between Gene Names and Systematic Names

While all ORFs identified in the genome sequence have a Systematic Name, e.g. YAL001C, YGR116W, YAL034W-A, or Q0010, many ORFs have not been given a Gene Name, e.g. a name such as COX2 or CDC28. In addition, Gene Names have been conferred on non-ORF features such as tRNAs, other non-coding RNAs, and on genetic loci which have not yet been mapped to a specific position on a chromosome. In this last case, because the chromosomal location is not known, there will not be a systematic name associated with the Gene Name.

An ORF, or other chromosomal feature, with a systematic name may have been associated with more than one common usage name, or Gene Name. Only one of these will be designated as the Standard Name; any other associated name is referred to as an Alias.

Google Sites

Report abuse

SGD Help: Nomenclature Conventions

Contents

Gene Name Assignment

Gene Name Format

Protein Name Format

Systematic Name Assignment

Systematic Nomenclature Conventions and Formats

Open Reading Frames

RNA-Coding Genes

Other Features

Correlation between Gene Names and Systematic Names