SGD Help: Protein Information
The Protein page provides detailed information about the protein encoded by a particular gene. This page contains locus specific nomenclature and protein product information; a brief description of the role of the protein within the cell; the predicted primary protein sequence; detailed domain/motif information; and basic information derived from the protein sequence, including physico-chemical properties and other values. Links provide access to detailed prediction-based and manually curated referenced information and to various external resources.
- Contents of the Protein Page
- Protein Overview
- Experimental Data
- Domains and Classification
- Amino Acid Sequence
- Protein Modifications
- Sequence Based Calculations
- Amino Acid Composition
- Physico-Chemical Properties
- Coding Region Translation Calculations
- Extinction Coefficient
- Atomic Composition
- External Identifiers
- Analyzing and Downloading Protein Data
- Downloading Data from Tables on the Protein Page
- Retrieving Protein Data from YeastMine
Contents of the Protein Page
The Protein Overview section contains several fields of nomenclature for the protein in question. If a field doesn't have a value, it will not be listed. Potential fields that may be listed under this heading include:
- Aliases: a list of all the alternative names in the literature given to the gene/protein.
- Protein Product: The name of the protein; compliant with both NCBI and UniProt protein nomenclature guidelines.
- Feature Type: The type of gene encoding the protein product (ORF, transposable element gene) or potential protein product (blocked reading frame, pseudogene).
- Description: a concise summary of the biological role and molecular function of the protein and/or gene.
- Paralog: if the protein has an identified paralog, a link is provided to the Locus Summary page of the paralog in SGD.
- EC Number: the Enzyme Commission number, for known enzymes. The value is a link to a dedicated page within SGD for the EC number, with additional information and links.
The Experimental Data section contains two subsections, one for protein half life and the other for protein abundance. Proteome-wide, steady-state protein turnover rate (i.e. protein half life), was calculated under standard growth conditions in synthetic medium using pulse stable isotope labeling of amino acids or SILAC (Christiano et al. 2014). The rate of decay of native proteins was analyzed using high-resolution mass spectrometry-based proteomics. The resulting distribution of half-lives span two orders of magnitude, ranging from a few minutes to more than 100 hr, defining three classes of proteins. Class I contains very short-lived proteins (~2%) including many that drive the cell cycle, Class III contains very stable proteins (86%), many of which drive growth and mass accumulation, and Class II proteins (12.5%) are defined by intermediate half-lives, including these that mediate regulated processes, such as nutrient transport.
The Protein Abundance subsection contains reanalyzed data from 21 quantitative analyses of the proteome, visualized using a variety of experimental methods (mass spectrometry, GFP microscopy, GFP flow cytometry and tandem affinity purification/immunoblot). Data from the primary studies were mode-shift normalized and scaled to the intuitive abundance unit of molecules per cell (Ho et al. 2018). The normalized abundance measurements and associated metadata (media, visualization and strain background) from untreated cells are displayed in the protein abundance table. For some GFP-based studies, changes in protein abundance for cells treated with various environmental stressors including DNA replication stress (hydroxyurea, methyl methanesulfonate), oxidative stress (hydrogen peroxide), reductive stress (dithiothreitol), nutritional stress (nitrogen starvation), quiescence and rapamycin treatment were also normalized and converted to molecules per cell. These values are also displayed in this table along with additional metadata including the perturbation, the treatment time and the concentration of chemicals, if applicable. When the protein abundance in stressed cells is more than two standard deviations from the untreated average abundance, a fold-change value is also displayed. The default display is alphabetical based on the original reference but can be changed using the arrows located in the table header or filtering by entering keyword(s) into the text box. Note, some low abundance proteins were not visualized in untreated cells based on the autofluorescence filtering of the data but became visible after treatment and therefore have a treated value(s) but no untreated value.
Finally, for a given protein, all values from untreated datasets were used to calculate a median abundance and a median absolute deviation (MAD) using a constant of C=1. These values are displayed on the Locus Summary page in the Protein section. Note, there are a few cases where the median value displayed is based on data from a single study and as such a median absolute deviation could not be calculated.
Domains and Classification
The Domains and Classification section displays the results of domain predictions for yeast protein sequences using the InterProScan program (Jones P. et al. (2014)). InterProScan is a tool that combines different protein signature recognition methods into one resource. The Interpro database integrates motif, domain and protein family HMM information from the following member databases: Gene3D, PANTHER, Pfam, Phobius, PIR Superfamily, PRINTS, SignalP, SMART, Superfamily, TIGRFAM and TMHMM. The domain predictions are refreshed every 3 months, to keep them up-to-date. The predictions are shown both in tabular and graphical form.
- Domains Table: In the table, coordinates of each domain are shown, along with the accession ID, description of the domain, the source of the domain information and the number of yeast genes that contain that domain. The domain accession ID is linked to a page in SGD that provides additional info about the domain. The contents of the table can be downloaded as tab-delimited text file.
- Domain Locations Graph: Hovering the mouse above a domain on the graphical representation will show an info box with the description and precise coordinates of the domain within the protein.
- Shared Domains: This section of the page shows a network visualization by Cytoscape that shows yeast proteins (grey circles) that share domains (colored squares) with the selected protein (yellow circle). The visualization shows the proteins that share the largest number of domains with the central protein and is limited to show maximum 100 nodes and maximum 250 edges. The nodes of the graph are linked out to locus summary pages and domain pages within SGD.
Amino Acid Sequence
This section of the page has several subsections that show sequence based information about the protein.
The amino acid sequence is displayed in 60-residue blocks. Residues are numbered on the left side. The sequence shown by default is that of the reference strain, but the pull-down menu allows selection of the sequence from any of the other cerevisiae strains whose sequence is available in SGD. Also included is a button to Download the sequence, which loads a flat-text browser page with the amino acid displayed in FASTA format. Known modification sites (currently phosphorylation) are highlighted on the sequence by color. (See next section for further information on modifications.)
If there are protein modification data available for the protein, this section of the page shows this information in tabular format. Protein phosphorylation data is taken from PhosphoGRID. Additional phosphorylation data, and other modification types, are curated by SGD. The modified sites shown match the sequence for the selected strain displayed above - changing the selected strain may change the listed sites. Annotations are assumed to be valid for each strain, unless the indicated residue is not present - polymorphic, mutated, etc. - in a given strain.
Sequence Based Calculations
Data in this section are calculated from the protein sequence using BioPerl Seq libraries and CODONW software.
Amino Acid Composition
The Amino Acid Composition is based on the primary sequence. The table contains three columns: the first lists the one letter designations for the twenty amino acids, the second column lists the number of amino acids present in one molecule, and the third contains the composition expressed as a percentage.
This section contains various physico-chemical properties of the protein calculated from the sequence, including:
- Length (a.a.): the predicted full length of the translated gene product.
- Molecular Weight (Da): the predicted molecular weight of the full length protein in daltons (Da).
- Isoelectric Point (pI): the theoretical isoelectric point (pI) is the pH at which the protein carries no net charge.
- Formula: molecular formula of the protein.
- Instability Index The instability index was developed based on a statistical analysis of 12 unstable and 32 stable proteins (Guruprasad et al., 1990). This analysis revealed the presence of certain dipeptides that occurred with significantly different frequencies between stable and unstable proteins. A dipeptide instability weight value (DIWV) was assigned to each of 400 different dipeptides. These weight values were then used to calculate an instability index (II) defined as:
II = (10/L) * Sum DIWV(x[i]y[i+1])
where: L is the length of sequence
DIWV is the instability weight value
and x[i]y[i+1] is a dipeptide starting at position i.
Proteins with an instability index less than 40 are predicted to be stable, whereas those with a value greater than 40 are predicted to be unstable.
- Aliphatic Index The aliphatic index refers to the relative volume of a protein that is occupied by aliphatic side chains (alanine, isoleucine, leucine and valine) and contributes to the increased thermostability observed for globular proteins. The aliphatic index of a protein is calculated according to the following formula (Ikai, 1980):
Aliphatic index = X(Ala) + a * X(Val) + b * ( X(Ile) + X(Leu) )
where X(Ala), X(Val), X(Ile), and X(Leu) are mole percent (100 X mole fraction) of alanine, valine, isoleucine, and leucine. The coefficients a and b are the relative volume of valine side chains (a = 2.9) and of Leu/Ile side chains (b = 3.9) relative to that of alanine side chains.
Coding Region Translation Calculations
Values for Codon Bias Index (CBI), Codon Adaptation Index (CAI), Frequency of Optimal Codons (Fop), Hydropathicity of Protein (GRAVY score), and Aromaticity Score (AROMO) are calculated based on the specific genetic code and codon usage of a given organism and organelle. These values were calculated using the CodonW software program written by John Peden.
CodonW analyzes the correspondence between amino acids and codon usage in a set of protein sequences, based on a given genetic code (i.e. that used in the S. cerevisiae nucleus versus that used in its mitochondrion). CodonW was designed to work with any genetic code. Decisions regarding whether an amino acid is synonymous or non-synonymous, the translation of a codon, the number of codons in a codon family, how many synonyms a codon has, are all determined at run time. Seven alternatives to the universal genetic code have been built in to the program, including S. cerevisiae chromosomal codon usage and S. cerevisiae mitochondrial codon usage. In SGD, we have used these two built-in options, as appropriate, to perform codon usage-based calculations for chromosomally-encoded or mitochondrially-encoded ORFs. Note that codon usage-based calculations are not currently performed for ORFs present within transposable elements (Ty elements), because the codon usage of transposable element genes differs from that of chromosomal genes (see the CodonW tutorial).
The extinction coefficient (epsilon) is the wavelength-dependent molar absorptivity coefficient with units of M-1 cm-1. The extinction coefficient provides an indication of the amount of light that a given protein will absorb at a certain wavelength (usually 280 nm). During protein purification a spectrophotometer can be used to follow the protein of interest if the extinction coefficient is known. The molar extinction coefficient of a protein can be estimated based on its amino acid composition. The extinction coefficient of the native protein in water can be calculated based on the molar extinction coefficient of tyrosine, tryptophan and cystine (cysteine does not absorb much at wavelengths greater than 260 nm while cystine does) using the following equation:
E(Prot) = Numb(Tyr)*Ext(Tyr) + Numb(Trp)*Ext(Trp) + Numb(Cystine)*Ext(Cystine)
where: Ext(Tyr) = 1490
Ext(Trp) = 5500
Ext(Cystine) = 125
The absorbance (optical density) can then be calculated using the following formula:
Absorb(Prot) = E x l x C
where: E = extinction coefficient
l = pathlength (cm)
C = protein concentration (M)
Two extinction coefficient values are calculated by ProtParam, the first value is based on the assumption that all cysteine residues appear as half cystines, and the second assumes that no cysteines appear as half cystines. The computation has been demonstrated to be quite reliable for proteins that contain Trp residues, but for proteins without Trp residues there may be more than a 10% error.
These calculations are based on the method developed by Edelhoch, 1967, using extinction coefficients for Trp and Tyr, as determined by Pace et al., 1995. The values used in the calculation of extinction coefficients for denatured proteins were also found to be accurate for calculating coefficients for the native protein (Gill and von Hippel, 1989). In general, since Trp residues contribute much more to the overall extinction coefficient than Tyr and cystine residues, the calculations tend to be much closer to measured values for proteins that contain Trp residues.
The Atomic Composition Table displays the composition of the protein, with respect to the number of atoms of carbon, hydrogen, nitrogen, oxygen, and sulfur that it contains as well as the total number of atoms and the resulting formula.
This section of the page provides access to a compendium of Saccharomyces cerevisiae sequence entries for alleles and strains that are located in various external databases including GenBank/EMBL/DDBJ, NCBI, EBI, and MIPS. Sequence entries are listed by accession and/or version numbers according to the source. Additional information is available in the All Associated Sequences help page.
This section provides access to a number of external resources relevant to the query protein. This includes sequence entries located at various homolog related resources, interaction databases, protein databases, and localization resources.
- Homologs: provide access to several sources of homolog information, when available for the requested protein.
- Ashbya (AGD): provides a direct link between the S. cerevisiae protein and the Ashbya gosspyii ortholog at the Ashbya Genome Database (AGD) located at the University of Basel.
- AspGD Homologs: provides a direct link between the S. cerevisiae protein and the Aspergillus nidulans ortholog at the Aspergillus Genome Database (AGD) located at Stanford University.
- CGD Homologs: provides a direct link between the S. cerevisiae protein and the Candida albicans ortholog at the Candida Genome Database (CGD) located at Stanford University.
- Fungal Orthogroups Repository: provides a direct link between the S. cerevisiae protein and its fungal orthologs at the Fungal Orthogroups Repository located at the Broad Institute.
- P-POD: provides a direct link between the S. cerevisiae protein and its protein orthologs at the Princeton Protein Orthology Database located at Princeton University.
- PDB Homologs: provides a link to an SGD page where homologs of the current S. cerevisiae protein are shown - proteins with 3D structures available at the RCSB Protein Data Bank.
- PhylomeDB: provides a link between the S. cerevisiae protein and its phylogenetic tree as provided by PhylomeDB at CRG in Barcelona.
- PomBase: provides a direct link between the S. cerevisiae protein and the Schizosaccharomyces pombe ortholog at the PomBase database located at University of Cambridge.
- YGOB (Yeast Gene Order Browser): a tool used to visualize the syntenic context of protein coding genes from S. cerevisiae, S. castellii, C. glabrata, A. gossypii, K. lactis, K. waltii, and S. kluyveri. YGOB was developed by Kevin Byrne and Ken Wolfe (Trinity College, Dublin, Ireland), as described in Byrne and Wolfe.
- YOGY: the eukarYotic OrtholoGY (YOGY) tool is used to view orthologous proteins from eukaryotic organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Arabidopsis thaliana, Drosophila melanogaster, Caenorhabditis elegans, Plasmodium falciparum, Schizosaccharomyces pombe, and Saccharomyces cerevisiae). YOGY provides information from KOGs, Inparanoid, Homologene, OrthoMCL, and manually curated orthologs between S. cerevisiae and S. pombe. YOGY was developed by the Fission Yeast Functional Genomics Team at the Wellcome Trust Sanger Institute, Cambridge, UK.
- Protein databases/Other These links provide access to information on structural assignments to protein sequences at the superfamily level using the SCOP Superfamily, links to mass spec. data at GPMdb, Pfam domains from the Wellcome Trust Sanger Institute, and YeastRC Structure Prediction from the YRC Public Data Repository.
- Localization Resources These links provide access to external databases that contain localization data for many yeast proteins including LoQAtE at the Weizmann Institute of Science, YPL+ at the University of Graz, Austria, and the Yeast GFP Fusion Localization Database originally at the University of California, San Francisco, but hosted at SGD.
- Post-translational Modifications These links provide access to external databases that contain post-translational modification data for many yeast proteins including the PhosphoGRID and the PHosphoPep databases.
Analyzing and Downloading Protein Data
Downloading Data from Tables on the Protein Information Page
All data presented in tables on the Protein page can be downloaded by clicking the download button at the bottom left of each table. All data are downloaded in tab-delimited text format, except the sequence data which is provided in FASTA format.
Retrieving Protein Data from YeastMine
All of the data displayed on the Protein page, plus additional data, are available from YeastMine. You can search for and download data, or create gene lists and analyze them further using additional YeastMine queries.
YeastMine templates (pre-composed queries) for protein data include: