Signal peptides (SP) are subsequences in proteins which facilitate directing them to their target subcellular localisation. Signal peptides direct the protein either co-translationally or post-translationally. Eukaryotic proteins sorted by the secretory pathway are co-translationally recruited to the ER via the SP and are then translocated through the Sec61 channel. The proteins of other organelles (e. g. mitochondria, nucleus, peroxisome) are synthesized by free ribosomes in the cytosol and are then post-translationally translocated.
SPs are canonically N-terminal, but some proteins have (additional) internal or C-terminal SPs. They are usually around 25-30 AAs long (von Heijne 1990), but there are some known to be as short as 16 AAs, and some as long as 140 AAs. Long SPs are mainly used in eukaryotic organelle targeting. Signal peptides can be thought of as tripartite (von Heijne 1990, Figure 1) :
N-region. It tends to be positively charged, and it may aid orienting the orientation of membrane proteins through the positive-inside rule.
H-region: Consists of more hydrophobic AAs, and often forms a single alpha-helix.
C-region: Can contain a cleavage site for a signal peptidase.
Figure 1 | Typical structure of an N-terminal signal peptide.
COMPARTMENTS is a web resource that presents the user with a subcellular localisation confidence score for their queried protein. This confidence score is generated through integration of 4 types of information sources (Figure 2):
Knowledge: manually annotated database entries
Experiments: Human Protein Atlas staining patterns
Text-mining: co-mentioning of protein + compartment
Predictions: sequence-based prediction (WoLF PSORT and Yloc)
This produces a confidence score between 0 and 5.
Figure 2 | The four sources of information that COMPARTMENTS integrates over to compute confidence scores for subcellular localisation. Numbers indicate the number of proteins covered by each pillar. Sequence-based predictions are not shown since they can produce predictions for any sequence.
My initial hypothesis for this project is that since signal peptides control the localisation of many proteins to different compartments, there has to be something intrinsic to these signal peptides that causes recognition e. g. by transporters which will translocate the protein to those compartments.
Thus, in this project motif finders are used in the attempt to find a sequence motif for 23 (sub)compartments known to use signal peptides. Similarily, multiple sequence alignment (MSA) is used to probe for chemical/structural conservation. And finally, a dedicated signal peptide predictor (SignalP 6.0) is used as well.
Figure 3 | Workflow of this project.
The integrated confidence scores for all human proteins in the COMPARTMENTS database (Binder et al. 2014) were downloaded from their website and linked to their respective protein sequences through matching the Ensembl transcript IDs with UniProtKB. The code utilised for this can be found in the Downloads section.
Downloaded human data (integrated from all 4 information sources) from website
Downloaded human proteome from uniprotKB database, with additional filters (GO, protein sequence, manually confirmed, Ensembl cross-reference, GO IDs)
Iterate row by row through compartments data – for each entry, find matching protein sequence from the uniprotKB database via the ENSP/T and append sequence in table
Sort final table by GO ID (i. e. by subcellular compartment)
(Filter for 4+ confidence score entries)
Create a fasta file for each compartment
Feed each compartments fasta file into SignalP
The human COMPARTMENTs database was filtered for 23 compartments of interest. All of these (except for the cytosol, which was added for comparison), are either part of the Sec pathway and/or are an organelle. The compartments are listed in the table below.
For each compartment, only the high confidence scoring (4 star+) entries were extracted (15692 high confidence/76752 total). The first 40 AAs (to limit the search) were then fed into various sequence aligners (ClustalW, BlastP, MAFFT) and motif finders (MEME suite, GibbsCluster 2.0). The motif finders could not find any motifs at all (except for that proteins typically start with M). This is likely because signal peptides are not sequence-conserved, they are structurally and chemically conserved. ClustalW seemed to be able to weakly align the h-region for a small portion of entries, but I could not see any compartment specific patterns.
In eukaryotes, SignalP 6.0 makes predictions for signal peptides of the Sec pathway. The same high confidence sequences used in the above tests were entered into SignalP 6.0, resulting in 0-42% of proteins being identified to have a signal peptide, depending on the compartment, which seems to be quite a low hit rate. This may in part be explained through proteins which get delivered to their compartments through the non-secretory pathway, or get transported along with proteins that they associate with.
What is more surprising is that organelles which do not receive proteins through the Sec pathway (mitochondrion, the peroxisome and nucleus) still had relatively high hit rates (10-17%). It might be proteins which are delivered both to compartments of the Sec pathway and to those organelles. Or it might just be false positives.
Since this classification seems to be rather weak, the SignalP predictions were integrated into the classifier described in project B, along with other data.
Table 1 | Signal peptide predictions via SignalP 6.0 of COMPARTMENTS high localisation confidence entries
(>= 4*).