About 1 billion years ago, it was estimated that Humans and Yeast diverged from a common ancestor. Yet, even until this day, yeast is a well known model organism that we still use to study human genetics, drug testing, and cellular processes. Humans and yeast share many common connections. Cell division in yeast cells are similar to how human cells divide which has great potential and interest in cancer research. Many human genes involved in diseases have functional equivalents in yeast. Many studies have tested and observed mutations in a yeast gene to hypothesize how it would affect humans. We are able to relate yeast functions and mutations to the human functions, because it is well studied that homologous and orthologous genes tend to retain similar functions across diverged species.
Despite the extent we use yeast to explain human mechanisms and processes, we still do not understand why certain human orthologs are replaceable while other human orthologs cannot replace their yeast counterpart. For example, in Figure 1, it is puzzling as to why CDK2 is the only human ortholog not replaceable with the yeast protein YBR160W. The CDK proteins and YBR160W are all cyclin-dependent kinase which are very important for cell division.
For my project, I wanted to computationally analyze and hypothesize why certain human orthologs are replaceable in yeast compared to other human orthologs by comparing the amino acid sequence of the yeast protein to each of the human orthologs, and compare the protein structure between the yeast and human orthologs.
Figure 1. This is a superimposed image of YBR160W, CDK1, CDK2, and CDK3. The yeast protein (YBR160W) is colored in green, CDK1 is in tan, CDK2 is in light blue, and CDK3 is in lavender. CDK2 is the only ortholog that was not replaceable with YBR160W.
In order to analyze and compare the yeast protein to the human orthologs:
used HHPred and Modeller, and I-TASSER to create the homology models.
created a sequence alignment between the yeast and human orthologs.
did a full protein sequence comparison between the yeast protein and each human ortholog to calculate percent identity and percent similarity.
identified the surface accessible amino acids.
compared the surface accessible amino acids sequence of the yeast to the human orthologs
did a full protein structure comparison between the yeast and the human orthologs.
compared the surface accessible amino acids of the protein structure between the yeast and each human ortholog.
using R, plotted normal distributions of both complement and non-complement.
ran a t-test to test if the means between the two groups (complement and non-complement) was significant.
Figure 2. This is a diagram of the methods. The first picture on the upper far left is the FASTA sequence of the full protein sequence. The picture in the upper right had side is the FASTA sequence of the surface accessible amino acids. The picture in the bottom lower right hand corner is a chimera image of the the human orthologs superimposed on to the yeast protein. The picture in the bottom left hand side is a chimera image of the surface accessible amino acids of the human orthologs superimposed on to the surface accessible amino acids of the yeast protein.
Several analyses was run in order to test different characteristics and features of both the amino acid sequence and the protein structure to hypothesize if a certain feature is more likely involved in the chance for the yeast ortholog to be replaced by a human ortholog. The full amino acid sequence was first analyzed by comparing and calculating the percent similarity of the amino acid sequence of the yeast protein to each human ortholog amino acid sequence. Percent similarity seemed to be the most indicative measurement for the sequence, because through evolution some of the amino acids have changed but often the physical properties have stayed the same, so the similarity between the sequences seemed the most appropriate way to measure the closeness of each sequence. The calculated mean of the percent similarity for the complement group was 56.39929, while the calculated mean of the percent similarity for the non-complement group was 56.56770. A t-test was used to calculated the significance between the two group's mean, and the p-value was calculated to be 0.9598. This large p-value indicated that the full sequence is not a good measure or indicator for the possibility of the human ortholog to be replaceable with the yeast protein.
After looking at the full amino acid sequence, the next step was too extract the surface accessible amino acids from the sequence and analyze if there is a relationship between the yeast surface accessible amino acids and the human orthologs. The reason for choosing to analyze the surface accessible amino acids is because some portion of the surface amino acids can be involved in protein-protein interactions or be involved in some active sites along the surface. Both the active sites and protein-protein interaction sites are very important in that protein carrying out its cellular function, so it seemed like looking at the surface accessible amino acids could indicate a reason as to why certain human orthologs could replace the yeast proteins. Once the surface accessible amino acids had been extract from the amino acid sequence, a percent similarity was calculated between the yeast protein and each human ortholog. The calculated mean of the percent similarity for the complement group was 50.55641, while the calculated mean of the percent similarity for the non-complement group was 49.98788. A t-test was used to calculated the significance between the two group's mean, and the p-value was calculated to be 0.8968. This large p-value indicated that the surface accessible amino acid sequence is not a good measure or indicator for the possibility of the human ortholog to be replaceable with the yeast protein.
Once the full sequence and the surface accessible amino acid sequence had been analyzed, the next step was to analyze the structure. First, the full human orthologs were superimposed on to the full yeast protein and the RMSD score was recorded. Compared to the sequences, the most ideal situation would be that the replaceable human orthologs had a lower RMSD score compared to the non-complement. RMSD stands for the the root-mean-square-deviation which is a value that scores the deviation between the two proteins being analyzed; the smaller the RMSD score is, the more similar the structures are. The average RMSD score for the complement group was calculated to be 0.7589048, and the average RMSD score for the non-complement group was calculated to be 0.8555172. A t-test was used to calculated the significance between the two group's mean, and the p-value was calculated to be 0.2343. The p-value is still very large and shows that the means are not significantly different, but this p-value is very much smaller than the p-values calculated for the sequences so further analysis into the structure of the human orthologs could hypothesize the reason as to why certain human orthologs are replaceable in yeast.
After analyzing the full protein structure and comparing the yeast to the human orthologs, I then analyzed the surface accessible amino acids of the structure and compared the yeast and human orthologs. The average RMSD score for the complement group was calculated to be 0.7092857, and the average RMSD score for the non-complement group was calculated to be 0.8175977. A t-test was used to calculated the significance between the two group's mean, and the p-value was calculated to be 0.0384. The p-value is small and does show that the two means are significantly different. From this p-value, we can hypothesize that the key difference and reason for certain human orthologs to replaceable while some orthologs cannot could be due to their surface structure. Despite the p-value being significant, we cannot accurately predict or guarantee that the surface is responsible for replaceability.
Figure 3. This graph shows the counts on the y-axis, and the percent similarity of the full protein sequence between the yeast and human ortholog split into 2 groups: complement and non-complement. The mean value for the complement group was calculated to be 56.39929, and the mean value for the non-complement was calculated to be 56.56770. A t-test was run to calculate if the mean between the two groups was significant, and the p-value from the t-test was calculated to be 0.9598.
Figure 4. This graph shows the counts on the y-axis, and the percent similarity of the surface accessible amino acids 5 angstroms between the yeast and human ortholog split into 2 groups: complement and non-complement. The mean value for the complement group was calculated to be 50.55641, and the mean value for the non-complement was calculated to be 49.98788. A t-test was run to calculate if the mean between the two groups was significant, and the p-value from the t-test was calculated to be 0.8968.
Figure 5. This graph shows the counts on the y-axis, and the RMSD value between the yeast and human ortholog structure split into 2 groups: complement and non-complement. The mean value for the complement group was calculated to be 0.7589048, and the mean value for the non-complement was calculated to be 0.8555172. A t-test was run to calculate if the mean between the two groups was significant, and the p-value from the t-test was calculated to be 0.2343.
Figure 6. This graph shows the counts on the y-axis, and the RMSD value between the yeast and human ortholog structure of the surface split into 2 groups: complement and non-complement. The mean value for the complement group was calculated to be 0.7092857, and the mean value for the non-complement was calculated to be 0.8175977. A t-test was run to calculate if the mean between the two groups was significant, and the p-value from the t-test was calculated to be 0.0384.
Laurent JM, Garge RK, Teufel AI, Wilke CO, Kachroo AH, Marcotte EM (2020) Humanization of yeast genes with multiple human orthologs reveals functional divergence between paralogs. PLoS Biol 18(5): e3000627. https://doi.org/10.1371/journal.pbio.3000627
Kachroo AH, Laurent JM, Yellman CM, Meyer AG, Wilke CO, Marcotte EM. Evolution. Systematic humanization of yeast genes reveals conserved functions and genetic modularity. Science. 2015; 348 (6237):921–5. https://doi.org/10.1126/science.aaa0769 PMID: 25999509; PubMed Central PMCID: PMC4718922.
Protein Sequence Analysis Using the MPI Bioinformatics Toolkit. Gabler F, Nam SZ, Till S, Mirdita M, Steinegger M, Söding J, Lupas AN, Alva V. Curr Protoc Bioinformatics. 2020 Dec;72(1):e108. doi: 10.1002/cpbi.108.
J Yang, R Yan, A Roy, D Xu, J Poisson, Y Zhang. The I-TASSER Suite: Protein structure and function prediction. Nature Methods, 12: 7-8 (2015).
A Roy, A Kucukural, Y Zhang. I-TASSER: a unified platform for automated protein structure and function prediction. Nature Protocols, 5: 725-738 (2010)
Y Zhang. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics, vol 9, 40 (2008).
A. Šali and T. L. Blundell. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815, 1993.
Van Rossum, G., & Drake Jr, F. L. (2009). Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.
R Core Team (2020). R: A language and environment for statisitcal computing. R Foundation for Satisitcal Computing, Vienna, Austria. URL https://www.R-project.org/