protein design for dummies

In my quest to understand protein design, I am charting some of the issues I face currently. Mostly just a grumpy grad student grappling with the choices of past designers and looking towards a more accurate future. Sure protein design is crazy difficult, but why has design been special in that it obscures (often neglects) hypotheses that are the backbone of scientific pursuit? In general, engineering is a rather binary pursuit; the object works as intended or it doesn't. At some point, we have to identify key features of what we are doing right and wrong to do more of the former and correct the latter. Importantly, we should allow everyone access to the knowledge we have gained, especially as machine learning is becoming common place in protein design and modelling. Shoot me a message if you have any thoughts!

Primer

Jan, 2022

The reason for these posts has to do with using Rosetta. Reading protein design papers among peers I commonly hear, "The authors used Rosetta to find the top scoring model and chose this model for gene synthesis." These sorts of statements are misguided. Rosetta is sort of the gold standard for modelling protein energetic landscapes, but one doesn't simply use Rosetta. Rosetta doesn't choose the best model, users/developers choose the best model. As they say, garbage in, garbage out.

The power of Rosetta lies in it's score function; it dictates the trajectory of any task, regardless of how you explore that space. Developers have chosen the phyisicochemical contributions that make Rosetta's score function the most accurate protein folding/design platform available at present. The underlying benefit of Rosetta is the ease of sampling degrees of freedom to minimize a structures energy, and the score function is the heart of this. Anyone using the function should understand the assumptions that are being made and bake them into their output. If you haven't yet, you owe it to yourself to do your due diligence. In any case, follow me down the rabbit hole!

The function is a highly optimized linear function with an ever growing hodge-podge of physical forces and statistical representations as parameters. A description of the REF2015 score function (most recent score function as of 2020), "The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design" lays out the comprehensive methodology for calculating the terms involved, and references articles about optimizing associated weights (importantly this one). It is clear that the score function is a sacrifice between computational efficiency and accuracy. Most terms terms have been truncated with the goal of decreasing computation time over the course of many hundred design runs (trajectories). This choice is understandable as sampling design space hundreds of times provides an entropic perspective that is not obtainable with the score function itself. However, in a world of sub-optimal models, the important point is that the choice of score function parameters leaves many details, argued among experts, veiled from the understanding of users.

This effects published results using Rosetta, obscuring from readers the reasons for success or failure. So, how does the function really work and what might we be missing still, either by negligence or expedience? How can these gaps inform the models we are making and better yet inform on the ones we have failed to access?

As with many questions in research, our knowledge of protein design is biased by success, specifically, human led notions about success (I once wrote an entire paper about this word. Still have no idea what it truly means). However, ProtaBank is a new resource that might make reporting failures more common place, maybe even something we can draw conclusions from. Until then, I am trying to understand what is working.

Conclusions:

  • The score function is a weird hybrid of terms

  • Protein design is 40% science, 40% engineering, 10% hardwork, and 10% luck (these percentages are 100% wrong)

Decomposing Rosetta's score function - ref

As a first example, I wanted to dive into one component deep on the list of terms to bring up some interesting questions.

ref

The 'reference energy' is included to distinguishing between the folded and unfolded state of a protein. This is accomplished by representing each residue of the protein with a baseline energy of the corresponding amino acid type in an unfolded state. Alright, so this provides some baseline about the upper energies that our protein could access solely given its polymeric composition. Upon further investigation, one finds that baked into ref are additional weights which signify the amino acid frequencies that are present in extant protein space. So where Leucine and Tryptophan are are the most and least common amino acids, respectively, the ref term factors their abundance into account and reflects that in the term.

So what are ref weights and how are they affecting protein structures? As a more concrete example, I pulled from the scoring tutorial the following scores contributing to a per residue energy (truncating zero energy contributions and some info for clarity).

pdb_id fa_atr fa_rep fa_sol fa_intra_rep fa_elec hbond_lr_bb hbond_sc rama omega fa_dun p_aa_pp ref residue_score

3 -2.666 0.270 2.416 0.025 -0.269 0.000 -0.324 0.000 0.010 1.894 0.000 -1.630 -0.273

4 -5.618 0.237 2.802 0.026 -0.123 -0.564 0.000 -0.262 0.007 0.814 -0.348 1.081 -1.949

Residue 3 (pdb_id) belongs to Aspartate and we can see that it has a couple terms contributing a magnitude of more than 1 rosetta energy unit for a total score (residue_score) of -0.273. Of these scores, ref contributes a favorable -1.630 units. Without this, Asp3 would contribute a positive energy to the structure, implying potential substitution could be equally favorable in this environment. Similar analysis for Isoleucine 4, shows that the attractive potential from fa_atr is so beneficial that ref's 1.081 makes a relatively smaller contribution. Still, ref accounts for ~10% of Ile4, no small contribution.

Taking this example and the overall success of Rosetta in mind, why do these scores accurately contribute to our representation of protein structures?

Well, as we have seen, ref accounts for unfolded states and biasing amino acid abundance. Another interesting fact, the ref energies change in magnitude and value with each published score function! In fact, the second linked publication above (optimizing score function weights), continually optimized ref weights to fit the design sequence as close as possible to the target sequence given the other terms in the score function. So ref is actually providing sequence abundance by fitting sequences to benchmarked crystallized proteins (sigh.. a topic for another date). In this case, ref is providing hidden weights to arrive at the correct sequences that isn't fully captured by the other score terms. That being the case, if the score function accurately captured the energy of a protein in all of it's other terms, there would be no need for this corrective measure. Maybe ref hides within it missing chemical or physical energy values contributed by each amino acid!

Given the intention behind ref, it seems haphazardly constructed, yet integral to the function itself. Taking ref for what it actually does, it provides necessary but unknown contribution to make Rosetta's score function so useful. It fits the score function to the data the best, and the data are real proteins that are super stable (again.. next post). I argue that the choices around ref are misguided as they are currently described. If we want to take amino acid abundance into account, we should reflected this during Monte Carlo sampling. Proteins evolve through mutation, where sampling new amino acid types is accomplished by their coding sequence. Intuitively, there is no difference in the final energy of a protein with regard to the composition of amino acids. The physical forces that the mutation carries and the local environment of it's slated conformational destination dictate whether the mutation will be accommodated. Sure, the production of the protein will be affected by it's codon choice, synthesis rates, and metabolic expense on the cell, but that isn't what Rosetta's score function should do! The score function should take a protein and identify states along it's energy landscape given modifications to the polymeric degrees of freedom.


Conclusions:

The description of ref is simply inaccurate

ref is an extra parameter to best fit the score function to a bench marked set of structure sequences

Including a term for residue abundance is better served in Monte Carlo sampling within Rosetta's mover environment