Last month, on the previous update to ProtoSyn blog, I said support for non-canonical aminoacids (NCAA) was a planned feature for 2022. Well, I was wrong. Turns out support for NCAAs was much more direct than expected, and in the past month I had the chance to add most required changes to introduce this goal. You can find a much more detailed (and technical) discussion on the topic in this GitHub issue: https://github.com/sergio-santos-group/ProtoSyn.jl/issues/31. In any case, here's a quick overview of the changes introduced.
1- Adding NCAAs to an L-Grammar
I've identified two types of NCAAs. On one hand, we have the completly new and alien NCAA, such as the Coronamic Acid, while on the other hand we can find the natural aminoacids with post-translational changes (this group, I believe, forms the majorarity of NCAAs of interest). Since ProtoSyn's peptide building and mutation methods use L-Grammars as the basis from where to get definition for the used aminoacids, all we need to do to introduce new aminoacids is to correctly define them in such a grammar. In order to do that, there's actually only 1 requirement: correct names in the backbone atoms. I was actually quite surprised when I realized this myself! The backbone atoms (identified by name) are used to:
Define what is the sidechain;
Define which atoms to connect inter-residues (in the cases of the peptidic bond, ProtoSyn looks for the C atom in residue 1 and the N atom in residue 2).
As such, the only requirement for a new NCAA to be added to the Peptides default grammar is that the backbone atoms are named N, H, CA, C and O. All other atoms must be named something else (such as C1, C2, N1, etc). Using ProtoSyn, a user can minimize a "somewhat stable" structure for a new NCAA, save it as .YML file and add it to the correct Grammar folder. This new Grammar can also be "appended" to the existing default grammar, therefore extending the support to both natural aminoacids and NCAAs, simultaneously.
2- Rotating chi dihedrals in NCAAs
Mutating a residue to a new NCAA is only part of the issue. In order to actually explore conformations and employ NCAAs in design protocols, users will require to rotate the new chi dihedral angles in the NCAA (and even explore rotamer libraries). Automatically identifying chi dihedral angles would be possible, but possibly fail for most exotic structures. As such, a most direct and simple procedure is to manually define which dihedrals are rotable in the Grammar entry itself. This strategy proved to be quite simple and foolproof to use. By default, ProtoSyn uses the Dunbrack Rotamer Library (containing the most energy favourable / common rotamers for the natural aminoacids). Such information is, unfortunetly, not available for most NCAAs. For this reason, the next best thing is to develop custom rotamer libraries. Two types of rotamer libraries can be used in ProtoSyn: Backbone dependent and backbone independent. As the name implies, the backbone dependent rotamer libraries (as is the case of the default Dunbrack rotamer library) include knowledge-based bias towards certain rotamers based on the current backbone phi and psi dihedral angle values. When attempting to create custom rotamer libraries for NCAAs, more often that not, the most pratical path is to simply define backbone independent rotamer libraries: rotate each chi dihedral by a certain step and measure the desired energy function (for example, the TorchANI machine learning model and an all-atom clash restraint, as a simple example). The probability of adoptation of a given set of chi angles (a rotamer) is, therefore, negatively proportional to the evaluated energy, thus filtering clashing or energetically unfavourable rotamers. The created rotamer library can then be saved in the correct folder to use in design and sidechain packaging efforts.
3- Adding post-translational modifications (PTMs)
As I stated above, most useful cases of NCAAs in existing proteins and peptides consist of post-translational modifications (PTMs). These include, among others, cases of methylation and phosphorylation. In order to introduce PTMs in ProtoSyn, similarly to the aminoacids L-Grammar, a set of the required fragments needs to exist (such as -CH4 or -PO4 groups, for methylation and phosphorylation, respectively). After correctly defining these fragments, users can call the ProtoSyn.replace_by_fragment! to replace a hydrogen for this new fragment, thus introducing a PTM.
In conclusion, the past month was quite busy with new changes to ProtoSyn. Hopefully, native and direct support for NCAAs is a feature that allows ProtoSyn to stand-out and become a useful tool in the toolbox of any computational chemist. Any feedback is more than welcome! Feel free to send and e-mail or open a new Issue in the GitHub page. ProtoSyn is a tool by everyone, for everyone.
José Pereira
2022 will be the first year ProtoSyn experiences as a fully fledged open-source platform. There is much work to be done before it is adopted as a production solution, but early feedback has been promissing. A couple of features have been requested and are planned to be introduced in the upcoming months:
Extension to non-canonical aminoacids: ProtoSyn uses a default Peptides grammar (with the 20 natural aminoacids) as a basis to build and design peptides. This grammar is, in essence, a database of .YML files containing the internal coodinates and bond information for each of these aminoacids. In practice, adding new residue types is as simple as adding new .YML files to this database. In fact, experimental support for certain sugars has been implemented in the Sugars module, where a separate grammar was compiled. Non-canonical aminoacids, however, have certain specific requirements: some, for example, produce ramified bonds to more than one aminoacid, and, as such, are not linear. ProtoSyn was developed with this constraint in mind, and implements a stochastic L-Grammar as the topology resolution system. This means that ramification is supported, and new and exciting opportunities for design and fold exploration are open! Implementing non-canonical aminoacids is, therefore, straightforward and supported by the native code. Certain modifications to other parts of the code might be necessary, for example, including the correct parameters in the Doolittle hydrophobicity index, etc. However, we expect this feature to be implemented in ProtoSyn in the early months of 2022.
Adding pre-existing energy functions (such as the Rosetta Ref15): implementing new energy function components is not only supported but also encouraged in ProtoSyn (the documentation even has a section especifying how to go about adding new energy function components). As such, replicating common energy functions used elsewhere is possible and easy. Implementing the Ref15 energy function, from the Rosetta software, is planned to mid 2022, given the large scope and number of necessary energy function components: Ref15 has more than 20 components!
Support for Molecular Dynamics Driver: in earlier versions of ProtoSyn, a working prototype for a molecular dynamics driver was implemented and briefly tested, and legacy code can currently be found in the src/ directory of ProtoSyn. This Driver has since lost its support and needs to be revamped and reworked, but support for Molecular Dynamics in ProtoSyn is planned and predicted. Expect this Driver to be re-added in late 2022 (or, who knows, perhaps sooner!).
Hopefully this dev log post sheds some light into the upcoming future of ProtoSyn and what to expect. Users are always encouraged to participate and take on any (or all) of these challenges: ProtoSyn is an open-source project, built by the comunitty for the comunitty. All the best, and happy coding,
José Pereira
ProtoSyn v1.0 was released just 3 months ago. Since then, development mainly focused on fixing bugs and introducing slight changes. A version 1.01 is scheduled to be released later this year, with some quality of life changes:
Revamped apply_potential function, with more user freedom and customization: potentials now follow a more streamlined workflow, allowing easier design of new potentials;
Revamped the Caterpillar solvation energy component: new methods for burial degree calculation (such as Neighboring Vectors) & better access to fine tuning parameters allows for a much more in-depth exploration of the solvation topology of a Pose.
Added download method: downloads the requested PDB file directly from the RSCB Protein Bank;
Added load_trajectory method in order to load PDBs with multiple frames as a vector of Pose structs;
Added fixate_masks method: automate the task to fixate the masks for non-design efforts. This dramatically increases performance by not employing dynamic masks.
Added the hydrogen bond energy component: identifies & quantifies hydrogen bonding pairs in Poses based on geometric criteria.
With this small update, we hope ProtoSyn becomes sturdier and easier to use, with a focus on providing users with access to all the parameters necessary to fine tune their experiments and prototypes. Until next time,
José Pereira
If you'd like to know more, please contact:
jose.manuel.pereira@ua.pt