Sequence Comparison and Amino Acid Locations
A sequence alignment comparing GsI RT and Tph RT shows a high degree of similarity between the two proteins. The Tph RT has a ~67% percent identity with the GsI RT sequence. In total there are 135 differences between the two sequences. This total does not include the potential 50 extra amino acids at the N-Terminus of the Tph RT. The locations of the differences between sequences can be seen in the sequence alignment where they are highlighted by asterisks. Individual differences are color highlighted with purple representing a favorable substitution determined by a positive BLOSUM50 score, and brown denoting an unfavorable substitution determined by a negative BLOSUM50 score.
To the right of the aligned sequences is the structure of the GsI RT with the amino acid substitutions highlighted in cyan. The majority of the substitutions are amino acids exposed to solvent and most of the these exposed substitutions had an unfavorable transition score. The locations of the substitutions, and the lack thereof in or near the active site, indicate the high degree of conservation found in group II intron RTs in the core of the enzyme. The core of the enzyme is crucial for activity as opposed to surface residues which are relatively more mutable and have higher sequence conservation. Since most substitutions are located on the surface it is possible that they have affected the solvation of the protein and its ability to bind its encoding intron. Using structure prediction programs, the Tph RT sequence was folded.
Model Comparisons
Models were aligned to the published GsI RT structure. Image A shows the AlphaFold2 structure in red aligned to the published GsI RT structure in cyan. The AlphaFold model had high agreement with the published GsI RT structure even in more flexible regions like the fingers loop the encloses the active site. Image B shows all the models aligned to the GsI RT structure in dark blue. All of the models had good alignment in the highly conserved regions in and around the active site of the RT binding pocket. This agrees with the previous observation that most of the substitutions were surface residues exposed to solvent which are least likely to have an effect on the structure of the RT.
Image C once again shows all the models aligned to the GsI RT structure shown in dark blue. The predicted structure models are all gray in color to showcase their deviations from established structure. Image D focuses on the top of the thumb of the RT and the D domain which is composed of several helices that are attached to the triple helix thumb. There is less agreement in the models in this region which is also the region in RTs that have the highest amount of sequence variety within the structure. Image D shows a close view of the active site of the RT. Relative to the thumb and D domain, here the models have much stronger alignment to the GsI RT structure. This region of the protein also happens to have stronger sequence conservation. Of the structures seen in this image, the fingers have the most deviation from the GsI RT structure. During catalysis the fingers are mobile and move from an open to a closed conformation to enclose the incoming dNTP substrate. Deviation from the GsI RT structure also correlates with B-factor values of the available structure with B-factors increasingly radially outward from the enzyme core.
Electrostatic Surfaces
This image is a compilation of the electrostatic surface calculated in vacuum of GsI RT and for the Tph RT in each of the models used. There is agreement in areas of positive and negative charge; however, the Tph RT seems to have some exposed negative charge in areas which hypothetically are involved in intron binding. Negative charges might discourage binding to RNA while simultaneously not hampering protein solvation. For example, arrows A-D indicate areas of stronger negative charge relative to the GsI RT structure. The distribution of negative charge is not identical across all the models; however, there is a trend of negative charge intruding into areas of positive charge. Such a distribution of charges could help the RT dissociate from the intron it is bound to while remained solvent and allow it to bind to another folded intron copy to aid in splicing.
Structure Prediction and N-Terminal Extension
The Tph RT has two potential translation start sites. The first instance of a start codon adds an additional 50 amino acids to the N-terminus of the Tph RT. It is unclear whether these amino acids are included during translation of the coding sequence. To help answer this uncertainty I submitted the longer Tph RT sequence for folding to the same protein structure prediction programs used previously. The Aligned Models image is a compilation of the predicted structures all aligned to the GsI RT structure. Similar to the results above, the models showed agreement in the main body of the protein. However, only three of the programs returned a structure which included the N-terminal addition which are showcased above. Even though AlphaFold2, I-Tasser, and RaptorX all returned predicted structures for the terminal addition, there was no agreement on the tertiary structure. Since many of the protein structure prediction models depend on sequence homology and available structures, I used BLAST to search for homologous sequences to determine the prevalence of this N-terminal addition.
Searching sequence homology using BLAST only returned six candidate sequences. All the returned sequences were reverse transcriptases and within this collection the amino acid sequence was conserved. The consereved sequence is shown above as a weblogo. However, none of the returned proteins had associated structural data. This combined with low prevalence of the sequence likely explains why the structure modeling programs could not reach consensus on the structure. Few homologous sequences and lack of structural data does not provide enough information to confidently fold the sequence.