Results and Discussion

Results

Training of the Hidden Markov Model (HMM)

To find the parameters of our HMM, we trained the model on a database of E. Coli proteins which contained their amino acid sequence and secondary structure sequences. Obtained emission probabilities can be found in Table 1 and Figure 1. For the helix, Alanine (A), Glutamic Acid (E), Leucine (L) and Methionine (M) are more commonly found [Skipper, 2005], while for the turn, Proline (P) and Glycine (G) are most often observed [Marcelino and Gierasch, 2008]. Indeed, especially high frequencies of alanine, Glutamic Acid and Leucine are observed in Fig. 1a, whilst higher frequencies of Glycine especially are found in the turns. Meanwhile, in strands, Valine (V), Threonine (T) and Isoleucine (I) are preferred due to the presence of -branched side chains, [Boyle, 2018] and an especially high proportion of Valine and Isoleucine can be seen in Fig. 1c. Thus, our emission frequencies obtained from the E.Coli database seem reasonable and are as expected. Transition probabilities from one state to another were also obtained, as shown in Table 2 and Figure 2. BN frequency was especially high, suggesting that most proteins began with non-structured regions. Furthermore, it was observed that there was a high propensity for states to remain in their current state, as indicated by comparatively high frequencies for HH, NN, SS and TT.

Table 1: Table containing emission probabilities of all amino acids, for each state (helix, turn, strand and non-structured states), based on E.Coli Uniprot dataset of known protein sequences and secondary structures.



Figure 1: Bar charts depicting emission frequencies of each amino acid in Table 1, for the following HMM states: a) Helix, b) Turn, c) Strand and d) Non-Structured Region

Figure 2: Bar chart depicting transition frequencies found in Table 2 between the beginning (B), helix (H), turn (T), strand (S) and non-structured (N) regions.

Table 2: Table showing transition probabilities from either beginning state (B), helix (H) state, strand (S) state, turn (T) state or non-structured (N) state to other possible states. Probabilities generated from E.Coli dataset from Uniprot.


Testing of Hidden Markov Model

To test our HMM, we used the Viterbi algorithm performed in log-space to obtain the most likely sequence of hidden states of different E. Coli proteins, using their amino acid sequences as the test sequences. However, when testing different proteins as depicted in Table 3, the HMM could only predict strings of ‘N’, indicating that the entire protein sequence was predicted to be unstructured. From looking at the transition frequencies, this is not unexpected as the likelihood of beginning in ‘N’ is high, and the likelihood of staying in the ‘N’ state is also fairly high, at 96.8%.

Table 3: Table showing different known amino acid sequences of E.Coli proteins with their true state sequences, predicted state sequences and accuracy (defined as the % of the true state sequence that was correctly predicted)


Control: Hidden Markov Model for Transmembrane and Soluble Protein Regions


As a control to test our HMM and Viterbi algorithm code was running correctly, we ran our code to train a simpler HMM model, used to find transmembrane protein sequences from a given amino acid sequence. The model is trained to predict which sections of a given sequence encodes for soluble regions of a protein compared to transmembrane regions. When running a very similar code, modified in some sections to accommodate for differences in the HMM topology, we found that our model could find the transmembrane protein sequences with a much higher accuracy as shown in Table 4. The majority of transmembrane regions could be found, and the starting and ending points of the transmembrane sections were predicted with relatively high accuracy. Thus, these results indicated that our HMM had been correctly coded and was able to accurately predict the state sequence given an amino acid sequence, given a simpler HMM topology.

Table 4: Table showing different known proteins of yeast, specifically their amino acid sequences and corresponding state sequences. Our modified HMM was trained on a database of yeast proteins and used to predict state sequences of other yeast proteins, given the amino acid sequence. Accuracy of the HMM’s predictions was also found (defined as the % of the true state sequence that was correctly predicted).


Modification of the E.Coli Dataset


Based off of our previous results where the HMM only outputted a string of ‘N’s, we decided to investigate whether modification of the E.Coli dataset could lead to better predictions of the state sequence. To achieve this, we removed the terminal ‘N’ regions, that were at the start or end of each protein sequence. Through this removal, this allows for the HMM to start in a different beginning state to ‘N’, thus potentially allowing for the prediction of other states. Additionally, when inputting test protein sequences, only the ‘core’ of the protein (i.e. a middle section) was tested as these sections typically correspond more to protein regions which are more likely folded. This allowed us to obtain new transition probabilities as shown in Table 5 and Fig. 4. Using these modified probabilities, we ran our HMM on different amino acid sequences of the ‘core’ regions of the proteins, as shown in Table 6. Unfortunately, though our HMM prediction no longer started in the ‘N’ state, it was unable to leave the ‘H’ state. To try overcome this, we once again modified the inputs, using synthesized, manually curated amino acid sequences which did not encode for any helices. However, the HMM was only able to predict strings of ‘S’ and could not leave this state as shown in Table 7.

Figure 4: Bar chart showing transition frequencies found in Table 4 between the beginning (B), helix (H), turn (T), strand (S) and non-structured (N) regions. These frequencies were obtained from truncated versions of E.Coli proteins, with any non-structured regions at the very start or end of the protein being trimmed.

Table 5: Table showing updated transition probabilities for modified dataset, from either beginning state (B), helix (H) state, strand (S) state, turn (T) state or non-structured (N) state to other possible states. Modifications to dataset consist of truncation of each protein state sequence such that terminal ‘N’ segments (i.e. N regions at the start or at the end) were removed.

Table 6: Table showing different known amino acid sequences of core regions of E.Coli proteins with their true state sequences, predicted state sequences and accuracy (defined as the % of the true state sequence that was correctly predicted). Transition probabilities used were as shown in Table 5, generated from a modified protein database with ‘N’ regions at the very start or beginning removed.


















Table 7: Table showing different known amino acid sequences of regions of E.Coli proteins only encoding for strands, turns or non-structured regions, along with their true state sequences, predicted state sequences and accuracy (defined as the % of the true state sequence that was correctly predicted). Transition probabilities used were as shown in Table 5, generated from a modified protein database with ‘N’ regions at the very start or beginning removed.


Discussion

From our results, we conclude that the currently implemented HMM is unable to accurately predict secondary structures from a given amino acid sequence. This could be due to a number of different reasons, with one being that in each secondary structure itself, there are multiple different sub-types with varying chemical properties. As shown in the figure below, taken from Martin et al., 2006, alpha helices can be further broken down into 15 different sub-groups, with some being hydrophilic in nature whilst others being hydrophobic. In our control experiment, amino acids typically found in transmembrane versus soluble segments can be easily distinguished due to amino acids in the transmembrane typically being more hydrophobic and vice versa, thus perhaps leading to higher accuracy of the HMM prediction. Contrary to this, if there are different sub-types of alpha helices with differing chemical properties, our HMM may have been an oversimplification, thus rendering it unable to accurately predict the secondary structure.

Future Directions

There are a number of different methods that could be explored to improve the accuracy of our HMM in predicting secondary structures. A more in-depth analysis of the amino acids and amino acid digrams typically seen in each secondary structure, prior to settling on a HMM topology may prove useful in the future. Following on from this, different HMM topologies could be attempted, and also, different algorithms for determining the most likely state sequence could also be implemented.


Alternate Model Topology

There are several methods that could improve this model’s ability to predict the secondary structure of proteins accurately. One of the fundamental issues with hidden Markov models is that the topology of the model needs to have a strong correlation to the probabilities and therefore needs to have a rational design [Bystroff & Krogh, 2008]. Future iterations of this model could have different topologies that may allow the model to provide better insights into the structure of proteins. One such approach would be to alter the model to only include two states, contributing and non-contributing states. While this model would provide less information about the specific secondary structures, it would at least be able to accurately detect the presence of a secondary structure within the given protein. Once the model has successfully identified the presence of a structured region over an unstructured region, another HMM to determine the most likely secondary structure of a sequence, given that that amino acid region is structured could be further implemented. Additionally, another variation of this approach would be to alter the model to a two-state model that includes only one specific type of secondary structure. This could allow the model to more accurately predict one type of secondary structure rather than try to predict all three.

Alternatively, another method that could prove useful in improving the model would be to perhaps provide more specificity to the model by adding more states that correspond to various different types of structure. The image below illustrates a more complex hidden Markov model in which each type of secondary structure contains multiple states [Martin et al., 2006]. However, this approach would require access to a much more detailed dataset that would have the amino acids for each different structure.

Other Algorithms to determine Most Likely State Sequence

The Viterbi algorithm gives the probability of the state sequence being derived from the model given the optimal state path. Like many other algorithms, this is done recursively using dynamic programming. Another algorithm that could be employed with this dataset would be the forward algorithm which also uses dynamic programming to determine the most likely state sequence. However, unlike the Viterbi algorithm, the forward algorithm takes into account all possible state paths which may prove useful in finding other state sequences that the Viterbi algorithm may miss. The Viterbi algorithm is already computationally expensive so re-evaluating the model with the forward algorithm could prove challenging. Lastly, the Baum-Welch algorithm could be useful with a complex model such as the image below as it is useful when the state paths are unknown.

Complex Hidden Markov Model Topology [Martin et al.,2006]