A Hidden Markov Model (HMM) can be trained to detect changes in states for a new test sequence if given the correct training data. The HMM used for this project was a four-state model (H,S,T,N) with states for the three secondary structures and an additional state for amino acids that did not contribute to a secondary structure. Each of these states requires the transition probability of moving from one state to another as well as the emission probabilities for each amino acid in the various states. Additionally, the model also requires the probability that the sequence will start in a specific state. The emission probabilities can be calculated from files that contains all the amino acid sequences present for each of the secondary structures. The transition probabilities and beginning probabilities can be calculated from a file that contains all of the state sequences for all the known proteins. These data files would need to be curated from a database that contains all the protein sequences and their associated secondary structures for a given species.
Hidden Markov Model for Predicting Protein Secondary Structure
Note: Non-contributing state not pictured for figure simplicity
The Uniprot database was used to obtain the amino acid sequences and the corresponding secondary structures for all proteins for E.coli. Each protein was separated by a “//” and had the entire amino acid sequence as well as the indexes of the specific amino acids that were part of each secondary structure. A python script was created to parse through the entire file and generate 4 different text files for calculating the emission probabilities. Each file contains lines of amino acids, each line corresponding to one instance of a secondary structure. The four files that were generated represent the amino acid sequences for creating alpha helices, beta strands, random turns, and non-contributing regions. The emission probability tables for the various states were calculated by determining the frequencies for each amino acid in these four files. Additionally, a state sequence file was generated that contained a line for each protein with the secondary state sequences instead of the amino acids. The transition probabilities were calculated by determining the number of occurrences of each digram in the state sequences file and the beginning probability was calculated by determining the frequencies of beginning a sequence in each of the four states. All of these probabilities can then be used to create the HMM and be used with an algorithm to predict the most likely states for a given protein.
For this HMM, the Viterbi algorithm was used to predict the most likely state sequence for a given protein. The Viterbi algorithm begins with initializing the path matrix with the first column having a probability of 1 for the beginning state and a 0 probability for any of the other states. Additionally, the probability across the first row should all be zeros as the beginning state (B) cannot be used as a state for the subsequent amino acids. The rest of the Viterbi matrix is built out to determine the most likely state sequence. For each element in the matrix, we try to maximize the product of the prior score and the transition probability of moving from the previous state to the current state. We then multiply that maximum by the emission probability for the current amino acid. Throughout this process, we have a pointer that keeps track of the previous entry that resulted in the current best score. Once we have completed the entire matrix, we work backwards on the scoring path with the maximum probability to determine the most likely state sequence for the given protein.
During the development of our code for the Viterbi algorithm, we discovered that the length of the amino acid sequence we were using to test our model was resulting in extremely small probabilities. The continuous multiplication of probabilities across the matrix was resulting in an underflow error. To circumvent this issue, we had to make a few modifications to the algorithm to allow for us to accurately calculate the probabilities. We decided to take the log transform of all the emission and transition probabilities to allow us to simply add the probabilities and thus lead to larger values that could be represented. However, this transformation led to all the values being negative so initializing the B state with a probability of 1 would result in the model predicting B for the entire state sequence. This issue was resolved by initializing B to 0 and all other entries in the first column and row as -∞ or -100,000.
Once the HMM and Viterbi algorithm are set up, we can use another set of proteins for the test data to determine how accurately the model can predict the secondary structures of these proteins. The test data would have the experimentally determined secondary structures as a comparison to see how well the HMM can predict the secondary structure for a given protein. The predicted sequence and the actual sequence can be compared directly to determine the percentage of correctly predict structures. Ideally, the model should also be able to predict how a change in the sequence of the given protein can alter the secondary structure.