Hidden Markov Models for Predicting Protein Secondary Structures

Group Members: Anant Beechar, Grace Hu, Rayna Taniguchi

Introduction


The structure of a protein provides a great understanding of a protein's biological function. Secondary structures are the folding regions along the polypeptide backbone. The two main secondary structures include the main alpha helix and beta strand, but another secondary structure is the random coil region.



Being able to predict the secondary structure of a protein from a given amino acid poses multiple benefits. Understanding the secondary structure of proteins could enhance the prediction of protein binding and interactions, as well as tertiary structure prediction. Adding into the need for this, there are currently many orphan proteins which lack any detectable sequence similarity to other known proteins [Martin et al., 2006]. Also, many neurodegenerative diseases are a result of protein misfolding from mutations in the genome, and consequently, the proteome [Jumper et al., 2021]. Thus, accurately predicting the protein structure of a mutated protein and assessing any changes to the secondary structure would provide a greater understanding of disease pathogenesis. Trends in protein sequence mutation and structure changes could then be used for research and treatments in these diseases and preventing specific interactions among altered secondary structures. Furthermore, knowledge of protein structure could allow protein sequences to be altered to change their function and interactions with other proteins, target molecules, or drugs [Ma et. al, 2018], all of which are especially useful in industry and in finding therapeutics for disease and disorders.


Multiple attempts have been made to allow for prediction of secondary structures, given an amino acid sequence. The advent of the AlphaFold system has allowed for the ability to accurately predict the structure of a protein given its amino acid sequence. This technology, while revolutionary though, has limitations when trying to predict protein misfolding [Pak et al.,2021]. If an amino acid sequence is given to AlphaFold with a mutation, it will be able to predict the original structure that the protein was meant to fold into. However, certain mutations can cause the protein to misfold and the specific information about how the protein is misfolded can be important in studying disease pathology. Thus, there is a need for a method which can be used for secondary structure predictoin.


Determining the secondary structure of proteins based on experimental methods with NMR and crystallography may be time consuming, expensive, and resource intensive [Darnell, 2020]. Because the number of known protein sequences is much greater than the number of protein structures and still growing, it is important to establish other methods of non-experimental protein structure prediction [Darnell, 2020]. However, the accurate and reliable prediction of protein structure prediction from protein sequences has been a challenge. Predicting the function of a protein based only on sequence similarity is not reliable and should be supplemented with other methods. A hidden Markov model may be useful in determining how a mutation alters the secondary structure of a given protein if the model contains states that characterize misfolded proteins.


The Hidden Markov Model is often used to predict biological sequences, specifically modeling the probabilities of different states and transition rates. Many algorithms can be used to find the most likely state sequence, including the Viterbi algorithm. This is achieved recursively with dynamic programming, which yields a path with the most likely sequence of hidden states. Coupled with their ability to allow for explicit modeling of the data [Martin et al., 2006], this makes a HMM very suitable for the prediction of secondary structure given an amino acid sequence. Thus, we propose to use a trained Hidden Markov model with a Viterbi algorithm to predict the secondary structure from a protein sequence.

The above figure represents an example of a four-state Hidden Markov Model

created using the following website: https://setosa.io/blog/2014/07/26/markov-chains/