Materials:
Our fitness landscape was acquired from the paper Robust Sequence Determinants of α-Synuclein Toxicity in Yeast Implicate Membrane Binding [5]. An α-syn missense mutation library was transformed into S. cerevisiae. The fitness landscape is the quantification of yeast cell death with the wild-type α-syn as the baseline score.
Evolutionary Scale Modeling:
With the raw data from the fitness landscape, we leverage transfer learning from ESM-2 (Evolutionary Scale Modeling), a state-of-the-art protein language model, to extract embeddings that capture the intricate relations between amino acid sequence properties and protein function. These embeddings serve as input features for training a downstream model, employing transfer learning techniques on our specific task of fitness prediction.
Application for Double Mutation Landscape:
We mutate every possible combination of position sites within the protein. To reduce the computational burden of mutating each position with all 20 amino acids, we have streamlined the data by focusing on the top 5 most influential amino acids, as identified by the probability outputs from the ESM-2 model. This approach enabled us to modify the sequences with the most favorable mutations, as the ESM-2 model was specifically trained to discern the relationships of amino acids within sequences of proteins.