Training the Model:
Figure 3. PCA of the embeddings. There is a meaningful seperation between the sequences with high and low fitness score, indicating that the ESM-2 model can appropraitely distinguish the two.
Figure 4. ESM model trainined on 80% of the fitness landscape data. Applying the last 20%, the model was able to predict fitness score with 80% accuracy.
Applying the model to predict double mutation fitness landscape
Figure 5. Fitness distribution for the landscape of predicted double mutant varients. This was in the same range as the distribution from the raw data.
(a)
(b)
Figure 6. Heatmap depicting the five amino acid varients at each position that most significantly increases (a) and decreases (b) the fitness score in the double mutant predictive model. Numerical values represent the amount of times that amino acid appears at that position among the most significant mutations.