For each domain, several residues exhibited substantial polymorphism. For domain I, 32 of 207 residues (15.46%) were polymorphic; for domain II, 13 of 137 residues (9.49%) were polymorphic; and for domain III, 9 of 106 residues (8.49%) were polymorphic . Figure 1 below shows the frequency of each amino acid at a polymorphic site and the confidence in the correct residue. Many of the polymorphisms involved substitutions between amino acids with different characteristics. For example, a majority of the residues in each domain are dimorphic (>70% in all domains). Interestingly, in domain I, there are several residues that are highly polymorphic, with amino acid mutations maintaining different properties. For example, residue 187 has polymorphisms that include an polar, acidic, and basic amino acid substitution. Residue 197 is the most polymorphic, with amino acid substitutions of every category (polar, acidic, basic, hydrophobic). Of note, domain II polymorphism tends to involve one amino acid that has a higher frequency than the other(s) and domain III tends to have mostly even frequency between dimorphisms.
Figure 1 – Amino Acid Polymorphism for A) Domain I, B) Domain II, and C) Domain III. Amino acids are color coated by properties, where polar amino acids are green or pink, basic are blue, acidic are red, and hydrophobic are black. Residue positions are indicated on the x-axes and bits are indicated on the y-axes.
The evolutinary relationships in the polymorphism is displayed in the phylogenetic trees for each of the domains, as shown in Figure 2. Domain I is highly polymorphic and produces a highly complicated tree, whereas domains II and III have substantially less complicated trees. This suggests that not only does domain I have high polymorphism, but there are not any clear high-order relationships between substitutions at any given locus in the protein. For domain II, and even more so for domain III, the phylogenetic trees show a clearer branching relationship between the individual sequences.
Figure 2 – Phylogenetic trees. A) The tree for domain I has a high number of nodes and few closely related sequences. B) The tree for Domain II exhibits two main families of sequences with fewer variation within each. C) The tree for Domain III exhibits two main families with close relationships inside of each family.
This interpretation of the results is furthered by examining the hierarchical clustering results in Figure 3. For Domain I, clustering failed to find any significant clusters, but instead found highly unordered variance across the entire dataset. Although some bands are visible suggesting some level of higher order organization, the highly striated pattern illustrates that amino acid substitutions occur frequently and chaotically. For domains II and III, clustering was more successful; although there were striations, there were also large clusters with identical amino acid frequencies. Overall, the phylogenetic trees and clustering failed to produce an easy way to quantify polymorphism in the data, but did help illustrate the extreme polymorphism in Domain I.
Figure 3 – Hierarchical clustering results. A) For Domain I, there were few large clusters. Within the heat map for amino acid frequency, several bands are visible, but overall, the clustering is dominated by disordered striations in the amino acid frequencies. B) For domain II, clustering produced several large clusters, and overall the pattern was not as significantly striated. Interestingly, a variation in amino acid frequency for one residues in a row was marked by variation in other amino acids. C) For domain III, several substantial clusters were found, and there was not a clear striated pattern. Overall, there appears to be much more substantial similarity in this domain.
Another method we employed to visualize polymorphism clustering is sequence alignment algorithms, as seen in Figure 4. These results confirm what was seen from the hierarchical clustering of Figure 3, where see disordered clusters much more abundant in Domain I, while Domains II and III have many large clusters encompassing major branches of the clustering tree.
Figure 4 – Sequence alignment and clustering of AMA1 domain I (A), domain II (B), and domain III (C). Each subfigure is a small snippet of the entire results. (A) Domain I had many small, divided clusters characterized by many polymorphic residues that did not align with each other. (B) Domain II and (C) Domain III has much larger branches, indicating increased sequence alignment among polymorphic residues.
To verify the effect of polymorphic residues, we highlighted those residues on the crystal structure in Figure 5. The most highly polymorphic residues were closed to hydrophobic pocket that is known as binding site of RON2. These diversities increase a chance to escape the immune surveillance and restrict the vaccine development.
Figure 5 – Polymorphic residues are highlighted. Green residues are dimorphic, Yellow residues are trimorphic, Orange residues have four possible amino acids, and Red residue has five possible amino acids. Magenta color shows hydrophobic trough residues.
In order to understand the impact of the polymorphism on structure, we predicted the protein structures by using Alphafold. Two vaccine structures and four haplotype structures were aligned in Figure 6. Based on the predicted structure, polymorphisms did not affect to structure of the protein which is shown in previous publication (23).
Figure 6 – superimposed predicted structures. Two vaccine structures (pink and cyan) and four haplotype structures were aligned.
Our results show that despite Domain I being the primary site of interest for vaccine development, substantial polymorphism disproportiantly effects Domain I. This polymorphism contains unpredictable substitution of residues and effects the folding of the protein around the domain. For these reasons, developing strong vaccines that target domain I will have substantial issues. However, domains II and III exhibit less polymorphism and less impact on protein folding and thus are better candidates for a vaccine.
Discussion
Overall, out investigation yielded an overview of polymorphism that spanned all three defined domains in P. falciparum AMA1. To date, few studies have utilized the same or similar bioinformatics approaches to evaluate AMA1 diversity (25,26). Those that have uncovered similar results to us: Domain I is incredibly complex, and properly defining antigenic epitopes that would induce a polyvalent antibody response capable of neutralizing multiple P. falciparum strains is challenging. Based on the structure prediction, the immune escape mostly arises from the diversity of amino acid properties in polymorphic region, not by structural changes.
To our knowledge, no study has simultaneously evaluated domains I, II, and III among provided AMA1 amino acid sequences. Our results indicate that domains II and III are much more conserved than domain I, and therefore are potentially improved vaccine antigen targets, assuming they can be expressed properly. Similar ideas have been proposed for the HIV-1 proteins (Gag, Env, Nef) (27) and SARS-CoV-2 spike protein S2 subunit (28), because these antigens can elicit broadly neutralizing mAbs that can recognize multiple viral strains and/or contain greater amino acid conservation.