The dataset used in this project was downloaded from UniProt after searching for the gene name AMA1 and filtering the results to those from P. falciparum; in total, 2,008 protein sequences were found (17). These sequences ranged from small fragments of the protein to the full 622 residue sequence. In order to analyze each domain individually, all proteins were aligned to the full protein sequence. An alignment standard was created by extracting all full protein sequences from the data and constructing a sequence which consisted of the most prevalent residue at each location. All of the smaller protein fragments were then aligned to this protein sequence. The alignment itself was performed using BioPython (18). Specifically, the Align library from BioPython was used to perform a dynamic programming search using the BLOSUM 62 substitution matrix and a heavy penalty for opening gaps to find the best possible alignment of the fragments to the standard (18).The dataset used in this project was downloaded from UniProt after searching for the gene name AMA1 and filtering the results to those from P. falciparum; in total, 2,008 protein sequences were found (17). These sequences ranged from small fragments of the protein to the full 622 residue sequence. In order to analyze each domain individually, all proteins were aligned to the full protein sequence. An alignment standard was created by extracting all full protein sequences from the data and constructing a sequence which consisted of the most prevalent residue at each location. All of the smaller protein fragments were then aligned to this protein sequence. The alignment itself was performed using BioPython (18). Specifically, the Align library from BioPython was used to perform a dynamic programming search using the BLOSUM 62 substitution matrix and a heavy penalty for opening gaps to find the best possible alignment of the fragments to the standard (18).
Each domain was then extracted from the aligned protein sequence for further analysis. However, not all protein sequences contained all three domains in their entirety. Therefore, only 706 sequences were extracted for domain I, 1,277 sequences for domain II, and 859 sequences for domain III. Polymorphism was then quantified in each domain. Within each domain, any location with polymorphism within more than 2% of the sequences was considered significant (11). These residues were then analyzed to determine the frequency and types of polymorphism using WebLogo (19).
Phylogenetic trees were generated for each domain to determine the relationships between the polymorphic sequences. These phylogenetic trees were developed using the Phylo libraries from BioPython and were based upon distance calculations between each sequence, with BLOSUM 62 again being used to determine the significance of different amino acid substitutions (18). These phylogenetic trees were then visualized using the ETE Toolkit libraries for Python (20). Interpretation of the phylogenetic trees was aided by performing hierarchical clustering on the amino acid frequencies in each of the sequences for each domain. Hierarchical clustering was performed using Morpheus and used a metric of one minus Pearson correlation and average linkage (21). Sequence alignment was conducted using Clustal Omega from EMBL-EBI (22).
The crystal structure of AMA1 form FVO (PDB entry 4R1A) was used to analyze the position of the polymorphic amino acids (23). PyMoL was used to analyze the structures and alignment of protein structures (24).