Phylogeny is one of the more interesting tools which allows for the recontextualization of a species into finding its place in the collection of evolutionary relationships shining new light on how similar the machinery is between all organisms. As was discussed in the introduction this was motivated by trying to place the isoprenyl production that uniquely makes up archaeal membranes in an evolutionary context to see where it may have come from. This result provides some surprising conclusions. As many of the eukaryote are much more closely tied to archaea then bacteria, an assumption that the 2 compared eukaryotes would have greater similarity between their geranylgeranyl diphosphate synthase which in archaea is referred to as bifunctional short chain isoprenyl diphosphate synthase. This turned out to be difficult to make a worse prediction. As can be seen in Figure 3 with the dramatically long tail to Homo sapiens and the amoeba Entamoeba histolytica. This indicates that there is a larger connection between bacterial metabolism and archaeal membranes then to their closer genetic relatives in Eukarya. This does make some sense as even though archaea and eukaryotes share much of their 16s gene, the eukaryotic membrane does not have the cyclic isoprenyl.
Figure 5: Archaea 16S ribosomal RNA phylogeny[7]
Figure 6: Initial "Gap" in aligned genes which propagated errors
The other area that was explored in this experiment was the prediction of ancestral gene sequences. Using the Maximum Likelihood algorithm, it was possible to work backwards from the archaea to generate most likely sequences. These clearly had several issues as the error compounds dramatically to the point where the likelihood for this sequence being correct was so small it was incalculable with the built-in error analyzer being set to 0. This is not necessarily of concern as the value should be extremely small and to improve this a new plugin will have to be coded for arbitrary precision. The largest problem with the dataset that caused issues is when aligning the nucleotide coding sequences they did not align along the start codon which accounted for some weird errors when propagating the values forward and is the reason why the theoretical DNA sequence starts with a “GTG” instead of an ATG. This is doubly unfortunate due to the computer suggesting that the probability of starting with a G is over 90%.