By Stephen Prata - January 2014
This page discusses creating a network (Figure 5) that visually represents the evolutionary relationships among the haplotypes for 121 members of the I-L38+ Project. It describes some of the concepts involved in creating and interpreting the network and outlines the many choices made in leading to this specific result.
The I2a2b L38+ Project lists Y-DNA STR data for over 200 project members, with 12 to 111 STR values (each set a haplotype) per entry. That’s a lot of information. The Project provides some order by sorting the members into nine subgroups based on the values of a few key STR markers. Another method of exploring the connections amongst members is to apply network analysis, as Hans de Beule has done in several papers (listed at I-L38 Papers & References). I thought it would be interesting to apply this technique to a large sample including many STR markers to see what it might reveal about the organization of the L38+ haplotypes. A graphical representation of the information makes it easier to visualize relationships. I used the free Network program available from Fluxus Technology Ltd. for the analysis. For data, I used the 121 project members that, as of autumn 2013, had at least 67 STR markers listed; 49 of those markers were used.
Understanding the results does require some background. Since some readers may lack parts of this background (as I did), I’ll try to supply some now.
First, a disclaimer: my understanding of these matters is a journey in progress, and if I’ve gone astray, I would appreciate some course correction (preferably gentle). Second: Much of my discussion represents an attempt to explain things to a novice, in particular, myself. Third: you can find a fuller discussion of some of these matters on the L38+ Project background page and on the other pages of this website.
The beginning, of course, is the Y chromosome, the chromosome that a man inherits from his father, the only chromosome in the human genome that is passed in its entirety from parent to offspring. If it weren’t for mutations, every male human today would have the identical Y chromosome. And it would be the same as that of the most recent common ancestor of all men today, currently estimated to have lived 120,000 or more years ago.
But mutations of various kinds have occurred. Geneticists use one particular type of mutation, the Single-Nucleotide Polymorphism (SNP), to divide the male human population into different Y-chromosome classes called haplogroups, which they have attempted to arrange in an evolutionary tree. For example, the M170 mutation, among others, indicates membership in the I haplogroup. Mutations that occurred after M170 serve to divide the I haplogroup into various subgroups. For instance, the P215 mutation indicates membership in the I2 subgroup. And the L38, L39, L40, L65, and L272 mutations indicate membership in a group labeled by ISOGG as I2b2 in 2008 and relabeled as I2a2b in 2011. This shouldn’t be confused with the former I2a2b group, relabeled as I2a1b2 in 2011. To avoid that confusion, people have begun using the final mutation in the series to identify groups. So the 2011 I2a2b subgroup can be labeled L38+ (Geno 2.0 and the L38+ Project) or L39+ (FTDNA). Recent discoveries (see the L38+ project News page) indicate the L533+, L69+, and, perhaps, F780, and PF112 mutations indicate L38+ subgroups, albeit ones with few members.
SNPs measure evolution on the time scale of thousands or tens of thousands of years. A second sort of mutation, the Short-Tandem-Repeat, or STR, has higher mutation rates and provides a view on a finer time scale. Each STR location is characterized by the number of times a short segment of DNA is repeated, and mutations change this number, either increasing or decreasing it. The original L38+haplogroup man would have had a particular set of counts for his STR markers, and that would constitute his haplotype. Since then, that original haplotype has mutated along various branches to the large set of haplotypes we have today within the L38+ haplogroup. (The most common mutation changes a count by 1. That is, a count of 20 most often mutates to 19 or 21. Mutations that change the count by 2 or 3 are possible but much rarer.)
The value of any single STR marker in a haplotype doesn’t indicate any particular haplogroup. For example, most L38+ members have a value of 13 for the STR labeled DYS393. But a few have values of 12, 14, or 15. And there are many individuals with a value of 13 who belong to other haplogroups. But values within certain ranges for ensembles of STR markers do correlate with haplogroup. For example, the FTDNA 12-panel STR test classifies participants into broad haplogroups.
Haplotypes can differ not only in the STR values but in the number of STRs used. For example, common Family Tree DNA (FTDNA) haplotypes are the 12-marker, 25-marker, 37-marker, 67-marker, and 111-marker versions. The more markers used in the haplotype, the more precise the correlation is to a haplogroup. Also, a 37-marker test can resolve individuals that have the same 12-marker haplotype into different 37-marker haplotypes. So we can regard increasing the marker count as increasing the resolution of our haplotype view.
Suppose we knew the identity and relationships of all the I-L38+ men since the first. Then it would be a simple matter to construct a family tree with each node representing a person and each link representing a father-son relationship.
Now suppose we also knew the haplotype for each of these men. Then we could construct a tree showing the evolution of these haplotypes. Each node would be a haplotype, generally containing several individuals, and each link would be the mutation that converted one haplotype to the other. This would be what is termed a phylogenetic tree.
But we don’t have all that information. Instead, we have some challenges.
Given the present populations of Europe, North American, and Australia (known habitats for I-L38+) and that I-L38+ is estimated to constitute about 0.4% of that population, Hans de Beule estimates that the current population of I-L38+ individuals is over two million. The I-L38+ project has data for over 200 individuals, or about 0.01% of that total population. The vast majority of I-L38+ haplotypes have a single representative, at least at the 67-marker level, so it's clear that adding many more individuals would add many more haplotypes. That could make analysis more difficult, but it also could clarify the relationships amongst the haplotypes.
Another problem with the present sample is that it's biased in a couple of ways. First, sampling rates vary widely over different geographical regions, which may obscure geographical differences. Second, many of the project members tested to explore family relationships, so the data is biased towards family groupings.
We are trying to reconstruct a tree based mostly on observations about the ends of the branches, and we don't have much data on the older haplotypes. That's because with the passage of time a haplotype tends to evaporate as its members die without issue or else produce new haplotypes. Consider the MRCA for the group. The likelihood that a 10th-generation descendent of his would have the same haplotype is about 34%. The likelihood that the same person has a genetic distance of 1 is about 37%. These numbers are based on the following:
The last point means the genetic distance for a given STR from the MCRA value could be different from the number of mutations for that STR. For example, three mutations in the same direction produce a G.D. of 3, but if two are in opposite directions, the G.D. is 1.
Table 1 shows the likelihood for various G.D. values for several generation values. This simulation assumes that all branches experience similar demographic growth. Different assumptions can lead to different numbers, but the main point is how the distribution of haplotypes changes with time. The highlighted ranges in each column indicate values that add to approximately 70%. So, as time passes, the most common haplotypes get further from the original haplotype and spread over a broader range. Keep in mind that a genetic distance of, say, 5, can describe many different haplotypes, as the differences can be distributed in many ways among the 49 STRs.
It's not impossible that old haplotypes are present in the modern population, but they are not likely to show up in a sample of 200 individuals.
But missing haplotypes aren’t the only problem in constructing a phylogenetic tree for haplotypes. It turns out that the supposed tree may not necessarily be a tree at all. That’s because two individuals with the same haplotype can descend from different haplotypes. That is, two twigs from different branches might merge into a single branch. So the result can be a tree with closed loops. The technical term for this merging is homoplasy. It can come about in more than one way. Let’s look at some cases. For simplicity, assume the haplotype is for just two markers.
1. Suppose a haplotype H1 consists of the values 10 and 20; we can represent that as (10,20). Suppose Alf sires a son Bob whose haplotype mutates, making Bob the founder of a new haplotype H2 with values (11,22). Later, Carl, an H2 descendant of Bob, sires Don, who becomes the founder of haplotype H4 with values (11,21). Meanwhile, back in H1, Ed, who lives a few generations later than his cousin Alf, sires Fred, who becomes the founder of a new haplotype H3 with values (10,21). Then Gus, an H3 descendent of Fred, sires Hal, who winds up with values (11,21) and hence is part of H4. Figure 1 illustrates this state of affairs, with the center color of a circle indicating the haplotype and the border color indicating the haplotype of the direct ancestor. One result is that both Doug and Hal have the identical haplotype H4 but they have different haplotype histories. This is one route to homoplasy.
2. We could have a similar situation, but instead of Hal descending from Gus, Hal could descend from Don, and Gus could descend from Hal, making H3 the haplogroup with homoplasy.
3. Another possibility is that Curtis, another H2 descendant of Bob, has a son Doug in which the mutation is from (11,20) back to (10,20). Now H1 would have a mixture of members with different haplotype histories.
4. Starting with the two haplotypes (10,20) and (10,24), each type might eventually have a descendant of type (10,22), a case of convergent evolution.
5. Starting with the (10,20) and (10,21) pair of haplotypes, we could have a chain of parallel mutations: (10,20) goes to (11,20), (11,20) goes to (12, 20), and (12,20) goes to (13,20). Then (13,20) goes to (13,21). Meanwhile, a second chain of mutations (Figure 2) leads from (10,21) to (13,21. Here haplogroup H8 exhibits homoplasy
Let’s examine this last point. Consider Figure 1. Suppose that instead of representing a two-marker haplotype, it represents two out of a total of 25 STR markers. For both Don and Hal to have the same haplotype, all 25 markers should have the same value. This means that either none of the other 23 markers underwent a mutation while the two pictured markers did or else exactly the same additional markers for Don and Hal followed the exactly the same mutation paths. The more time elapses, the less likely these possibilities become. For instance, it becomes very unlikely that all the mutations shown in Figure 2 would have taken place without any of the other 23 markers going unchanged.
Going to more markers further reduces the importance of homoplasy, and that is one reason for using as many markers as feasible.
We face tough problems when attempting to construct a phylogenetic tree for the I-L38+ haplotypes. Most important is that we have only a scant sampling of haplotypes, and they just portray the contemporary state of the haplogroup. This makes it possible to hypothesize many possible trees that might fit the data. Then there is the problem of homoplasy. Network analysis can help with both of these issues. The idea is to create a network of links between nodes. Each node represents a haplotype, and each link represents a mutation or set of mutations needed to transform one haplotype to another. A tree provides just one path leading from one node to another, but a network can provide multiple paths. A network can have closed loops representing homoplasy. It also can represent the superposition of many possible trees, allowing us to visualize several equally good trees rather than pick one as being “the” tree. This serves to remind us of the limitations and uncertainties in our knowledge.
To visualize the advantage of a network over a tree, suppose a database contains individuals with the H1 and H4 haplotypes of Figure 1. If you were constructing a tree you might infer a hypothetical haplotype H2 and connect the types as shown in Figure 3, case a. Or you might infer haplotype H3 and connect the types as shown for case b. In reality, there’s really no basis to prefer one to the other, and in a network you can connect them as in case c. This visual representation tells you that either construction is equally plausible. It also reflects the possibility that, due to homoplasy, both representations might simultaneously be true. It is a more complete and more honest representation of what we know and of the uncertainty in what we know.
In this figure the circles representing the haplotypes are the nodes and the lines connected them are the links.
The project to be presented here used Network 22.214.171.124, which is free software from Fluxus Technology. It’s intended for a variety of applications, including analyzing Y-DNA STR data, DNA nucleotides, and amino acids, and it provides several pathways for analysis. Its primary goal is to create median networks, so let’s take a look at what that means.
Let’s use just a three-marker haplotype. Suppose we have three haplotype nodes: A = (9,20,30), B = (10,21,30), and C = (10,20,31). There are various ways we can link them together with single-mutation links by assuming suitable intermediate haplotype nodes. For instance, you could hypothesize two intermediate nodes and create a chain of five nodes. But the most efficient way to tie A, B, and C together is to assume a single intermediate haplotype Z = (9,20,30). Each of A, B, and C is exactly a single one-step mutation away from Z, and you can link them together as shown in Figure 4.
Z is termed a median vector or consensus vector, and it provides the shortest combined distance for linking each pair (i.e., AB, AC, and BC) together. In this case the median vector is a hypothesized node needed to link three nodes together to create that shortest combined distance, but it could also be an existing node corresponding to a known haplotype.
The links in themselves do not imply the direction of the relationship. In Figure 4, any of the four nodes could be ancestral to the other three. For instance, B and C could descend from Z, which descends from A. Or A, B, and C could descend from Z.
A median network is one for which a unique median vector exists for any three nodes. The significance of a median network in phylogenetics is that it will contain all possible maximum parsimony trees for the data. And what that means, I think, is this:
Consider all the possible evolutionary tree structures you could create to connect a set of haplotypes. Every link between nodes represents one or more mutations. For each tree compute the total number of mutations represented by all the links. Find the trees that use the least number of mutations. These are maximum parsimony trees. There’s no guarantee that nature actually uses the most parsimonious approach, but these trees are thought to be a good beginning for recreating the evolutionary structure.
The Network program undertakes to create a median network for a data set. It supplies median vectors as needed to allow joining all the original haplotypes into a single median network. These hypothesized median vectors may represent haplotypes that exist but are not part of the sample, and they can represent older ancestral haplotypes no longer present in the population. The Network program calls its basic network-creating procedure MJ, for median-joining.
Some of the STR markers are not suitable for network analysis. This includes all those markers with multiple values, for instance, CDY, which lists values such as 34-35. The problem with this type of marker is that the values are sorted by numerical order rather than by the sub-unit involved. So the two orderings 34,35 and 35,34 both would be reported as 34,35. This is just a result of the testing procedure, and further tests would be needed to get the correct order. The information is still useful for general classification because 34-36 is definitely a different haplotype from 34-35. Also, the Network User Guide advises not to use DSY389i and DSY389ii unless two other components of the DSY389 group are known. These omissions still leave 49 STR markers in play.
The Network program requires fairly short labels for the haplotypes, so I used the simple scheme in Table 2 to represent subgroups as identified on the I-L38+ Project STR data page as of autumn 2013. (Since then, another small group has been added.)
Note that two of the subgroups (G3 and G9) have associated SNP mutations (L533+ and L69+), making them haplogroup subgroups.
I labelled the first member of G1 G1-01, the next G1-02, and so on. Appendix A shows the members used and the corresponding group IDs.
What is a Subgroup?
This article uses the term subgroup in three different ways.
A new SNP defines a new haplogroup. So those members that test positive for L533 can be considered members of a descendant haplogroup labeled I2a2b1 or I-L533+.
The I-L38+ Project divides its members into subgroups based on STR values. These are the groups labeled G1, etc., in the table above. Two of these groups, G1 and G6, are fairly broad compared to the others and can be thought of as the results of a first sorting pass.
Network analysis can reveal subgroups based on how haplotypes appear to cluster in the graphical representation of the haplotypes.
These uses aren’t entirely independent. As discussed on the I-L38+ Project News page, the L533+ SNP is associated with the G3 group and the L69+ SNP is associated with the G9 group. And the mutations F780+ and PF112+ may define other subgroups.
The main impetus for this study is to see how network subgroups correlate with the other two definitions.
Figure 5 shows an example result obtained by running Network with the I-L38+ data.
Figure 5 - click to enlarge
In this figure the length of a link is proportional to the number of mutations between the two ends of the link. The small red nodes are the median vectors created by the program. The haplotype nodes are color-coded as indicated, as are the links that tie together nodes of one group.
In general terms, I’ve positioned the nodes, keeping link lengths unchanged, so that the periphery represents contemporary haplotypes and the center of the network, where the median vectors concentrate, represents the reconstructed past. The outer parts of the network are tree-like, with only one possible pathway linking two nodes. The network aspect appears at the center, where multiple pathways, shown in red, link two nodes. This part is termed the torso.
The smaller project groups (G2, G3, G4, G5, G8, and G9) also appear as grouped nodes in this network. G7 doesn’t appear to be a simple group, and G1 and G6, the two largest groups, appear to be reservoirs from which some smaller groups may be drawn. A few haplotypes appear candidates for reclassification. (Indeed, some have been reclassified on that basis.) There are indications of some small subgroupings within G1 and G6. For example, G6-42 through G6-46 form a tight cluster, as one might expect from a family group. G6-16, G6-35, G6-36, and G6-37 form another subgroup, one for which the individual haplotypes diverged from one another further back in time.
Note how these two G6 subgroups, as well as G4 and G5, are structured. Each has a long, multi-mutation link from near the center to a more distant median vector that appears to be the base for the group. In contrast, the G8 and G9 groups show branching extending much further into the past. Consider the first sort of structure. Each mutation within a link corresponds to a father that has produced a son with a mutated haplotype and thus is a branching point. So there should have been several links branching off, much as is the case for G8. There are a couple of explanations why we don’t see them. One is that none of these branches have present-day survivors. Given the European history of warfare, plague, famine, and other disasters, it’s not difficult to believe this could happen. The second explanation is sample size. If we had data for 10,000 I-L38+ members, we could expect more haplotypes to show up, and perhaps some might be those missing from these groups.
Recall that two SNPs seem to indicate I-L38+ subgroups and that two more SNPs are candidates for indicating new subgroups. How does the network fit in with these facts?
The network appears to radiate from a central point that I’ll call the hub (Figure 6).
Figure 6 - click to enlarge
This picture is consistent with the central haplotype existing at a time of a rapid population expansion. Note how many different branches connect to the node I’ve called the hub. Each represents a mutation, and the only reasonable way for one haplotype to generate so many mutations is that many individuals have that haplotype. Haplotypes tend to evaporate with time as the biologic descendants experience mutations, so the simplest way to build up a large population of a particular haplotype is to have a rapid population growth while the haplotype still exists. Note that several of the nearby descendant nodes also have several branches. Given that those nearby haplotypes began with a single individual, population growth would have had to continue long enough for those haplotypes to reach a population sufficient to engender multiple mutations.
Thus, the network suggests a period of rapid population growth that persisted for several generations, and that most likely would have to be accompanied by geographic growth to provide the resources for the growing population.
The hub is not necessarily the oldest node. Links are not in themselves directional. So one of the links radiating out from the hub may actually connect to an ancestor node rather than a descendant node. But, as it is extremely unlikely for multiple nodes to be ancestral to the hub, most of the links would connect to nodes that evolved from the hub. In short, the hub doesn’t signify the oldest haplotype; it signifies a haplotype involved in a rapid population expansion.
The torso consists of several diamond-shaped elements with one vertex at the hub. Each portrays two alternative paths from the hub to the node at the opposite end of the diamond. As noted earlier, there are two ways to interpret this topology. One is that only one alternative actually occurred, but we have no way of choosing one over the other. The second (homoplasy) is that both occurred, so that the haplotype at the tip contained some individuals with one ancestry and some with the other.
Going from the raw data to a particular outcome, such as that of Figure 5, requires many choices, choices that shape the result. Figure 7 demonstrates that point by showing the result if one accepts the default processing. Once again, the links in red represent the torso.
Figure 7 - click to enlarge
Not only is this network confusingly more complex, it has problems tying together members with the L533+ SNP mutation (G3).
Knowing the process is crucial to understanding the result, so let’s look more closely at the process.
It’s often said that one shouldn’t visit a sausage factory if one wants to enjoy the output. But the situation for studies like this is different. The results depend on many decisions made in the processing, so to understand the results and to appreciate the uncertainties that always exist, one really should visit the factory. In this spirit, I’ll lead up to the network of Figure 5 in a sequence of steps that, I hope, illustrate how the Network model works and how to read the graphics it produces, and which also show and justify the choices made.
First, let’s look at some of the Network options. Figure 8 shows part of the Calculate Network menu. The sole Optional Pre-Processing choice is Star Contraction. The two Network Calculations choices are Reduced Median, or RM, and Median Joining, or MJ. The sole Optional Post-Processing choice is MP (for Most Probable) Calculation.
Figure 8 - click to enlarge
What do they do?
The Star Contraction can group closely related haplotypes into a single node. This option did not prove particularly useful for the data set used here, so I didn’t use it.
The Reduced Median (RM) and the Median Joining (MJ) use two different algorithms to create a median network. The default form of each typically doesn’t produce a median network but works as a starting point for an iterative approach. For each method one can increase a parameter that brings the result closer to a median network.
The Optional Post-Processing choice is called MP (for Most Probable), and it simplifies networks by removing nodes and links not needed for the more probable trees in the network.
Network has somewhat contradictory goals. One is to produce a median network. It will contain all the maximum parsimony trees, and the hope is that one of those represents the actual evolution of the haplotypes. The other goal is to present the information in an intelligible fashion. Figure 7, which was generated using the default settings for MJ, fails on both counts. The defaults are equal weights of 10 for all STRs and a value of 0 for what’s termed the ε parameter. The advantage of using these settings for the I-L38 data set is that the calculation is fast, about ten seconds on my desktop computer. The first disadvantage is that the result typically is not a median network, hence not all possibilities are shown. The second disadvantage is that interpretation may not be clear. So let’s examine the Network choices.
Recall that the aim of the Network median-joining (MJ) routine is to construct a median network because that would contain all possible maximum parsimony trees (trees that required the fewest number of mutations). And recall that a median network is one containing a unique median vector for any three nodes. The default setting for MJ is intended to be a fast first attempt that just looks at closely spaced node triplets. It’s possible that the resulting network might fall short of the median network.
The suggested next step is to progressively increase a parameter ε, which, by default, is set to 0. Increasing this parameter causes the program to look further than single-mutation steps when constructing median vectors. The initial pass may not find all the median vectors needed for the complete median network. Increasing ε can find additional maximum parsimony trees. This, in turn, can produce networks with too much complexity to easily understand. So the user guide recommends running the MP post-processor, as described earlier, to eliminate the less probable trees.
Unfortunately, using MJ with non-zero ε places a much greater strain on computing resources. For example, an MJ calculation with the full data sample and an ε of 0 took about ten seconds on my computer, but using an ε (the next recommended step) of 10 led to an out-of-memory message after nearly a half-day of computing. It’s a difference readily apparent even to a novice such as myself.
Network provides an alternative computation (RM) that produces a reduced median network. This computation attempts to remove less-likely links as it creates a network. Thus the final network may not contain all possible maximum parsimony trees. But, like MJ, RM does have an adjustable parameter that allows one to increase the number of median vectors. Unfortunately, the default RM computation failed with the L38 data. More precisely, after nearly two days of processing, it yielded an output file with nearly 120,000 median vectors, and neither MP nor the plotting module could process that file.
It turns out that large STR data bases are particularly stressful for Network. But the User Guide has a remedy: using phase 1 of the RM computation to generate a preliminary file (.rmf suffix) that then is used as input for MJ. This reduces complexity and produces better results than MJ alone, even than MJ with large values for ε. (Michael Forster, Fluxus (private communication) and the Network User Guide)
Another suggestion is to assign a weight to each STR. By default, the program assigns equal weights of 10. The suggestion is to assign higher weights, say 20, to those STRs with few mutations, and lower weights, 5 or even 0, to those with more frequent mutations. Consider a mutation that occurred just once in L38 a hundred generations ago. That would divide the haplogroup into two subgroups, and individuals in one subgroup would be more closely related to other individuals in the same subgroup than those in the other subgroup. This mutation would represent an important organizing event. Now consider a mutation that has occurred 50 times during that same period. In this case, whether or not two individuals share that mutation doesn’t tell us much about relationships.
From the standpoint of the network, unequal weights downplay the importance of some markers, simplifying the network.
In short, at least as I understand, using the MJ calculation with increasing values of the ε parameter aims to construct a median network, which will contain all maximum parsimony trees. Using the RM calculation, either alone or with MJ, using weights, and using MP post-processing all act to eliminate less likely maximum parsimony trees and to create a more intelligible network. The result of applying these last three techniques is to produce a plausible, but not exhaustive, reconstruction of possible evolutionary pasts.
All these points may become clearer if we first look at a reduced data set, one small enough illustrate several processing options.
For a reduced data set, I used just the smallest groups: G2, G3, G4, G5, G7, G8, and G9. Together they comprise less than half the total sample, making the results easier to visualize. Let’s begin with the STRs from the 12-marker panel. That will give us reason to appreciate the value of using more markers. After eliminating the multi-value sites as mentioned earlier, 8 markers are left. Running the default Fluxus Network MJ (median-joining) routine on this sample produces the graphic representation shown in Figure 9.
Figure 9 - click to enlarge
At first, this may appear impressive but remarkably uninformative. Here are some important elements:
The size of the nodes represents the number of individuals having that haplotype. For example, G7-03 includes G7-04, G7-05, G7-06, and G7-07, while G7-01 is just G7-01.
I added color coding to facilitate identifying subgroups.
The small red nodes are the constructed median vectors.
The length of the links is proportional to the number of mutations needed convert one node to the other. In this particular case, most of the links represent a single mutation. The G2-03 to G2-04 link represents two mutations.
The physical distance on the page between two nodes doesn’t mean anything. What counts is the distance as measured along links connecting the two. For example, G7-09 is close to G4-04 in the figure. But the mutation distance would be measured along the links leading from G7-09 up and across to G7-16, then back again via G7-12 to G4-04.
The network links drawn in red (a user option) form what is termed the torso. The torso consists of all the parts of the network that are closed loops plus links to connect the different loops together. (A post-processing routine called MP may remove some of the loops, but the remaining nodes still count as being in the torso.) As you can see, nearly the entire network is part of the torso. This indicates ambiguity and possible homoplasy. Consider, for example, the G7-11 haplotype. Supposing that G7-11 is the end product of a mutation chain, it could descend in two mutations from G4-04 by the violet path or the green path shown in Figure 10. Or it could descend from G7-05 by the magenta path or via G7-12 (orange and violet). If there is just one person with the G7-11 haplotype, the figure describes the ambiguity about the actual origin of that individual. If several individuals have the G7-11 haplotype, homoplasy comes into play; there might be some individuals with each of these origin stories.
Figure 10 - click to enlarge
The network doesn’t really do much to increase our understanding of the relationships among subgroups; we failed to give the program enough information to do a good job.
Next, let’s increase the number of STRs to 49, again using the default setting for the MJ option. Figure 11 shows the result.
Figure 11 - click to enlarge
Notice the many differences.
The torso (red links) is much less extensive and less complex.
Nearly all the nodes represent single individuals; haplotypes that were identical for 8 markers no longer are identical. This is one result of higher resolution.
More markers results in more mutations, putting greater distance (longer links) between nodes. So higher resolution creates greater separation between nodes.
For the most part, the network separates the haplotypes into distinct clusters corresponding to the I-L38+ Project groups. This is another result of higher resolution.
The clustering is indicated by the links, but I’ve dragged the nodes around, maintaining link length, so as to make the clustering more visual.
There are some anomalies. First, G5-11 doesn’t seem to have a close connection to the other G5s. (Recently the I-L38+ Project reclassified that haplotype into a new subgroup.) More importantly, the four G3s appear to have no close relationship. But we know they must be related because they form the new subgroup marked by the L533+ SNP mutation. The G2s are a bit iffy also, looking as if they derive from G3 even though it is G3 that has the mutation. However, G2-04 is the only G2 that has tested negative for L533; the other two haven’t been tested and thus could be L533+.
There are also some points to notice about networks, trees, and other Network options.
Network Versus Tree
A network can have more than one path linking to nodes, while a tree has just one path linking two nodes. You can obtain a tree from a network by selectively removing links. Because the torso contains all the loops, any reduction of a network to a tree involves removing links from just the torso. For example, Figure 12 shows the original torso and the central part of the network, while Figure 13 and Figure 14 show two possible trees.
Figure 12 - click to enlarge
Figure 13 - click to enlarge
Figure 14 - click to enlarge
While a median network contains trees of equal (and maximal) parsimony, not all are equally probable. MP post-processing retains the most probable links and removes the less likely ones. The G8 grouping is the only part of the network affected by this processing .Figure 15 shows the original network for this group, and Figure 16 shows the network after MP processing. Note that the retained median vectors all have links to existing haplotypes, whereas the dropped median vectors don’t.
Figure 16 - click to enlarge
Links that were part of the torso remain identified as part of the torso even though they no longer are part of closed loops.
How can we deal with the failure to render the L533+ subgroup adequately? As mentioned earlier, using MJ with increasing ε values isn’t an option for the full data set, nor is using RM alone. The two main suggestions are to use weights for the markers and to precede the MJ calculation with an RM (Reduced Median) calculation.
Applying Network Techniques
Taking these points together, I decided to use the combined RM-MJ processing followed by the MP post-processer. Working with the full data set, I developed a set of weights; I’ll discuss that process later. For the present, I’ll use those weights for the remaining examples, beginning withFigure 17, which uses the same haplotypes as Figure 11
Figure 17 - click to enlarge
What can we read from this network?
Note how the various branches seem to stem from a central point (a median vector shown here as red with a blue rim, at least if you can zoom in on the center). Also note how the various median vectors tend to be close to this hub while the nodes for existing individuals are further away. This fits with the notion of the median vectors being reconstructions of the past. We can view this figure as showing an evolution of haplotypes from a somewhat uncertain past in the center to present-day haplotypes towards the perimeter.
It’s important to keep in mind that although most of the nodes represent data for a single individual, each node really represents a haplotype, not a person. This gives each node an extended time dimension. A node such as G9-04, for example, represents the tested individual, all his paternal ancestors back to the first instance of that haplotype, and all their unmutated descendants. Thus it probably spans many generations. A haplotype has a lifespan. It begins with one individual, it’s populated by that individual’s unmutated descendants, and, perhaps, by homoplasy. It’s depopulated by the deaths of its members and also by the descendants themselves eventually having mutations. (The estimated combined mutation rate for the 49 STRs used is roughly one mutation per nine generations.) So it shouldn’t be surprising that the past is represented by median vectors rather than by current haplotypes.
The G2/G3 situation looks better than in the preceding graph. There is a clear relationship, shown in green, among all four G3 samples. G2-05 seems anomalous, as it’s closely linked to G3 nodes rather than to G2 nodes. But G2-05 hasn’t been tested yet for L533, so G2-05 very well could be G3 rather than G2. (If not, that’s a problem.) Also, the figure would conform better to expectations if the G2-02, G2-03 branch connected closer to the center of the graph, between where the G3-03/G3-04 branch connects and the G4 branch connects, making the G2 branch predate the G3 branches, which are thought to represent a mutation of G2.
As always seems the case, G5-11 doesn’t seem to belong with the other G5s. What will differ from model to model is where G5-11 does get placed.
This model has no torso; introducing RM and weights has removed some of the ambiguity from the representation, although probably not from the reality.
The tree shown here is termed unrooted, meaning it shows the connections but not necessarily the direction of evolution. It is clear that the outer parts evolved from the inner parts, but the central location of the hub median vector doesn’t necessarily mean it’s the oldest. There could be older haplotypes that weren’t required for linking the current ones. Or we can assume the median vector immediately above the hub is older and rearrange the graph as shown in Figure 18.
Figure 18 - click to enlarge
This emphasizes the tree-like structure of this particular loop-free network. However, graphically it crowds the nodes together, which would have made the labels (omitted here) hard to read.
In either representation, the network represent an evolution of haplotypes from a past represented by the hub and median vectors near the hub to the present population of haplotypes.
Both the RM procedure and using unequal weight reduce the complexity of the network, concealing some of the inherent uncertainty. The optional ε parameter can restore some of that lost uncertainty. As mentioned earlier, this parameter is not that useful for larger data sets, but it is useable with the smaller data set used so far. And the result does shed some light on the G3 problem. The manual suggests starting with an ε of 10 for the case when the weights are all 10. For the model shown, the average weight used was 11, so I used that value for ε. Figure 19 shows the result.
Figure 19 - click to enlarge
The overall associations of haplotypes is not much changed, but now there is a torso expressing alternative ways for linking the nodes. Note in particular the part of the network with G2 and G3, as shown in Figure 20.
Figure 20 - click to enlarge
Starting from the hub node, the green path shows a subtree containing all four G3 samples. It also includes G2-05, but a positive L533 test, which this model predicts, would fix that. The cyan branch connects the hub to G2-04, the known L533- member of G2. The two paths cross at the purple hub. This would be an example of homoplasy, with some individuals at that node having an ancestry that traces the green path back to the hub and others having an ancestry that traces the cyan path back to the hub. G2-03, you may recall, hasn’t been tested for L533. And this network leaves open either possible result, as there is one link path from a green branch to G2-03 and one link path from the cyan branch to G2-03.
This is a good example of how a network can explain data that don’t quite fit a tree. Other closed loops in the torso may also reflect homoplasy, or they may simply indicate that there is more than one possible set of connections that can explain the observed data.
Also, the example illustrates the sort of behavior one might expect if non-zero ε models could be run with the larger data samples.
Earlier, it was suggested that the RM-MJ combination works better than using MJ and varying ε. We can examine that claim for this reduced sample. Figure 21 shows the result of using MJ with uniform weights of 10 and an epsilon of 10.
Figure 21 - click to enlarge
Clearly, the result is not clear. Running the MP calculation simplifies the network quite a bit, as shown in Figure 22. However, the network still is less than simple, and it doesn’t resolve the L533+ problem as well as the RM-MJ calculation does.
Notice that the main complication of the network takes place with the median vectors representing past haplogroups. For example, the G5 representatives still group together, but instead of being tied to the center of the network by a single long link, they are connected by a set of alternative paths. In general, as one might expect, the network paints a more definitive picture of recent relationships between haplotypes than it does of relationships in the past.
One reason the recent networks divide the data so cleanly into separate groups is that these groups are already well defined by the I-L38+ Project classification scheme, which uses from two up to six STR markers to define them. The remaining two groups, G1 and G6, are defined by the value of a single STR (DYS448), and adding them to the analysis creates a more disordered picture. This brings us back to Figure 5, which, for convenience, is displayed again here as Figure 23. In this more complete network, the G7 members remain well separated from the other small groups, but they do seem to overlap with some of the G1 and G6 members.
Figure 23 - click to enlarge
The results for this data set depend strongly upon the weights used. The easiest way to see this is to look at the network produced using the same data and the same processing (RM followed by MJ followed by MP) but using equal weights of 10 (Figure 24).
Figure 24 - click to enlarge
Omitting the MP stage gives an even more complex torso, whereas with unequal weights the torso was the same whether or not MP was used, so it’s clear that weight choice can simplify the output. It’s also true that while some features are not much affected (the correlation of the network with G4, G5, G8, and G9, for example), others, such as the hub form and the representations of G2 and G3, are affected.
This raises two important questions:
I looked at three plausible methods of choosing weights.
In a perfect world, one might expect all three approaches to yield similar weights, but apparently the world is not perfect. Table 3 shows those initial weights. The mutation count and variance boxes are empty for DYS426, DSY590, DSY472, and DSY425 because the sample data set had no mutations for those STRs.
A second factor is the randomness built into the mutation process. A mutation rate of, say, one mutation per hundred generations does not mean that one mutation shows up every hundred generations, like clockwork. It’s similar to tossing a coin—the fact that there’s a 50% probability of heads doesn’t mean that tossing a coin a thousand times yields the sequence HTHTHTHT and so on for 500 HT pairs. There will be stretches with many heads in a row and stretches with many tails in a row. So with STRs you may get some mutations when you expect none and few mutations when you expect several. This behavior results in variances that can be quite different from what one would expect from the mutation rates. Ken Nordtvedt investigated this with numerical simulations and concluded the following:
“When the many STRs of extended haplotypes are evaluated after a clade's descendant from the common ancestor, it is found that the individual STR ratios of Variance/mutation rate vary all over the place from way above the expected value … to significantly below.” (http://knordtvedt.home.bresnan.net/FirstMutation-A.txt)
Generally speaking, the earlier a particular mutation first occurs in a trial, the larger the variance is for that mutation.
A third factor is the human history of the haplotype. If some branches are selectively thinned or cut short and other branches grow disproportionally quickly, diversity will be reduced and so will the variances.
To compare the effects of the three weighting schemes, let’s look at the resulting torsos. Figure 25 shows the result obtained using weights based on the internal mutation counts for Network.
Figure 25 - click to enlarge
Figure 26 shows the results using weights based on mutation rates.
Figure 26 - click to enlarge
And, finally, Figure 27 shows the result based on the variance. The long extension looks similar to the one in Figure 25, but it involves different haplotypes.
Figure 27 - click to enlarge
Although different from one another, all three are considerably less complex that the uniform weight torso.
None of these models was internally consistent in this sense: the mutation counts generated by the models were not consistent with the weights used to create the models. I decided to take the mutation count approach further by using the resulting mutation counts to generate new weights, then using the new weights to generate a new model and new mutation counts. My hope was that this iterative approach would lead to weights for successive models converging and to a self-consistent model. I applied this approach using each of the three previous starting points (first-pass mutation counts, mutation rates, and sample variance). In fact, each model did converge. However, although the three final models were much closer to each other than were the starting models, they did not match precisely. Given that the models were attempting to generate a network spanning a few thousand years with just a scattering of contemporary data, it’s not surprising that one cannot recover an unambiguous answer from the data. Each is a plausible subset of possible models.
Not having a basis for choosing any one result, I averaged the weights and ran two further iterations on that sample. The results, shown in Table 4, are the weights used in Figure 16 through Figure 23. The Network model ignores markers such as DYS426 that exhibit no mutations, so any weights can be used for those four markers.
We’ve seen three methodologies for generating networks for the full L38 49 marker data set.
Clearly, Method 3 produces the most intelligible and aesthetic representation. Also, the resulting network is the only internally consistent model, having weights consistent with the mutation counts. And that network is the most consistent with the information for the L533+ and the L69+ SNP values. So it is fair to say the Method 3 network is better than the other two.
But it also is true that Method 3 oversimplifies the situation. Comparing Figure 23 to Figure 24 shows that using unequal weights simplifies the torso. Comparing Figure 19 to Figure 21 and Figure 22 shows that using RM as a preprocessor for MJ excludes a great number of possibilities. We can restore some of these possibilities by using a non-0 ε with MJ. This wasn’t possible using MJ in the default mode, but RM simplifies the network so greatly that it becomes possible to use some none-0 ε values with the RM-MJ sequence. Figure 28 shows the result using an ε of 9.
Figure 28 - click to enlarge
This process adds much more intricacy to the center of the network. But it doesn’t change the basic picture much. The G3 group does look divided, but if one magnifies the center and follows the links, the relationship is not much changed. There are many more median vectors and links in the central region, but the overall pattern still is consistent with the idea a hub of rapid population growth surrounded by nodes acting as secondary sources of growth (Figure 29).
Figure 29 - click to enlarge
In short, the network of Figure 5 (and Figure 23) can be regarded as a simplified and non-unique network capturing the essential relationships in the data base. In particular, the torso can be regarded as a lower-resolution view of a more complex past.
One question that has bothered me is this: is the hub feature of the network of Figure 5 a valid reconstruction, or is it just an artifact of the particular procedure I followed? To get some insight, I applied the same procedure, including the iterative weight-selection, to a different data set. The Family Tree Y-Haplogroup I2a project has 67-STR data for over 1100 individuals divided among 40 subgroups. This haplogroup is thought to be older than I-L38. I randomly selected a few haplotypes from each subgroup to obtain 120 individuals, matching the original size of the L38 group. (One was added to the latter later.) Figure 30 shows the resulting network. The structure is quite different from the L38 network. In particular, rather than exhibiting a hub, it has several regions that seem to indicate periods of rapid population expansion, as indicated by several links emanating from a few closely linked nodes. This doesn’t prove that the hub form for L38 isn’t an artifact, but it does make it plausible that it isn't an artifact.
Figure 30 - click to enlarge
About the Author
Questions, Suggestions and Thoughts
Feedback is welcome. You can reach me at the following email address: