I-L38 Median Networks (1) - STRs only

By Stephen Prata - January 2014

This page discusses creating a network (Figure 5) that visually represents the evolutionary relationships among the haplotypes for 121 members of the I-L38+ Project. It describes some of the concepts involved in creating and interpreting the network and outlines the many choices made in leading to this specific result.

Introduction

The I2a2b L38+ Project lists Y-DNA STR data for over 200 project members, with 12 to 111 STR values (each set a haplotype) per entry. That’s a lot of information. The Project provides some order by sorting the members into nine subgroups based on the values of a few key STR markers. Another method of exploring the connections amongst members is to apply network analysis, as Hans de Beule has done in several papers (listed at I-L38 Papers & References). I thought it would be interesting to apply this technique to a large sample including many STR markers to see what it might reveal about the organization of the L38+ haplotypes. A graphical representation of the information makes it easier to visualize relationships. I used the free Network program available from Fluxus Technology Ltd. for the analysis. For data, I used the 121 project members that, as of autumn 2013, had at least 67 STR markers listed; 49 of those markers were used.
Understanding the results does require some background. Since some readers may lack parts of this background (as I did), I’ll try to supply some now.


Background

First, a disclaimer:  my understanding of these matters is a journey in progress, and if I’ve gone astray, I would appreciate some course correction (preferably gentle). Second: Much of my discussion represents an attempt to explain things to a novice, in particular, myself. Third: you can find a fuller discussion of some of these matters on the L38+ Project background page  and on the other pages of this website.
The beginning, of course, is the Y chromosome, the chromosome that a man inherits from his father, the only chromosome in the human genome that is passed in its entirety from parent to offspring. If it weren’t for mutations, every male human today would have the identical Y chromosome. And it would be the same as that of the most recent common ancestor of all men today, currently estimated to have lived 120,000 or more years ago.
But mutations of various kinds have occurred. Geneticists use one particular type of mutation, the Single-Nucleotide Polymorphism (SNP), to divide the male human population into different Y-chromosome classes called haplogroups, which they have attempted to arrange in an evolutionary tree. For example, the M170 mutation, among others, indicates membership in the I haplogroup. Mutations that occurred after M170 serve to divide the I haplogroup into various subgroups. For instance, the P215 mutation indicates membership in the I2 subgroup. And the L38, L39, L40, L65, and L272 mutations indicate membership in a group labeled by ISOGG as I2b2 in 2008 and relabeled as I2a2b in 2011. This shouldn’t be confused with the former I2a2b group, relabeled as I2a1b2 in 2011. To avoid that confusion, people have begun using the final mutation in the series to identify groups. So the 2011 I2a2b subgroup can be labeled L38+ (Geno 2.0 and the L38+ Project) or L39+ (FTDNA). Recent discoveries (see the L38+ project News page) indicate the L533+, L69+, and, perhaps, F780, and PF112 mutations indicate L38+ subgroups, albeit ones with few members.
SNPs measure evolution on the time scale of thousands or tens of thousands of years. A second sort of mutation, the Short-Tandem-Repeat, or STR, has higher mutation rates and provides a view on a finer time scale. Each STR location is characterized by the number of times a short segment of DNA is repeated, and mutations change this number, either increasing or decreasing it. The original L38+haplogroup man would have had a particular set of counts for his STR markers, and that would constitute his haplotype. Since then, that original haplotype has mutated along various branches to the large set of haplotypes we have today within the L38+ haplogroup. (The most common mutation changes a count by 1. That is, a count of 20 most often mutates to 19 or 21. Mutations that change the count by 2 or 3 are possible but much rarer.)
The value of any single STR marker in a haplotype doesn’t indicate any particular haplogroup. For example, most L38+ members have a value of 13 for the STR labeled DYS393. But a few have values of 12, 14, or 15. And there are many individuals with a value of 13 who belong to other haplogroups. But values within certain ranges for ensembles of STR markers do correlate with haplogroup. For example, the FTDNA 12-panel STR test classifies participants into broad haplogroups.
Haplotypes can differ not only in the STR values but in the number of STRs used. For example, common Family Tree DNA (FTDNA) haplotypes are the 12-marker, 25-marker, 37-marker, 67-marker, and 111-marker versions. The more markers used in the haplotype, the more precise the correlation is to a haplogroup. Also, a 37-marker test can resolve individuals that have the same 12-marker haplotype into different 37-marker haplotypes. So we can regard increasing the marker count as increasing the resolution of our haplotype view.


A Phylogenetic Tree

Suppose we knew the identity and relationships of all the I-L38+ men since the first. Then it would be a simple matter to construct a family tree with each node representing a person and each link representing a father-son relationship.
Now suppose we also knew the haplotype for each of these men. Then we could construct a tree showing the evolution of these haplotypes. Each node would be a haplotype, generally containing several individuals, and each link would be the mutation that converted one haplotype to the other. This would be what is termed a phylogenetic tree.
But we don’t have all that information. Instead, we have some challenges.


The Missing Present

Given the present populations of Europe, North American, and Australia (known habitats for I-L38+) and that I-L38+ is estimated to constitute about 0.4% of that population, Hans de Beule estimates  that the current population of I-L38+ individuals is over two million. The I-L38+ project has data for over 200 individuals, or about 0.01% of that total population. The vast majority of I-L38+ haplotypes have a single representative, at least at the 67-marker level, so it's clear that adding many more individuals would add many more haplotypes. That could make analysis more difficult, but it also could clarify the relationships amongst the haplotypes.
Another problem with the present sample is that it's biased in a couple of ways. First, sampling rates vary widely over different geographical regions, which may obscure geographical differences. Second, many of the project members tested to explore family relationships, so the data is biased towards family groupings.



The Missing Past

We are trying to reconstruct a tree based mostly on observations about the ends of the branches, and we don't have much data on the older haplotypes. That's because with the passage of time a haplotype tends to evaporate as its members die without issue or else produce new haplotypes. Consider the MRCA for the group. The likelihood that a 10th-generation descendent of his would have the same haplotype is about 34%. The likelihood that the same person has a genetic distance of 1 is about 37%. These numbers are based on the following:
  • The haplotype is described using the same 49 STRs used later for tree construction.
  • The mutation rates are as given by Marko Heinila (found at http://knordtvedt.home.bresnan.net/Generations111T.xlsx).
  • A mutation changes a value by just 1.
  • Mutations are assumed equally likely for the up and for down directions.
The last point means the genetic distance for a given STR from the MCRA value could be different from the number of mutations for that STR. For example, three mutations in the same direction produce a G.D. of 3, but if two are in opposite directions, the G.D. is 1.
Table 1 shows the likelihood for various G.D. values for several generation values. This simulation assumes that all branches experience similar demographic growth. Different assumptions can lead to different numbers, but the main point is how the distribution of haplotypes changes with time. The highlighted ranges in each column indicate values that add to approximately 70%. So, as time passes, the most common haplotypes get further from the original haplotype and spread over a broader range. Keep in mind that a genetic distance of, say, 5, can describe many different haplotypes, as the differences can be distributed in many ways among the 49 STRs.

Generations

10

30

50

70

90

110

average G.D.

1.06

3.03

4.82

6.48

8.01

9.46

Genetic Distance

likeli-hood

likeli-hood

likeli-hood

likeli-hood

likeli-hood

likeli-hood

0

34.1%

4.3%

0.6%

0.1%

0.02%

0.001%

1

37.1%

14.4%

3.4%

0.7%

0.2%

0.04%

2

20.0%

22.3%

8.7%

2.6%

0.8%

0.2%

3

6.8%

23.0%

15.1%

6.4%

2.3%

0.8%

4

1.7%

17.7%

18.7%

11.2%

5.0%

2.0%

5

0.3%

10.3%

18.3%

15.1%

8.9%

4.4%

6

0.1%

5.1%

14.8%

16.6%

12.5%

7.4%

7

0.0%

2.0%

9.9%

15.5%

14.6%

10.7%

8

0.0%

0.7%

5.7%

12.3%

14.9%

13.0%

9

0.0%

0.2%

2.8%

8.7%

13.4%

14.0%

10

0.0%

0.1%

1.3%

5.4%

10.4%

13.3%

11

0.0%

0.0%

0.5%

2.9%

7.3%

11.4%

12

0.0%

0.0%

0.2%

1.5%

4.7%

8.5%

Table 1
It's not impossible that old haplotypes are present in the modern population, but they are not likely to show up in a sample of 200 individuals.


The Tree That Isn't a Tree

But missing haplotypes aren’t the only problem in constructing a phylogenetic tree for haplotypes. It turns out that the supposed tree may not necessarily be a tree at all. That’s because two individuals with the same haplotype can descend from different haplotypes. That is, two twigs from different branches might merge into a single branch. So the result can be a tree with closed loops. The technical term for this merging is homoplasy. It can come about in more than one way. Let’s look at some cases. For simplicity, assume the haplotype is for just two markers.

1. Suppose a haplotype H1 consists of the values 10 and 20; we can represent that as (10,20). Suppose Alf sires a son Bob whose haplotype mutates, making Bob the founder of a new haplotype H2 with values (11,22). Later, Carl, an H2 descendant of Bob, sires Don, who becomes the founder of haplotype H4 with values (11,21). Meanwhile, back in H1, Ed, who lives a few generations later than his cousin Alf, sires Fred, who becomes the founder of a new haplotype H3 with values (10,21). Then Gus, an H3 descendent of Fred, sires Hal, who winds up with values (11,21) and hence is part of H4. Figure 1 illustrates this state of affairs, with the center color of a circle indicating the haplotype and the border color indicating the haplotype of the direct ancestor. One result is that both Doug and Hal have the identical haplotype H4 but they have different haplotype histories. This is one route to homoplasy.

Figure 1

2. We could have a similar situation, but instead of Hal descending from Gus, Hal could descend from Don, and Gus could descend from Hal, making H3 the haplogroup with homoplasy.

3. Another possibility is that Curtis, another H2 descendant of Bob, has a son Doug in which the mutation is from (11,20) back to (10,20). Now H1 would have a mixture of members with different haplotype histories.

4. Starting with the two haplotypes (10,20) and (10,24), each type might eventually have a descendant of type (10,22), a case of convergent evolution.

5. Starting with the (10,20) and (10,21) pair of haplotypes, we could have a chain of parallel mutations:  (10,20) goes to (11,20), (11,20) goes to (12, 20), and (12,20) goes to (13,20). Then (13,20) goes to (13,21). Meanwhile, a second chain of mutations (Figure 2) leads from (10,21) to (13,21. Here haplogroup H8 exhibits homoplasy


Figure 2

Observations
  • A haplotype contains its founding individual plus all his unmutated descendants. If homoplasy occurs, the haplotype may also contain individuals with different haplotype histories.
  • Mutations are rare events, so, to get multiple haplotypes originating from one haplotype, as with H1 in the examples above, it helps to have a relatively large number of individuals with that haplotype. It is extremely unlikely that a single individual will sire two separate mutations.
  • Adding markers to a haplotype description reduces the role of homoplasy.
Let’s examine this last point. Consider Figure 1. Suppose that instead of representing a two-marker haplotype, it represents two out of a total of 25 STR markers. For both Don and Hal to have the same haplotype, all 25 markers should have the same value. This means that either none of the other 23 markers underwent a mutation while the two pictured markers did or else exactly the same additional markers for Don and Hal followed the exactly the same mutation paths. The more time elapses, the less likely these possibilities become. For instance, it becomes very unlikely that all the mutations shown in Figure 2 would have taken place without any of the other 23 markers going unchanged.
Going to more markers further reduces the importance of homoplasy, and that is one reason for using as many markers as feasible.




Network Analysis

We face tough problems when attempting to construct a phylogenetic tree for the I-L38+ haplotypes. Most important is that we have only a scant sampling of haplotypes, and they just portray the contemporary state of the haplogroup.  This makes it possible to hypothesize many possible trees that might fit the data. Then there is the problem of homoplasy. Network analysis can help with both of these issues. The idea is to create a network of links between nodes. Each node represents a haplotype, and each link represents a mutation or set of mutations needed to transform one haplotype to another. A tree provides just one path leading from one node to another, but a network can provide multiple paths. A network can have closed loops representing homoplasy. It also can represent the superposition of many possible trees, allowing us to visualize several equally good trees rather than pick one as being “the” tree. This serves to remind us of the limitations and uncertainties in our knowledge.
To visualize the advantage of a network over a tree, suppose a database contains individuals with the H1 and H4 haplotypes of Figure 1. If you were constructing a tree you might infer a hypothetical haplotype H2 and connect the types as shown in Figure 3, case a. Or you might infer haplotype H3 and connect the types as shown for case b. In reality, there’s really no basis to prefer one to the other, and in a network you can connect them as in case c. This visual representation tells you that either construction is equally plausible. It also reflects the possibility that, due to homoplasy, both representations might simultaneously be true. It is a more complete and more honest representation of what we know and of the uncertainty in what we know.

Figure 3
In this figure the circles representing the haplotypes are the nodes and the lines connected them are the links.
The project to be presented here used Network 4.6.1.1, which is free software from Fluxus Technology. It’s intended for a variety of applications, including analyzing Y-DNA STR data, DNA nucleotides, and amino acids, and it provides several pathways for analysis. Its primary goal is to create median networks, so let’s take a look at what that means.



Median Vectors and Median Networks

Let’s use just a three-marker haplotype. Suppose we have three haplotype nodes:  A = (9,20,30), B = (10,21,30), and C = (10,20,31). There are various ways we can link them together with single-mutation links by assuming suitable intermediate haplotype nodes. For instance, you could hypothesize two intermediate nodes and create a chain of five nodes. But the most efficient way to tie A, B, and C together is to assume a single intermediate haplotype Z = (9,20,30). Each of A, B, and C is exactly a single one-step mutation away from Z, and you can link them together as shown in Figure 4.
Figure 4
Z is termed a median vector or consensus vector, and it provides the shortest combined distance for linking each pair (i.e., AB, AC, and BC) together. In this case the median vector is a hypothesized node needed to link three nodes together to create that shortest combined distance, but it could also be an existing node corresponding to a known haplotype.
The links in themselves do not imply the direction of the relationship. In Figure 4, any of the four nodes could be ancestral to the other three. For instance, B and C could descend from Z, which descends from A. Or A, B, and C could descend from Z.
A median network is one for which a unique median vector exists for any three nodes. The significance of a median network in phylogenetics is that it will contain all possible maximum parsimony trees for the data. And what that means, I think, is this: 
Consider all the possible evolutionary tree structures you could create to connect a set of haplotypes. Every link between nodes represents one or more mutations. For each tree compute the total number of mutations represented by all the links. Find the trees that use the least number of mutations. These are maximum parsimony trees. There’s no guarantee that nature actually uses the most parsimonious approach, but these trees are thought to be a good beginning for recreating the evolutionary structure.
The Network program undertakes to create a median network for a data set. It supplies median vectors as needed to allow joining all the original haplotypes into a single median network. These hypothesized median vectors may represent haplotypes that exist but are not part of the sample, and they can represent older ancestral haplotypes no longer present in the population. The Network program calls its basic network-creating procedure MJ, for median-joining.



The Data

The FTDNA I-L38+ Project has STR data for over 200 individuals. In order to take advantage of the abundance of information available, I decided to use 67-marker data. As mentioned earlier, using more markers provides a higher resolution view and also reduces some forms of homoplasy. Using 111-marker data would give even more resolution, but that was available for only 20 people in the autumn of 2013, whereas 67-marker information was available for 121 people, including representatives of all the Project subgroups.
Some of the STR markers are not suitable for network analysis. This includes all those markers with multiple values, for instance, CDY, which lists values such as 34-35. The problem with this type of marker is that the values are sorted by numerical order rather than by the sub-unit involved. So the two orderings 34,35 and 35,34 both would be reported as 34,35. This is just a result of the testing procedure, and further tests would be needed to get the correct order. The information is still useful for general classification because 34-36 is definitely a different haplotype from 34-35. Also, the Network User Guide advises not to use DSY389i and DSY389ii unless two other components of the DSY389 group are known. These omissions still leave 49 STR markers in play.
The Network program requires fairly short labels for the haplotypes, so I used the simple scheme in Table 2 to represent subgroups as identified on the I-L38+ Project STR data page as of autumn 2013. (Since then, another small group has been added.)

STR Page Description

Background Page Details

Group ID

 (19 basic)

DSY448=19

G1

 (19-11)

DSY448=19, DSY454=11, DYS385a,b= 15/16, L533- or untested

G2

(19-11) L533+

DSY448=19, DSY454=11, L533+

G3

(19-8)

DSY448=19, DSY438=8, DSYH4=10

G4

(20-Scot)

DSY448=20, DYS385a,b=11/16, DYS537=12, DYS640=11, DYS617=12, DYS568=11

G5

(21 basic)

DSY448=21

G6

(21-14)

DSY448=21, DYS388=14, DYS461=12, DYS452=30

G7

(21-9)

DSY448=21, DYS446=9, DYS522=13, DYS463=23, DYS452=30, DYS497=14

G8

(22-EE)

DSY448=22, DYS393=14, DYS640=11, L69+

G9

Unclassified

 

UC

Ungrouped

 

UG

Table 2
Note that two of the subgroups (G3 and G9) have associated SNP mutations (L533+ and L69+), making them haplogroup subgroups.
I labelled the first member of G1 G1-01, the next G1-02, and so on. Appendix A shows the members used and the corresponding group IDs.

What is a Subgroup?

This article uses the term subgroup in three different ways.

A new SNP defines a new haplogroup. So those members that test positive for L533 can be considered members of a descendant haplogroup labeled I2a2b1 or I-L533+.

The I-L38+ Project divides its members into subgroups based on STR values. These are the groups labeled G1, etc., in the table above. Two of these groups, G1 and G6, are fairly broad compared to the others and can be thought of as the results of a first sorting pass.

Network analysis can reveal subgroups based on how haplotypes appear to cluster in the graphical representation of the haplotypes.

These uses aren’t entirely independent. As discussed on the I-L38+ Project News page, the L533+ SNP is associated with the G3 group and the L69+ SNP is associated with the G9 group. And the mutations F780+ and PF112+ may define other subgroups.

The main impetus for this study is to see how network subgroups correlate with the other two definitions.




A First Look

Figure 5 shows an example result obtained by running Network with the I-L38+ data.


Figure 5 - click to enlarge

Commentary
In this figure the length of a link is proportional to the number of mutations between the two ends of the link. The small red nodes are the median vectors created by the program. The haplotype nodes are color-coded as indicated, as are the links that tie together nodes of one group.
In general terms, I’ve positioned the nodes, keeping link lengths unchanged, so that the periphery represents contemporary haplotypes and the center of the network, where the median vectors concentrate, represents the reconstructed past. The outer parts of the network are tree-like, with only one possible pathway linking two nodes. The network aspect appears at the center, where multiple pathways, shown in red, link two nodes. This part is termed the torso.
The smaller project groups (G2, G3, G4, G5, G8, and G9) also appear as grouped nodes in this network. G7 doesn’t appear to be a simple group, and G1 and G6, the two largest groups, appear to be reservoirs from which some smaller groups may be drawn. A few haplotypes appear candidates for reclassification. (Indeed, some have been reclassified on that basis.) There are indications of some small subgroupings within G1 and G6. For example, G6-42 through G6-46 form a tight cluster, as one might expect from a family group. G6-16, G6-35, G6-36, and G6-37 form another subgroup, one for which the individual haplotypes diverged from one another further back in time.
Note how these two G6 subgroups, as well as G4 and G5, are structured. Each has a long, multi-mutation link from near the center to a more distant median vector that appears to be the base for the group. In contrast, the G8 and G9 groups show branching extending much further into the past. Consider the first sort of structure. Each mutation within a link corresponds to a father that has produced a son with a mutated haplotype and thus is a branching point. So there should have been several links branching off, much as is the case for G8. There are a couple of explanations why we don’t see them. One is that none of these branches have present-day survivors. Given the European history of warfare, plague, famine, and other disasters, it’s not difficult to believe this could happen. The second explanation is sample size. If we had data for 10,000 I-L38+ members, we could expect more haplotypes to show up, and perhaps some might be those missing from these groups.



SNP Haplogroups

Recall that two SNPs seem to indicate I-L38+ subgroups and that two more SNPs are candidates for indicating new subgroups. How does the network fit in with these facts?
  1. The L69+ SNP is thought to correspond to the G9 group because all three cases of L69+ are in the group (G9-02, G9-04, and G9-07) and no one in the group that has been tested is L69-. All trace back to the same median vector in the torso. So do G1-08 and G1-32, but not in a way that would require them to be L69+.
  2. The L533+ SNP is used to define membership in G3. G3-03 and G3-04 trace back to one torso median vector and G3-05 and G3-06 trace back to a second torso median vector, and those two median vectors are adjacent. So it is possible to fit these G3s to a single point of origin. However the two branches separated a long time ago. The network appears to require that the untested G2-05 test positive for L533+. Either a positive or negative test for G2-03 can be fitted to the network. Also, G1-16 appears to come off of the G3 branches in the distant past, but subsequent testing revealed G1-16 to be L533-. Thus, the network representation is imprecise in this instance.
  3. The PF112+ SNP has been found in G8-04 and not found in G8-01. The other G8s haven’t been tested. G8-01 is not part of the sample used here. G8-04 is at the end of a branch, so the network doesn’t require any other G8 to be PF112+. However, G8-07 shares a node with G8-04, so, if the mutation occurred before that node, the network implies that G8-07 would be PF112+.
  4. The F780+ SNP occurs in G1-17, which has a seven-mutation link to a torso median vector. So the F780 mutation could have occurred anywhere along that link without implying any other F780+ members. If the mutation occurred earlier, it could affect four haplotypes stemming from a median vector one mutation removed from the one originating G1-17. These are G1-11, G1-22, G1-39, and G1-57. Also, G1-01 and G1-19 could be connected via a different median vector, as could G1-03, G1-05, G1-18, G1-21, and G6-21.
 


The Hub

The network appears to radiate from a central point that I’ll call the hub (Figure 6).

Figure 6 - click to enlarge
This picture is consistent with the central haplotype existing at a time of a rapid population expansion. Note how many different branches connect to the node I’ve called the hub. Each represents a mutation, and the only reasonable way for one haplotype to generate so many mutations is that many individuals have that haplotype. Haplotypes tend to evaporate with time as the biologic descendants experience mutations, so the simplest way to build up a large population of a particular haplotype is to have a rapid population growth while the haplotype still exists. Note that several of the nearby descendant nodes also have several branches. Given that those nearby haplotypes began with a single individual, population growth would have had to continue long enough for those haplotypes to reach a population sufficient to engender multiple mutations.
Thus, the network suggests a period of rapid population growth that persisted for several generations, and that most likely would have to be accompanied by geographic growth to provide the resources for the growing population.
The hub is not necessarily the oldest node. Links are not in themselves directional. So one of the links radiating out from the hub may actually connect to an ancestor node rather than a descendant node. But, as it is extremely unlikely for multiple nodes to be ancestral to the hub, most of the links would connect to nodes that evolved from the hub. In short, the hub doesn’t signify the oldest haplotype; it signifies a haplotype involved in a rapid population expansion.


The Torso

The torso consists of several diamond-shaped elements with one vertex at the hub. Each portrays two alternative paths from the hub to the node at the opposite end of the diamond. As noted earlier, there are two ways to interpret this topology. One is that only one alternative actually occurred, but we have no way of choosing one over the other. The second (homoplasy) is that both occurred, so that the haplotype at the tip contained some individuals with one ancestry and some with the other.


Caveat


Going from the raw data to a particular outcome, such as that of Figure 5, requires many choices, choices that shape the result. Figure 7 demonstrates that point by showing the result if one accepts the default processing. Once again, the links in red represent the torso.

Figure 7 - click to enlarge
Not only is this network confusingly more complex, it has problems tying together members with the L533+ SNP mutation (G3).
Knowing the process is crucial to understanding the result, so let’s look more closely at the process.



A Rationale

It’s often said that one shouldn’t visit a sausage factory if one wants to enjoy the output. But the situation for studies like this is different. The results depend on many decisions made in the processing, so to understand the results and to appreciate the uncertainties that always exist, one really should visit the factory. In this spirit, I’ll lead up to the network of Figure 5 in a sequence of steps that, I hope, illustrate how the Network model works and how to read the graphics it produces, and which also show and justify the choices made.



Fluxus Network Processing Options

First, let’s look at some of the Network options. Figure 8 shows part of the Calculate Network menu. The sole Optional Pre-Processing choice is Star Contraction. The two Network Calculations choices are Reduced Median, or RM, and Median Joining, or MJ. The sole Optional Post-Processing choice is MP (for Most Probable) Calculation.

Figure 8 - click to enlarge
What do they do?
The Star Contraction can group closely related haplotypes into a single node. This option did not prove particularly useful for the data set used here, so I didn’t use it.
The Reduced Median (RM) and the Median Joining (MJ) use two different algorithms to create a median network. The default form of each typically doesn’t produce a median network but works as a starting point for an iterative approach. For each method one can increase a parameter that brings the result closer to a median network.
The Optional Post-Processing choice is called MP (for Most Probable), and it simplifies networks by removing nodes and links not needed for the more probable trees in the network.
Network has somewhat contradictory goals. One is to produce a median network. It will contain all the maximum parsimony trees, and the hope is that one of those represents the actual evolution of the haplotypes. The other goal is to present the information in an intelligible fashion. Figure 7, which was generated using the default settings for MJ, fails on both counts. The defaults are equal weights of 10 for all STRs and a value of 0 for what’s termed the ε parameter. The advantage of using these settings for the I-L38 data set is that the calculation is fast, about ten seconds on my desktop computer. The first disadvantage is that the result typically is not a median network, hence not all possibilities are shown. The second disadvantage is that interpretation may not be clear. So let’s examine the Network choices.


Working with Network

Recall that the aim of the Network median-joining (MJ) routine is to construct a median network because that would contain all possible maximum parsimony trees (trees that required the fewest number of mutations). And recall that a median network is one containing a unique median vector for any three nodes. The default setting for MJ is intended to be a fast first attempt that just looks at closely spaced node triplets. It’s possible that the resulting network might fall short of the median network.
The suggested next step is to progressively increase a parameter ε, which, by default, is set to 0. Increasing this parameter causes the program to look further than single-mutation steps when constructing median vectors. The initial pass may not find all the median vectors needed for the complete median network. Increasing ε can find additional maximum parsimony trees. This, in turn, can produce networks with too much complexity to easily understand. So the user guide recommends running the MP post-processor, as described earlier, to eliminate the less probable trees.
Unfortunately, using MJ with non-zero ε places a much greater strain on computing resources. For example, an MJ calculation with the full data sample and an ε of 0 took about ten seconds on my computer, but using an ε (the next recommended step) of 10 led to an out-of-memory message after nearly a half-day of computing. It’s a difference readily apparent even to a novice such as myself.
Network provides an alternative computation (RM) that produces a reduced median network. This computation attempts to remove less-likely links as it creates a network. Thus the final network may not contain all possible maximum parsimony trees. But, like MJ, RM does have an adjustable parameter that allows one to increase the number of median vectors. Unfortunately, the default RM computation failed with the L38 data. More precisely, after nearly two days of processing, it yielded an output file with nearly 120,000 median vectors, and neither MP nor the plotting module could process that file.
It turns out that large STR data bases are particularly stressful for Network. But the User Guide has a remedy: using phase 1 of the RM computation to generate a preliminary file (.rmf suffix) that then is used as input for MJ. This reduces complexity and produces better results than MJ alone, even than MJ with large values for ε. (Michael Forster, Fluxus (private communication) and the Network User Guide) 
Another suggestion is to assign a weight to each STR. By default, the program assigns equal weights of 10. The suggestion is to assign higher weights, say 20, to those STRs with few mutations, and lower weights, 5 or even 0, to those with more frequent mutations. Consider a mutation that occurred just once in L38 a hundred generations ago. That would divide the haplogroup into two subgroups, and individuals in one subgroup would be more closely related to other individuals in the same subgroup than those in the other subgroup. This mutation would represent an important organizing event. Now consider a mutation that has occurred 50 times during that same period. In this case, whether or not two individuals share that mutation doesn’t tell us much about relationships.
From the standpoint of the network, unequal weights downplay the importance of some markers, simplifying the network.
In short, at least as I understand, using the MJ calculation with increasing values of the ε parameter aims to construct a median network, which will contain all maximum parsimony trees. Using the RM calculation, either alone or with MJ, using weights, and using MP post-processing all act to eliminate less likely maximum parsimony trees and to create a more intelligible network. The result of applying these last three techniques is to produce a plausible, but not exhaustive, reconstruction of possible evolutionary pasts.
All these points may become clearer if we first look at a reduced data set, one small enough illustrate several processing options.


Round 1: Familiarization


For a reduced data set, I used just the smallest groups: G2, G3, G4, G5, G7, G8, and G9. Together they comprise less than half the total sample, making the results easier to visualize.  Let’s begin with the STRs from the 12-marker panel. That will give us reason to appreciate the value of using more markers. After eliminating the multi-value sites as mentioned earlier, 8 markers are left. Running the default Fluxus Network MJ (median-joining) routine on this sample produces the graphic representation shown in Figure 9.


Figure 9 - click to enlarge
At first, this may appear impressive but remarkably uninformative. Here are some important elements:
The size of the nodes represents the number of individuals having that haplotype. For example, G7-03 includes G7-04, G7-05, G7-06, and G7-07, while G7-01 is just G7-01.
I added color coding to facilitate identifying subgroups.
The small red nodes are the constructed median vectors.
The length of the links is proportional to the number of mutations needed convert one node to the other. In this particular case, most of the links represent a single mutation. The G2-03 to G2-04 link represents two mutations.
The physical distance on the page between two nodes doesn’t mean anything. What counts is the distance as measured along links connecting the two. For example, G7-09 is close to G4-04 in the figure. But the mutation distance would be measured along the links leading from G7-09 up and across to G7-16, then back again via G7-12 to G4-04.
The network links drawn in red (a user option) form what is termed the torso. The torso consists of all the parts of the network that are closed loops plus links to connect the different loops together. (A post-processing routine called MP may remove some of the loops, but the remaining nodes still count as being in the torso.) As you can see, nearly the entire network is part of the torso. This indicates ambiguity and possible homoplasy. Consider, for example, the G7-11 haplotype. Supposing that G7-11 is the end product of a mutation chain, it could descend in two mutations from G4-04 by the violet path or the green path shown in Figure 10. Or it could descend from G7-05 by the magenta path or via G7-12 (orange and violet). If there is just one person with the G7-11 haplotype, the figure describes the ambiguity about the actual origin of that individual. If several individuals have the G7-11 haplotype, homoplasy comes into play; there might be some individuals with each of these origin stories.

Figure 10 - click to enlarge
The network doesn’t really do much to increase our understanding of the relationships among subgroups; we failed to give the program enough information to do a good job.




Round 2: Increasing the Resolution

Next, let’s increase the number of STRs to 49, again using the default setting for the MJ option. Figure 11 shows the result.

Figure 11 - click to enlarge
Notice the many differences.
The torso (red links) is much less extensive and less complex.
Nearly all the nodes represent single individuals; haplotypes that were identical for 8 markers no longer are identical. This is one result of higher resolution.
More markers results in more mutations, putting greater distance (longer links) between nodes. So higher resolution creates greater separation between nodes.
For the most part, the network separates the haplotypes into distinct clusters corresponding to the I-L38+ Project groups. This is another result of higher resolution.
The clustering is indicated by the links, but I’ve dragged the nodes around, maintaining link length, so as to make the clustering more visual.
There are some anomalies. First, G5-11 doesn’t seem to have a close connection to the other G5s. (Recently the I-L38+ Project reclassified that haplotype into a new subgroup.) More importantly, the four G3s appear to have no close relationship. But we know they must be related because they form the new subgroup marked by the L533+ SNP mutation. The G2s are a bit iffy also, looking as if they derive from G3 even though it is G3 that has the mutation. However, G2-04 is the only G2 that has tested negative for L533; the other two haven’t been tested and thus could be L533+.
There are also some points to notice about networks, trees, and other Network options.

Network Versus Tree
A network can have more than one path linking to nodes, while a tree has just one path linking two nodes. You can obtain a tree from a network by selectively removing links. Because the torso contains all the loops, any reduction of a network to a tree involves removing links from just the torso. For example, Figure 12 shows the original torso and the central part of the network, while Figure 13 and Figure 14 show two possible trees.
Figure 12 - click to enlarge

Figure 13 - click to enlarge

Figure 14 - click to enlarge

MP Post-Processing
While a median network contains trees of equal (and maximal) parsimony, not all are equally probable. MP post-processing retains the most probable links and removes the less likely ones. The G8 grouping is the only part of the network affected by this processing .Figure 15 shows the original network for this group, and Figure 16 shows the network after MP processing. Note that the retained median vectors all have links to existing haplotypes, whereas the dropped median vectors don’t.

Figure 15 - click to enlarge
Figure 16 - click to enlarge

Links that were part of the torso remain identified as part of the torso even though they no longer are part of closed loops.


Round 3: Increasing the Sophistication

How can we deal with the failure to render the L533+ subgroup adequately? As mentioned earlier, using MJ with increasing ε values isn’t an option for the full data set, nor is using RM alone. The two main suggestions are to use weights for the markers and to precede the MJ calculation with an RM (Reduced Median) calculation.

Applying Network Techniques
Taking these points together, I decided to use the combined RM-MJ processing followed by the MP post-processer. Working with the full data set, I developed a set of weights; I’ll discuss that process later. For the present, I’ll use those weights for the remaining examples, beginning withFigure 17, which uses the same haplotypes as Figure 11


Figure 17 - click to enlarge

Observations
What can we read from this network?
Note how the various branches seem to stem from a central point (a median vector shown here as red with a blue rim, at least if you can zoom in on the center). Also note how the various median vectors tend to be close to this hub while the nodes for existing individuals are further away. This fits with the notion of the median vectors being reconstructions of the past. We can view this figure as showing an evolution of haplotypes from a somewhat uncertain past in the center to present-day haplotypes towards the perimeter.
It’s important to keep in mind that although most of the nodes represent data for a single individual, each node really represents a haplotype, not a person. This gives each node an extended time dimension. A node such as G9-04, for example, represents the tested individual, all his paternal ancestors back to the first instance of that haplotype, and all their unmutated descendants. Thus it probably spans many generations. A haplotype has a lifespan. It begins with one individual, it’s populated by that individual’s unmutated descendants, and, perhaps, by homoplasy. It’s depopulated by the deaths of its members and also by the descendants themselves eventually having mutations. (The estimated combined mutation rate for the 49 STRs used is roughly one mutation per nine generations.) So it shouldn’t be surprising that the past is represented by median vectors rather than by current haplotypes.
The G2/G3 situation looks better than in the preceding graph. There is a clear relationship, shown in green, among all four G3 samples. G2-05 seems anomalous, as it’s closely linked to G3 nodes rather than to G2 nodes. But G2-05 hasn’t been tested yet for L533, so G2-05 very well could be G3 rather than G2. (If not, that’s a problem.) Also, the figure would conform better to expectations if the G2-02, G2-03 branch connected closer to the center of the graph, between where the G3-03/G3-04 branch connects and the G4 branch connects, making the G2 branch predate the G3 branches, which are thought to represent a mutation of G2.
As always seems the case, G5-11 doesn’t seem to belong with the other G5s. What will differ from model to model is where G5-11 does get placed.
This model has no torso; introducing RM and weights has removed some of the ambiguity from the representation, although probably not from the reality.
The tree shown here is termed unrooted, meaning it shows the connections but not necessarily the direction of evolution. It is clear that the outer parts evolved from the inner parts, but the central location of the hub median vector doesn’t necessarily mean it’s the oldest. There could be older haplotypes that weren’t required for linking the current ones. Or we can assume the median vector immediately above the hub is older and rearrange the graph as shown in Figure 18.


Figure 18 - click to enlarge
This emphasizes the tree-like structure of this particular loop-free network. However, graphically it crowds the nodes together, which would have made the labels (omitted here) hard to read.
In either representation, the network represent an evolution of haplotypes from a past represented by the hub and median vectors near the hub to the present population of haplotypes.


Round 4: Reintroducing More Possibilities

Both the RM procedure and using unequal weight reduce the complexity of the network, concealing some of the inherent uncertainty. The optional ε parameter can restore some of that lost uncertainty. As mentioned earlier, this parameter is not that useful for larger data sets, but it is useable with the smaller data set used so far. And the result does shed some light on the G3 problem. The manual suggests starting with an ε of 10 for the case when the weights are all 10. For the model shown, the average weight used was 11, so I used that value for ε. Figure 19 shows the result.

Figure 19 - click to enlarge
The overall associations of haplotypes is not much changed, but now there is a torso expressing alternative ways for linking the nodes. Note in particular the part of the network with G2 and G3, as shown in Figure 20.


Figure 20 - click to enlarge
Starting from the hub node, the green path shows a subtree containing all four G3 samples. It also includes G2-05, but a positive L533 test, which this model predicts, would fix that. The cyan branch connects the hub to G2-04, the known L533- member of G2. The two paths cross at the purple hub. This would be an example of homoplasy, with some individuals at that node having an ancestry that traces the green path back to the hub and others having an ancestry that traces the cyan path back to the hub. G2-03, you may recall, hasn’t been tested for L533. And this network leaves open either possible result, as there is one link path from a green branch to G2-03 and one link path from the cyan branch to G2-03.
This is a good example of how a network can explain data that don’t quite fit a tree. Other closed loops in the torso may also reflect homoplasy, or they may simply indicate that there is more than one possible set of connections that can explain the observed data.
Also, the example illustrates the sort of behavior one might expect if non-zero ε models could be run with the larger data samples.
Earlier, it was suggested that the RM-MJ combination works better than using MJ and varying ε. We can examine that claim for this reduced sample. Figure 21 shows the result of using MJ with uniform weights of 10 and an epsilon of 10.

Figure 21 - click to enlarge
Clearly, the result is not clear. Running the MP calculation simplifies the network quite a bit, as shown in Figure 22. However, the network still is less than simple, and it doesn’t resolve the L533+ problem as well as the RM-MJ calculation does.
Notice that the main complication of the network takes place with the median vectors representing past haplogroups. For example, the G5 representatives still group together, but instead of being tied to the center of the network by a single long link, they are connected by a set of alternative paths. In general, as one might expect, the network paints a more definitive picture of recent relationships between haplotypes than it does of relationships in the past.
Figure 22 - click to enlarge


Round 5: The Full Data Set

One reason the recent networks divide the data so cleanly into separate groups is that these groups are already well defined by the I-L38+ Project classification scheme, which uses from two up to six STR markers to define them. The remaining two groups, G1 and G6, are defined by the value of a single STR (DYS448), and adding them to the analysis creates a more disordered picture. This brings us back to Figure 5, which, for convenience, is displayed again here as Figure 23.  In this more complete network, the G7 members remain well separated from the other small groups, but they do seem to overlap with some of the G1 and G6 members.

Figure 23 - click to enlarge


The Importance of Weights

The results for this data set depend strongly upon the weights used. The easiest way to see this is to look at the network produced using the same data and the same processing (RM followed by MJ followed by MP) but using equal weights of 10 (Figure 24).

Figure 24 - click to enlarge
Omitting the MP stage gives an even more complex torso, whereas with unequal weights the torso was the same whether or not MP was used, so it’s clear that weight choice can simplify the output. It’s also true that while some features are not much affected (the correlation of the network with G4, G5, G8, and G9, for example), others, such as the hub form and the representations of G2 and G3, are affected.
This raises two important questions:
  • How were the weights chosen?
  • Why should one prefer the weighted network to the unweighted network?

Choosing Weights
I looked at three plausible methods of choosing weights.
  • I used weights based on internal data from the Network program. The User Guide suggests looking at the number of mutations of each sort that the model used to create the final network and to give higher weight to less frequent events, perhaps 20 to uncommon events and 5 or 0 to common events. I implemented this philosophy by using weights that were inversely proportional to the counts obtained from the equal-weight trial. I used 100 as a scale factor but capped weights at 20. This resulted in weights ranging from 1 to 20 and averaging 11.
  • I used weights inversely proportional to the mutation rates, capping at 20 and scaling so the average was about 11 in order to match the preceding method. I used Marco Heinila’s mutation rates as given in Ken Nordtvedt’s Generations111T.xlsx spread sheet .
  • I used weights based on the statistics of the original data. If the various branches of haplotypes grow in population in the same way, then one would expect the statistical value called the variance to be proportional to the product of the mutation rate and time. So I used weights inversely proportional to the variance for each STR, again capping at 20 and scaled to average about 11.
In a perfect world, one might expect all three approaches to yield similar weights, but apparently the world is not perfect. Table 3 shows those initial weights. The mutation count and variance boxes are empty for DYS426, DSY590, DSY472, and DSY425 because the sample data set had no mutations for those STRs.
 


Marker

DYS393

DYS390

DYS19

(DSY394)

DYS391

DYS426

DYS388

DYS439

based on

mutation counts

9

4

4

5


7

6

based on

mutation rates

11

4

6

4

20

18

2

based on

variance

10

4

3

6


6

8


Marker

DYS392

DYS458

DYS455

DYS454

DYS447

DYS437

DYS448

based on

mutation counts

17

3

20

17

5

8

8

based on

mutation rates

17

1

20

20

3

13

8

based on

variance

20

2

20

20

4

17

1


Marker

DYS449

DYS460

Y-GATA-H4

DYS456

DYS607

DYS576

DYS570

based on

mutation counts

3

7

11

6

14

3

3

based on

mutation rates

1

3

4

2

4

1

1

based on

variance

1

9

13

6

20

3

2


Marker

DYS442

DYS438

DYS531

DYS578

DYS590

DYS537

DYS641

based on

mutation counts

7

20

20

20


6

20

based on

mutation rates

3

20

18

20

20

8

20

based on

variance

8

6

20

20


9

20


Marker

DYS472

DYF406S1

DYS511

DYS425

DYS557

DYS594

DYS436

based on

mutation counts


6

14


3

20

20

based on

mutation rates

20

7

8

20

3

20

20

based on

variance


4

20


2

20

20


Marker

DYS490

DYS534

DYS450

DYS444

DYS481

DYS520

DYS446

based on

mutation counts

20

3

20

7

5

8

6

based on

mutation rates

20

1

20

3

2

6

3

based on

variance

20

2

20

8

3

12

3


Marker

DYS617

DYS568

DYS487

DYS572

DYS640

DYS492

DYS565

based on

mutation counts

11

14

13

20

14

20

20

based on

mutation rates

17

20

13

8

20

20

15

based on

variance

10

11

20

20

16

20

10

Table 3



Why Is the World Not Perfect?

Let’s confine that inquiry to a subset of the world -- the three weighting schemes. One key factor in the differences is that each is based on a different population. The mutation rates are based on a large random sample drawn from a range of haplogroups. The variance is based on a small, probably biased sample drawn from just one small haplogroup. The internal Network statistics are based on that same small sample of haplotypes plus the hypothesized haplotypes used to construct the network.
A second factor is the randomness built into the mutation process. A mutation rate of, say, one mutation per hundred generations does not mean that one mutation shows up every hundred generations, like clockwork. It’s similar to tossing a coin—the fact that there’s a 50% probability of heads doesn’t mean that tossing a coin a thousand times yields the sequence HTHTHTHT and so on for 500 HT pairs. There will be stretches with many heads in a row and stretches with many tails in a row. So with STRs you may get some mutations when you expect none and few mutations when you expect several. This behavior results in variances that can be quite different from what one would expect from the mutation rates. Ken Nordtvedt investigated this with numerical simulations and concluded the following:
“When the many STRs of extended haplotypes are evaluated after a clade's descendant from the common ancestor, it is found that the individual STR ratios of Variance/mutation rate vary all over the place from way above the expected value … to significantly below.” (http://knordtvedt.home.bresnan.net/FirstMutation-A.txt)
Generally speaking, the earlier a particular mutation first occurs in a trial, the larger the variance is for that mutation.
A third factor is the human history of the haplotype. If some branches are selectively thinned or cut short and other branches grow disproportionally quickly, diversity will be reduced and so will the variances.



Comparing Weighting Schemes

To compare the effects of the three weighting schemes, let’s look at the resulting torsos. Figure 25 shows the result obtained using weights based on the internal mutation counts for Network.

Figure 25 - click to enlarge
Figure 26 shows the results using weights based on mutation rates.

Figure 26 - click to enlarge
And, finally, Figure 27 shows the result based on the variance. The long extension looks similar to the one in Figure 25, but it involves different haplotypes.

Figure 27 - click to enlarge
Although different from one another, all three are considerably less complex that the uniform weight torso.




Mutation Count Iteration

None of these models was internally consistent in this sense: the mutation counts generated by the models were not consistent with the weights used to create the models. I decided to take the mutation count approach further by using the resulting mutation counts to generate new weights, then using the new weights to generate a new model and new mutation counts. My hope was that this iterative approach would lead to weights for successive models converging and to a self-consistent model. I applied this approach using each of the three previous starting points (first-pass mutation counts, mutation rates, and sample variance). In fact, each model did converge. However, although the three final models were much closer to each other than were the starting models, they did not match precisely. Given that the models were attempting to generate a network spanning a few thousand years with just a scattering of contemporary data, it’s not surprising that one cannot recover an unambiguous answer from the data. Each is a plausible subset of possible models.
Not having a basis for choosing any one result, I averaged the weights and ran two further iterations on that sample. The results, shown in Table 4, are the weights used in Figure 16 through Figure 23. The Network model ignores markers such as DYS426 that exhibit no mutations, so any weights can be used for those four markers.

marker

DYS393

DYS390

DYS19

(DSY384)

DYS391

DYS426

DYS388

DYS439

weight

10

3

3

4


8

9

marker

DYS392

DYS458

DYS455

DYS454

DYS447

DYS437

DYS448

weight

17

2

20

20

3

9

1

marker

DYS449

DYS460

Y-GATA-H4

DYS456

DYS607

DYS576

DYS570

weight

2

6

14

5

20

2

2

marker

DYS442

DYS438

DYS531

DYS578

DYS590

DYS537

DYS641

weight

5

20

20

20


7

20

marker

DYS472

DYF406S1

DYS511

DYS425

DYS557

DYS594

DYS436

weight


4

17


2

20

20

marker

DYS490

DYS534

DYS450

DYS444

DYS481

DYS520

DYS446

weight

20

3

20

6

4

11

5

marker

DYS617

DYS568

DYS487

DYS572

DYS640

DYS492

DYS565

weight

11

20

20

20

20

20

20

Table 4


Choosing a Network

We’ve seen three methodologies for generating networks for the full L38 49 marker data set.
  1. The default calculation, which uses the MJ calculation with equal weights of 10 and an ε of 0 (Figure 7)
  2. The combined RM-MJ computation with equal weights of 10 and an ε of 0 and followed with MP (Figure 24)
  3. The combined RM-MJ computation with weights iteratively based on mutation counts and an ε of 0 and followed with MP (Figure 5 and Figure 23)
Clearly, Method 3 produces the most intelligible and aesthetic representation. Also, the resulting network is the only internally consistent model, having weights consistent with the mutation counts. And that network is the most consistent with the information for the L533+ and the L69+ SNP values. So it is fair to say the Method 3 network is better than the other two.
But it also is true that Method 3 oversimplifies the situation. Comparing Figure 23 to Figure 24 shows that using unequal weights simplifies the torso. Comparing Figure 19 to Figure 21 and Figure 22 shows that using RM as a preprocessor for MJ excludes a great number of possibilities. We can restore some of these possibilities by using a non-0 ε with MJ. This wasn’t possible using MJ in the default mode, but RM simplifies the network so greatly that it becomes possible to use some none-0 ε values with the RM-MJ sequence. Figure 28 shows the result using an ε of 9.
 

Figure 28 - click to enlarge
This process adds much more intricacy to the center of the network. But it doesn’t change the basic picture much. The G3 group does look divided, but if one magnifies the center and follows the links, the relationship is not much changed. There are many more median vectors and links in the central region, but the overall pattern still is consistent with the idea a hub of rapid population growth surrounded by nodes acting as secondary sources of growth (Figure 29).

Figure 29 - click to enlarge
In short, the network of Figure 5 (and Figure 23) can be regarded as a simplified and non-unique network capturing the essential relationships in the data base. In particular, the torso can be regarded as a lower-resolution view of a more complex past.


Reconstruction or Artifact?

One question that has bothered me is this: is the hub feature of the network of Figure 5 a valid reconstruction, or is it just an artifact of the particular procedure I followed? To get some insight, I applied the same procedure, including the iterative weight-selection, to a different data set. The Family Tree Y-Haplogroup I2a project has 67-STR data for over 1100 individuals divided among 40 subgroups. This haplogroup is thought to be older than I-L38.  I randomly selected a few haplotypes from each subgroup to obtain 120 individuals, matching the original size of the L38 group. (One was added to the latter later.)   Figure 30 shows the resulting network. The structure is quite different from the L38 network. In particular, rather than exhibiting a hub, it has several regions that seem to indicate periods of rapid population expansion, as indicated by several links emanating from a few closely linked nodes. This doesn’t prove that the hub form for L38 isn’t an artifact, but it does make it plausible that it isn't an artifact.

Figure 30 - click to enlarge



Appendix


Table 5 lists the I-L38 Project members whose haplotypes were used.

Kit Number

Paternal Ancestor Name

Country

GID

255515

Hans Andersen 1784-1843, Odense, Denmark

Denmark

G1-01

225182

Henry Davis, b1849, Tennessee

Unknown Origin

G1-02

20622

Patrick Connolly b.c 1808 Bulgaden Co Limerick IRE

Ireland

G1-03

19744

Ragan

Canada

G1-05

169558

Owen Ragon, 1797

Ireland

G1-07

271168

Thomas Daniels, b.ca.1780

Ireland

G1-08

145107

Andrew Wolfe

Unknown Origin

G1-09

84502

Peter Lawrence 1774 (NC) -1856 (IN)

Unknown Origin

G1-10

184403

Michal Szwarc, b.c.1793, Motule, Poland

Poland

G1-11

N91527

Jonathan Bunting 1753-1825 Sussex Co, DE

England

G1-13

131801

Jens Abrahamsen 1754-1813, VFO

Norway

G1-16

N74871

Christian Deterding, born between 1792 and 1799, d

Netherlands

G1-17

113095

Fridolin Hurbi 1767 - ?, Kienberg, Switzerland

Switzerland

G1-18

N17977

Robert Thadeus McClellan, 1895 TN - 1986 CA

United Kingdom

G1-19

216134

David Regan, b. 1803, Castlelyons, Cork, Ireland

Ireland

G1-21

N31707

Georg Simon Wehr, b.c. 1720, Heidelberg, Germany

Germany

G1-22

24288

Thomas Boucher ca 1780 western VA-1830 AL

Unknown Origin

G1-23

24376

Thomas Boucher, ca 1780 western VA-1830 AL

Unknown Origin

G1-25

133768

William Vance, b.c. 1715

Ireland

G1-28

20263

James W. Evans, 1829-1900?, USA

Unknown Origin

G1-29

61703

William Vance b. ca 1715 d. ca 1750

Ireland

G1-30

140136

William Vance b.ca 1715 d.ca 1750

Ireland

G1-31

161255

Samuel Robison about 1765 - 1826

Scotland

G1-32

85848

William Barker

United Kingdom

G1-33

N25287

Hialti De Hauteville b.852 Hauteville-la-Guichard

France

G1-34

199864

Gustaaf De Beul, Lokeren, Belgium

Belgium

G1-36

228971

John Graham b.1671 near Langholm, Scotland

Scotland

G1-37

90841

Hugh Bullock

Scotland

G1-38

167285

Wendelin Stehle 1717-1785 Bittelbronn,Hohenzollern

Germany

G1-39

75895

Matthew Weakley, b.c. 1695, Martock SOM UK

England

G1-41

53868

Jean Guittard, 1614 FR -1706 Bellemagny, Alsace

France

G1-43

139077

Horatio Huggins, Gingerland, c1830's

United Kingdom

G1-44

N77819

Loran White, August 20,1952

Unknown Origin

G1-50

134370

Jacob R Brooks>James>John Jr.>John Sr 1700 [VA]

Unknown Origin

G1-51

223210

Thomas Aldersea 1800s Staffordshire

England

G1-52

197368

Samuel Firth, 1811 - 1875, Lightcliffe, Yorkshire

England

G1-53

8202

Thomas Rix 1622-1718 Brancaster, Norfolk, England

England

G1-56

69862

Thomas Hester, b. 1715 and d. 1779

England

G1-57

115222

Samuel Dale b. 1801 Ballycloghan, Antr, N. Ireland

Northern Ireland

G1-58

57392

Thomas Cullen, b.c.1690, Upton, Nottinghamshire

England

G1-59

25903

Thomas Bond, Sr., Bertie Co., NC

Unknown Origin

G1-60

142308

John Fortner/Faulkner b. 1678 Kent County MD,USA

Germany

G1-62

N46014

William Robinson, 1614-1668

England

G2-03

133439

James Smith, abt. 1792-1868, Birmingham

United Kingdom

G2-04

109456

Joseph Price, 1796-

Unknown Origin

G2-05

114418

Johan Fuchs (Fox) 1784, Prussia, Germany

Germany

G3-03

17821

Jacob Brion, 1619, Burbach, Alsace

France

G3-04

59981

Elijah (Wm E.) Butler b. 1819, Colerne, Wiltshire

United Kingdom

G3-05

182932

Rawlings, Virginia, 1800 or earlier

England

G3-06

17218

Reece Vandever Morrel, b.1795, Camden Dist, SC

Unknown Origin

G4-04

194114

Shirly Lyle White 1934, West Virgina

Unknown Origin

G4-05

67750

Richard Wootten of Warwick, VA b ca 1614 England

England

G4-10

175549

Richard Green Wooten, 1800-1859, Catahoula, LA

England

G4-13

124739

Unknown, United Kingdom

United Kingdom

G4-16

265273

Samuel Houlder, b.1781 Wales d. 1848 Sweden

England

G4-17

149696

Oliver McCratic b1887 res. Arkansas

Unknown Origin

G5-01

33255

John MacKenzie bc1700 m. Dicey res. So Carolina

Scotland

G5-02

159027

Williamson

Scotland

G5-03

101984

Samuel McKinney bc 1840 Tennessee

Scotland

G5-04

49547

Wilson McKenny (McKinney) c.1758 of VA

Unknown Origin

G5-05

27772

Johnson McKinney, 1764 - 1852

Unknown Origin

G5-06

114416

Nicholas McKinney 1822 Franklin Co. Alabama

Unknown Origin

G5-07

165591

Donald McPherson b1799 SCT>Alexander bNovaScotia

Scotland

G5-10

149381

Samuel Watson, b. 1731

Scotland

G5-11

88756

Henry Hutchison, b. 1777 (unknown death date)

Scotland

G6-06

133132

Patrick Bellew, b 1820, MountBellew, Ireland

Ireland

G6-10

158069

John Garrison (1799/1883 MO) m.1826 Eliz.Thompson

Unknown Origin

G6-11

198755

Donald More, c1680, Caithness, Scotland

Scotland

G6-12

N79293

John White , 1862-1907

Ireland

G6-16

141638

Abner H. Marlin b. 1809 and d. 1886 Franklin Co TN

Unknown Origin

G6-19

178665

Jakob Mesterton, 1625-1689

Scotland

G6-20

168374

 

Unknown Origin

G6-21

109024

Enoch Cornelius Seaver, born 1834, North Carolina

Germany

G6-23

N17917

Salve Ommundson Fjellestad,b.1812 Norway.

Norway

G6-24

94391

Jesse Campbell (1820) of South Carolina, USA

Unknown Origin

G6-27

E7258

Oscar Jung, Luzern, Switzerland

Switzerland

G6-28

166268

Samuel Shivers 1814-1861

Ireland

G6-34

215272

John Hogg c 1750

United Kingdom

G6-35

215273

Tristram Hogg b. 1739

United Kingdom

G6-36

86953

Thomas Sivers 1680-1714

England

G6-37

60758

Stone

England

G6-39

61374

Evans, New Zealand

England

G6-40

N25230

Reginald Foster, b ca 1595 in England, d 1681 MA

United Kingdom

G6-42

237903

Reginald Foster b.1602-d.1681 Massachusetts

England

G6-43

61086

Reginald Forster 1795

England

G6-44

129114

Elisha Foster, Sr, 1766-1833, b: VT/MA - USA

England

G6-45

166815

Reginald Foster, 1595-1681

England

G6-46

161257

Peder Andersen, b.c. 1742, Tonder, Denmark

Denmark

G6-49

51138

Henry Zimmer, b.abt 1835

Germany

G7-01

199283

Ginwright

Unknown Origin

G7-03

240150

Charley Rauls, b.1862 GA and d. 1935 FL

Unknown Origin

G7-04

66739

Hezekiah Haney , abt 1770, Halifax Co. NC

Unknown Origin

G7-05

70501

James Rawls, b. ca. 1734, Nansemond Co., Virginia

Unknown Origin

G7-06

213187

James Rawls b. 1734 Nansemond Co., VA, d. aft 1791

Unknown Origin

G7-07

N79341

Suárez, Garrovillas de Alconétar, Spain

Spain

G7-09

115875

Johannes Böhly (b.1702 d 1765) from Germany

Switzerland

G7-11

216062

Johann Friederich Wigand (1759-1804) Bad Durkheim

Germany

G7-12

E11642

Ernst Steinhilber 1904 - 1975 Germany

Germany

G7-14

137613

Grantner

Slovakia

G7-15

51788

Joseph Seiler b.c.1687, Willenstein, Germany

Unknown Origin

G7-16

181894

Nelson Gibson 1745-1823

Scotland

G8-02

88422

William A. Hill b 1817-1818

England

G8-03

N43230

John Holmes, b. 1706 and d. 1775

Wales

G8-04

253397

John Stanley b 1827 TN d 1904 MO

Unknown Origin

G8-05

160148

James Stanley, 1785 SC - 1836 MO

England

G8-06

88037

William Sebron Standley b.1834 d.1914

Unknown Origin

G8-07

140263

Claude Reynaud,b.1620,Bâtie Montgascon,Rhône-Alpes

France

G8-08

210581

 

Unknown Origin

G8-09

E2623

Adalbertus/Wojciech Tatucha,1732,Warta,Lodz,Poland

Poland

G9-02

183324

Adalbertus Tatucha ,about 1732 ,Warta,Poland

Poland

G9-03

N102603

George Prata, 1880-1932, Abrud, Romania

Romania

G9-04

B4910

Andrea Marotta (father of Tommaso Antonio Marotta)

Italy

G9-07

N69049

Melchoir Sauberli Moriken Switzerland

Switzerland

UC-01

147740

William Chaffee Shannon, b. 6 Jun1876 ,,New York

Unknown Origin

UC-05

55374

Michael Weathers b. c1733 s/o Thomas, Surry Co. VA

England

UC-07

68367

Joachim Fox, New Orleans ~1800

Unknown Origin

UC-13

171867

William Richardson, Nash County, NC 1759-1850(abt)

Unknown Origin

UC-15

144546

Edward Richardson, 1701-1764

Germany

UC-16

188871

Nicholas Rimer, c.1660-1717

England

UC-17

86603

Henry Hainer (Hanor) lived Ulster Co. NY 1765

Germany

UC-21

263539

David Nicolson, b abt 1731

Scotland

UG-01


About the Author

Stephen Prata
  • was born in the early forties
  • lives in California
  • studied astronomy (BS, PhD)
  • taught physics and astronomy at a community college
  • is retired
Questions, Suggestions and Thoughts 
Feedback is welcome. You can reach me at the following email address: