By Stephen Prata - June 2015
This article discusses using SNPs to provide greater resolution to phylogenetic trees and networks used to depict relationships among members of the I-L38+ haplogroup.
The I2a2b L38+ Project lists Y-DNA STR data for over 300 project members, with 12 to 111 STR values (each set a haplotype) per entry. This article examines the relationships with in the I-L38+ haplogroup by using a Fluxus Network combining SNP and STR data for a subset of thirty-six I-L38 Project members.
Scientists and genealogists have used two kinds of DNA mutations to examine the story of Y-DNA: Single Nucleotide Polymorphisms (SNPs) and Short Tandem Repeats (STRs). You can read more about these mutations on the Background page at the I-L38 project site (www.familytreedna.com/public/I2b2/) and at the I-L38 Haplogroup web page (https://sites.google.com/site/haplogroupil38/home ). For now, let’s look at some of the differences and on what can be gained by combining the two.
SNP and STR Differences
One distinction is that of mutation rates. SNPs have relatively low mutation rates, making them suitable for studying changes that take place on a time scale of tens of thousands of years. They are used to define Y-DNA haplogroups and subgroups. For instance, someone with the mutated version of the M170 SNP belongs to the I haplogroup. If he additionally has the M438 mutation, he belongs to the I2 subgroup of I. Similarly, an L460 mutation defines the I2a subgroup, L35 defines I2a2, and L38 defines I2a2b. (Often, more than one SNP defines a haplogroup, so the I-L38 group also is characterized by L39, L40, L65, and L272.) Haplogroups can be used, for example, to help untangle the histories of human migrations.
Note: the initial letter (or letters) in a SNP name identify the research group that found it. Some were found by multiple groups and have multiple names. For instance, L38 is the same as S154 and L39 is the same as S155.
STRs have relatively higher mutation rates, making them suitable for studying short-term developments in genetics, for example, in forensic studies to identify an individual or in genealogical studies investigating family relationships.
Another distinction is a SNP typically has two possible values – the ancestral value and the mutated value. In principle, given that there are just four nucleotides involved (A, T, C, and G), there are three possible mutations for any one nucleotide, but the rarity of mutations ensures that most often just one has occurred. STRs, on the other hand, are characterized by the number of repeats, and this number may show a range of values. Furthermore, an STR mutation may either increase or decrease the count, so two people with the same count for a particular STR may have descended from ancestors with different counts.
STRs, Haplogroups, and Haplotypes
SNPs define haplogroups. A set of STR values defines a haplotype. The relationship of STRs to haplogroups is a bit fuzzier. The count value for a particular STR does not identify a haplogroup. If you scan the Classic page in the Results section of the I-L38 Project, you’ll see multiple values for most of the STRs. If you look at similar pages for other haplogroups, you’ll find members of different haplogroups having the same value for a particular STR as an I-L38 member. Nonetheless, by examining a panel of several STR values, one can establish correlations between sets of STR values (a haplotype) and the SNP-defined haplogroups. In short a haplotype that includes enough STRs can be used to infer a haplogroup. Each distinct set of STR values on the Project page is a distinct haplotype, but all belong to the same haplogroup, I-L38.
The Project administrators noticed that the haplotypes seem to fall into subgroups, which they have characterized by key STR values. A natural question is, do these haplotype subgroups correspond to a haplogroup subgroup? That is, do these groupings trace the effect of some SNP mutation? From the limited sample studied here, the answer would appear to be “somewhat.”
Recent developments in genetic testing allow for cheaper and more extensive SNP testing, and this has led to more and more individuals having both SNP and STR results. With their slow mutation rates, SNPs provide a course-grained picture of Y-DNA relationships, and the faster mutating STRs provide a fine-grained picture. Combining the two should give a more complete and accurate picture than either does alone. In particular, one would expect STRs to better portray recent genetic changes and SNPs to mark more basic genetic variations occurring longer ago in the past.
Note: In fact, both SNPs and STRs exhibit ranges of mutation rates. For example, some STRs are identical for all members of the I-L38 project, while others exhibit up to five distinct values. Similarly, while most SNP mutations used to classify haplogroups seem to have been unique events, at least as far as the surviving population is concerned, some have occurred independently in different haplogroups. The faster mutating SNPs, such as L69, may be unsuitable for classifying major haplogroups but prove useful for identifying subgroups within a major haplogroup.
The STR data used in this article comes from the public FTDNA I-L38+ Project. FTDNA, with its emphasis on family genealogy, traditionally has used STR testing, and the project originally was populated with individuals whose STR haplotypes correlated with the I-L38+ haplogroup and who volunteered to join the project.
Note: In the past this group was classified as I2b2, then later classified as I2a2b. This same label had previously be used for a different haplogroup, so to reduce confusion, it’s now more common to identify the group by the identifying mutation, hence the I-L38+ label. However, while the Project and National Geographic Geno 2 use I-L38+, FTDNA uses I-L39+, as L39 also is a defining mutation.
The SNP data comes from a variety of sources.
The Emerging Picture
Several SNPs appear to divide L38+ into subgroups. These include the following:
The SNP tests suggested on the project Background page reflect the following deduced history:
The Role of BritainsDNA
IN 2014 BritainsDNA released a spreadsheet of SNP results for 1999 subjects, including 14 members of I-L38+, which BritainsDNA calls S155+, with S155 being an alternative name for L39, another marker for the I-L38+ haplogroup. As shown in the listing above, the data revealed several SNPs (S2606, S24121, S2488, S4556, S27697, and S25490) as subdividing the group. (The BritainsDNA S155+ group did not include any L533+ or L69+ individuals.)
This data base doesn’t include STR data. Subsequently, five members of the L38+ project also tested with BritainsDNA and made their results available to Hans de Beule. This provided a basis for SNP-STR comparisons and guidance for suggesting further SNP tests for project members. The only S4556+ and S27697+ individuals we have belong to the original BritainsDNA sample, and we don’t have STR data for these individuals. Hence this study excludes those STRs.
Fluxus Network Analysis
The Fluxus Network program constructs phylogenetic trees and networks. A phylogenetic tree is similar to a genealogical tree, but with the basic unit being a haplotype instead of an individual. The basic link, instead of joining two individuals a generation apart, joins two haplotypes a mutation apart. However, because of missing haplotype data, links can be two mutations longer or more. And some of the nodes representing haplotypes may be inferred nodes, called median vectors. These branching points represent haplotypes not included in the data base, perhaps because we’ve sampled only a small fraction of the whole haplogroup, perhaps because they are types no longer extant. In more complex cases the data often is too ambiguous to allow the program to construct a unique tree. In such cases, the program produces a network that superimposes several possible trees. Even if the program does produce a unique tree, it typically is a best guess, not the only possibility, and additional data could change the structure significantly.
The article I-L38 Median Networks (1) (https://sites.google.com/site/haplogroupil38/median-networks ) discusses this program further in the context of STR networks. One important aspect is that one can provide weights to emphasize some STRs over others. The justification is that it is more significant when a low-mutation rate event occurs, so the weights are based on mutation rates, often judged by sample variance in the repeat values.
This mutation-rate dependence of weights suggests adding SNPs to the data as if they were very high-weight STRs. I used this approach following a suggestion from Hans de Beule, but it turns out that professionals have begun doing similar work without needing our guidance. See, for example, the following:
This article uses data for project members having both SNP and STR data. For SNPs, this article uses the sources mentioned earlier: public FTDNA tests, transferred National Geographic Geno 2 data, and values that Hans de Beule has gathered from other. The STR data comes from FTDNA testing. This study uses 36 individuals with appropriate SNP data and with 67-marker STR results, as more STRs yield better resolution. I used data for 51 STRs out of the 67 as some STRs are not suitable for network analysis. I used STR weights similar to those in the I-L38 Median Networks; they range from 1 through 20.
The SNPs used are these: S2606, S24121, L533, L69, S2448, and F780. For these I used the maximum possible weight of 99. Not every individual was tested for every SNP, so some individuals were assigned values based on tested values for similar haplotypes. Table1 describes the sample.
In this table a minus sign (–) represents ancestral values (as in S2606-), and a plus sign (+) represents derived values (as in S2606+). (The Fluxus Network program expects numerical values. Any two integers with a difference of one would work; I used 10 and 11.) Values in red represent results of SNP testing.
The SNP values are better documented than the number of red entries in the table might indicate. For example, K02, a cousin of K04, was tested for all but L780; however, he had fewer than 67 STR markers, so wasn’t included in the sample. The table shows L02’s values for L04. Also, examination of 19 BritainsDNA I-L38 samples (of which just 5 are part of the I-L38 project) shows that S24121+ is a subset of S2606+ and that S2488+ is a subset of S24121-. Thus, for example, S2488+ implies both S24121- and S2606+. In a similar manner, F780+ is a subset of S24121+, so S24121- implies F780-.
Results Using Seven SNPs
Figure 1 shows the network that Fluxus Network generates using the SNP values in Table 1 and STR values taken from the I-L38+ project STR page. The figure identifies individual haplotypes by an ID and by color code.
Each node (L04, M08, etc.) represents a particular haplotype. A haplotype corresponds to a particular set of STR values and can represent more than one individual. The node size increases with the number of individuals. For example, the Q12 node represents both Q12 and Q15. The length of a link is proportional to the number of mutations needed to go from one node to the next. The small unlabeled nodes are median vectors, haplotypes whose existence is inferred in order to generate the network. What counts in determining the genetic closeness of two nodes in the network is the distance between the two nodes as measured along the path following the connecting links. For example, although L05 and L08 are about equidistant from L04 in the figure, the genetic distance between L04 and L08 is much greater than the distance between L04 and L05. Since the number of mutations increases with the passage of time, we can conclude the two haplotypes L04 and L05 diverged from one another relatively recently compared to when L04 and L08 diverged. In general, as one follows branches from the periphery of the network towards the center, one is looking at older haplotypes. The more central median vectors represent haplotypes deduced to have once existed in order to give rise to the newer haplotypes present today.
The picture conforms to what one would expect from assigning high weights to the SNP data. The 36 haplotypes are organized unambiguously into seven haplogroup subdivisions, the ones labeled GK,GL, GM, GN, GO, GP, and GQ in Table 1 and in the Figure 1 legend. (Here G stands for Group.)
Let’s look again at the significance of link lengths. Note that L03, L04, and L05 form one subgroup within GL while L06, L07, and L08 form a second subgroup. The long links back to an originating median vector indicate that these two subgroups separated from each other long ago, longer, say, than the time during which the GM group has been diverging. This agrees with the picture that L533+ is the oldest mutation represented here. (You may have to zoom in to see figure details.)
Results Using no SNPs
Is this structure supported by STR data alone? To investigate, let’s first redo the network using no SNPs.
Figure 2 shows the zero-SNP result.
Superficially, Figure 2 looks similar to Figure 1, but, without the SNP information, the program no longer clearly separates the subgroups from one another. There is partial separation but also some mingling. For example L03, L04, and L05 form a clear and isolated grouping as do L06, L07, and L08. But this network doesn’t reveal that these two subgroups really are a single subgroup. And members of the GQ group don’t show the cohesion they do in Figure 1. The GM group still coheres, but there is the false suggestion that Q08 might be related. Similarly, there is the false suggestion that Q01 belongs to Group O. Trees using just STRs become fuzzier as one goes back in time (note the closed loops in the past of GO, etc.). In general, this figure still delineates more recent developments (nodes connected by shorter links) but becomes inaccurate for longer links. So it reveals the recent history of Group L but fails to show the deeper relationship. It’s the slower evolutionary pace of SNPs that allows them to provide a better picture of deep structure of the past than do STRs.
Note: The links shown in red comprise the torso of the network. The torso shows where the data is too ambiguous to suggest a unique tree; it encompasses multiple choices for development. That is, a closed loop indicates more than one possible path from one node to another.
Adding a SNP
To illustrate how SNPs help clear up the deep structure, we can include S2606 in the mix. Figure 3 shows the result of adding it to the STRs of Figure 2.
Including this SNP separates out the GK and GL groups (the two with S2606-) from the remaining groups, all S2606+. However, lacking the L533 information, this model fails to separate GK from GL.
The Data Deficit
The original BritainsDNA sample is small (14 individuals) and geographically biased (United Kingdom). The FTDNA STR sample used here also is small (36 individuals) and biased in multiple ways -- for example, including groupings of related individuals and including individuals being encouraged to test for particular SNPs. So it’s not surprising that there are statistical differences between the two groups.
The Frequency of L533+
Six of the seven S2606- individuals in the network sample are L533+. The original BritainsDNA sample has two S2606- individuals, none of which are L533+. Although the statistics are weak, they suggest the S2606-/L533- sample is greatly underrepresented in this article.
The Frequency of L69
Four of the twenty-eight S2606+ individuals in the sample network are L69+. None of the twelve individuals in the original BritainsDNA sample are L69+. This may suggest that this group is much rarer than might appear from the network sample or that it has a much different geographical distribution – or both.
The Frequency of S2606+/S24121-/S2488-
Of the nine S2606+/S24121- individuals in the BritainsDNA sample, seven are S2488- and two are S2488+, which suggests S2606+/S24121-/S2488- is more common than S2606+/S24121-/S2488+. But the network sample has only one example of the former. In the I-L38 project there is a group of six individuals testing as S2606+/S24121-, but, without further testing, we don’t know how to classify them. We can, however, add them to the network sample, drop the S2488 SNP, and see how these six compare to the known S2488+ group. Table 2 shows the new additions (note that S2606+ implies L533-), and Figure 4 shows the result.
Two of the new samples (S05 and S06) appear to associate with some GQ members, so it’s likely these three will prove to be S2488. Similarly, but less strongly, the network suggests S03 is S2488+. The trio S01, S02, S06 are less closely associated with GQ members, so they might be S2488-, but they don’t seem as distant from the main body of GQ as are, say, Q03, Q09, and Q10. In short, the STR data is suggestive, but not conclusive.
S4556, S27697, and S25490
As mentioned earlier, the S4556, S27697, and S25490 SNPs divide the BritainsDNA sample into subgroups. So they most likely also divide the I-L38+ project database into subgroups. But, as of yet, we don’t have the relevant SNP data for the I-L38+ project.
Using More Partial Information
Figure 4 was the result of adding S2606+, S24121- samples that have not been tested for other SNPs. We can expand the data by adding S2606+ samples that have not been tested for S24121. Thus they could, in principle, eventual prove to be part of any of the S2606+ groups: GM, GN, GO, GP, or GQ. The point in expanding the sample is to see what, if anything, just the STR data tells us about these partially tested samples. Table 3 shows the added data. It includes one individual, T01, for which there is no SNP testing but whose STR data suggests a GM connection.
Figure 5 shows the corresponding network. It used just the S2606 and L533 information in addition to the STR data.
Notice how all the S2606+ groups seem to radiate from a single median vector. This suggests that the S2606+ mutation was associated with a population expansion. Having a large number of S2606+ individuals creates conditions for multiple additional mutations. Also, the relative short link from this central median vector to the median vector with the L533 branch suggests the S2606+ mutation took relatively soon after the L533+ mutation.
Next, how well do the STR data predict the SNP groups? As with the preceding example, the answer is mixed. Again, the GM group is well defined, and T01 does seem to fit in as a member. Most of G0 and individual parts of GQ cohere. J01, J02, and especially S04 and S05 look well integrated into GQ. On the other hand, the apparent association of Q01 with GO serves as a warning to not rely too strongly on associations suggested by the figure
When Fluxus Network is used with STR data, it constructs plausible trees and networks that are likely accurate and well resolved for recently developed haplotypes. But as the STR genetic distance from present haplotypes increase, the trees and networks become fuzzier and less reliable. Adding SNP data to the mix helps resolve the network for these earlier times. One reason for this is that the slower mutation rates for SNPs means we are dealing with much smaller genetic distances for SNPs than with STRs for the same period of time.
As SNP testing becomes cheaper and more extensive, we can expect to develop better ideas about the history of the I-L38+ haplogroup.