SNP Analysis - Klyosov's Comment
STR and SNP Analysis
Comment to “SNP Analysis”
Recently a paper was placed at the above link, under a title “SNP Analysis”. Granted, the material is rather complicated and needs a careful study. However, the cited paper sends an inappropriate message right in the first phrase, hence, my response. The phrase contains false statements. Firstly, it says “SNPs mutate less often and more predictably than STRs”.
Well, SNPs are not “predictable”, since they occur randomly. It is a statistical matter. They are predictable as “predictable” coin tossing, whether it will show head or tail. Particularly SNPs are not predictable and not reliable at all on short time spans, such as an R1a Ashkenazi time span from their common ancestor, who lived on different accounts (that is, calculated using various haplotype datasets) during a timespan between 1150 and 1300 years before the present (ybp). As it will be shown below, this timespan corresponds to about 8 SNPs to occur if a fragment of Y chromosome of about 10Mb is sequenced. However, 8 SNPs – it is an average figure. In reality, it is between 6 and 11 SNPs. Talk about a predictability. Try to toss a coin 6, 8 or 11 times, and predict, which side will be up in the next case, and what would be an average of those heads and tails.
So, please, no more wishful thinking about “predictability” of SNPs.
Secondly, the same phrase says – “SNPs will ultimately be more reliable than STRs in identifying the genealogical relationships among individual R1a1a Ashkenazi Levites (assuming that a sufficient sample of R1a1a Ashkenazi Levites do full Y-DNA sequencing)…”
On the one hand, the statement is rather strange, as it is strange to make a similar statement in comparing, say, a telescope and a microscope, which of them is more useful and reliable. They both are useful, for different purposes. Do not forget a binocular as well, and there are many of them, of different magnifications, shapes and sizes. Why? Because they all serve different purposes. Similarly, both SNPs and STRs are useful, to be employed in different situations. They are complementary to each other. I do not know why even to compare them in such terms, as which one is better. It only shows that some folks do not realize how useful are the both tools, and that each of them has its own limitations and benefits.
That is why SNPs and STRs should always go together. They both provide useful information, each of its own kind. Examples will be given below.
Finally, in this introductory part, please notice “assuming that a sufficient sample of …Levites do full Y-DNA sequencing”. We can assume whatever we want. Why not to assume that a sufficient sample of Levites do the 500 marker haplotype test? Or 1000 marker haplotypes? It certainly would be beneficial. Why not to assume that all people on the Earth will be healthy and wealthy? This would be great too. However, we have to consider what we have right now, not empty, albeit great projects. To project, that a sufficient number of Levites will have all their 58 million nucleotides in Y chromosome sequenced is not a small thing. Some advanced tests consider 10 million nucleotides to be analyzed (Big Y test), some consider 30 million nucleotides, and this was a great achievement. However, the point is not just to determine all SNPs in each individual. STRs give a different angle in the DNA analysis. Which one – we will consider here as well.
So, the main conclusion of this introductory part is that STR and SNP analysis should go hand in hand. They are synergistic to each other. It would be a great mistake to ignore or to diminish a significance of any one of them.
Let us first summarize what was written in the cited paper on SNPs and their contribution to unfolding history of R1a Ashkenazi Levites.
Below is a SNP tree of R1a Ashkenazi Levites (of April 22, 2014), which accompanies the cited paper. It is not of a good clarity, so I will duplicate what is written on the horizontal “levels” from top to bottom. It is CTS6, Y2619, Y2630, YP264, and “Private".
Article © 2014 by Anatole A. Klyosov. Posted on LeviteDNA.org after Professor Klyosov granted permission to Meir G. Gover to do so.
Please note that since Professor Klyosov wrote this article: (1) some of the analysis upon which he commented has been augmented to reflect more recent test results; and (2) the webpage on which he commented has been broken into three webpages for ease of reference: (a) a general discussion of SNPs found among R1a1a Ashkenazi Levites; (b) SNP-based calculations of a time to a Most Recent Common Ancestor for R1a1a Ashkenazi Levites and some clusters thereof; and (c) an SNP-based tree for R1a1a Ashkenazi Levites.
Fig. 1. An SNP tree for Ashkenazi Levites of R1a-L342.2-CTS6 and downstream subclades. Taken from the cited article “SNP Analysis”. The horizontal “levels” from top to bottom are CTS6, Y2619, Y2630, YP264, and “Private".
The cited paper explains, that all R1a Ashkenazi Levites (and not only them) are characterized by a string of SNPs as follows: Z93 – Z94 – Z2124 – Z2122 – F1345 – CTS6. This is fine, there are no problems with that. Some other people have F1345 and/or CTS6 (and SNPs above them), not being Ashkenazi Levites. Overall, there were 12 SNPs found thus far at and below F1345/Z2472 (among them M582/Z2474). Therefore, the paper sets the next “borderline” for Ashkenazi Levites at the second level in the diagram above, that is Y2619. There are as many as 19 SNPs identified at and below Y2619. All of them are described in the cited paper. The next “borderline” is set in the cited paper at Y2630, and, finally (thus far) there is an apparent level for the branch of the Horowitz rabbinical family at YP264 (it is hardly so, as it is shown below using STRs). Finally, there are “private” SNPs, which are observed in individual Y chromosomes, and not shared with other Ashkenazi Levites.
The above is what the cited paper essentially says regarding assigning of certain SNPs to several groups of Ashkenazi Levites. “Separate groups” here mean that they split based on selection of their SNPs in Y chromosomes.
Is there any real significance in such a split of R1a Ashkenazi Levites to different “groups”? Yes and no, that is how I see it. The thing is that new SNPs appear in the Y chromosome once in a generation, on average. It means that each family having a man has a unique set of SNP mutations in his Y-chromosome. In other words, SNPs will eventually split down to every family of R1a Ashkenazi Levites, since we talk about them. So, what is a big deal about it? We know that every human being is unique, in a way. That is, I personally do not quite understand that chase after every new SNP. There should be a certain purpose, a goal in it, isn’t it? What is a goal? What is a purpose here? I believe, a criminalist would appreciate it. A police detective would appreciate it too. But why do we need to split Ashkenazi Levites to dozens and hundreds of “clusters”? To thousands clusters, down to single families? In order to find relatives? And then what? What if they want to protect their privacy and do not want to be engaged in your family circle? Go figure.
Apparently, everyone should answer those questions for him/herself.
The cited paper also considers using SNPs for calculating the TMRCA, that is Time to the Most Recent Common Ancestor – for all R1a Ashkenazi Levites and for their “subclusters”. Again, we see an erroneous statement in the first phrase of that section – “SNPs mutate at a regular rate”. It is a same mistake as to write “when toss a coin, heads and tails occur at a regular rate”. They are not. They are random. They show a certain probability, and we know that after dozens, and better hundreds tosses the probability approaches some “fifty-fifty”. The same situation is observed with SNPs, only after dozens and hundreds of them, their probability approaches some rather stable values, and those values are hotly debated. Furthermore, their probability depends on a size of Y chromosomal fragment used for the sequencing, and the size varies – slightly or not quite – from one tested person to another. That is probably why a number of SNP varies over the same timespan in different individuals.
Here is an example. Below is a SNP tree, provided by Vladimir Tagankin (in modification by Dr. Alexander Zolotarev), for 27 individuals having R1a-Z280 (based upon Big Y test). The bottom line shows their branches within Z280 (NEA – Northern Eurasian, CEA – Central Eurasian, WC – Western Carpathian, BC – Balto-Carpathian, WEA – Western Eurasian, NE – Northern European, NC – Northern Carpathian, EC – Eastern Carpathian, and some SNPs). As we see, the same timespan (for Z280 it is about 4900 years) produces a different number of SNPs, from 29 to 43, with a variation of 48%. Based on the figure of 4900 years for Z280 (obtained, by the way, with STRs), an average timespan for one SNP varies from 114 to 169 years.
Therefore, the statement of the cited paper (with a reference to Michal Milewski) that “a (SNP) mutation may occur once … every 150 years on the SNPs reported by FTDNA’s Big Y test” is not exactly correct; unfortunately, a margin of error is missing here. If we employ 150 years per SNP, R1a-Z280 would give the TMRCA from 4350 to 6450 years. One can see, that there is not much hope for a more or less accurate determination of the TMRCA for Ashkenazi Levites based on SNP numbers, since the timespan for their “cluster” is only 1150-1300 years, that is about 7 to 11 SNPs. Again, as we have said above, toss a coin 7 or 11 times, and see a reproducibility of data obtained. Talk about “SNPs mutate less often and more predictably than STRs”, with which we have begun this paper. “Less often” also is not exactly correct. The 111 marker haplotype mutates overall – on average – once in about 125 years. The 67 marker haplotype mutates – on average - once in 208 years. SNP occurs – on average – once in 114 to 169 years. We see the same order of frequencies of mutations.
Fig. 2. An SNP tree for a subclade R1a-Z280, provided by Vladimir Tagankin (in modification by Dr. Alexander Zolotarev), for 27 individuals (based upon Big Y test). The bottom line shows their branches within Z280 (NEA – Northern Eurasian, CEA – Central Eurasian, WC – Western Carpathian, BC – Balto-Carpathian, WEA – Western Eurasian, NE – Northern European, NC – Northern Carpathian, EC – Eastern Carpathian, and some SNPs). As we see, the same timespan (for Z280 it is about 4900 years) produces a different number of SNPs, from 29 to 43, with a variation of 48%. Based on the figure of 4900 years for Z280 (obtained, by the way, with STRs), an average timespan for one SNP varies from 114 to 169 years
Having said that, let’s see how the TMRCAs for Ashkenazi Levites were determined based on SNPs, and what are their TMRCAs. For the string of SNPs
F1345 – CTS6 – Y2619 – Y2630 – YP264
the TMRCA were obtained as follows (based on SNPs from about 30 MB and 10 MB Y-chromosome coverage, respectively):
4224-3975 – 3168-2925 – 1496-1725 – 1056-1425 – 440-675 years
The figures obtained are certainly meaningful (and it would be strange if not), and reasonably close to those determined from mutations in haplotypes (STRs). Actually, the cited paper uses the STR-based TMRCAs as a reference data, not vice versa. Let me quote – “Z645 is commonly considered to be about 5,800 years old” (the figure was obtained using STRs), “…Y2619 is indeed slightly older… than expected based on the previously known STR-based estimates”, and so on. Was there any need to bash STRs in the beginning of the cited paper, and then use the STRs as a reference material?
Now, it is time to move to STRs, and consider them.
Currently, the IRAKAZ database (67 and 111 marker R1a haplotypes), having 3997 entries, contains 181 of 67 marker haplotypes of Ashkenazi Levites. Their haplotype tree is shown below.
Fig. 3. A 67 marker haplotype tree of 181 Ashkenazi Levites of R1a-L342.2-CTS6 subclade and downstream subclades. The tree consists of two halves, the upper part of the tree has DYS459b=10, the lower part has DYS459b=11. The borderline dissects the tree between haplotype 14 on the left and 83 on the right.
In those 181 haplotypes three of them form a branch with a distinct number of alleles, and a base haplotype as follows:
13 25 17 10 11 14 12 12 10 13 11 30—14 9 10 11 11 24 14 20 30 12 12 15 15—11 11 19 23 14 16 19 20 35 40 14 11—11 8 17 17 8 12 10 8 11 10 12 22 22 15 10 12 12 14 8 14 23 21 12 12 12 13 10 11 12 13
It contained three mutations (marked) compared with the base haplotype for the whole dataset (see below), and those three haplotypes contained only seven mutations from their base haplotype, shown above. It gives 7/3/0.12 = 19 conditional generations (25 years in each), that is 19x25 = 475±185 years from their common ancestor (the margin of error is calculated as described in [Klyosov, 2009; Klyosov, 2012]).
The remaining 179 haplotypes contain 1002 mutations from their base haplotype (in its first 67 markers)
13 25 16 10 11 14 12 12 10 13 11 30—14 9 10/11 11 11 24 14 20 30 12 12 15 15—11 11 19 23 14 16 19 20 35 38 14 11—11 8 17 17 8 12 10 8 11 10 12 22 22 15 10 12 12 14 8 14 23 21 12 12 11 13 10 11 12 13—32 15 9 17 12 27 27 19 12 12 12 12 10 9 12 11 10 11 11 30 12 12 25 13 9 10 20 15 20 11 23 15 12 15 25 12 23 19 10 15 17 9 11 11
which gives 1002/179/0.12 = 47 à 50 conditional generations, or 1250±130 years to their common ancestor (the arrow shows a correction for back mutations). This is in agreement with an earlier figure of 1300±150 years, obtained with a lesser number of haplotypes (Rozhanskii and Klyosov, 2012). The automatic calculator gives 1123±168 years, which is practically the same value within the margin of error, in spite of the calculator employs a different way of calculation, based on the individual mutation rate constants for each marker (http://aklyosov.home.comcast.net/Kilin-Klyosov TMRCA 111 ver 1.xlsb; Kilin and Klyosov, 2014).
All 179 haplotypes are divided themselves onto two halves. The upper part of the tree has DYS459b=10, the lower part has DYS459b=11. The borderline dissects the tree between haplotype number 14 on the left and 83 on the right (Fig. 3). These alleles are shown in the base haplotype above as 10/11 in the first (top) line. Such a variation in one marker cannot reflect a random mutation; it shows two main branches in the dataset, separated by only 1/0.12 = 8 generations, or 200 years. It means that the split occurred almost immediately (200 years) after the common ancestor of Ashkenazi Levites has lived.
Three mutations between the two base haplotypes, shown above, sets the two apart by 3/0.12 = 25 à 26 conditional generations, that is by about 650 years. It means that their common ancestor lived about (650+1250+475)/2 = 1200 years. This is the “age” of the main base haplotype above. In other words, the younger branch split from the older one, and it occurred about 650 years ago.
Let us take a look more attentively at the two halves of the haplotype tree. There are 83 haplotypes with DYS459b=10 (including a few with alleles 11, but located in the upper part of the tree, they reflect random mutations 10 à 11). All 83 haplotypes contain 404 mutations from the base haplotype, which gives 404/83/0.12 = 41 à 43 generations, that is 1075±120 years from their common ancestor. 93 haplotypes with DYS459b=11 (including a few with alleles 10, but located in the lower half of the tree, they reflect random mutations 11 à 10) contain 430 mutations from the base haplotype, which gives 430/93/0.12 = 39 à 41 generations, that is 1025±115 years from their common ancestor. As it was indicated, there is only one mutation between these two base haplotypes (in DYS459b), which separate their common ancestor by 200 years. This places a common ancestor of the two halves of the tree at (200+1075+1025)/2 = 1150 ybp.
Finally, there is a small branch of 5 haplotypes having DYS459b=11 but located in the upper half of the tree. All five contain 17 mutations, which gives 17/5/0.12 = 28 à 29 generations, that is 725±190 years to their common ancestor. Their base haplotype differs by 2.5 mutations from that of the lower half of the tree, which is equivalent to 525 years, and places a common ancestor of the whole tree at 1150 ybp. As we see, the pattern is very consistent. It shows a common ancestor of Ashkenazi Levites at approximately 1200 ybp, give or take a century; almost immediately two branches split apart after a progenitor of one of them obtained a mutation in DYS459b, and there are a few young branches which split from those two branches, approximately 725 and 475 years ago. They probably have their own SNPs.
One interesting set of haplotypes contains five of 111-marker haplotypes, from apparent descendants of the Horowitz rabbinical family, whose common ancestor according to the documented genealogy lived in 1507-1572, that is 442-507 years ago. Their base haplotype is exactly the main base haplotype shown above, with DYS459b=10. All the five haplotypes contain 17 mutations from the 111- marker base haplotype, which gives 17/5/0.198 = 17 conditional generations, or 425±110 years to their common ancestor. The Calculator gives 439±137 years to their common ancestor, which is practically the same thing. The same five haplotypes in the 67 marker format contain 11 mutations, which gives 11/5/0.12 = 18 conditional generations, or 450±140 years to their common ancestor. The same five haplotypes in the 37 marker format contain 8 mutations, which gives 8/5/0.09 = 18 conditional generations, or 450±165 years to their common ancestor.
One can see how reliable are the calculations. For the 111-, 67- and 37-marker datasets of five Horowitz haplotypes the number of conditional generations to their common ancestor was 17, 18 and 18, respectively.
As it was indicated, the documented genealogy gave a timespan to the common ancestor of 442-507 years, the manual calculations gave 425±110 and 450±140 years (111- and 67-marker haplotypes), the automatic Calculator gave 439±137 years. All these figures are within the margins of error with each other and with the documented genealogy data.
Finally, let us reconcile the STR and SNP data. In a recent paper “Thoughts about Jewish DNA genealogy” (https://sites.google.com/site/levitedna/origins-of-r1a1a-ashkenazi-levites/2014-klyosov-article-on-jewish-dna-genealogy) a timespan to a common ancestor of Jewish and Arabic R1a haplotypes was determined as 3990 years before the present. The SNP-based calculations showed that TMRCA for F1345 was 3975-4224 years (see above). Those are essentially the same dates.
A common ancestor who originated CTS6 (2925-3168 years ago, SNP-based data, see above) has not left any extant Jewish R1a haplotypes, known today. It seems that their bearers did not survive, hence, they are absent among contemporary Jewish R1a haplotypes.
The TMRCA for Y2619 (1496-1725 years, SNP-based data, see above) is either overestimated (it should be around 1200 years ago) or descendants of the common ancestor also vanished, did not pass the population bottleneck around 289-518 AD, that is in the 3rd-6th century. They could have been the Khazars, however, there is no other support for this suggestion.
The TMRCA of 1056-1425 years (SNP Y2630) mirrors that of 1200±125 ybp, which embraces practically all R1a Ashkenazi Levites known today. This is their main SNP.
Finally, the TMRCA of 440-675 years (YP264). There are at least three candidates among Ashkenazi Levite branches for that SNP, with the TMRCA of 475, 425, and 450 years, determined using STRs (the last one belongs to the alleged Horowitz descendants). In terms of STRs, the latter do not form any branch, since their base haplotype is exactly as that of the upper half of the tree, with DYS459b=10. However, since the branch is rather young, the SNP (YP264) and the STRs might be independent of each other. So, the Horowitz descendants might have YP264 or might have not. Only direct testing will show its presence or absence YP264 in them.
As a conclusion, the author hopes that this essay illustrates the usefulness of both SNPs and STRs for DNA genealogy. They indeed complement each other, working in synergism.
Kilin, V.V., Klyosov, A.A. (2014) A novel TMRCA calculator working in all practically possible time ranges between hundreds of years to millions of years before present, based on the random walk model. Proc. Academy of DNA Genealogy, 7, No. 3, 438-478. [A 2016 article about the TMRCA calculator is linked here.]