Using SNPs for Genealogy

Here is a report on a successful genealogical use of SNPs: ADVANCED Y-DNA TESTINGFOR THE ACREE ONE-NAME STUDY

For an excellent (but detailed and long) video, "Using SNP Testing and STRs to Enhance Genealogy Research"

go to this YouTube site.

Overview - Revised 12/28/2015 By Doug Phelps, assisted by John Phelps

Discussion:

You may have found YDNA matches at FTDNA that suggest a common ancestor within a number of generations - with a range of probabilities for each. For example if you match someone at 65 of 67 STR markers, You are told here you have a 95% probability that the Most Recent Common Ancestor was WITHIN 14 generations (at 30 years per generations, that's WITHIN 420 years) IF you both have the same surname. At a match of 107 of 111 markers you are told here you have a 95% probability of a MRCA at 14 generations (420 years). Notice the "within" limitation. Could be very recent; could be the most distant time.

With such a range, you turn to carefully comparing str values hoping to see patterns which might identify family branches. In practice, this is often inconclusive and possibly misleading due to back mutating markers,a variable mutation rate depending on the marker being viewed, and markers that seem not to mutate at all.

Wouldnt it be better to have mutations that clearly identify family branches and suggest TMRCA more reliably? SNPs may well be the answer.

SNPs mutate far less often than STRs. Back mutations are extremely uncommon. Most authorities consider the average time of a SNP mutation to be 150 years although Yfull.com uses 144 years plus 60 (as the age of the man tested) . Like STR mutations are handed down from father to son.

A few illustrative examples of using SNPs in genealogical time:

Assume two men, Person A and Person B, are SNP tested with FTDNA's Big Y. An analysis by a qualified person/company compares their snps to ALL other SNPs and is able to place them in the most recently named subgroup or subclade. "New SNPs" may be identified.

In the following scenarios “new snps” means not previously found in others in the same subclade. When viewing these, remember that newly found snps are identified by their DNA position. Once discovered, most are given “names” with a letter prefix and a new number. Most family or private snps are never give a “name”.

Case #1 Person A and Person B have the same newly found SNPs There was no mutation within the past 150* years, thus the common ancestor was alive less than 150* years ago. Others could test for those SNPs to see if they were in that line during part or all of that time.

Case #2 Person A and Person B have shared SNP#1 and each person has one unshared SNP. Then the common ancestor was alive about 150*-300 years ago. Others could test for those SNPs to see where they connect.

Case #3 Person A and Person B have shared SNP#1 and each has two unshared SNPs. Then the common ancestor was alive 300*-450 years ago. Others could test for those SNPs to see where they connect.

Another way of looking at this based on a known genealogy…. About 1700 ancestor Joe had 3 sons, Doug, Jack, and John. Direct paternal descendants of Doug and Jack are tested. Comparing the snps, Doug’s descendant is shown to have new snp mutations X and Y. Jack’s descendant has new mutations A and B. John was not tested. These SNPs identify two main family branches and others can test for them. Being positive for just one of the snps indicates the main branch and also another branch from it, probably 150* years ago. If negative for all 4, it is likely they are of John’s main branch or not related at all.

Thus individual SNP testing at YSEQ.com (or at FTDNA if the SNP test is available) might suffice without STR or additional BIg Y testing at FTDNA. Of course, to give finer definition to a family tree, additional BIg Y testing will be needed.

These are of course simple examples of many variations. Yfull.com will analyze the results of a BIg Y test as part of a name family group (if requested) and provide the valid new/unique and shared SNPs.

*150 years is the typical rough value for estimations. YFULL.com will provide a "control interval" range of tmrca at a 95% CI. Their age estimation is based on the below Q&A from Yfull.com

YFULL.com has analyzed a number of Phelps/Pond M44 Big Y tests .

Considerably more other M44 surnames were tested

.

Yfull.com provides an experimental E tree HERE and show member results by a YFnumber. The M44 testing is ongoing and is being discussed in the related FTDNA forum. Below find the approach YFULL uses to age the snps/subgroups.

YFULL's Age Estimation This FAQ was created by YFull customer William F. Archerd. Last updated on Nov. 17, 2015.

Q: What is YFull's age estimation methodology?

A: YFull uses a methodology based on the research and analysis discussed in Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data by Adamov, Guryanov, Korzhavin, Tagankin, Urasin (2015).

The methodology is reflected in the Age Estimation table for the each analyzed sample and in the subclade age pop-up tables linked to the YTree.

The first step is to select and count reliable derived Known SNPs for a sample. The number of counted SNPs appears in both tables.

The following five criteria are used to select reliable SNPS:

1. The coordinates of the SNPs must fall within the combBED regions designed to select X-degenerate segments. The combBED area borders were formed by mutual overlapping BED files taken from the work of Poznik et al. (2013) (total length of 10.45 Mbp) and by the generalized BigY BED file (11.38 Mbp long), published in the BigY White Paper (2014). The result was 857 continuous segments of the Y-chromosome with a total length of 8,473,821 base pairs.

2. Insertions and deletions (called "Indels") are excluded, as are multiple nucleotide polymorphisms (SNPs with more than one base position).

3. Variants detected in more than five different "localizations" are excluded. "Localization” means a group of samples from the YFull database belonging to the same subclade and having derived allele nomination. In some cases, the same derived variants may be found in different subclades or different haplogroups because of mapping errors or because the standard reference sequence is based mainly on haplogroup R1b data and to a lesser extent on haplogroup G data. This causes some variants in some haplogroups to be ancestral instead of derived. Although YFull established the "five different localizations" criterion empirically, the criterion is soft but believed to be effective.

4. SNPs with only one or two "reads" are excluded.

5. SNPs are excluded if the "read quality" is less than 90%. Quality is determined pursuant to YFull's proprietary SNP rating system. See the FAQ How does YFull determine the quality rating for my SNPs?

The Age Estimation table for each sample provides a high level of detail about the application of the selection criteria. Reliable Known and Novel SNPs are listed in the "+Known SNPS" and "+Novels" columns of the table, and SNPs not selected are listed in the "x Known SNPs" and "x Novels" columns, with details related to the five criteria.

The second step of the sample age determination methodology is explained in the YTree "info" pop-up tables for the YTree subclades. For each sample in a table, two formulas are applied to the number of SNPs for the sample. The first formula corrects the SNP count to an assumed (or corrected) count from the combBed bp coverage area, and the second formula establishes the age of a sample based on the corrected count. The second formula uses an assumed mutation rate of 144.41 years (0.8178*10-9, which is the average of the mutation rates of the ancient Anzick-1 sample and of a group of known genealogies, and an assumed age of 60 years for living providers of YFull samples.

See also: How does YFull determine "formed" age and "TMRCA", and the related confidence intervals, of the subclades in its Experimental YTree?

Q: How does YFull determine "formed" age and "TMRCA", and the related confidence intervals, of the subclades in its Experimental YTree?

A: The following definitions and methodologies relate to the subclades in the Experimental YTree:

Subclade name: Each subclade name is highlighted in green.

SNPs "defining" a subclade: These are listed to the right of the subclade name (by SNP name, with additional SNP names in the grey-shaded pop-up: "X (a number) SNPs"). The SNP list for a subclade may change in the future as more samples are added to the YFull database and new branches are added.

Subclade "formed" age: The TMRCA (time to most recent common ancestor) of a subclade is used as the "formed" age of each branch of the subclade. Stated otherwise, the formed age of a branch is the same as the TMRCA of the "parent" subclade of that branch.

Determination of TMRCA for a subclade: The general rule is that the TMRCA of a subclade is equal to the average age (after rounding) shown in the yellow bar of the YTree "info" pop-up table for the subclade. In the situations where the general rule is not followed YFull will add an explanatory note at the bottom of the table. For an example, see the table for the I1-Z63 subclade.

Rounding rules: An age of less than 500 ybp is rounded to the nearest "25" (e.g., 381 becomes 375); an age of 500 to 1999 is rounded to the nearest "50" (e.g., 1477 becomes 1500); and an age of 2000 or more is rounded to the nearest "100" (e.g., 3160 becomes 3200).

Formed CI xx% yyyy <-> zzzz ybp, TMRCA CI aa% bbbb <-> cccc ybp: CI means "Confidence Interval". A confidence interval is an indicator of the precision of the YFull "formed" age and "TMRCA" data in the Experimental YTree. YFull developed its own statistical analysis computer script in order to calculate its confidence intervals.

Yellow Bar in "info" pop-up table: The "ybp" (years before the present) for the subclade is the average of the ages of the branches and samples (if any) highlighted in green in the Branch ID column, as shown in the yellow bar "Formula".

Number of SNPs column in "info" pop-up table: For a branch, the number in this column is the average of the numbers reported for the samples in the branch. For a sample, the number in this column is the total of the Known SNPs and Novel SNPs located between the subclade and the present. These SNPs are identified in the Age Estimation table.

Other columns in "info" pop-up table: Branch numbers are averages of the numbers given for the samples in the branch. The two formulas used in the table are discussed in the FAQ: What is YFull's age estimation methodology?

Last updated on Nov. 18, 2015.

The following "best effort explanation" by D Phelps was in response to a question about the "play" in the range of years shown by YFULL.com for their TMRCA years before the present .

When we roll over , for example, " formed 3100 ybp formed 3100 ybp (on the tree) we see CI 95% 2600<->700 ybp " So that is the play. Some time ago I spent a lot of time trying to understand their “CI”

I found in their FAQ: Formed CI xx% yyyy <-> zzzz ybp, TMRCA CI aa% bbbb <-> cccc ybp: CI means "Confidence Interval". A confidence interval is an indicator of the precision of the YFull "formed" age and "TMRCA" data in the Experimental YTree. YFull developed its own statistical analysis computer script in order to calculate its confidence intervals.

I studied the concept of Confidence interval, but came away not very clear on it. It has to do with the level of confidence that future individual tests will actually have a number of unique snps back to the subclade which will compute to the resulting aging value that is within that range. The number of years is determined by the number of novel snps a tested man shows back to his terminal snp. No doubt that CI is a big range. The more men tested in the haplogroup, the more it would narrow. I am seeing thankfully that the Phelps are not varying much at all. One subgroup of 3 phelps has exactly 3 snps each. But perhaps they are more similar due to similar family environments?? The use of 144 years per snp also affects the ybp, but it appears 144 is the current thinking. What makes your situation more complex is that they don’t show the novel snps for the Sardinian.

For what it is worth, here is a response I got on the subject:

For simple subclades (as E-A6108) we use the Poisson distribution.

https://en.wikipedia.org/wiki/Poisson_distribution

Observed mutations is 3.16+2.1 = 5.26

CI 95% for 5.26 is 1.77-12.03. We need to divide it by 2 (number of samples). 0.885-6.015. Multiply by 144.41 plus 60. 929-188 years before present.

I didn't find online calculation of CI of Poisson distribution for fractional means but you can use some of approximating formulas

http://www.ine.pt/revstat/pdf/rs120203.pdf

For integers you can use online calculator (try for mean=5 and for mean=6)

http://statpages.org/confint.html

For composite subclades (as E-Z31503) we use more complicated algorithm. There is some kind of sum of Poisson distributions.

Best regards,

Vadim Urasin

YFull Team