Yfull.com and SNP Aging

YFULL.com has analyzed a number of Phelps/Pond M44 Big Y tests .

Considerably more other M44 surnames were tested

Topics on this page:

  • What is YFull's age estimation methodology?
    • TMRCA and "formed age"
  • Yfull.com's definitions and methodologies relating to the subclades

Yfull.com provides an experimental E tree HERE and identifies member results by a YFnumber. The M44 testing is ongoing and is being discussed in the related M44/E1a1 FTDNA forum.

What is YFull's age estimation methodology? See official Yfull Here Revised

A: YFull uses a methodology based on the research and analysis discussed in Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data by Adamov, Guryanov, Korzhavin, Tagankin, Urasin (2015).

The methodology is reflected in the Age Estimation table for each analyzed sample and in the subclade age "info" pop-up tables linked to the YTree. [Note: VCF files are not used for age estimation purposes.]

The first step is to select and count reliable derived Known and Novel SNPs for a sample. The number of counted SNPs appears in both tables.

The following five criteria are used to select reliable SNPs:

1. The coordinates of the SNPs must fall within the combBED regions designed to select X-degenerate segments. The combBED area borders were formed by mutual overlapping BED files taken from the work of Poznik et al. (2013) (total length of 10.45 Mbp) and by the generalized BigY BED file (11.38 Mbp long), published in the BigY White Paper (2014). The result was 857 continuous segments of the Y-chromosome with a total length of 8,473,821 base pairs.

2. Insertions and deletions (called "Indels") are excluded, as are multiple nucleotide polymorphisms (SNPs with more than one base position).

3. Variants detected in more than five different "localizations" are excluded. "Localization” means a group of samples from the YFull database belonging to the same subclade and having derived allele nomination. In some cases, the same derived variants may be found in different subclades or different haplogroups because of mapping errors or because the standard reference sequence is based mainly on haplogroup R1b data and to a lesser extent on haplogroup G data. This causes some variants in some haplogroups to be ancestral instead of derived. Although YFull established the "five different localizations" criterion empirically, the criterion is soft but believed to be effective.

4. SNPs with only one or two "reads" are excluded.

5. SNPs are excluded if the "read quality" is less than 90%. Quality is determined pursuant to YFull's proprietary SNP rating system. See the FAQ How does YFull determine the quality ratings for my Known SNPs and for my Novel SNPs?

The Age Estimation table for each sample provides a high level of detail about the application of the selection criteria. Reliable Known and Novel SNPs are listed in the "+Known SNPS" and "+Novels" columns of the table, and SNPs not selected are listed in the "x Known SNPs" and "x Novels" columns, with details related to the five criteria.

The second step of the sample age determination methodology is explained in the YTree "info" pop-up tables for the YTree subclades. For each sample in a table, two formulas are applied to the number of SNPs for the sample. The first formula corrects the SNP count to an assumed (or corrected) count from the combBed bp coverage area, and the second formula establishes the age of a sample based on the corrected count. The second formula uses an assumed mutation rate of 144.41 years (0.8178*10-9, which is the average of the mutation rates of the ancient Anzick-1 sample and of a group of known genealogies, and an assumed age of 60 years for living providers of YFull samples.

See also: How does YFull determine "formed" age and "TMRCA", and the related confidence intervals, of the subclades in its Experimental YTree?

Yfull.com's definitions and methodologies relating to the subclades

A: The following definitions and methodologies relate to the subclades in the Experimental YTree:

Subclade name: Each subclade name is highlighted in green.

SNPs "defining" a subclade: These are listed to the right of the subclade name (by SNP name, with additional SNP names in the grey-shaded pop-up: "X (a number) SNPs"). The SNP list for a subclade may change in the future as more samples are added to the YFull database and new branches are added.

Subclade "formed" age: The TMRCA (time to most recent common ancestor) of a subclade is used as the "formed" age of each branch of the subclade. Stated otherwise, the formed age of a branch is the same as the TMRCA of the "parent" subclade of that branch.

Determination of TMRCA for a subclade: The general rule is that the TMRCA of a subclade is equal to the average age (after rounding) shown in the yellow bar of the YTree "info" pop-up table for the subclade. In the situations where the general rule is not followed YFull will add an explanatory note at the bottom of the table. For an example, see the table for the I1-Z63 subclade.

Rounding rules: An age of less than 500 ybp is rounded to the nearest "25" (e.g., 381 becomes 375); an age of 500 to 1999 is rounded to the nearest "50" (e.g., 1477 becomes 1500); and an age of 2000 or more is rounded to the nearest "100" (e.g., 3160 becomes 3200).

Formed CI xx% yyyy <-> zzzz ybp, TMRCA CI aa% bbbb <-> cccc ybp: CI means "Confidence Interval". A confidence interval is an indicator of the precision of the YFull "formed" age and "TMRCA" data in the Experimental YTree. YFull developed its own statistical analysis computer script in order to calculate its confidence intervals.

Yellow Bar in "info" pop-up table: The "ybp" (years before the present) for the subclade is the average of the ages of the branches and samples (if any) highlighted in green in the Branch ID column, as shown in the yellow bar "Formula".

Number of SNPs column in "info" pop-up table: For a branch, the number in this column is the average of the numbers reported for the samples in the branch. For a sample, the number in this column is the total of the Known SNPs and Novel SNPs located between the subclade and the present. These SNPs are identified in the Age Estimation table.

Other columns in "info" pop-up table: Branch numbers are averages of the numbers given for the samples in the branch. The two formulas used in the table are discussed in the FAQ: What is YFull's age estimation methodology?

Last updated on Nov. 18, 2015.