Benchmarking Datasets

Ideal data for benchmarking the tools used to called genomic variants would consist of several, perfectly characterised, real genomes. However, ideal verification data are not easy to come by.

This site hopes to serve as a catalogue of data that may be useful for benchmarking. In each case the primary data were generated on a next-generation sequencing platform, most frequently the Illumina HiSeq, and must be available to download (although they require permission from data access committees). In addition to this there will be some form of data generated on an orthogonal platform from the same biological samples. This may take the form of microarray data, whole sample sequencing using an alternative NGS platform, or targeted sequencing of specific genomic loci.

Each of these verification approaches has its own limitations - e.g. alternate NGS approaches are potentially susceptible to the same biases that produced incorrect calls in the original data, while microarrays and targeted approaches are limited in the number of loci they can interrogate. Targeted verification approaches are particularly problematic as the location of the target regions was, in the majority of cases, driven by calls made from the original sequencing data. Thus it is unlikely that a novel call generated by the method being benchmarked will have any corresponding verification data. This is compounded by the fact that the verification status is often published simply as a binary field in a table of calls, and those that fail verification are often silently excluded.

Nonetheless, these relatively short lists of verified variants are often used to provide some measure of the accuracy of variant calling software.  The intention is for this site to make it easier to locate such data, particularly in fields such as cancer genomics, and to bring together complementary data that were run in different experiments and may otherwise be missed.

Each dataset is accompanied by a brief description, where we have tried to detail the types of samples studied and the technologies used. Where appropriate We have attempted to highlight the locations of relevant information (e.g. lists of validated variant calls) in a publication and in some cases simplified forms of such data are provided directly. Any ambiguities, such as whether sites that failed verification are still included in lists of potential variants, are also discussed.