A non-exhaustive review of medical and biology datasets/benchmark (by Magali Richard).
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
Jiang et al, Genome Biology, 2016
"Availability of data and materials :
Data The benchmark data and the predictions are available on FigShare https://dx.doi.org/10.6084/m9.figshare.2059944.v1. Note that according to CAFA rules, all but the top-ten methods are anonymized. However, methods are uniquely identified by a code number, so use of the data for further analysis is possible.
Software The code used in this study is available at https://github.com/yuxjiang/CAFA2. "
"CASP (Critical Assessment of Structure Prediction) is a community wide experiment to determine and advance the state of the art in modeling protein structure from amino acid sequence. Every two years, participants are invited to submit models for a set of proteins for which the experimental structures are not yet public. Independent assessors then compare the models with experiment. Assessments and results are published in a special issue of the journal PROTEINS. In the most recent CASP round, CASP14, nearly 100 groups from around the world submitted more than 67,000 models on 90 modeling targets (see Critical assessment of methods of protein structure prediction (CASP) - Round XIII). "
" CAPRI (Critical Assessment of PRedicted Interactions) is a community wide initiative for testing computational algorithms in blind predictions of experimentally determined 3D structures of protein complexes, the “targets”, provided to CAPRI prior to publication.
This page provide you links to various software tools, databases and web server which might be useful for CAPRI predictors."
Multiple Myeloma DREAM Challenge reveals epigenetic regulator PHF19 as marker of aggressive disease
Mason et al, Leukemia, 2020
" Challenge model submission architecture: training datasets are fully available to Challenge participants , while blinded validation datasets are sequestered in the cloud.
The Challenge includes five microarray and three RNA-seq expression datasets, annotated with clinical characteristics such as gender, age, International Staging System stage (ISS), and cytogenetics (Table 1) [9,10,11,12,13,14]. In all datasets, expression assays were performed on CD138+PCs isolated from bone marrow aspirates or blood of newly diagnosed patients. Data were split into training and validation datasets. "
Salcedo et al, Nature biotechnology, 2020
" Code availability : BAMSurgeon is available at: https://github.com/adamewing/bamsurgeon. The framework for subclonal mutation simulation is available at http://search.cpan.org/~boutroslb/NGS-Tools-BAMSurgeon-v1.0.0/. The PhaseTools BAM phasing toolkit is available at https://github.com/mateidavid/phase-tools. Scripts providing the complete scoring harness are available at: https://github.com/asalcedo31/SMC-Het_Scoring/smc_het_eval.
Data availability : Sequences files are available at EGA under study accession no. EGAD00001003971. "
Creason et al, Cell Systems, 2021
" The data and various quality estimates are available on Synapse(https://www.synapse.org/Synapse:syn22344794). "
Decamps, BMC bioinformatics, 2021
" DECONbench is hosted on the open source Codalab competition platform. It is freely available at: https://competitions.codalab.org/competitions/27453. Further documentation (online demo) is available at: https://deconbench.github.io/. "
A set of machine learning baseline and datasets reflective of Cancer related problems.
Bohnert et al, Plos One, 2017
" Data Availability .
The synthetic data sets are available from https://doi.org/10.5281/zenodo.556347. All other relevant data is provided in the Supporting Information files. "
Hunt et al, Genome Biologyy 2014
" Data Availability.
The ENA accession numbers of the P. falciparum reads are ERR034295 and ERR163027-9 and the reference genome can be downloaded from ftp://ftp.sanger.ac.uk/pub/pathogens/Plasmodium/falciparum/3D7/3D7.latest_version/version3/Pf3D7_v3.fasta.gz. All wrapper and analysis scripts are freely available from https://github.com/martinghunt/Scaffolder-evaluation. The simulated data can be generated using those scripts. The remaining data were all from the GAGE project and can be downloaded from http://gage.cbcb.umd.edu/data/index.html."
Sczyrba et al, Nature methods, 2017
" Data availability.
A Life Sciences Reporting Summary for this paper is available. The plasmid assemblies, raw data and metadata have been deposited in the European Nucleotide Archive (ENA) under accession number PRJEB20380. The challenge and toy data sets including the gold standard, the assembled genomes used to generate the benchmark data sets (Supplementary Table 10), NCBI and ARB public reference sequence collections without the benchmark data and the NCBI taxonomy version used for taxonomic binning and profiling are available in GigaDB under data set identifier (100344) and on the CAMI analysis site for download and within the benchmarking platform (https://data.cami-challenge.org/participate). Further information on the CAMI challenge, results and scripts is provided at https://github.com/CAMI-challenge/. Supplementary Tables 2 and 9 specify the Docker Hub locations of bioboxes for the evaluated programs and used metrics. Source data for Figures 1,2,3 are available online. "
Labak et al, Biology Direct, 2016
" Availability of supporting data.
This study builds on the main synthetic benchmark data set of the SEQC consortium [doi:10.1038/nbt.2957]. The datasets analysed during the current study are available in the GEO repository with series accession number GSE47792. "
For more informations on genomic challenges and benchmarks, you can check out :
The CAMDA challenges, or the following article : "Systematic benchmarking of omics computational tools"
Email from Gustavo:
You can take a look at this review paper (now a little dated) that we wrote a few years ago: https://pubmed.ncbi.nlm.nih.gov/27418159/. More recently there was a review on benchmarking of omics computational tools: https://pubmed.ncbi.nlm.nih.gov/30918265/. There are more, but those are a good start.
Probably the most influential benchmarking exercise to data in the field of computational biology has been the structure prediction competition CASP (Critical Assessment of Structure Prediction), which in its 10th iteration allowed Alpha Fold to shine in its whole glory https://pubmed.ncbi.nlm.nih.gov/34599769/ . However it is not a true benchmarking dataset in the style of ImageNet, as the gold standard for CASP is revealed after each biannual competition.
There was an interesting challenge we run on screening mammograms, that offered an incredibly interesting dataset (like 650K images) for training and testing, and then an independent test set. The paper is here: https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2761795 I can provide a ppt with a few slides on this.
There is also a Society called the medical image Computer and Computing Assisted Intervention Society, which amongst other things organizes challenges. See here http://www.miccai.org/special-interest-groups/challenges/
There is an effort called Critical Assessment of Genome Interpretation: https://genomeinterpretation.org that organizes challenges in genomics.
About benchmarking in the healthcare space, I am attaching a recent paper published in Nature Machine Intelligence which I am not sure you saw (https://www.nature.com/articles/s42256-022-00559-4). It’s written by Google colleagues of yours from Google Research London (I know Subrajit Roy, the last author, as he was in IBM before). It’s about developing end to end benchmarks in healthcare. I am most interested in the last section about Clinical Deployment. I think that is the ultimate bottleneck…
And now that we are at it, I am not sure you saw a paper I published last year. It’s not related to benchmarking, but with ensemble learning. I would love to hear your opinion of it, if you have a chance to glance at it: https://www.pnas.org/doi/10.1073/pnas.2100761118
From: jake.albrecht@sagebase.org
There are quite a few datasets released for benchmarking, notably MIMIC for clinical records, and NCBI Disease corpus of pubmed abstracts that led to BioBERT. There are other datasets like Tox21 for drug discovery that elevated graphconvolutional network models for representing chemicals, and the BraTS dataset for MRI image segmentation showed the capabilities of UNet models for tumor measurements.