ICML 2022 Workshop

DataPerf

Benchmarking Data for Data-Centric AI

July 22, 2022 in Baltimore, MD

About DataPerf

While building and using datasets has been critical to AI successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. This triggered a recent focus shift from the modeling algorithms to the underlying data used to train and evaluate ML models. In this context, DataPerf aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems. It aims to provide clear evaluation and encourage rapid innovation aimed at conferences and workshops such as the NeurIPS Datasets and Benchmarks track. Similar to the MLPerf effort, we’ve brought together the leaders of these motivating efforts to build DataPerf.

DataPerf is a benchmark suite for ML datasets and data-centric algorithms. Historically, ML research has focused primarily on models, and simply used the largest existing dataset for common ML tasks without considering the dataset’s breadth, difficulty, and fidelity to the underlying problem. This under-focus on data has led to a range of issues, from data cascades in real applications, to saturation of existing dataset-driven benchmarks for model quality impeding research progress. In order to catalyze increased research focus on data quality and foster data excellence, we created DataPerf: a suite of benchmarks that evaluate the quality of training and test data, and the algorithms for constructing or optimizing such datasets, such as core set selection or labelling error debugging, across a range of common ML tasks such as image classification. We plan to leverage the DataPerf benchmarks through challenges and leaderboards.

DataPerf White Paper

DataPerf Challenges

This workshop builds on a tradition of series of workshops focusing on the role of data in AI:

Data-Centric AI (DCAI) @ NeurIPS2021
Data Excellence (DEW) @ HCOMP2020
Machine Learning for Data – Automated Creation, Privacy, Bias @ ICML 2021
Economics of Privacy and Data Labor @ ICML 2020
Evaluating Evaluation of AI Systems (Meta-Eval) @ AAAI 2020
Rigorous Evaluation of AI Systems (REAIS) @ HCOMP 2020 and 2019
Subjectivity, Ambiguity and Disagreement (SAD) @ TheWebConf (WWW) 2019 and HCOMP 2018

Important Dates

Submission Deadline

May 9 → May 25, 2022

Notification of Acceptance

June 6, 2022

Workshop

July 22, 2022

FAQ

For questions please check FAQ

Call for Papers

Creating reliable and scalable data pipelines is often the biggest challenge in applying machine learning in practice. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The development of tools to make repeatable and systematic adjustments to datasets has lagged. While dataset quality is a top concern, how to rigorously evaluate data is underexplored. Many data challenges include: fairness and bias issues in labeled datasets, quality issues in datasets, limitations of benchmarks, reproducibility concerns in machine learning, lack of documentation and replication of data, and unrealistic performance metrics.

Benchmarking ML datasets and data-centric algorithms. Recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The Data-centric AI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems. In order to catalyze increased research focus on data quality and foster data excellence, we created DataPerf: a suite of benchmarks that evaluate the quality of training and test data, and the algorithms for constructing or optimizing such datasets, such as core set selection or labeling error debugging, across a range of common ML tasks such as image classification. We plan to leverage the DataPerf benchmarks through challenges and leaderboards.

This workshop will bring together the perspectives from the wide data-centric AI community and will focus on diverse aspects of benchmarking datasets. It will highlight recent advances, case-studies, methodologies for excellence in data engineering. Our goal is to build an active research community focused on discussing the core problems and creating ways to measure progress in machine learning through data quality tasks. Please see our call for papers below to take an active role in shaping that future!

Submission Instructions

We invite submissions in the form of short papers (1-2 pages) and long papers (4 pages) excluding references. Feel free to add appendices as support information, however the review of the submissions will focus on the main text in the max 4 pages paper.
All submissions should address one or more of the topics of interest below.
All submissions should contain author names and formatted according to ICML 2022 Formatting Instructions.
Submissions will be single-blind peer-reviewed by the program committee
Accepted papers will be presented as lightning talks during the workshop.
Submission link: https://easychair.org/conferences/?conf=dataperf2022

If you have any questions about submission, please first check the FAQ link. Contact us per email only if your question is not answered in the FAQ, or if you experience any problems with the submission site. Please email us at Dataperf-ws-org@googlegroups.com

Topics of Interest

DataPerf workshop is inviting position papers from researchers and practitioners on topics that include but not limited to the following:

New datasets in areas:

Speech, vision, manufacturing, medical, recommendation/personalization, science

Tools & methodologies for accelerating open-source dataset iteration:

Tools that quantify and accelerate time to source and prepare high quality data
Tools that ensure that the data is labeled consistently, such as label consensus
Tools that make improving data quality more systematic
Tools that automate the creation of high quality supervised learning training data from low quality resources, such as forced alignment in speech recognition
Tools that produce consistent and low noise data samples, or remove labeling noise or inconsistencies from existing data
Tools for controlling what goes into the dataset and for making high level edits efficiently to very large datasets, e.g. adding new words, languages, or accents to speech datasets with thousands of hours
Search methods for finding suitably licensed datasets based on public resources
Tools for creating training datasets for small data problems, or for rare classes in the long tail of big data problems
Tools for timely incorporation of feedback from production systems into datasets
Tools for understanding dataset coverage of important classes, and editing them to cover newly identified important cases
Dataset importers that allow easy combination and composition of existing datasets
Dataset exporters that make the data consumable for models and interface with model training and inference systems such as web dataset
System architectures and interfaces that enable composition of dataset tools such as, MLCube, Docker, Airflow

Algorithms for working with limited labeled data and improving label efficiency:

Data selection techniques such as active learning and core-set selection for identifying the most valuable examples to label
Semi-supervised learning, few-shot learning, and weak supervision methods for maximizing the power of limited labeled data
Transfer learning and self-supervised learning approaches for developing powerful representations that can be used for many downstream tasks with limited labeled data
Novelty and drift detection to identify when more data needs to be labeled

Responsible AI development:

Fairness, bias, diversity evaluation and analysis for data sets and modeling/algorithms
Tools for green AI hardware-software system design and evaluation
Scalable, reliable training methods and systems
Tools, methodologies, and techniques for private, secure machine learning training
Efforts toward reproducible AI, such as data cards, model cards