ICML 2022 Workshop

DataPerf

Benchmarking Data for Data-Centric AI

July 22, 2022 in Baltimore, MD

 

About DataPerf

While building and using datasets has been critical to AI successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. This triggered a recent focus shift from the modeling algorithms to the underlying data used to train and evaluate ML models. In this context, DataPerf aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems. It aims to provide clear evaluation and encourage rapid innovation aimed at conferences and workshops such as the NeurIPS Datasets and Benchmarks track. Similar to the MLPerf effort, we’ve brought together the leaders of these motivating efforts to build DataPerf.


DataPerf is a benchmark suite for ML datasets and data-centric algorithms. Historically, ML research has focused primarily on models, and simply used the largest existing dataset for common ML tasks without considering the dataset’s breadth, difficulty, and fidelity to the underlying problem. This under-focus on data has led to a range of issues, from data cascades in real applications, to saturation of existing dataset-driven benchmarks for model quality impeding research progress. In order to catalyze increased research focus on data quality and foster data excellence, we created DataPerf: a suite of benchmarks that evaluate the quality of training and test data, and the algorithms for constructing or optimizing such datasets, such as core set selection or labelling error debugging, across a range of common ML tasks such as image classification. We plan to leverage the DataPerf benchmarks through challenges and leaderboards.

This workshop builds on a tradition of series of workshops focusing on the role of data in AI:

 

 

Important Dates

Submission Deadline

May 9 → May 25, 2022

Notification of Acceptance

June 6, 2022

Workshop

July 22, 2022

FAQ

For questions please check FAQ

 

 

Call for Papers

Creating reliable and scalable data pipelines is often the biggest challenge in applying machine learning in practice. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The development of tools to make repeatable and systematic adjustments to datasets has lagged. While dataset quality is a top concern, how to rigorously evaluate data is underexplored. Many data challenges include: fairness and bias issues in labeled datasets, quality issues in datasets, limitations of benchmarks, reproducibility concerns in machine learning, lack of documentation and replication of data, and unrealistic performance metrics.


Benchmarking ML datasets and data-centric algorithms. Recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The Data-centric AI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems. In order to catalyze increased research focus on data quality and foster data excellence, we created DataPerf: a suite of benchmarks that evaluate the quality of training and test data, and the algorithms for constructing or optimizing such datasets, such as core set selection or labeling error debugging, across a range of common ML tasks such as image classification. We plan to leverage the DataPerf benchmarks through challenges and leaderboards.


This workshop will bring together the perspectives from the wide data-centric AI community and will focus on diverse aspects of benchmarking datasets. It will highlight recent advances, case-studies, methodologies for excellence in data engineering. Our goal is to build an active research community focused on discussing the core problems and creating ways to measure progress in machine learning through data quality tasks. Please see our call for papers below to take an active role in shaping that future! 

Submission Instructions

If you have any questions about submission, please first check the FAQ link. Contact us per email only if your question is not answered in the FAQ, or if you experience any problems with the submission site. Please email us at Dataperf-ws-org@googlegroups.com

Topics of Interest

DataPerf workshop is inviting position papers from researchers and practitioners on topics that include but not limited to the following:


New datasets in areas:


Tools & methodologies for accelerating open-source dataset iteration:


Algorithms for working with limited labeled data and improving label efficiency:


Responsible AI development:

 

 

Keynote Talks

Andrew Ng

Stanford & Landing AI

Besmira Nushi

Microsoft

Invited Talks

Matei Zaharia

Stanford & Databricks

Mona Diab

Meta

Kurt Bollacker

Long Now

Jordi Pont-Tuset

Google

Baharan Mirzasoleiman

UCLA

Ehsan Valavi

Harvard

Yiling Chen

Harvard

Sharon Yixuan Li

UW-Madison

Xavier Bouthillier

Mila, Université de Montréa

 

 

Organizing Committee

Newsha Ardalani

Facebook AI Research

Lora Aroyo

Google

Colby Banbury

Harvard University

Greg Diamos

Landing AI

Tzu-Sheng Kuo

Carnegie Mellon University

Peter Mattson

Google

Mark Mazumdar

Harvard University

Praveen Paritosh

Google

William Gaviria Rojas

Coactive AI

James Zou

Stanford University & AWS

Vijay Janapa Reddi

Harvard University

Carole-Jean Wu

Facebook AI Research

Cody Coleman

Coactive AI

 

 

Program Committee

Shelby Heinecke

Cynthia Freeman

Sarah Luger

Tristan Thrush

Urmish Thakker

Matt Lease

Panos Ipeirotis

Jose Hernandes-Orallo

Ka Wong

Kurt Bollacker

Ian Beaver

David Kanter

Bilge Acun

Bojan Karlaš

Chris Welty

Anish Athalye

William Gaviria Rojas

Praveen Paritosh

Mark Mazumder

Greg Diamos

James Zou

Colby Banbury

Newsha Ardalani

Carole-Jean Wu

Tzu-Sheng Kuo

Lora Aroyo

 

 

The Venue

The Baltimore Convention Center

1 W Pratt St, Baltimore, MD 21201

We look forward to seeing you here!

 

 

Contact