Data Collection, Curation, and Labeling (DCCL) for Mining and Learning


Data is a critical and defining component for all data mining and machine learning based approaches. Thus, collecting and curating a proper dataset is of great importance to anyone who wishes to leverage these methods in practice. This workshop focuses on all aspects of how to define a "good" dataset for various learning tasks and also seeks to showcase different techniques for efficiently constructing such a dataset. More specifically, we are targeting an audience that is interested in the areas of efficient (1) data collection, (2) data curation, and (3) data labeling.

Research topics that touch on these three above mentioned tasks, for example, include crowdsourcing and active learning. Crowdsourcing seeks to build mechanisms and processes for leveraging large groups of human participants to efficiently and accurately collect data for a particular task (oftentimes for building supervised training sets). Active learning aims to make the labelling process, required for supervised learning, more efficient by intelligently and selectively deciding which data points to label. Of course, these areas of research are not all encompassing and we welcome submissions from other areas such as fairness in ML, semi-supervised learning, and reinforcement learning. One approach to fairness focuses on creating methods to better curate data in order to address historical biases, semi-supervised learning directly tailors solutions to heterogeneous data that is both potentially selectively labelled and unlabelled, and reinforcement learning intelligently interacts with the environment to generate data.

We wish to attract both practitioners and theoreticians that work in these fields in an effort to foster additional collaboration across groups. We strive to identify the core hurdles in these topics and to develop solutions that are both practical and theoretically sound. For example, one positive outcome would be the development and empirical study of novel practical settings that inspire novel theoretical analyses. In addition to submissions that solve a particular problem within this topic we will also strongly encourage submissions that present well developed and important open questions.


Despite the growing costs of collecting and labeling vast amounts of data, efforts to tackle problem are spread across several different research areas, often with a large disconnect between theory and practice. The goal of this workshop is to bring together researchers across these various areas in order to identify points of overlap and foster collaboration as well as to bridge the gap between theory and practice. We expect that the large crowd of KDD attendees, with their relatively diverse set of research areas, will also provide a great audience for this workshop. We expect that the fruitful collaborations started in this workshop will result in novel research.

Location and Dates

This workshop will take place on August 5th, 2019 at Anchorage, Alaska.

Venue: Summit 4 - Ground Level, Egan