ECMLPKDD Workshop on Automating Data Science (ADS2021)
Virtual, 17 September 2021
Progress in data science automation can have important implications for a democratisation of data science and related disciplines such as machine learning and statistics. This is especially critical as the diversity of data and techniques in these areas is accelerating, and data scientists are in urgent need for more powerful tools helping them in the data science process. While there has been significant progress in the core stages of the process, exemplified by the success of AutoML, other areas, such as data understanding, data preparation, and deployment still need fundamental research breakthroughs to really make a significant impact on data science automation overall. The workshop will cover all areas of data science automation, but will especially welcome research that focuses on steps before and after modelling, deals with “messy data”, or extends the AutoML paradigm beyond supervised tasks.
The program will consists of invited talks, a panel discussion, contributed talks and spotlights, and a poster session. As the meeting will be virtual, we will encourage interaction via Q&A after talks, in the panel discussion, and in the poster session. The workshop would be the third in a series, following on from Dagstuhl (2018) and ECML-PKDD (2019).
This ECMLPKDD workshop wants to bring together researchers from all areas concerned with data science in order to study whether, to what extent, and how data science can be automated. It will focus on the following Data Science topics:
Automating data wrangling
Data integration via AI techniques (e.g., NLP)
Merging the preparation of data into the statistical learning
Handling missing and anomalous values semi-automatically
Using NLP for generating explanations and reports.
Incorporating domain knowledge into the automation of data science.
Semi-automated machine learning
Learning with non-normalized data
Impact of data science automation on the work of data scientists
Schedule (all times Bilbao, CEST)
11:10-11:55 Invited talk: Neil Lawrence (University of Cambridge): "Access, Assess and Address: A Pipeline for (Automated?) Data Science" (see abstract below)
11:55-1240 Contributed session 1 (3 orals) (15 mins each incl. handover & questions)
"Can language models automate data wrangling?" Gonzalo Jaimovitch-Lopez, Cesar Ferri, Jose Hernandez-Orallo, Fernando Martínez-Plumed and María José Ramírez-Quintana
"ptype-cat: Inferring the Type and Values of Categorical Variables". Taha Ceritli and Christopher K. I. Williams
"NeuMiss network classifiers: deep learning for classifying with missing values". Alexandre Perez-Lebel, Marine Le Morvan and Gaël Varoquaux
12:40-13:25 Invited talk: Luc De Raedt (KU Leuven): "On the automation of data science" (see abstract below)
13:25-14:30 Lunch break
14:30-15:15 Invited talk: Madeleine Udell (Cornell University): "Structured Models for Automated Machine Learning" (see abstract below)
15:15-16:00 Panel discussion. "Messy data: More wrangling and cleaning, or more flexible modelling techniques?" with Michael Betancourt (Stan developer), Zachary Lipton (CMU) and Madeleine Udell (Cornell). (see description below)
16:00-16:50 Contributed session 2 (2 orals, 6 spotlights) (orals: 15 mins each incl. handover & questions) (spotlights: 3 mins each incl. handover - no questions)
"Automated Computational Energy Minimization of ML Algorithms using Constrained Bayesian Optimization". Pallavi Mitra and Felix Biessmann
"SpLyCI: Integrating Spreadsheets by Recognising and Solving Layout Constraints". Dirk Petrus Coetsee, Rodney Stephen Kroon, Mcelory Hoffmann and Luc De Raedt
"Identifying the Units of Measurement in Tabular Data". Taha Ceritli and Christopher K. I. Williams
"Automated architecture search with model complexity control". Konstantin Yakovlev, Olga Grebenkova, Oleg Bakhteev and Vadim Strijov
"Automatic Componentwise Boosting: An Interpretable AutoML System". Stefan Coors, Daniel Schalk, Bernd Bischl and David Ruegamer
"Democratizing Constraint Satisfaction Problems through Machine Learning". Adem Kikaj, Mohit Kumar, Gust Verbruggen, Samuel Kolb, Luc De Raedt and Clement Gautrais
"From strings to data science: a practical framework for automated string handling". John W. van Lith and Joaquin Vanschoren
"Automated Machine Learning, Bounded Rationality, and Rational Metareasoning". Eyke Hüllermeier, Felix Mohr, Alexander Tornede and Marcel Wever
17:00-18:00 Poster session for all contributed papers
18:00-18:45 Invited talk: Joe Hellerstein (UC Berkeley and Trifacta): "From Wrangler to Trifacta and Beyond: Human/AI Interfaces for Data Science and Engineering" (see abstract below)
18:45-19:00 Wrap up
About the Panel
Panel discussion: Messy data: More wrangling and cleaning, or more flexible modelling techniques?
Some modelling techniques are very powerful but require highly curated data (no missing values, full numerization, scaling, outlier elimination, consistency, data enhancement, etc.) while others are more versatile by dealing with low-quality data but still producing reasonably good models. In some areas, such as NLP, some architectures (e.g., transformers) are able to deal with data that is noisy, non-structured, and still display some good functionality (although limited robustness). In areas of machine learning dealing with images, audio, tabular data or multimodal data, what is the best tradeoff for automation, more data wrangling tools or more flexible models? Does this trade-off depend on the desired quality of the models and the expertise of the data scientists? We will ask panelists and attendees to discuss on the pros and cons of the two suggested approaches (with emphasis on the automation of data wrangling and cleaning, or on the automation of more flexible modelling techniques).
List of panelists: Michael Betancourt (Stan developer), Zachary Lipton (CMU), Madeleine Udell (Cornell).
Neil Lawrence (Cambridge): Access, Assess and Address: A Pipeline for (Automated?) Data Science
Data Science is an emerging discipline that is being promoted as a universal panacea for the world's desire to make better informed decisions based on the wealth of data that is available in our modern interconnected society. In practice data science projects often find it difficult to deliver. In this talk we will review efforts to drive data informed in real world examples, e.g., the UK's early Covid19 pandemic response. We will introduce a framework for categorising the stages and challenges of the data science pipeline and relate it to the challenges we see when giving data driven answers to real world questions. We will speculate on where automation may be able to help but emphasise that automation in this landscape is challenging when so many issues remain for getting humans to do the job well
Luc De Raedt (KU Leuven) : On the automation of data science
Inspired by recent successes towards automating highly complex jobs like automatic programming and scientific experimentation, I want to automate the task of the data scientist when developing intelligent systems. In this talk, I shall introduce some of the involved challenges and some possible approaches and tools for automating data science.
More specifically, I shall discuss how automated data wrangling approaches can be used for pre-processing and how both predictive and descriptive models can in principle be combined to automatically complete spreadsheets and relational databases. I will argue that autocompleting spreadsheets is a simple yet highly challenging setting for the automation of data science. Special attention will be given towards the induction of constraints in spreadsheets and in an operations research context.
Madeleine Udell (Cornell): Structured Models for Automated Machine Learning
Automated machine learning (AutoML) seeks algorithmic methods for finding the best machine learning pipeline and hyperparameters to fit a new dataset. The complexity of this problem is astounding: viewed as an optimization problem, it entails search over an exponentially large space, with discrete and continuous variables. An efficient solution requires a strong structural prior on the optimization landscape of this problem.
In this talk, we survey some of the most powerful techniques for AutoML on tabular datasets. We will focus in particular on techniques for meta-learning: how to quickly learn good models on a new dataset given good models for a large collection of datasets. We will see that remarkably simple structural priors, such as the low-dimensional structure used by the AutoML method Oboe, produce state-of-the-art results. The success of these simple models suggests that AutoML may be simpler than was previously understood.
Joe Hellerstein (UC Berkeley & Trifacta): From Wrangler to Trifacta and Beyond: Human/AI Interfaces for Data Science and Engineering
Over the last decade, I have worked with colleagues in academia and industry developing techniques for making data engineering more approachable and productive. The theme of the work is to knit together a trifecta of technical fields: Database Management, AI and HCI. In this talk I will describe and illustrate one of the key design principles of our work, which we call Predictive Interaction---a methodology for bringing AI assistance to user experiences for challenging data-centric tasks. We introduced this work in academic papers via the open source Data Wrangler project, and brought it to market as Trifacta, and as Google Dataprep. Time permitting, I will also cover more recent efforts to "democratize" data engineering in the sense of building a "big tent", encompassing both technical users and domain experts, in both Trifacta and the academic open-source B2 extension for Jupyter notebooks.
Joint work with Jeffrey Heer, Sean Kandel, Yifan Wu, Arvind Satyanarayan, and the Trifacta team.
Call for Contributions
Types and format
The workshop will welcome submissions in the following formats:
Extended abstracts that report on novel and preliminary ideas. Extended abstracts can be at most 6* pages in LNCS format.
Short position statements on automating data science, at most 6* pages in LNCS format.
Presentations of relevant work that has recently been published or has already been accepted for publication in journals such as DMKD, MLJ, JMLR, AIJ, JAIR, and major conferences such as SIGKDD, NeurIPS, ICML, IJCAI, etc. The submission should in this case only consist of a copy of the paper.
(*) References and optional supplementary material following the references don't count for the number of pages.
The program committee will review all submissions. It will also decide which accepted submissions can be presented orally, as spotlights, and/or as posters. Submissions of types 1. and 2. are intended as non-archival.
We have also had discussions with the editor-in-chief of Machine Learning Journal about a special issue on the topic of Automating Data Science. More information about this soon.
Submissions via easychair.
Submission deadline (EXTENDED): Wed June 23, 2021 (Title and abstract), Mon June 28, 2021 (Full paper*)
Acceptance notification: Fri July 23, 2021
Camera-ready deadline: September 6, 2021
(*) All deadline times are AoE (GMT-12). All papers must have their title and abstract submitted on easychair by June 23.
Tijl De Bie (UGent, Belgium)
Jose Hernandez-Orallo (Universitat Politecnica de Valencia, Spain)
Joaquin Vanschoren (Eindhoven University of Technology)
Gaël Varoquaux (INRIA)
Chris Williams (University of Edinburgh)
Marcos Bueno (Eindhoven University of Technology)
Oana Balalau (INRIA)
Felix Biessmann (Beuth Hochschule fuer Technik, Berlin),
Pavel Brazdil (U Porto),
Marcos Bueno (Eindhoven University of Technology),
Remco Chang (Tufts),
Jesse Davis (KU Leuven),
Luc De Raedt (KU Leuven),
Cèsar Ferri (Universitat Politècnica de València),
Peter Flach (U Bristol),
Ernesto Jimenez-Ruiz (City University),
Jefrey Lijffijt (Ghent University),
Pierre-Alexandre Mattei (INRIA),
Marine le Morvan (INRIA),
Tomas Petricek (U Kent),
Padhraic Smyth (UC Irvine),
Isabel Valera (Saarland University),
Gerrit van den Burg (Alan Turing Institute).