The 1st Workshop on Data Science with Human in the Loop (DaSH)

8:00 – 8:05am: Workshop introduction

Session Chair: Lucian Popa, IBM Almaden

8:05 – 8:50am: Invited talk 1 (Marti Hearst)

Title: Human-in-the-Loop from the Human Perspective (slides)

Abstract: As stated in this workshop’s call for participation, in order to unleash the full potential of data science, we need to improve our understanding about the best modalities of human and computer cooperation along the data science pipeline. The accepted papers in this workshop advance the future of human-machine interaction in data analysis, and include new algorithms for active learning, new user interfaces for allowing analysts to augment algorithms, algorithms to automate parts of the analysis, and trenchant forecasts of the future of work in the field of data science.

My contribution to this conversation will be twofold. I will first share results of a survey of professional information analysts’, relating their views about the role of machine automation in the process of exploratory data analysis. I will then discuss results in peer learning in online education, and how these ideas might be applicable to advanced human-machine analysis tasks.

Session Chair: Yunyao Li, IBM Almaden

8:50 – 9:20am: Session 1 (Human-in-the-Loop Techniques)

Session Chair: Slobodan Vucetic, Temple University

9:25 – 9:55 am: Session 2 (Model Analysis and Applications)

Session Chair: Eduard Dragut, Temple University

10:00 – 10:30am: 3rd Session (Impact of Data Science and Automation)

Session Chair: Lucian Popa, IBM Almaden

10:30 – 11:15: Invited talk 2 (AnHai Doan)

Title: Human-in-the-Loop Challenges for Entity Matching: A Report from the Trenches

Abstract: Entity matching (EM) is a fundamental problem in data science. Many data science projects must integrate multiple data sources, before analysis can be carried out to extract insights, and such integration often requires EM. In the past five years, we have been building Magellan, a general platform that uses machine learning, big data processing, and effective user interaction to solve EM problems. Magellan has been deployed at 12 companies and domain science groups, recently commercialized by GreenBay Technologies, and pushed into commercial EM platforms at Informatica, the world-leading data integration company. In this talk, I will discuss human-in-the-loop (HIL) challenges we faced in Magellan, and how we designed Magellan from the scratch using HIL principles. Specifically, I will discuss how we identify the end-to-end process that a user must follow to perform EM, then develop semi-automatic tools to support the various steps in the process. I will also discuss why we designed tools to be atomic, highly interoperable, and built into popular ecosystems of data science tools. Finally, I discuss lessons learned which can potentially be applied to other problem settings in data science. It is my hope that more researchers will investigate EM, as it can be a rich “playground” for HIL research.

Session Chair: Eduard Dragut, Temple University

11:15 – 12:00: Panel on Open challenges in human-computer cooperation in data science

Session Chair: Yunyao Li, IBM Almaden