Invited Speakers

Invited keynote speaker: Prof Felix Naumann, Hasso Plattner Institute and University of Potsdam.

Title: Data Profiling for Data Integration

Data profiling comprises a broad range of methods to efficiently analyze a given dataset. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and various data dependencies. The talk highlights the key insights behind recent state of the art methods and presents various use cases in the areas of data cleaning and data integration: violations of dependencies point to errors in the data; key discovery identifies the core entities of a data source; inclusion dependencies are candidates to join up multiple sources; and in general, data profiling results can be used to organize data lakes.

Bio:

Prof Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he completed his PhD thesis in the area of data quality at Humboldt University of Berlin in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on data integration topics. From 2003 - 2006 he was assistant professor for information integration, again at the Humboldt-University of Berlin. Since 2006 he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany. He has been visiting researcher at QCRI in Qatar, AT&T Research in New York, and IBM Research in California. His research interests include data profiling, data cleansing, and text mining. Next to numerous PC memberships for international conferences, he has organized several conferences in various roles, he is editor-in-chief for the Information Systems journal and trustee of the VLDB Endowment.


Invited industry talk by Amazon Development Centre Scotland

Title: Record Linkage At Amazon Scale Using Deep Siamese Networks

Discovering relationships between products can greatly improve a customer’s search and discovery experience over Amazon’s products catalog. As such, product relationships data powers many features on Amazon's website including author pages, search deduplication, automated pricing and product substitutions. Finding relationships between items can be framed as a Record Linkage task, where the goal is to cluster products corresponding to the same real-world entity.

Solving this problem over the Amazon catalog is challenging as the data is complex and noisy, containing different types of fields with variable quality. In addition, the scale of the catalog is large and continuously growing, with the number of entities close to the number of products.

Amazon's Inherent Relationships team leverages Deep Learning and an incremental clustering algorithm to solve these problems. In this talk, we will present the record linkage task, its challenges and the methods we use to overcome them at Amazon scale

Bios:

Yoni Lev is a Machine Learning Scientist at the Amazon Development Centre in Edinburgh, Scotland. Over the past four years, Yoni has worked on variety of record linkage problems as part of his scientific roll with the Inherent Relationship team. Prior to joining Amazon, Yoni fulfilled a few rolls as a researcher and applied scientist working on different machine learning problems including speech recognition, information extraction and text summarization. Yoni holds a bachelor degree in Mathematics and Computer Science and a master’s degree In Computer Science and Natural Language Processing from Ben Gurion University in Israel where he worked on statistical methods for error correction in modern Hebrew.

Grant Galloway joined the Amazon Inherent Relationships team in 2016. Previously, Grant completed a PhD with the Institute for Energy and Environment at the University of Strathclyde. Here, Grant research focussed on using machine learning to predict mechanical failures in renewable energy technologies, primarily tidal turbines.