ELLIS workshop on Representation Learning and Generative Models for Structured Data

Thursday 27 February, 2025

@ Science Park Congress Centre, Amsterdam

NOTE: everyone who registered through the form is counted in! You should have received a confirmation email, possibly in your spam folder.

Abstract

Attendance and Logistics

Abstract

The use of Machine Learning is well established for modalities such as text, images, audio and even video. In the past years, a less studied modality is structured data, such as relational tables and knowledge graphs. Recent works attempt to use this modality as part of, or in combination with, ML models. This workshop will host a program focused on representation learning and generative models for structured data, such as relational tables and spreadsheets, as well as knowledge graphs. The workshop will also engage researchers focusing on the intersection of learning over structured data and information retrieval, for example, in retrieval augmented generation (RAG) and question answering (QA) systems. The aim of the workshop is to connect researchers working on this topic and surface novel research ideas and collaboration opportunities by bringing views from the NLP, ML, DB, and IR disciplines together.

Organizers

Madelon Hulsebos (CWI, main contact)
Iacer Calixto (UvA, Amsterdam UMC)
Michael Cochez (VU)
Andrew Yates (UvA)

Program

The workshop on 27 February, starts at 9:00AM with a walk-in and coffee, and lasts until roughly 5:30-6:00PM CET.

Schedule

Invited talks
Poster session
Short talks (randomly selected accepted papers)

Invited Speakers

Paolo Papotti (Eurecom): SQL and Large Language Models: A Marriage Made in Heaven?

Abstract With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of documents. However, for data-intensive tasks over structured data, relational DBs and SQL queries are at the core of countless applications. While these two technologies may appear distant, in this talk we will see that they can interact effectively and with promising results. LLMs can help users express SQL queries (Semantic Parsing), but SQL queries can be used to evaluate LLMs (Benchmarking). Their combination can be further advanced, with opportunities to query with a unified SQL interface both LLMs and DBs. We present recent results on these topics and then conclude with an overview of the research challenges in effectively leveraging the combined power of SQL and LLMs.

Bio Paolo Papotti got his Ph.D. degree from the University of Roma Tre (Italy) in 2007 and is an associate professor in the Data Science department at EURECOM (France) since 2017. Before joining EURECOM, he has been a scientist in the data analytics group at QCRI (Qatar) and an assistant professor at Arizona State University (USA). His research is in the broad areas of scalable data management and NLP, with a focus on data integration and information quality.

Effy Xue Li (University of Amsterdam): Efficient Use of LLMs for Data Preparation

Abstract Data preparation accounts for 80% of a data scientist’s time and remains one of the least enjoyable yet essential tasks in the workflow. While LLMs offer new opportunities for automating structured data preparation, challenges persist in efficiency, scalability, and adaptability. In this talk, we explore the efficient use of LLMs for data preparation, including (1) generating transformation code for data wrangling tasks and (2) fine-tuning small LLMs for entity matching. We highlight how LLMs can be leveraged not just for reasoning but as scalable, cost-effective automation tools in data-cleaning pipelines. Finally, we discuss future research opportunities in the area, paving the way for more adaptable and interpretable AI-driven data science workflows.

Bio Effy Xue Li is completing her PhD at the University of Amsterdam and will soon be joining CWI as a postdoctoral researcher. Her research focuses on knowledge graph construction in complex domains and broader questions around automating data science workflows, drawing from her background in NLP, ML, and Knowledge Graphs. Previously, she was an AI Resident at Microsoft Research Cambridge and obtained her MSc from the University of Edinburgh.

Frank Hutter (University of Freiburg / ELLIS Institute Tübingen): Accurate predictions on small data (and time series) with the tabular foundation model TabPFN

Abstract Tabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science. The fundamental prediction task of filling in missing values of a label column based on the rest of the columns is essential for various applications as diverse as biomedical risk models, drug discovery and materials science. Although deep learning has revolutionized learning from raw data and led to numerous high-profile success stories, gradient-boosted decision trees have dominated tabular data for the past 20 years. Here we present the Tabular Prior-data Fitted Network (TabPFN), a tabular foundation model that outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time. In 2.8 s, TabPFN outperforms an ensemble of the strongest

baselines tuned for 4h in a classification setting. As a generative transformer-based foundation model, this model also allows fine-tuning, data generation, density estimation and learning reusable embeddings. TabPFN is a learning algorithm that is itself learned across millions of synthetic datasets, demonstrating the power of this approach for algorithm

development. By improving modeling abilities across diverse fields, TabPFN has the potential to accelerate scientific discovery and enhance important decision-making in various domains. Likewise, TabPFN has enormous potential for related tabular applications such as time series or relational data. We show that TabPFN already excels at time series forecasting, outperforming foundation models only built for time series.

Bio Frank Hutter has been a Full Professor for Machine Learning at the University of Freiburg (Germany) since 2016, and an Emmy Noether Research Group Lead since 2013. Before that, he did a PhD (2004-2009) and postdoc (2009-2013) at the University of British Columbia (UBC) in Canada. He received the 2010 CAIAC doctoral dissertation award for the best thesis in AI in Canada, as well as several best paper awards and prizes in international ML competitions. He is a Fellow of ELLIS and EurAI, Director of the ELLIS unit Freiburg, and the recipient of 3 ERC grants. Frank is best known for his research on automated machine learning (AutoML), including neural architecture search, efficient hyperparameter optimization, and meta-learning. He co-authored the first book on AutoML and the prominent AutoML tools Auto-WEKA, Auto-sklearn and Auto-PyTorch, won the first two AutoML challenges with his team, is co-teaching the first MOOC on AutoML, co-organized 15 AutoML-related workshops at ICML, NeurIPS and ICLR, and founded the AutoML conference as general chair in 2022. In recent years, his focus has been on the intersection of foundation models and AutoML, including the first foundation model for tabular data, TabPFN, and improving pretraining and fine-tuning with AutoML.

Julian Eisenschlos (Google DeepMind): Visual language: how generation can drive understanding in data visualizations

Abstract Large amounts of content both online and offline relies on structure to organize and communicate information more effectively. While natural image and language understanding and generation has been studied extensively, visually situated language such as tables, charts, plots, and infographics, continues to be a challenge for models large and small. In this talk we will show how teaching models to generate visually situated language can improve downstream reading and reasoning on this data modality for tasks such as question answering, entailment and summarization through multi-step reasoning and tool use.

Bio Julian is a staff Research Scientist at Google DeepMind tackling problems in visual language understanding and generation, previously co-founder of Botmaker and ML at Meta and ASAPP.

Pasquale Minervini (University of Edinburgh): Integrating Algorithms and Neural Models

Abstract: Neural models, including LLMs, can exhibit remarkable abilities; paradoxically, they also struggle with algorithmic tasks where much simpler models excel. To solve these issues, we propose Implicit Maximum Likelihood Estimation (IMLE), a framework for end-to-end learning of models combining algorithmic combinatorial solvers and differentiable neural components, which allows us to incorporate planning and reasoning algorithms in neural architectures by just adding a simple decorator [1, 2].

[1] Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions. https://arxiv.org/abs/2106.01798 , NeurIPS 2021

[2] Adaptive Perturbation-Based Gradient Estimation for Discrete Latent Variable Models. https://arxiv.org/abs/2209.04862 , AAAI 2023

Bio: Pasquale is a Lecturer in Natural Language Processing at the School of Informatics, University of Edinburgh; co-founder and CTO of the generative AI start-up Miniml.AI; and an ELLIS Scholar – Edinburgh Unit. His research interests include NLP and ML, focusing on relational learning and learning from graph-structured data, solving knowledge-intensive tasks, hybrid neuro-symbolic models, compositional generalisation, and designing data-efficient and robust deep learning models. Pasquale routinely collaborates with researchers across both academia and industry. https://neuralnoise.com/

Accepted Abstracts

Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints
Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Michael Cochez, M.C. Schut

-> Also randomly selected for a short talk

Rethinking Feature Augmentation In Graph Contrastive Learning
Congfeng Cao, Pengyu Zhang, Zeyu Zhang, Jelke Bloem

-> Also randomly selected for a short talk

A Resolution-Alignment-Completeness System for Data Imputation over Tabular Clinical Records
Shervin Mehryar

-> Also randomly selected for a short talk

Integrating domain constraints in tabular data generation process
Salijona Dyrmishi, Mihaela C. Stoian, Eleonora Giunchiglia, Maxime Cordy

-> Also randomly selected for a short talk

Author Instructions

Submission Instructions

We welcome extended abstracts that will be presented as a poster. We might invite a few selected abstracts for a spotlight talk.

Submission format: we invite extended abstracts of 2 pages (excl. references).
Submission template: LNCS template (https://www.overleaf.com/latex/templates/springer-lecture-notes-in-computer-science/kzwwpvhwnvfj)
Submission portal: https://openreview.net/group?id=ELLIS.eu/2024/Workshop/RLGMSD

Presentation Instructions

All abstracts will be presented during the poster session. Please print your poster in A0 or A1 size, with portrait layout.

Important dates

Extended Abstract submission: 31 January 2025 -> due to the ICML deadline, extended to 3 February 2025 (6PM CET)
Notification: 4 February 2025
Workshop: 27 February 2025

Attendance and Logistics

Registration

Please use this link to apply for in-person attendance of the workshop: https://forms.gle/z27pzg1nLuQxhxH86

Location

The workshop will be held at: Science Park Congress Centre, Science Park 125, Amsterdam, Netherlands

Getting there

You can find all instructions at https://www.cwi.nl/en/about/contact/ (the entrance of the Science Park Congress Centre is next to CWI). If you are coming by car and want to park on the premise, please share your license plate through M.Anholt.Gunzeln@cwi.nl. Limited options are available.

ELLIS workshop on Representation Learning and Generative Models for Structured Data

Abstract

Organizers

Program

Accepted Abstracts

Author Instructions

Attendance and Logistics

Sponsors