Slides have been uploaded in the Talks section

Talks

Towards Scalable Schema Mapping using Large Language Models

Talk Slides: MIDAS-Towards-Scalable-Schema-Mapping-using-Large-Language-Models-Christopher Buss-20250618-20.30.pptx

Authors:

Christopher Buss, Oregon State University

Mahdis Safari, Oregon State University

Arash Termehchy, Oregon State University

David Maier, Portland State University

Stefan Lee, Oregon State University

Abstract: The growing need to integrate information from a large number of diverse sources poses significant scalability challenges for data integration systems. These systems often rely on manually written schema mappings, which are complex, source-specific, and costly to maintain as sources evolve. While recent advances suggest that large language models (LLMs) can assist in automating schema matching by leveraging both structural and natural language cues, key challenges remain. In this paper, we identify three core issues with using LLMs for schema mapping: (1) inconsistent outputs due to sensitivity to input phrasing and structure, which we propose methods to address through sampling and aggregation techniques; (2) the need for more expressive mappings (e.g., GLaV), which strain the limited context windows of LLMs; and (3) the computational cost of repeated LLM calls, which we propose to mitigate through strategies like data type prefiltering.

Unveiling Challenges for LLMs in Enterprise Data Engineering

(invited talk)

Talk Slides: MIDAS_LLM4EnterpriseData.pdf

Authors:

Carsten Binnig, TU Darmstadt

Jan-Micha Bodensohn, TU Darmstadt

Ulf Brackmann, TU Darmstadt

Liane Vogel, TU Darmstadt

Anupam Sanghi, TU Darmstadt

Abstract: Large Language Models (LLMs) have demonstrated significant potential for automating data engineering tasks on tabular data, giving enterprises a valuable opportunity to reduce the high costs associated with manual data handling. However, the enterprise domain introduces unique challenges that existing LLM-based approaches for data engineering often overlook, such as large table sizes, more complex tasks, and the need for internal knowledge. In this talk, we present the results of a study where we set out the goal of systematically studying the performance of LLMs for enterprise data engineering. As main results, we identify key enterprise-specific challenges related to data, tasks, and background knowledge and show their impact on recent LLMs for data engineering. Our analysis reveals that LLMs face substantial limitations in real-world enterprise scenarios, resulting in significant accuracy drops. We believe that our findings can thus contribute to a systematic understanding of LLMs for enterprise data engineering to support their adoption in industry.

Optimizing open-domain question answering with graph-based retrieval augmented generation

Talk Slides: midas-trex-presentation-Yiwen Zhu-20250620-20.14.pptx

Authors:

Joyce Cahoon, Microsoft - Gray Systems Lab

Prerna Singh, Microsoft

Nick Litombe, Microsoft - Gray Systems Lab

Jonathan Larson, Microsoft

Ha Trinh, Microsoft

Yiwen Zhu, Microsoft - Gray Systems Lab

Andreas Mueller, Microsoft - Gray Systems Lab

Fotis Psallidas, Microsoft - Gray Systems Lab

Carlo Curino, Microsoft - Gray Systems Lab

Abstract: In this work, we benchmark various graph-based retrieval-augmented generation (RAG) systems across a broad spectrum of query types, including OLTP-style (fact-based) and OLAP-style (thematic) queries, to address the complex demands of open-domain question answering (QA). Traditional RAG methods often fall short in handling nuanced, multi-document synthesis tasks. By structuring knowledge as graphs, we can facilitate the retrieval of context that captures greater semantic depth and enhances language model operations. We explore graph-based RAG methodologies and introduce TREX, a novel, cost-effective alternative that combines graph-based indexing and vector-based retrieval techniques. Our benchmarking across four diverse datasets highlights the strengths of different RAG methodologies, demonstrates TREX’s ability to handle multiple open-domain QA types, and reveals the limitations of current evaluation methods. We publicly release these datasets to facilitate further research and benchmarking at https://github.com/microsoft/graphrag-benchmarking-datasets. Our findings underscore the potential of augmenting large language models with advanced retrieval capabilities and scalable graph-based AI solutions.

Semantic Knowledge Graphs for High‑Precision, Low‑Latency NL2SQL

(invited talk)

Talk Slides: MIDAS-Semantic-Knowledge-Graphs-for-High‑Precision,-Low‑Latency-NL2SQL-Wangda Tan-20250621-13.53.pptx

Authors:

Gunther Hagleitner, Waii

Wangda Tan, Waii

Abstract: Enterprise adoption of natural language to SQL interfaces is often blocked by complex database schemas and unclear user intents. We present a metadata‑driven knowledge graph that automatically infers table relationships, categorizes columns, and links technical documentation into a unified semantic framework, achieving translation accuracy above 95 percent on challenging schemas. At query time, graph‑derived concepts are dynamically retrieved and ranked to resolve ambiguity in user requests.

To optimize performance, our system employs model right‑sizing—routing simple intent detection to lightweight models and complex SQL generation to larger ones—and compresses schema references to reduce token usage. A multi‑tier, graph‑aware cache combined with speculative parallel execution and streamed intermediate artifacts (such as entity extractions and draft queries) cuts end‑to‑end latency by up to 50 percent without compromising accuracy. This talk will share the design and implementation of our unified semantic and performance framework, along with practical lessons for building scalable, responsive NL2SQL systems.

Advancing Workload Management with Foundational Models: Challenges in Time Series Similarity and Interpretability

Talk Slides: midas-advancing-workload-management-prefinal-Tiemo Bang-20250622-13.33.pptx

Full Paper link available here.

Authors:

Tiemo Bang, Microsoft - Gray Systems Lab

Sergiy Matusevych, Microsoft - Gray Systems Lab

Yuanyuan Tian, Microsoft - Gray Systems Lab

Georgia Christofidi, IMDEA Software Institute

Giannis Roumpos, IMDEA Software Institute

Thaleia Dimitra Doudali, IMDEA Software Institute

Abstract: Workload management (WLM) is essential for cloud providers to balance performance, reliability, and cost. Many WLM tasks rely on understanding workload behavior through time series similarity analysis, but traditional approaches face scalability challenges due to manual feature engineering and computational overheads. Foundational time series models promise to address these limitations by learning reusable representations with minimal supervision. This paper evaluates their practical potential for WLM through a focused case study on time series similarity. We present concrete use cases, characterize a real-world query arrival dataset from Microsoft Fabric Warehouse, and compare the foundational model MOMENT against conventional similarity methods. Our findings reveal that while foundational models offer computational efficiency, they produce overly generalized similarities with limited interpretability compared to hand-engineered features. We identify key challenges and research directions needed to make foundational models practical for workload management.

Reimagining Databases in the Age of LLMs

(invited talk)

Talk Slides: MIDAS-Reimagining-Databases-in-the-Age-of-LLMs--v2-Georgia Koutrika-20250622-10.34.pptx

Authors:

Georgia Koutrika, Athena Research Center, Greece

Abstract: The rise of large language models (LLMs) is transforming how we interact with and manage data. Traditional databases —designed for structured, schema-bound querying— are now being challenged by the fluid, context-aware capabilities of LLMs. From natural language interfaces for querying, which allow users to interact with data using everyday language instead of rigid SQL syntax, to semantic query operators that enable reasoning over the meaning of data, LLMs are extending the expressive power of traditional query languages. These semantic capabilities allow databases to interpret intent and operate across structured and unstructured data in novel ways. In parallel, we are witnessing the emergence of learned query optimizers that leverage LLMs and other neural models to predict efficient execution plans, rewrite queries, and adapt indexing strategies. Together, these innovations are driving a new generation of hybrid data architectures redefining the boundaries of what databases can do. This talk will explore these emerging paradigms at the intersection of databases and AI, and what they mean for the future of data systems.

Page updated

Google Sites

Report abuse