Amogh Jahagirdar & Szehon Ho I Lakehouse Room I 9:00 AM
Anoop Johnson, Oussama Saoudi, Allison Portis, Bart Samwel, Anton Okolnychyi I Lakehouse Room I 10:00 AM
Ryan Blue, Daniel Weeks I Lakehouse Room I 11:00 AM
[LanceDB] Weston Pace, Jasmine Wang, [Daft] Sammy Sidhu, Jay Chia, [Meta]: Sundaram Narayanan I Lakehouse Room I 1:00 PM
LanceDB's origins as a vector database have given us a lot of experience with the workloads faced by AI & ML solutions. We've used this experience to create an enterprise-capable lakehouse from Arrow-native components that tackles the unique challenges associated with these workloads. In this talk we will describe the challenges we have had to solve so far, such as getting rid of row groups, better support for multi-modal data, secondary index management, and building a compute engine that can handle search workloads in addition to traditional OLAP workloads. We will describe the user workloads we experience as well as show where traditional lakehouse design has struggled.
This solution spans several technology points across lakehouse design. We have created a custom storage format, a custom table format, our own compute engine, as well as a variety of secondary indexes. Fortunately, this work has been accelerated by Arrow native utilities such as Apache DataFusion, pyarrow, arrow-rs, and Arrow Flight. For discussion we would like to highlight how these new components could help other lakehouses and provide opportunities for collaboration, adding new utilities to the Arrow-native universe.
Daniel Weeks & Bart Samwel I Lakehouse Room I 3:00 PM
This session is about managing large binary data such as images, sound and video fragments, PDF documents and so on. This type of data is becoming ever more important in the age of AI. However, the main open Lakehouse formats have no native support for managing such types of data. This session is to discuss how we could add support for the BLOB data type to Delta and Iceberg. We will discuss the basics such as data storage and metadata formats, and also how to query and process them efficiently. In addition, we will look at broader integration topics including the interaction with governance, catalogs, and Delta Sharing.
DB Tsai, Xiao Li I Spark-it-up Room I 9:00 AM
Apache Spark has long been recognized as the leading open-source unified analytics engine, combining a simple yet powerful API with a rich ecosystem and top-notch performance. All this has been possible thanks to the global Spark community of contributors and committers--and the dedicated users who have adopted for their respective workloads.
To continue this successful effort, in this short introduction we will share and discuss the Apache Spark™ Roadmap, beyond Spark 4.0, including PySpark. This will include what SPIPs, Issues, pain points, concerns, and JIRAs that have been submitted thus far.
More importantly, to foster productive and positive discussions and consider salient features for future roadmap , we want to frame the open discussions on myriad topics that concern the Spark Community, including:
Community logistical goals and concerns
Use of LLMs with Spark for debugging & coding
Python Data Sources & DataSources v2 APIs or interoperability
Apache Arrow UDFs and Arrow serialization for UD(T)Fs
Important SPIPs under consideration
Martin Grund, Jules Damji I Spark-it-up Room I 9:00 AM
Spark Connect introduces a client-server architecture for Apache Spark, enabling remote connectivity to Spark clusters via the DataFrame API. This separation of concerns allows Spark to be integrated into various environments, including data applications, IDEs, and notebooks, addressing many operational issues, allowing accessibility of Spark from other non-JVM based clients (Rust, Go, .NET, Swift, etc), providing isolation from conflicting dependencies with Spark driver and cluster, and allowing easy upgradability. This session is to share how easily Spark Connect can connect to Spark from different platforms, foster community discussion, encourage community involvement, discuss any concerns, and share what future entails for Spark Connect as being an integral part of Apache Spark 4.x and beyond.
Steve Russo, Renjie Liu, Yuanjian Li, I Spark-it-up Room I 9:30 AM
Spark Connect Rust Client will be proposed as part of the Spark 4.x sub project in the general Spark GitHub Repository. This will allow us to have more exposure, encourage contributions from the community, and extend its functionality. This session will cover the following topics to foster discussion and engage the community:
Overview of Existing Work
Release & Maintenance Planning
Ecosystem Integration & Key Dependenciesuys
Technical Deep Dive (optional)
Cross‑Language Client Governance (optional/open topics)
Sandy Ryza, Anton Okolnychyi, Andreas Neumann I Spark-it-up Room I 10:00 AM
Jerry Peng Anish Shrigondekar I Spark-it-up Room I 1:00 PM
SPIP: Basic introduction to current state - what & why & API
Real-time mode in Spark Structured Streaming that dramatically lowers end-to-end latency for processing streams of data. In plain terms, our goal is to make Spark capable of handling streaming jobs that need results almost immediately (within a few tenths of a second). We want to achieve this without changing the high-level DataFrame/Dataset API that users already use – so existing streaming queries can run in this new ultra-low-latency mode by simply turning it on, without rewriting their logic. In short, we’re trying to enable Spark to power real-time applications (like instant anomaly alerts or live personalization) that today cannot meet their latency requirements with Spark’s current streaming engine.
Release Planning - Part of the Spark 4.x release schedule
Support needed from Spark Committers/PMC - Code review, release handling?
Contribution - How to contribute: features, enhancements, documentation, tutorials, etc
Roadmap - what coming now and future
Becket Qin (LinkedIn), Sijie Guo (Ursa Streams), Matthew Schumpert (Red Panda), Yingjun Wu, Rayees Pasha (Rising Wave) I Spark-it-up Room I 2:00 PM
Jeff Shute, Daniel Tenedorio, Serge Rielau I Spark-it-up Room I 3:00 PM
SQL has succeeded by allowing users to declare what data they want, while leaving the how to the engine. But as queries grow complex—especially with deeply nested subqueries—readability and maintainability suffer. To address this, we have introduced a new, flexible SQL syntax in Apache Spark 4.0 that lets users compose logic through independent, re-orderable clauses—similar to DataFrame-style programming.
In this short talk, we will discuss the following big rocks and want to foster discussion and what the community want to see in the future.
Key Takeaways:
Traditional SQL emphasizes declarative logic but can become hard to manage
Complex subqueries reduce readability and maintainability
New syntax improves clarity by structuring queries as modular, flexible clauses
Inspired by DataFrame APIs, enabling easier learning and authoring
JB Onofre & Daniel Weeks I Catalogs Room I 9:00 AM
Jason Reid, Ryan Blue, Scott Sandre, Michelle Leon I Catalogs Room I 10:00 AM
JB Onofre, Jack Ye, Ben Wilson, TD Das I Catalogs Room I 11:00 AM
JB Onofre I Catalogs Room I 1:00 PM
R. Tyler Croy, Robert Pack, Zach Schuermann, Kevin Liu I Quiver Room I 9:00 AM
Matt Topol I Quiver Room I 10:00 AM
Apache Arrow has become the de-facto standard for in-memory columnar analytics, underpinning Pandas, Polars and Apache DataFusion, and supported by DuckDB, ClickHouse, Velox, Spark and more. Aside from the promise of high performance analytics, Arrow also enables efficiency for data transport and interoperability for databases, RPC services, and even ML workflows. Given the breadth of the Arrow community, come hear about the roadmaps for some of the language implementations (Go, C++, PyArrow, etc.) and subprojects (like ADBC), followed by a Q&A session for any and all of your burning questions.
Robert Pack, Zach Schuermann I Quiver Room I 11:00 AM
Andrew Lamb I Quiver Room I 1:00 PM
It is a common misconception that querying Apache Parquet data is constrained to the basic metadata built into the format itself and thus is slower than querying proprietary formats. Parquet does contain standard Min/Max metadata, "Page Index" and Bloom filters, and using open source composable systems such as Apache DataFusion, it is possible to build sophisticated caches and specialized system specific indexes while retaining broad ecosystem compatibility.
In this talk Andrew will review the structures built into Parquet for query acceleration, and demonstrate how to use a cache for parsed metadata, push row group and page pruning into a metadata store, and build a specialized index for multi-column primary keys.
Binwei Yang (Gluten), Konstantinos Karanasos (Velox) I Quiver Room I 3:00 PM
Apache Gluten (incubator) is an emerging open-source project in the Apache software ecosystem. It's designed to offload Apache Spark SQL engine from JVM to native libraries like Velox or accelerators. It uses Substrait as standard query plan representation and Apache Arrow as standard data format to enable interaction between Spark and native libraries. By leveraging cutting-edge technologies like vectorized execution, columnar data formats, and advanced memory management techniques, Apache Gluten aims to deliver significant improvements in data processing speed and efficiency.