One-Hop Sub-Query Result Caches
Authors: Hieu Nguyen (eBay), Jun Li (eBay), Shahram Ghandeharizadeh (USC)
Abstract: This poster introduces a novel one-hop sub-query result cache for processing graph read transactions, gR-Txs, in a graph database system. The one-hop navigation is from a vertex using either its in-coming or out-going edges with selection predicates that filter edges and vertices. Its cache entry identifies a unique one-hop sub-query (key) and its result set consisting of immutable vertex ids (value). When processing a gR-Tx, the query processor identifies its sequence of individual one-hop sub-queries and looks up their results in the cache. A cache hit fetches less data from the storage manager and eliminates the requirement to process the one-hop sub-query. A cache miss populates the cache asynchronously and in a transactional manner, maintaining the separation of read and write paths of our transactional storage manager. A graph read and write transaction, gRW-Tx, identifies the impacted cache entries and either deletes or updates them. Our implementation of the cache is inside the graph query processing engine and transparent to a user application. We evaluate the cache using our eCommerce production workload. On average, the cache enhances 95th and 99th percentile of query response times by at least 2.79x and 3.09x, respectively. An interesting result is the significant performance enhancement observed by the indirect beneficiaries of the cache, gRW-Txs and gR-Txs that do not reference one-hop sub-queries. The cache frees system resources to expedite their processing significantly.
LIMAO: A Framework for Lifelong Modular Learned Query Optimization
Authors: Qihan Zhang (USC), Shaolin Xie (USC), Ibrahim Sabek (USC)
Abstract: Query optimizers are crucial for the overall performance of database systems. Recently, many learned query optimizers (LQOs) have demonstrated significant performance improvements over traditional ones. However, most of these optimizers operate under a limited assumption: a static query environment. This limitation prevents them from effectively handling complex, dynamic query environments in real-world scenarios. Extensive retraining can lead to the well-known catastrophic forgetting problem when learning the model parameters, which reduces the LQO generalizability over time. In this work, we address this limitation and introduce LIMAO (Lifelong Modular Learned Query Optimizer), a framework for lifelong learning of plan cost prediction that can be seamlessly integrated into existing LQOs. LIMAO leverages a modular lifelong learning technique, an attention-based neural network composition architecture, and an efficient training paradigm designed to retain prior knowledge while continuously adapting to new data and workloads.
POLY2VEC: Polymorphic Fourier-Based Encoding of Geospatial Objects for GeoAI Applications
Authors: Maria Despoina Siampou (USC), Jialiang Li (Roskilde University), John Krumm (USC), Cyrus Shahabi (USC), Hua Lu (Roskilde University)
Abstract: Encoding geospatial objects is fundamental for geospatial artificial intelligence (GeoAI) applications, which leverage machine learning (ML) models to analyze spatial information. Common approaches transform each object into known formats, like image and text, for compatibility with ML models. However, this process often discards crucial spatial information, such as the object's position relative to the entire space, reducing downstream task effectiveness. Alternative encoding methods that preserve some spatial properties are often devised for specific data objects (e.g., point encoders), making them unsuitable for tasks that involve different data types (i.e., points, polylines, and polygons). To this end, we propose POLY2VEC, a polymorphic Fourier-based encoding approach that unifies the representation of geospatial objects, while preserving the essential spatial properties. POLY2VEC incorporates a learned fusion module that adaptively integrates the magnitude and phase of the Fourier transform for different tasks and geometries. We evaluate POLY2VEC on five diverse tasks, organized into two categories. The first empirically demonstrates that POLY2VEC consistently outperforms object-specific baselines in preserving three key spatial relationships: topology, direction, and distance. The second shows that integrating POLY2VEC into a state-of-the-art GeoAI workflow improves the performance in two popular tasks: population prediction and land use inference.
TrajGPT: Controlled Synthetic Trajectory Generation Using a Multitask Transformer-Based Spatiotemporal Model
Authors: Shang-Ling Hsu (USC), Emmanuel Tung (Novateur Research Solutions), John Krumm (USC), Cyrus Shahabi (USC), Khurram Shafique (Novateur Research Solutions)
Abstract: Human mobility modeling from GPS-trajectories and synthetic trajectory generation are crucial for various applications, such as urban planning, disaster management and epidemiology. Both of these tasks often require filling gaps in a partially specified sequence of visits, a new problem that we call "controlled" synthetic trajectory generation. Existing methods for next-location prediction or synthetic trajectory generation cannot solve this problem as they lack the mechanisms needed to constrain the generated sequences of visits. Moreover, existing approaches (1) frequently treat space and time as independent factors, an assumption that fails to hold true in real-world scenarios, and (2) suffer from challenges in accuracy of temporal prediction as they fail to deal with mixed distributions and the inter-relationships of different modes with latent variables (e.g., day-of-the-week). These limitations become even more pronounced when the task involves filling gaps within sequences instead of solely predicting the next visit. We introduce TrajGPT, a transformer-based, multi-task, joint spatiotemporal generative model to address these issues. Taking inspiration from large language models, TrajGPT poses the problem of controlled trajectory generation as that of text infilling in natural language. TrajGPT integrates the spatial and temporal models in a transformer architecture through a Bayesian probability model that ensures that the gaps in a visit sequence are filled in a spatio-temporally consistent manner. Our experiments on public and private datasets demonstrate that TrajGPT not only excels in controlled synthetic visit generation but also outperforms competing models in next-location prediction tasks-Relatively, TrajGPT achieves a 26-fold improvement in temporal accuracy while retaining more than 98% of spatial accuracy on average.
Exploiting Polygon Metadata to Understand Raster Maps - Accurate Polygonal Feature Extraction
Authors: Fandel Lin (USC), Craig A. Knoblock (USC), Basel Shbita (USC), Binh Vu (USC), Zekun Li (University of Minnesota), Yao-Yi Chiang (University of Minnesota)
Abstract: Locating undiscovered deposits of critical minerals requires accurate geological data. However, most of the 100,000 historical geological maps of the United States Geological Survey (USGS) are in raster format. This hinders critical mineral assessment. We target the problem of extracting geological features represented as polygons from raster maps. We exploit the polygon metadata that provides information on the geological features, such as the map keys indicating how the polygon features are represented, to extract the features. We present a metadata-driven machine-learning approach that encodes the raster map and map key into a series of bitmaps and uses a convolutional model to learn to recognize the polygon features. We evaluated our approach on USGS geological maps; our approach achieves a median F1 score of 0.809 and outperforms state-of-the-art methods by 4.52%.
LLM-based Declarative Knowledge Base Construction
Authors: Shaolin Xie (USC), Ibrahim Sabek (USC)
Abstract: Knowledge base construction traditionally relies on structured rule-based frameworks such as DeepDive, which require expert-defined inference rules in DDlog. This dependence on domain expertise and formal specification creates a barrier to broader adoption. We propose a novel approach that leverages large language models (LLMs) to enable a fully declarative, natural language-driven knowledge base construction pipeline. Our system translates user-provided natural language descriptions of inference rules and entity relationships into executable DDlog representations, removing the need for specialized programming expertise. Additionally, we incorporate LLM-based fact-checking to validate the inferred knowledge base, improving reliability and reducing the risk of propagated errors. By replacing rigid, expert-driven rule specification with an intuitive, language-based interface, our approach makes knowledge base construction more accessible and adaptable across domains.
Q2O: Quantum-augmented Query Optimizer
Authors: Hanwen Liu (USC), Ibrahim Sabek (USC), Federico Spedalieri (USC)
Abstract: The join order (JO) problem is one of the key challenges in query optimization. Many heuristic methods trade off plan quality to reduce the exponential search space. Recently, quantum-based methods were proposed to leverage quantum mechanisms to accelerate exploration. However, most of these approaches treat the JO problem in isolation and primarily execute the algorithms offline. In this work, we present the first (to the best of our knowledge) Quantum-augmented Query Optimizer (Q2O) that integrates quantum computing into the end-to-end database workflow.
Conformal Prediction for Verifiable Learned Query Optimization
Authors: Hanwen Liu (USC), Shashank Giridhara (Amazon), Ibrahim Sabek (USC)
Abstract: Query optimization is critical in relational databases. Recently, numerous Learned Query Optimizers (LQOs) have been proposed, demonstrating superior performance over traditional hand-crafted query optimizers after short training periods. However, the opacity and instability of machine learning models have limited their practical applications. To address this issue, we are the first to formulate the LQO verification as a Conformal Prediction (CP) problem. We first construct the CP model and obtain user-controlled bounded ranges for the actual latency of LQO plans before execution. Then, we introduce CP-based runtime verification along with violation handling to ensure performance prior to execution. For both scenarios, we further extend our framework to handle distribution shifts in the dynamic environment using adaptive CP approaches. Finally, we present a CP-guided plan search, which uses actual latency upper bounds from CP to heuristically guide query plan construction. We integrated our verification framework into three LQOs (Balsa, Lero, and RTOS) and conducted evaluations on the JOB and TPC-H workloads. Experimental results demonstrate that our method is both accurate and efficient. Our CP-based approaches achieve tight upper bounds, reliably detect and handle violations. Adaptive CP maintains accurate confidence levels even in the presence of distribution shifts, and the CP-guided plan search improves both query plan quality (up to 9.84x) and planning time, with a reduction of up to 74.4% for a single query and 9.96% across all test queries from trained LQOs.
Optimizing Big Active Data Management Systems
Authors: Shahrzad Haji Amin Shirazi (UC Riverside), Xikui Wang (UC Irvine), Michael Carey (UC Irvine and Couchbase), Vassilis J. Tsotras (UC Riverside)
Abstract: Within the dynamic world of Big Data, traditional systems typically operate in a passive mode, processing and responding to user queries by returning the requested data. However, this methodology falls short of meeting the evolving demands of users who not only wish to analyze data but also to receive proactive updates on topics of interest. To bridge this gap, Big Active Data (BAD) frameworks have been proposed to support extensive data subscriptions and analytics for millions of subscribers. As data volumes and the number of interested users continue to increase, it is imperative to optimize BAD systems for enhanced scalability, performance, and efficiency. To this end, this paper introduces three main optimizations, namely, strategic aggregation, intelligent modifications to the query plan, and early result filtering, all aimed at reinforcing a BAD platform’s capability to actively manage and efficiently process soaring rates of incoming data and distribute notifications to larger numbers of subscribers.
Indexing in the World of Document Databases
Authors: Shahrzad Haji Amin Shirazi (UC Riverside), Ali Alsuliman (Couchbase), Michael Carey (UC Irvine and Couchbase), Vassilis J. Tsotras (UC Riverside).
Abstract: Document store databases have gained popularity for managing large volumes of semi-structured data. However, their flexible data model often lacks the advanced indexing features found in relational systems, leading to performance bottlenecks. Traditional indexing techniques don’t easily adapt to document-oriented structures, making efficient query processing a challenge. In our work, we identify key indexing limitations in document stores and propose practical solutions to address them. We implement and evaluate our methods in Apache AsterixDB, an open-source system for big data management. Our results show significant query performance improvements without compromising data ingestion speed.
SOLAR: Scalable Distributed Spatial Joins through Learning-based Optimization
Authors: Yongyi Liu (UC Riverside), Amr Magdy (UC Riverside)
Abstract: Modern spatial applications generate massive datasets that often require frequent and repetitive join queries. However, existing distributed spatial systems repeatedly partition these datasets from scratch, leading to significant overhead. We introduce SOLAR, a novel framework that accelerates distributed spatial joins by reusing previously computed partitioners. SOLAR operates in two phases. In an offline phase, it trains a Siamese Neural Network to learn dataset similarities based on inexpensive metadata embeddings, thereby approximating more complex distribution statistics. In an online phase, SOLAR uses this learned similarity to identify whether a new query’s datasets resemble previously processed ones, enabling it to retrieve and reuse a stored partitioner rather than creating a new one. This reuse avoids costly repartitioning and substantially reduces query response time. Extensive experiments on real-world datasets show that SOLAR achieves up to 3.6× faster overall join runtime and 3.14× faster partitioning compared to state-of-the-art systems. Our results demonstrate that incorporating learned similarities and partitioner reuse offers an effective and practical solution for handling repetitive spatial queries at scale.
ReNUP: A Recursive Approach to Non-Uniform Graph Partitioning
Authors: AlHassan AlShareedah (UC Riverside), Amr Magdy (UC Riverside)
Abstract: Graph partitioning is crucial for managing spatial relationships in urban planning, environmental management, and similar domains. Traditional minimum k-way partitioning divides graphs into equal-sized partitions to minimize edge cuts, but this uniformity fails when regions require heterogeneous capacities (e.g., socioeconomic criteria like income or education levels). We address the Non-Uniform Graph Partitioning (NUGP) problem, where partitions follow application-specific capacity constraints, and introduce ReNUP, a recursive algorithm that adapts recursive bipartitioning for non-uniform cases. ReNUP allows a user-defined violation threshold 𝜖 to balance constraint adherence and edge-cut minimization. Experiments on nine real-world datasets demonstrate ReNUP’s effectiveness in addressing practical spatial partitioning challenges.
DiStash: A Disaggregated Multi-Stash Transactional Key-Value Store
Authors: Yiming Gao (USC), Ziqi Fang (USC), Hieu Nguyen (eBay), Jun Li (eBay), Shahram Ghandeharizadeh (USC)
Abstract: A stash is a storage medium such as dynamic random access memory (DRAM), Solid State Disk (SSD), hard disk drive (HDD), or Non-Volatile Memory (NVM). This poster presents a disaggregated transactional key-value (KV) store, DiStash, that governs KVs cross N stashes. Its main benefit is its concept of transaction to preserve consistency of key-value pairs across the different stashes. It simplifies the application logic by (a) preventing undesirable race conditions that may cause the content of different stashes to diverge from one another and/or (b) failures that may result in the loss of key-value pairs. An application may configure a DiStash to use a stash for either temporary (caches) or permanent storage of KV pairs. A cache may be realized using either a volatile or a non-volatile stash. Similarly, permanent storage of KV pairs may be realized using either a volatile or a non-volatile stash. The application defines whether the content of its participating stashes are inclusive (replicated) or exclusive (tiered). We implement a DiStash by extending FoundationDB, quantifying the tradeoffs with its design decisions using a variety of benchmarks and eBay’s production workload.
Batch-Enhanced kNN Spatial-Keyword Queries Supporting Negative Keyword Predicates
Authors: Yiyang Bian (UC Riverside), Yongyi Liu (UC Riverside), Amr Magdy (UC Riverside)
Abstract: Spatial-textual k-Nearest Neighbor (kNN) queries, which retrieve top-k objects based on spatial and textual proximity, are popular and important in spatial databases. However, existing frameworks lack support for negative keywords (e.g., retrieving tweets containing Chipotle but not Chipotle sauce) and also rely on specialized indexing. To address these limitations, we propose U-ASK. U-ASK features the TEQ index and POWER query processor, enabling unified kNN spatial-keyword queries with negative predicates. To enhance performance, we define temporal and spatial batching strategies and introduce BPOWER for batch query processor optimization. Experiments on real tweet datasets demonstrate up to 80x faster runtime compared to state-of-the-art algorithms, and a 3x speedup with BPOWER over POWER.
Efficient Hotspot Detection in Spatial Networks Using GNNs
Authors: Ahmed Abdelmaguid (UC Riverside), Amr Magdy (UC Riverside)
Abstract: Recently, the availability of spatial data has increased due to the widespread usage of geospatial-integrated applications and location-based services. As a result, spatial datasets are growing rapidly in both volume and complexity. This growth creates a serious challenge for existing analytical algorithms, many of which struggle to scale efficiently with such large datasets. One important task impacted by this trend is hotspot detection in spatial networks—a process that plays an effective role in decision-making across various domains and applications. Current solutions typically lack one of two main specifications: some provide high accuracy with strong statistical evidence but lack adaptability to large-scale data, while others scale efficiently but often lack precision, creating a trade-off between scalability and accuracy. In this work, we introduce a scalable and efficient hotspot detection approach that leverages graph machine learning, offering robust performance and scalability with only a minor loss in accuracy—achieving a balance between the trade-offs.
Learning from Uncertain Data: From Possible Worlds to Possible Models
Authors: Jiongli Zhu (UC San Diego), Su Feng (Nanjing Tech University), Boris Glavic (University of Illinois Chicago), Babak Salimi (UC San Diego)
Abstract: We introduce an efficient method for learning linear models from uncertain data, where uncertainty is represented as a set of possible variations in the data, leading to predictive multiplicity. Our approach leverages abstract interpretation and zonotopes, a type of convex polytope, to compactly represent these dataset variations, enabling the symbolic execution of gradient descent on all possible worlds simultaneously. We develop techniques to ensure that this process converges to a fixed point and derive closed-form solutions for this fixed point. Our method provides sound over-approximations of all possible optimal models and viable prediction ranges. We demonstrate the effectiveness of our approach through theoretical and empirical analysis, highlighting its potential to reason about model and prediction uncertainty due to data quality issues in training data.
Graph Machine Learning-based Doubly Robust Estimator for Network Causal Effects
Authors: Baharan Khatami (UC San Diego), Harsh Parikh (Johns Hopkins University), Haowei Chen (UC San Diego), Sudeepa Roy (Duke University), Babak Salimi (UC San Diego)
Abstract: Estimating causal effects in social network data presents unique challenges due to the presence of spillover effects and network-induced confounding. While much of the existing literature addresses causal inference in social networks, many methods rely on strong assumptions about the form of network-induced confounding. These assumptions often fail to hold in high-dimensional networks, limiting the applicability of such approaches. To address this, we propose a novel methodology that integrates graph machine learning techniques with the double machine learning framework, facilitating accurate and efficient estimation of both direct and peer effects in a single observational social network. Our estimator achieves semiparametric efficiency under mild regularity conditions, enabling consistent uncertainty quantification. Through extensive simulations, we demonstrate the accuracy, robustness, and scalability of our method. Finally, we apply the proposed approach to examine the impact of Self-Help Group participation on financial risk tolerance, highlighting its practical relevance.
From Ground to Future Analysis: Efficient Processing and Analysis of Remote Sensing Data
Authors: Zhuocheng Shang (UC Riverside)
Abstract: Advancements in remote sensing technology have resulted in a rapid and significant increase in the volume of geospatial data available. Today, researchers have access to petabytes of Earth observational data, significantly impacting critical research across various fields including disaster response and monitoring, wildfire detection, energy and natural resource management, agricultural monitoring, and marine biology. Efficient analysis of raster data is crucial yet presents several significant challenges. Complex zonal statistical analyses, linear algebra and map algebra computations, and advanced convolution-based analyses often encounter bottlenecks, particularly due to inefficient format conversions between vector and raster data, slow data ingestion, and memory issues. Additionally, managing large-scale datasets from initial loading to query execution poses considerable difficulties, hindering timely and efficient processing. This poster presents optimized solutions designed to address these challenges, providing enhanced methods for the entire data processing pipeline. This includes optimized solutions for zonal statistical problems, efficient parallel raster query processing, and enhanced querying capabilities for large-scale geospatial datasets. The integration of machine learning methodologies has significantly enhanced remote sensing applications, enabling more sophisticated tasks such as object detection, classification, predictive modeling, road and building segmentation, and data imputation. Despite these advancements, preprocessing pipelines still require substantial improvements to integrate seamlessly with machine learning models, allowing these models to ingest data directly without extensive preprocessing steps. Looking forward, the future of remote sensing data analysis lies in achieving smoother integration with machine learning and active learning frameworks. Efforts will focus on streamlining preprocessing pipelines to enable direct data ingestion by models and developing scalable, distributed processing solutions to manage the continually increasing volumes of geospatial data. This poster comprehensively outlines current advancements, addresses existing challenges, and highlights future opportunities in the efficient, integrated analysis of remote sensing data.
Graph Theoretical Optimization for Acyclic Joins
Authors: Zheng Luo (UCLA), Wim Van den Broeck (University of Bergen), Guy Van den Broeck (UCLA), Remy Wang (UCLA)
Abstract: Finding and enumerating plans is the first step of query optimization. In this project, we first generalize Kirchhoff’s Matrix-tree theorem to devise a polynomial-time algorithm that computes the exact number of join trees an acyclic query admits, up to exponentially many. From the inductive proof of the generalized matrix-tree theorem, we derive an algorithm that efficiently enumerates all the join trees. Through the experiments on real-world queries, we observe that the vast majority of predicates join two relations on only one attribute, in spite of the increasing many relations involved. We name this observation singleton property. The singleton property guarantees that any subquery of an acyclic query remains acyclic. Lastly, we propose a linear-time algorithm to find the shallowest join tree of an acyclic singleton query, which facilitates highly efficient execution of Yannakakis’ join algorithm.
Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line Control
Authors: Yicong Huang (UC Irvine), Zuozhi Wang (UC Irvine), Chen Li (UC Irvine)
Abstract: Many big data systems are written in languages such as C, C++, Java, and Scala to process large amounts of data efficiently, while data analysts often use Python to conduct data wrangling, statistical analysis, and machine learning. User-defined functions (UDFs) are commonly used in these systems to bridge the gap between the two ecosystems. In this paper, we propose Udon, a novel debugger to support fine-grained debugging of UDFs. Udon encapsulates the modern line-by-line debugging primitives, such as the ability to set breakpoints, perform code inspections, and make code modifications while executing a UDF on a single tuple. It includes a novel debug-aware UDF execution model to ensure the responsiveness of the operator during debugging. It utilizes advanced state-transfer techniques to satisfy breakpoint conditions that span across multiple UDFs. It incorporates various optimization techniques to reduce the runtime overhead. We conduct experiments with multiple UDF workloads on various datasets and show its high efficiency and scalability.
Efficient Mouse Brain Image Processing Using Collaborative Data Workflows on Texera
Authors: Yunyan Ding (UC Irvine), Yicong Huang (UC Irvine), Pan Gao (UC Irvine), Atchuth Naveen Chilaparasetti (UC Irvine), Andy Thai (UC Irvine), Gopi Meenakshisundaram (UC Irvine), Xiangmin Xu (UC Irvine), Chen Li (UC Irvine)
Abstract: In the realm of neuroscience, mapping the three-dimensional (3D) neural circuitry and architecture of the brain is important for advancing our understanding of neural circuit organization and function. This study presents a novel pipeline that transforms mouse brain samples into detailed 3D brain models using a collaborative data analytics platform called Texera. The user-friendly Texera platform allows for effective interdisciplinary collaboration between team members in neuroscience, computer vision, and data processing. Our pipeline utilizes the tile images from a serial two-photon tomography/TissueCyte system, then stitches tile images into brain section images, and constructs 3D whole-brain image datasets. The resulting 3D data supports downstream analyses, including 3D whole-brain registration, atlas-based segmentation, cell counting, and high-resolution volumetric visualization. Using this platform, we implemented specialized optimization methods and obtained significant performance enhancement in workflow operations. We expect the neuroscience community can adopt our approach for large-scale image-based data processing and analysis.
Demonstration of Collaborative and Interactive Workflow-based Data Analytics in Texera
Authors: Xiaozhen Liu (UC Irvine), Zuozhi Wang (UC Irvine), Shengquan Ni (UC Irvine), Sadeem Alsudais (UC Irvine), Yicong Huang (UC Irvine), Avinash Kumar (UC Irvine), Chen Li (UC Irvine)
Abstract: Collaborative data analytics is becoming increasingly important due to the higher complexity of data science, more diverse skills from different disciplines, more common asynchronous schedules of team members, and the global trend of working remotely. In this work, we will show how Texera supports this emerging computing paradigm to achieve high productivity among collaborators with various backgrounds. Based on our active joint projects on the system, we use a scenario of social media analysis to show how a data science task can be conducted on a user-friendly yet powerful platform by a multi-disciplinary team, including domain scientists with limited coding skills and experienced machine learning experts. We will present how to do collaborative editing of a workflow and collaborative execution of the workflow in Texera. We will focus on data-centric features such as synchronization of operator schemas among the users during the construction phase, and monitoring and controlling the shared runtime during the execution phase.
SqlRewriter: Optimizing SQL Queries with Crowd-Sourced Rewriting Knowledge
Authors: Yihong Yu (UC Irvine), Jessie He (UC Irvine), Jun Xia (UC Irvine), Hartley Tran (UC Irvine), Qiushi Bai (Microsoft), Chen Li (UC Irvine)
Abstract: SQL performance optimization is critical for modern database applications, yet manually identifying and applying effective rewrites remains a challenge, especially for complex and machine-generated queries. We are developing SqlRewriter (https://sqlrewriter.io/), an online community for sharing and contributing SQL query-rewriting knowledge. SqlRewriter allows users to discover, share, and apply optimization techniques that transform poorly performing SQL queries into equivalent, highly optimized forms. It provides a language called VarSQL, which extends SQL with variables and allows users to create rewriting rules as easily as writing standard SQL queries. SqlRewriter offers a rule-generalization tool that can transform user-provided rewriting examples into formal, reusable rules. Beyond knowledge sharing, SqlRewriter also automatically applies user-contributed rewriting rules to input queries, rewriting them into more efficient forms without requiring manual intervention. Through these capabilities, SqlRewriter empowers users to collaboratively build a rich repository of optimization knowledge and enhance SQL performance across diverse applications.
SynopsesDB: A Distributed Data System Supports In-system Data Exploration
Authors: Xin Zhang (UC Riverside)
Abstract: In the era of big data, domain experts often begin by exploring large datasets to gain insights, typically through Online Analytical Processing (OLAP), approximate, and progressive queries over hundreds of data files. However, existing distributed data processing platforms, such as Spark, are isolated from storage systems such as data warehouses and data lakes, forcing data scientists to build custom connectors to export data for analysis. We present SynopsisDB, a distributed system that enables exploration queries to run directly on storage systems without data export. SynopsisDB supports four categories of data summaries: histograms, wavelets, sketches, and samples. It enables in-system exploration by querying the compact summaries or accessing the original data when needed. It stores raw data in the data lake and maintains data synopses in the data lakehouse, bridging the gap between storage and exploration. Furthermore, SynopsisDB introduces a novel framework for maintaining and efficiently combining hundreds of synopses at query time, ensuring scalable, high-performance analytics over massive datasets.
Quantum Annealing on Spatial Regionalization
Authors: Yunhan Chang (UC Riverside), Amr Magdy (UC Riverside), Federico Spedalieri (USC)
Abstract: Quantum computing has demonstrated potential for solving complex optimization problems. Regionalization, which is recognized as a spatial query in the database community, remains underexplored in quantum computing. Spatial contiguity as a fundamental constraint requiring spatial entities to form connected components significantly increases the complexity of regionalization problems, which are typically challenging for quantum modeling. We propose novel quantum formulations based on a flow model that enforces spatial contiguity constraints and a hybrid quantum-classical approach to manage larger-scale problems within existing hardware limitations. This work establishes a foundational framework for integrating quantum methods into practical spatial optimization tasks.
MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning
Authors: Yifan Jiang (USC), Jiarui Zhang (USC), Kexuan Sun (USC), Zhivar Sourati (USC), Kian Ahrabian (USC), Kaixin Ma (Tencent AI Lab), Filip Ilievski (Vrije Universiteit Amsterdam), Jay Pujara (USC)
Abstract: While multi-modal large language models (MLLMs) have shown significant progress across popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints on numbers) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only consider a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 × 3 matrices). And they fail to capture all abstract reasoning patterns in human cognition necessary for addressing real-world tasks, such as geometric properties and object boundary understanding in realworld navigation. To evaluate MLLMs’ AVR abilities systematically, we introduce MARVEL founded on the core knowledge system in human cognition, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model performance is grounded in perception or reasoning, MARVEL complements the standard AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with ten representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all MLLMs show near-random performance on MARVEL, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance). Although closed-source MLLMs, such as GPT-4V, show a promising understanding of reasoning patterns (on par with humans) after adding textual descriptions, this advantage is hindered by their weak perception abilities. We release our entire code and dataset at https://github.com/1171-jpg/MARVEL_AVR.
Meaningful Data Erasure in the Presence of Data Dependencies
Authors: Vishal Chakraborty (UC Irvine), Sharad Mehrotra (UC Irvine)
Abstract: Data regulations like GDPR require systems to support data erasure but leave the definition of "erasure" open to interpretation. This ambiguity makes compliance challenging, especially in databases where data dependencies can lead to erased data being inferred from remaining data. In this paper, we formally define a precise notion of data erasure that ensures any inference about deleted data, through dependencies, remains bounded to what could have been inferred before its insertion. We design erasure mechanisms that enforce this guarantee at minimal cost. Additionally, we explore strategies to balance cost and throughput, batch multiple erasures, and proactively compute data retention times when possible. We demonstrate the practicality and scalability of our algorithms using both real and synthetic datasets.