NorCal DBDay 2018 - Posters

Posters

1. Build, run, and reproduce experiments with Flor by Rolando Garcia (UC Berkeley)

Abstract: Data scientists try many ideas quickly using development environments and toolchains with which they are familiar. The open and free creative process can be at odds with the formal and systematic practice of tracking experiments. As a result, it is easy for data scientists to neglect the necessary steps for ensuring experiment reproducibility, and compromise a sound scientific method. At the other extreme, an overly bureaucratic and mentally strenuous process can hamper innovation and hurt the creation of tomorrow’s ML applications. This poster will give an overview of Flor, a system we have been developing in the RISE Lab to help manage the machine learning lifecycle. We will discuss how Flor tracks, executes, and reproduces workflows, as well as ongoing efforts to integrate Flor with Jupyter and Git to improve the pipeline development experience and scientific process.

2. ExpoDB: An Exploratory Data Platform by Suyash Gupta, Domenic Cianfichi, Patrick Liao, Shreenath Iyer, Mohammad Sadoghi (UC Davis)

Abstract: ExpoDB is a distributed ledger that unifies secure transactional and real-time analytical processing, all centered around a democratic and decentralized computational model. The goals of ExpoDB are to facilitate design of new database principles, algorithms and architectures and to develop a variety of efficient transactional capabilities, storage architecture, and access methods. We envision ExpoDB to serve as a platform to foster "creativity".

3. L-Store: A Real-time OLTP and OLAP System by Mohammad Sadoghi (UC Davis)

Abstract: To derive real-time actionable insights from the data, it is important to bridge the gap between managing the data that is being updated at a high velocity (i.e., OLTP) and analyzing a large volume of data (i.e., OLAP). However, there has been a divide where specialized solutions were often deployed to support either OLTP or OLAP workloads but not both; thus, limiting the analysis to stale and possibly irrelevant data. In this paper, we present Lineage-based Data Store (L-Store) that combines the realtime processing of transactional and analytical workloads within a single unified engine by introducing a novel update-friendly lineage-based storage architecture. By exploiting the lineage, we develop a contention-free and lazy staging of columnar data from a write-optimized form (suitable for OLTP) into a read-optimized form (suitable for OLAP) in a transactionally consistent approach that supports querying and retaining the current and historic data.

4. EasyCommit: A Non-blocking Two-phase Commit Protocol by Suyash Gupta, Mohammad Sadoghi (UC Davis)

Abstract: To ensure efficient commitment process, the database community has mainly used the two-phase commit (2PC) protocol. However, the 2PC protocol is blocking under multiple failures. This necessitated the development of the non-blocking, three-phase commit (3PC) protocol. However, the database community is still reluctant to use the 3PC protocol, as it acts as a scalability bottleneck in the design of efficient transaction processing systems. In this work, we present Easy Commit which leverages the best of both the worlds (2PC and 3PC), that is, non blocking (like 3PC) and requires two phases (like 2PC). Easy Commit achieves these goals by ensuring two key observations: (i) first transmit and then commit, and (ii) message redundancy. We present the design of the Easy Commit protocol and prove that it guarantees both safety and liveness. We also present a detailed evaluation of EC protocol, and show that it is nearly as efficient as the 2PC protocol.

5. EmbedS: Scalable and Semantic-Aware Knowledge Graph Embeddings by Gonzalo I. Diaz (Oxford U.), Achille Fokoue (IBM Research), Mohammad Sadoghi (UC Davis)

Abstract: While the growing corpus of knowledge is now being encoded in the form of knowledge graphs with rich semantics, the current graph embedding models do not incorporate ontology information into the modeling. We propose a scalable and ontology-aware graph embedding model, EmbedS, which is able to capture RDFS ontological assertions. EmbedS models entities, classes, and properties differently in an RDF graph, allowing for a geometrical interpretation of ontology assertions such as type inclusion, subclassing, and alike.

6. A Scalable Circular Pipeline Design for Multi-Way Stream Joins in Hardware by Mohammadreza Najafi (TUM), Hans-Arno Jacobsen (TUM), Mohammad Sadoghi (UC Davis)

Abstract: Efficient real-time analytics are an integral part of a growing number of data management applications such as computational targeted advertising, algorithmic trading, and Internet of Things. In this paper, we primarily focus on accelerating stream joins, arguably one of the most commonly used and resource-intensive operators in stream processing. We propose a scalable circular pipeline design (Circular-MJ) in hardware to orchestrate multi-way join while minimizing data flow disruption. In this circular design, each new tuple (given its origin stream) starts its processing from a specific join core and passes through all respective join cores in a pipeline sequence to produce final results. We further present a novel two-stage pipeline stream join (Stashed-MJ) that uses a best-effort buffering technique (stash) to maintain intermediate results. In a case that an overwrite is detected in the stash, our design automatically resorts to recomputing intermediate results.

7. Distributed Caching for Processing Raw Arrays by Weijie Zhao (UC Merced)

Abstract: We introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format.

8. Adaptive Online Aggregation with Randomness Detection by Weijie Zhao (UC Merced)

Abstract: We propose an adaptive online aggregation framework that does not need a heavy preprocessing (data shuffling). The proposed initialization-free method is scalable and easy to apply on the big data analytics.

9. Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU? by Yujing Ma (UC Merced)

Abstract: Many of data analytics frameworks, e.g., TensorFlow and BIDMach, implement their compute-intensive primitives in two flavors---as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPU. Stochastic gradient descent (SGD) is the most popular optimization method for model training implemented extensively on modern data analytics platforms. While the data-intensive properties of SGD are well-known, there is an intense debate on which of the many SGD variants is better in practice. We perform a comprehensive study of parallel SGD for training generalized linear models. We consider the impact of three factors -- computing architecture (multi-core CPU or GPU), synchronous or asynchronous model updates, and data sparsity -- on three measures---hardware efficiency, statistical efficiency, and time to convergence. In the process, we design an optimized asynchronous SGD algorithm for GPU that leverages warp shuffling and cache coalescing for data and model access. Our CPU and GPU implementations always outperform TensorFlow and BIDMach in time to convergence---sometimes by several orders of magnitude.

10. High Throughput Push Based Storage Manager by Ye Zhu ( UC Merced)

Abstract: The storage manager, as a key component of the database system, is responsible for organizing, reading, and delivering data to the execution engine for processing. According to the data serving mechanism, existing storage managers are either pull-based -- which incurs high latency -- or push-based -- which leads to a high number of I/O requests when the CPU is busy. In order to improve the above defects, we propose a push-based pre-fetching strategy in a column-wise storage manager. We implement an efficient cache layer to store shared data among queries to reduce the number of I/O requests. The capacity of the cache is maintained by a time access-aware eviction mechanism. Our strategy enables the storage manager to coordinate multiple queries by merging their requests and dynamically generate an optimal reading order that maximizes the overall I/O throughput. We evaluate our storage manager both over a disk-based RAID and NVMe SSD. With the high reading performance of the SSD, we successfully minimize the total reading time and the number of I/O accesses.

11. Novel Selectivity Estimation Strategy for Modern DBMS by Jun Hyung Shin (UC Merced)

Abstract: Selectivity estimation is important in query optimization, however accurate estimation is difficult when predicates are complex. Instead of existing database synopses and statistics not helpful for such cases, we introduce a new approach to compute the exact selectivity by running an aggregate query during the optimization phase. Exact selectivity can be achieved without significant overhead for in-memory and GPU-accelerated databases by adding extra query execution calls. We implement selection push-down based on the novel selectivity estimation strategy in the MapD database system. Our approach records constant and less than 30 millisecond overheads in any circumstances while running on GPU. The novel strategy successfully generates better query execution plans which result in up to 68.2 times faster execution time of the queries with complicate predicates from TPC-H benchmark SF-50 dataset.

12. Towards Not Re-Inventing the Wheel: Managing Data Management Tools by Kathryn Dahlgren (UC Santa Cruz)

Abstract: NoSQL databases offer powerful abstractions for querying non-relational data. However, NoSQL products generally pursue superior flexibility, customizability, scalability, and performance goals while neglecting support for generally useful data management tools. In particular, products typically ship without integrated support for management features rendered conventional by the long history of RDBMSs, such as sophisticated query processing systems, join operations, aggregate functions, and integrity constraints. The design decision forces users of NoSQL technologies to find alternative methods for providing missing tools by engaging either directly or indirectly in a suboptimal k-implementation cycle as developers re-invent new instances of the same data management tools across NoSQL products. This work articulates the problem associated with the lax regard for data management support currently defining the class of NoSQL databases and introduces the Piper package index and management system as an exploratory solution.

13. Distributed Query-Aware Quantization for High-Dimensional Similarity Searches by Gheorghi Guzun (SJSU)

Abstract: In this project we design a Query dependent Equi-Depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan Nearest Neighbor queries over datasets with hundreds of dimensions.

14. MacroBase SQL: An Optimized OLAP Engine for Prioritizing Human Attention by Firas Abuzaid, Peter Kraft (Stanford University)

Abstract: As data volumes continue to grow, manual inspection has become increasingly untenable for the typical data analyst. In response, systems such as MacroBase, Data XRay, and SliceFinder have been proposed that automate the process of searching through large datasets to highlight important subsets relevant to the user. Theses specialized systems, however, are designed explicitly for this goal, without providing other functionalities commonly found in OLAP databases, such as selections, projections, and JOINs. MacroBase SQL is a high-performance OLAP engine that unifies these two capabilities: with our proposed DIFF operator, users can seamlessly integrate anomaly explanation queries found in MacroBase with other standard SQL operators. To evaluate DIFF queries efficiently, we implement several logical and physical optimizations in MacroBase SQL, allowing it to scale to tens of millions of rows on a single node, and billions of rows in the distributed setting.

15. Ground: A Data Context Service by Vikram Sreekanti (UC Berkeley)

Abstract: Modern big data ecosystems are sorely lacking context surrounding the use of data in an organization; this include what data is had, who is using that data, when and how data is changing, and where the data is moving. Data context services are broadly applicable, including use cases in data inventory, machine learning lifecycle management, workflow reproducibility, and collective governance.

16. Anna: A KVS For Any Scale by Chenggang Wu (UC Berkeley)

Abstract: Modern cloud providers offer dense hardware with multiple cores and large memories, hosted in global platforms. This raises the challenge of implementing high-performance software systems that can effectively scale from a single core to multicore to the globe. Conventional wisdom says that software designed for one scale point needs to be rewritten when scaling up by 10−100×. In contrast, we explore how a system can be architected to scale across many orders of magnitude by design. We explore this challenge in the context of a new key-value store system called Anna: a partitioned, multi-mastered system that achieves high performance and elasticity via wait-free execution and coordination-free consistency. Our design rests on a simple architecture of coordination-free actors that perform state update via merge of lattice-based composite data structures. We demonstrate that a wide variety of consistency models can be elegantly implemented in this architecture with unprecedented consistency, smooth fine-grained elasticity, and performance that far exceeds the state of the art.

17. UX Consistency in Interactive Visualizations, a Distributed Systems Perspective by Yifan Wu (UC Berkeley)

Abstract: Interactive visualizations are increasingly connected to large datasets, with real-time streaming updates, and affords fluid interactions that request data from remote sources frequently and update the UI in sophisticated ways. Each of the processes—data requests, streaming updates, user reactions, and user interactions—occur over a span of time. Visualizations with multiple interactive timespans can confuse the user about what they have selected and what results they are seeing, breaking the entire data analysis system. This poster will describe a model of the change of the visualization state in response to new interaction and data events which provides not only the foundation for understanding design considerations but also programming abstractions to implement these designs efficiently. We share some initial explorations in the design space with large-scale user studies and a prototype DSL embedded in SQL running in the browser.