SoCal DB Day 2018

List of Student Posters

The following posters will be displayed and presented by the students during the two coffee breaks and lunch breaks. A few more posters, whose authors opted out of being listed here, will also be presented.

A Parallel and Scalable Processor for JSON Data. Christina Pavlopoulou (UCR), E. Preston Carman, Jr (UCR), Till Westmann (Couchbase), Michael J. Carey (UCI), Vassilis J. Tsotras (UCR)

Abstract: Increasing interest in JSON data has created a need for its efficient processing. Although JSON is a simple data exchange format, its querying is not always effective, especially in the case of large repositories of data. This work aims to integrate the JSONiq extension to the XQuery language specification into an existing query processor (Apache VXQuery) to enable it to query JSON data in parallel. VXQuery is built on top of Hyracks (a framework that generates parallel jobs) and Algebricks (a language-agnostic query algebra toolbox) and can process data on the fly, in contrast to other well-known systems which need to load data first. Thus, the extra cost of data loading is eliminated. In this paper, we implement three categories of rewrite rules which exploit the features of the above platforms to efficiently handle path expressions along with introducing intra-query parallelism. We evaluate our implementation using a large (803GB) dataset of sensor readings. Our results show that the proposed rewrite rules lead to efficient and scalable parallel processing of JSON data.

A Pipeline for Political Network Analytics. Amarnath Gupta (UCSD SDSC), Haoran Sun (UCSD), Jennifer De La Cruz (UCSD), Pavan Jahagirdhar (UCSD), Susram Kanamarlapudi (UCSD), Subhasis Dasgupta (UCSD SDSC)

Abstract: A political network is a graph representing relationships among politicians and their opinions toward different topics. We present a political network analytics pipeline that processes user queries, applies NLP techniques to build political network graph from political news articles, and performs data ingestion.

ASTRO: A Datalog System for Advanced Stream Reasoning. Ariyam Das (UCLA), Sahil M. Gandhi (UCLA), Carlo Zaniolo (UCLA)

Abstract: The rise of the Internet of Things (IoT) and the recent focus on a gamut of ‘Smart City’ initiatives world-wide have pushed for new advances in data stream systems to (1) support complex analytics and evolving graph applications as continuous queries, and (2) deliver fast and scalable processing on large data streams. Unfortunately current continuous query languages (CQL) lack the features and constructs needed to support the more advanced applications. For example recursive queries are now part of SQL, Datalog, and other query languages, but they are not supported by most CQLs, a fact that caused a significant loss of expressive power, which is further aggravated by the limitation that only non-blocking queries can be supported. To overcome these limitations we have developed an advanced stream reasoning system ASTRO that builds on recent advances in supporting aggregates in recursive queries. We will primarily show that the the formal Streamlog semantics, when combined with the Pre-Mappability (PreM) concept, allows the declarative specification of many complex continuous queries, which are then efficiently executed in real-time by the portable ASTRO architecture. Using different case studies, we can demonstrate (i) the ease-of-use, (ii) the expressive power and (iii) the robustness of our system, as compared to other state-of-the-art declarative CQL systems.

AWESOME: Data Exploration Platform for Heterogeneous Social Data. Junan Guo (UCSD), Amarnath Gupta (UCSD SDSC), Subhasis Dasgupta (UCSD SDSC)

AWESOME is a data exploration platform for heterogeneous social data that gives the user a comprehensive interface to gain an insight into the data. AWESOME is powered by a middleware we developed called Boutique, which not only provides data for the visualizations but also periodically acquires data from the database to keep the visualizations current. In addition, Boutique computes aggregate data to show an overview and trends of the data. From the interface, the user is able to achieve chain visualizations by adding items to a shopping cart and exporting the data to other visualization and computational analytic tools such as Zenvisage. AWESOME also takes advantage of speculative query processing as the user interacts with the interface to guarantee a smooth and responsive user experience. Currently, we are working with users to better understand the interactions and improve the functionalities as well as the performance.

Best Practices For Main-Memory Top-K Selection on Modern Multi-core Platforms. Vasileios Zois (UCR), Vassilis J. Tsotras (UCR), and Wallid A. Najjar (UCR)

Abstract: Efficient Top-k query evaluation entails balancing random access against the cost of maintaining auxiliary information that is necessary for pruning candidate objects and reducing object evaluations. Parallel and in-memory Top-k selection presents a novel challenge, considering that computation is shifted higher up in the memory hierarchy. In this environment cache contention restricts the availability of auxiliary information which in turn negatively affects algorithmic efficiency. Likewise, naıvely partitioning the data to enable parallelism may result in higher number of object evaluations. This occurs when the partition objects are sparsely positioned in the search space, necessitating more evaluations to resolve score uncertainty for low ranking objects with few high attribute scores. In this work, we present three Top-k processing models, and focus towards quantifying score uncertainty in order to identify relevant practices suited for parallel in-memory execution. In conjunction with these efforts, we develop two algorithms (i.e. VTA, SLA) which showcase the disadvantages of previous work in association with our new setting. This realization motivates the design of a new algorithm (PTA) optimized to reduce score uncertainty by leveraging angle space partitioning. We prove theoretically and experimentally that this method exhibits superior performance when combined with optimizations targeted at improving the stopping threshold. Our extensive experimental evaluation using both real and synthetic data indicates that the proposed method achieves between 2 and 4 orders of magnitude better query latency, and throughput when compared to the state-of-the-art methods that were refitted for efficient parallel in-memory execution.

Big Active Data: from Petabytes to Megafolks in Milliseconds. Xikui Wang (UCI), Yusuf Sarwar (UCI), Hang Nguyen (UCI), Michael Carey (UCI), Nalini Venkatasubramanian (UCI), Steven Jacobs (UCR), Vassilis Tsotras (UCR), Vagelis Hristidis (UCR)

Abstract: While the most of the data systems are passive in nature that serve data only against user queries, we envision to move towards an active paradigm, called Big Active Data (BAD), where data is continuously ingested, matched and delivered to a set of interested users. Unlike publish/subscribe systems where users subscribe to messages, in BAD users rather subscribe to data; data as in a single incoming stream as well as in relationships to other data streams or previously stored data. We create a BAD platform which can ingest data continuously and reliably, support declarative subscriptions over the incoming data, and provide timely delivery of relevant events to a large number of interested users. By utilizing the modern techniques from distributed database and publish/subscribe systems, we hope this platform can provide users a simpler and more efficient way of developing scalable BAD applications.

Cloudberry: A Middleware Solution for Big Data Visualization. Qiushi Bai (UCI), Taewoo Kim (UCI), Chen Li (UCI)

Abstract: We are developing a general purpose middleware system called Cloudberry to support big data visualization. We will demonstrate an application called "Twittermap" that utilizes Cloudberry and Apache AsterixDB to support interactive exploration of 1.3 billion tweets with spatial, temporal, and keyword conditions.

Data Stream Processing: Models and Applications. Mohammad Javad Amiri (UCSB), Vaibhav Arora (UCSB), Sujaya Maiyya (UCSB), Victor Zakhary (UCSB), Divyakant Agrawal (UCSB), Amr El Abbadi (UCSB)

Abstract: Data stream processing is crucial for diverse Big Data, IoT, and Business Process Management applications. In this poster, we expound on different works developed in DSL at UCSB that model and use data stream processing to enhance efficiency, expressibility, and privacy. First, on query processing efficiency, distributed caches are widely deployed to serve social networks and web applications at billion-user scales. We present Cache on Track (CoT), an adaptive and predictive cache at the edge of a distributed cloud-based caching infrastructure. CoT uses stream processing techniques to cache heavy hitters at edge datacenters to enhance the efficiency of query answering. On an efficiency related note, workflows are typically used to model the data stream processing pipeline. We propose VEWO, an efficient verification technique for evolving workflows that incrementally checks and verifies different properties such as soundness and Linear Temporal Logic (LTL). On expressibility, one of the biggest challenges of the IoT ecosystem is to get real-time insights from diverse sensor data, which is continuously transmitted from IoT devices. We propose model-based operators as data management abstractions to IoT application developers for integrating data from different sensors. Finally, on privacy, social media streams analysis can reveal the characteristics of people who engage with or write about different topics. We present Locborg, a location privacy preserving cyborg that exploits social media stream processing to hide users’ locations while maintaining their online persona.

Efficient Data Ingestion and Query Processing for LSM-Based Storage Systems. Chen Luo (UCI), Michael J. Carey (UCI)

Abstract: In recent years, the Log Structured Merge (LSM) tree has been widely adopted by NoSQL and NewSQL systems for its superior write performance. Despite its popularity, however, most existing work has focused on LSM-based key-value stores with only a primary LSM-tree index; auxiliary structures, which are critical for supporting ad-hoc queries, have received much less attention. In this work, we focus on efficient data ingestion and query processing for general-purpose LSM-based storage systems. We first propose and evaluate a series of optimizations for efficient batched point lookups, significantly improving the range of applicability of LSM-based secondary indexes. We then present several new and efficient maintenance strategies for LSM-based storage systems. Finally, we have implemented and experimentally evaluated the proposed techniques in the context of the Apache AsterixDB system, and we present the results here.

Enabling & Accelerating Zero-shot Recognition on Video Datasets. Yuhao Zhang (UCSD), Arun Kumar (UCSD)

Abstract: Convolutional neural network(CNN)-based object recognition system has quickly gained popularity in the past few years. Although these systems have enabled very high accuracy, they are usually very expensive due to the GPU price, limiting the actual deployment and applications at scale, especially for video analytics. On the other hand, because these models are primarily designed for computer vision challenges, they are not specified for the dataset and usually require more efforts at deploying time. What is more, not all of these models have the capability for zero-shot recognition with an unbounded vocabulary, and model re-training needs to take place every time when a new class is added to the vocabulary, which is strongly unfavorable. We present a unified system to enable and also accelerate zero-shot recognition with controllable trade-offs between throughput and accuracy. Given a video feed and a reference model, the system automatically source the training data from the video feed and customize a deep cascade--a CNN with multiple intermediate output layers and can short-circuit itself to avoid over-killing--to substitute the reference model. The system features high-throughput(over 300x faster than state-of-art models), adjustable accuracy, zero-shot-ready and short-circuit-processing.

End-to-End Machine Learning with Apache AsterixDB. Wail Alkowaileet (UCI), Sattam Alsubaiee (Center for Complex Engineering Systems at KACST and MIT), Michael J. Carey (UCI), Chen Li (UCI), Heri Ramampiaro (Norwegian University of Science and Technology), Phanwadee Sinthong (UCI), Xikui Wang (UCI)

Abstract: Recent developments in machine learning and data science provide a foundation for extracting underlying information from Big Data. Unfortunately, current platforms and tools often require data scientists to glue together and maintain custom-built platforms consisting of multiple Big Data component technologies. In this paper, we explain how Apache AsterixDB, an open source Big Data Management System, can help to reduce the burdens involved in using machine learning algorithms in Big Data analytics. In particular, we describe how AsterixDB’s support for user-defined functions (UDFs), the availability of UDFs in data ingestion pipelines and queries, and the provision of machine learning platform and notebook inter-operation capabilities can enable data analysts to more easily create and manage end-to-end analytical dataflows.

Fault-Tolerant Global-Scale Data Management. Mohammad Javad Amiri (UCSB), Sujaya Maiyya (UCSB), Faisal Nawab (UCSC), Victor Zakhary (UCSB), Divyakant Agrawal (UCSB), Amr El Abbadi (UCSB)

Abstract: Global-Scale Data Management (GSDM) empowers systems by providing higher levels of fault-tolerance, read availability, and efficiency in utilizing cloud resources. In this poster, we expound on different works developed in DSL at UCSB that address different aspects of data geo-replication like transaction latency, fault-tolerance, and pedagogical understanding of distributed consensus and commitment . First, on transaction latency, we propose Dynamic Paxos (DPaxos), a Paxos-based consensus protocol to manage access to partitioned data across globally-distributed datacenters and edge nodes. DPaxos is intended to implement State Machine Replication to facilitate data management systems for the edge. Furthermore, on transaction latency, we present GPlacer, a data placement framework that answers the question: at which datacenters should data be placed?. GPlacer embeds the transaction protocol constraints into an optimization problem to derive both the data placement and the transaction protocol configuration that minimize overall transaction latency. On fault-tolerance, we present SeeMore, a hybrid replication protocol that addresses both crash and malicious failures. Finally, we propose the Consensus and Commitment (C&C) framework, which unifies both consensus and commitment into a single framework. The C&C framework can model existing and well known data management protocols as well as propose new ones. In addition, C&C is a pedagogical framework that helps understanding different aspects of distributed consensus and commitment.

Gemini: A Distributed Crash Recovery Protocol for Persistent Caches. Shahram Ghandeharizadeh (USC), Haoyu Huang (USC)

Abstract: Gemini is a distributed crash recovery protocol for persistent caches. When a cache instance fails, Gemini assigns other cache instances to process its reads and writes. Once the failed instance recovers, Gemini starts to recover its persistent content while using it to process reads and writes immediately. Gemini does so while guaranteeing read-after-write consistency. It also transfers the working set of the application to the recovering instance to maximize its cache hit ratio. Our evaluation shows that Gemini restores hit ratio two orders of magnitude faster than a volatile cache. Working set transfer is particularly effective with workloads that exhibit an evolving access pattern.

IoT Notary for Retroactive Consents, Compliance and Audits. Nisha Panwar (UCI), Mamadou Diallo (UCI), Sharad Mehrotra (UCI), Ardalan Amiri Sani (UCI)

Abstract: Contemporary IoT environments, such as smart buildings, require end-users to trust data collection policies published by the systems. There are several reason why such a trust is misplaced -- IoT systems may inadvertently collect data without users' knowledge, or may fall victim to cyberattacks that hijack IoT devices transferring user data to a malicious third-party leading to the loss of individuals' privacy. To address such concerns, we propose IoT notary a framework to ensure trust and confidence in IoT systems and applications. IoT notary provides secure log sealing along with a verifiable proof-of-order that an auditor can attest to validate the policy actuation, device state, or the purpose of sensing from the infrastructure's perspective. A verifier at the remote location collects the sealed logs and the paired proof via SIGMA Authenticated Key Exchange (AKE) protocol. In addition, IoT notary has been integrated with the IoT testbed System T under development. The experimental results conducted on the integrated system illustrate that a retroactive IoT notary impose nominal overhead for policy verification.

Materialization Trade-offs for Feature Transfer from Deep CNNs for Multimodal Data Analytics. Supun Nakandala (UCSD), Arun Kumar (UCSD)

Abstract: Deep Convolutional Neural Networks (CNNs) achieve near-human accuracy on many image understanding tasks. Thus they are now increasingly used to integrate images with structured data for multimodal analytics applications. Since training deep CNNs from scratch is expensive, transfer learning has become popular: using a pre-trained CNN, one reads off a certain layer of features to represent images and combines them with other features for a downstream ML task. Since no single layer will always offer best accuracy in general, such feature transfer requires comparing many CNN layers. The current dominant approach to this process on top of scalable analytics systems such as Spark and Ignite using deep learning toolkits such as TensorFlow is fraught with inefficiency due to redundant CNN inference and the potential for system crashes due to mismanaged memory. We present Vista, the first data system to mitigate such issues by elevating the feature transfer workload to a declarative level and formalizing the data model of CNN inference. Vista enables automated optimization of feature materialization trade-offs, distributed memory management, and system configuration. Real world experiments show that apart from enabling seamless feature transfer, Vista helps improve system reliability and also reduces runtimes up to 90%.

Online Provenance for Big Graph Analytics. Vicky Papavasileiou (UCSD), Ken Yocum (UCSD), Alin Deutsch (UCSD)

Abstract: Data provenance is a powerful tool for debugging large-scale analytics on batch processing systems. This paper presents Ariadne, a first approach for capturing and querying provenance from Vertex-Centric graph processing systems. While current provenance tracing procedures support explicit debugging scenarios, like crash-culprit determination, devising graph analytics is an iterative process where developers assess the quality of data and algorithm in an exploratory manner even when a crash does not occur. Currently, provenance tracing is performed offline incurring large overheads and only in the form of imperative function calls limiting the usability of provenance. To address these challenges, Ariadne provides developers with a concise declarative query language to customize capturing and analysis of provenance. Taking advantage of the formal semantics of this Datalog-based language, we identify useful query subclasses that can run online while an analytic computes. Experiments with various analytics and real-world datasets shows the overhead of online capturing and querying is 2x over the baseline (vs. 8x for the traditional approach). These experiments also illustrate how Ariadne’s query language supports non-crashing provenance querying for graph analytics.

Outlier Detection in Non-stationary Data Streams. Luan Tran (USC), Liyue Fan (University at Albany, SUNY), Cyrus Shahabi (USC)

Abstract: Continuous outlier detection in data streams has important applications in many domains, such as in climate science, transportation, and energy. The non-stationary characteristic of real-world data streams brings the challenge of updating the outlier detection model in a timely and accurate manner. In this paper, we propose a framework for outlier detection in non-stationary data streams (O-NSD) which detects changes in the underlying data distribution to trigger a model update. We propose an improved distance metric based on Kullback-Leibler distance, two accurate change detection algorithms as well as new evaluation metrics that quantify the timeliness of the detected changes. We demonstrate the outlier detection framework with two outlier models, i.e., Principal Component Analysis and One-class Support Vector Machine. Our extensive experiments with real-world and synthetic datasets show that our change detection algorithms outperform the state-of-the-art PHDT which employs Page Hinley test for change detection. Our O-NSD framework offers higher accuracy and requires a much shorter running time than retrain-based and incremental update approaches.

Polygraph. Yazeed Alabdulkarim (USC), Marwan Almaymoni (USC), Shahram Ghandeharizadeh (USC)

Abstract: Polygraph is a tool to quantify anomalies attributed to system behavior that violates atomicity, isolation, linearizability properties of transactions. It is a plug-n-play framework that includes visualization tools to empower an experimentalist to (a) quickly incorporate Polygraph into an existing application or benchmark and (b) quantify the number of anomalies. We demonstrate Polygraph using existing benchmarks, including TPC-C, SEATS, TATP, YCSB, and BG. We highlight Polygraph as an on-line tool by showing it scales for almost all benchmarks to process their transaction log records in real-time.

RamSQL and its System: Superior Expressive Power, Performance and Scalability. Jiaqi Gu (UCLA), Yugo Watanabe (UCLA), William Andrea Mazza (University of Naples Federico II), Alexander Shkapsky (Workday), Ling Ding (UCLA), Carlo Zaniolo (UCLA)

Abstract: Thanks to a simple SQL extension, RamSQL can express very powerful queries and declarative algorithms, such as classical graph algorithms and data mining algorithms. A novel optimization technique allows RamSQL to map declarative queries into one basic fixpoint operator supporting aggregates in recursive queries. A fully optimized implementation of this basic operator over multiple platforms leads to superior performance, scalability and portability. Thus, our RamSQL system, which extends Spark SQL with the before-mentioned new constructs and implementation techniques, matches and often surpasses the performance of other systems, including Apache Giraph, GraphX and Myria.

Scaling Cryptographic Techniques by Exploiting Data Sensitivity. Sharad Mehrotra (UCI), Shantanu Sharma (UCI), Jeffrey D. Ullman (Stanford University)

Abstract: Several researchers have proposed solutions for secure data outsourcing on the public clouds, based on encryption, secret-sharing, and trusted hardware. Existing approaches, however, exhibit many limitations including high computational complexity, imperfect security, and information leakage. We continue along the emerging trend in secure data processing that recognizes that an entire dataset may not be sensitive, and hence, non-sensitivity of data can be exploited to overcome some of the limitations of existing encryption-based approaches. In particular, data and computation can be partitioned into sensitive or non-sensitive datasets. While the sensitive dataset is outsourced using any existing secure approach, the non-sensitive dataset is outsourced in the cleartext. While partitioned computing can bring new efficiencies since it does not incur (expensive) encrypted data processing costs on non-sensitive data, it can lead to information leakage. We sketch out a new approach to secure partitioned computing which we refer to as query binning (QB) and show how QB can be used to support selection queries. Interestingly, we show that QB, in addition to scaling cryptographic solutions, can also be exploited to strengthen the security properties of weaker (but more efficient) cryptographic approaches that are prone to size, frequency-count, and workload attacks.

Smart Data Integration for IoT/Buildings. Jason Koh (UCSD), Rajesh Gupta (UCSD)

Abstract: Individual devices in IoT environment generate tremendous amount of the data. In a middle-sized building, there are 5,000 to 15,000 data points that continuously generates time-stamped data. While this large data has the potential to benefit its occupants for energy optimization and space utilization, its heterogeneity of metadata becomes a huge hurdle for the adoption of smart applications. We thus introduce a standard metadata schema for buildings, Brick, and algorithms to convert unstructured data into Brick.

SpeakQL: Towards Speech-driven Multi-modal Querying. Vraj Shah (UCSD), Side Li (UCSD), Arun Kumar (UCSD)

Abstract: Natural language and touch-based interfaces are making data querying significantly easier. But typed SQL remains the gold standard for query sophistication although it is painful in many querying environments. Recent advancements in automatic speech recognition raise the tantalizing possibility of bridging this gap by enabling spoken SQL queries. In this work, we outline our vision of one such new query interface and system for regular SQL that is primarily speech-driven. We propose an end-to-end architecture for making spoken SQL querying effective and efficient and present empirical results to understand the feasibility of such an approach.

Texera: Big text analytics as a Web-based service using declarative GUI-based interactive workflows. Avinash Kumar (UCI), Zuozhi Wang (UCI), Chen Li (UCI)

Abstract: Texera is an open source system we are developing that allows data scientists and domain experts to use workflows interactively to conduct analytics on large amounts of text data. It exploits the new opportunities enabled by recent changes including large amounts of data calling for scalable and parallel solutions, trends of cloud-based services, new browser-based frontend techniques, faster networks, and latest advances in machine learning and natural language processing. Analysts can use Texera to easily formulate a workflow of predefined operators, and run the job on a cluster of machines. They can also interact with the workflow during its execution by pausing the execution, investigating the state of operators, and resuming the computation. We will give an overview of the system, discuss its goals and challenges, report its current status, and present the future development plan.

VigilaDE: Avoiding False Discoveries with Hierarchical Data in Data Exploration Systems. Nikos Koulouris (UCSD), Yannis Papakonstantinou (UCSD)

Abstract: More data means more opportunity for a researcher to test more hypotheses until he discovers an interesting finding. This increases the chance of making a false discovery purely by chance and has been identified as one of the key reasons why a lot of published research results are not reproducible. The advent of big data exploration systems only makes this problem worse. We present VigilaDE, a data exploration system that utilizes the hierarchical structure of the input data in order to control false discoveries. VigilaDE guides the exploration towards interesting discoveries, while controlling false discoveries and increasingly statistical power. Through extensive experiments with real-world data and simulations we show that a user can find up to 3.4 times more true discoveries in the data against the baseline.

Report abuse