Research Projects

NoSQL Datastores:

1- Rafiki: A middleware for parameter tuning of nosql datastores for dynamic metagenomics workloads

High performance computing (HPC) applications, such as metagenomics and other big data systems, need to store and analyze huge volumes of semi-structured data. Such applications often rely on NoSQL-based datastores, and optimizing these databases is a challenging endeavor, with over 50 configuration parameters in Cassandra alone. As the application executes, database workloads can change rapidly from read-heavy to write-heavy ones, and a system tuned with a read-optimized configuration becomes suboptimal when the workload becomes write-heavy.

2- SOPHIA: Online reconfiguration of clustered nosql databases for time-varying workloads

Reconfiguring NoSQL databases under changing workload patterns is crucial for maximizing database throughput. This is challenging because of the large configuration parameter search space with complex interdependencies among the parameters. While state-of-the-art systems can automatically identify close-to-optimal configurations for static workloads, they suffer for dynamic workloads as they overlook three fundamental challenges:(1) Estimating performance degradation during the reconfiguration process (such as due to database restart).(2) Predicting how transient the new workload pattern will be.(3) Respecting the application’s availability requirements during reconfiguration. Our solution, SOPHIA, addresses all these shortcomings using an optimization technique that combines workload prediction with a cost-benefit analyzer. SOPHIA computes the relative cost and benefit of each reconfiguration step, and determines an optimal reconfiguration for a future time window. This plan specifies when to change configurations and to what, to achieve the best performance without degrading data availability. We demonstrate its effectiveness for three different workloads: a multi-tenant, global-scale metagenomics repository (MG-RAST), a bus-tracking application (Tiramisu), and an HPC data-analytics system, all with varying levels of workload complexity and demonstrating dynamic workload changes. We compare SOPHIA’s performance in throughput and tail-latency over various baselines for two popular NoSQL databases, Cassandra and Redis.

3- OPTIMUSCLOUD: Heterogeneous Configuration Optimization for Distributed Databases in the Cloud

Achieving cost and performance efficiency for cloud-hosted databases requires exploring a large configuration space, including the parameters exposed by the database along with the variety of VM configurations available in the cloud. Even small deviations from an optimal configuration have significant consequences on performance and cost. Existing systems that automate cloud deployment configuration can select near-optimal instance types for homogeneous clusters of virtual machines and for stateless, recurrent data analytics workloads. We show that to find optimal performance-per-$ cloud deployments for NoSQL database applications, it is important to (1) consider heterogeneous cluster configurations,(2) jointly optimize database and VM configurations, and (3) dynamically adjust configuration as workload behavior changes. We present OPTIMUSCLOUD, an online reconfiguration system that can efficiently perform such joint and heterogeneous configuration for dynamic workloads. We evaluate our system with two clustered NoSQL systems: Cassandra and Redis, using three representative workloads and show that OPTIMUSCLOUD provides 40% higher throughput/$ and 4.5× lower 99-percentile latency on average compared to state-of-the-art prior systems, CherryPick, Selecta, and SOPHIA.

4- Suitability of nosql systems—cassandra and scylladb—for iot workloads

Motivated by the increasing trend of storing data for web applications in fast NoSQL systems, in this paper, we experiment with the leading NoSQL datastore-Cassandra-and a latest generation re-design of Cassandra-ScyllaDB-meant to deliver bleeding-edge performance on modern multicore machines. We evaluate the scalability claim of ScyllaDB, in terms of the number of clients, and provide diagnostic evidence through OS-level metrics, such as, disk utilization and cache-miss rates. Specifically we are motivated by the need to store large amounts of IoT-generated data in nearby datastores. Our evaluation is the first in the line of objective benchmarking of these two technologies that are finding widespread adoption in data centers and other modern computing platforms. For example, we find hitherto unreported performance instability of ScyllaDB when the servers are replicated while we identify the root cause of improved read performance of ScyllaDB compared to Cassandra.

Serverless Computing:

1- SONIC: Application-aware data passing for chained serverless applications

Data analytics applications are increasingly leveraging serverless execution environments for their ease-of-use and pay-as-you-go billing. The structure of such applications is usually composed of multiple functions that are chained together to form a workflow. The current approach of exchanging intermediate (ephemeral) data between functions is through a remote storage (such as S3), which introduces significant performance overhead. We compare three datapassing methods, which we call VM-Storage, Direct-Passing, and state-of-practice Remote-Storage. Crucially, we show that no single data-passing method prevails under all scenarios and the optimal choice depends on dynamic factors such as the size of input data, the size of intermediate data, the application’s degree of parallelism, and network bandwidth. We propose SONIC, a data-passing manager that optimizes application performance and cost, by transparently selecting the optimal data-passing method for each edge of a serverless workflow DAG and implementing communication-aware function placement. SONIC monitors application parameters and uses simple regression models to adapt its hybrid data passing accordingly. We integrate SONIC with Open-Lambda and evaluate the system on Amazon EC2 with three analytics applications, popular in the serverless environment. SONIC provides lower latency (raw performance) and higher performance/$ across diverse conditions, compared to four baselines: SAND, vanilla OpenLambda, OpenLambda with Pocket, and AWS Lambda.

2- Janus: Benchmarking commercial and open-source cloud and edge platforms for object and anomaly detection workloads

With diverse IoT workloads, placing compute and analytics close to where data is collected is becoming increasingly important. We seek to understand what is the performance and the cost implication of running analytics on IoT data at the various available platforms. These workloads can be compute-light, such as outlier detection on sensor data, or compute-intensive, such as object detection from video feeds obtained from drones. In our paper, JANUS, we profile the performance/$ and the compute versus communication cost for a compute-light IoT workload and a compute-intensive IoT workload. In addition, we also look at the pros and cons of some of the proprietary deep-learning object detection packages, such as Amazon Rekognition, Google Vision, and Azure Cognitive Services, to contrast with open-source and tunable solutions, such as Faster R-CNN (FRCNN). We find that AWS IoT Greengrass delivers at least 2X lower latency and 1.25X lower cost compared to all other cloud platforms for the compute-light outlier detection workload. For the compute-intensive streaming video analytics task, an opensource solution to object detection running on cloud VMs saves on dollar costs compared to proprietary solutions provided by Amazon, Microsoft, and Google, but loses out on latency (up to 6X). If it runs on a low-powered edge device, the latency is up to 49X lower.

Natural Language Processing:

1- SIMVECS: Similarity-based Vectors for Utterance Representation in Conversational AI Systems

Conversational AI systems are gaining a lot of attention recently in both industrial and scientific domains, providing a natural way of interaction between customers and adaptive intelligent systems. A key requirement in these systems is the ability to understand the user’s intent and provide adequate responses to them. One of the greatest challenges of language understanding (LU) services is efficient utterance (sentence) representation in vector space, which is an essential step for most ML tasks. In this paper, we propose a novel approach for generating vector space representations of utterances using pair-wise similarity metrics. The proposed approach uses only a few corpora to tune the weights of the similarity metric without relying on external general purpose ontologies. Our experiments confirm that the generated vectors can improve the performance of LU services in unsupervised, semi-supervised and supervised learning tasks.

2- Word representations in vector space and their applications for Arabic

A lot of work has been done to give the individual words of a certain language adequate representations in vector space so that these representations capture semantic and syntactic properties of the language. In this paper, we compare different techniques to build vectorized space representations for Arabic, and test these models via intrinsic and extrinsic evaluations. Intrinsic evaluation assesses the quality of models using benchmark semantic and syntactic dataset, while extrinsic evaluation assesses the quality of models by their impact on two Natural Language Processing applications: Information retrieval and Short Answer Grading. Finally, we map the Arabic vector space to the English counterpart using Cosine error regression neural network and show that it outperforms standard mean square error regression neural networks in this task.

3- Semantic query expansion for Arabic information retrieval

Traditional keyword based search is found to have some limitations. Such as word sense ambiguity, and the query intent ambiguity which can hurt the precision. Semantic search uses the contextual meaning of terms in addition to the semantic matching techniques in order to overcome these limitations. This paper introduces a query expansion approach using an ontology built from Wikipedia pages in addition to other thesaurus to improve search accuracy for Arabic language. Our approach outperformed the traditional keyword based approach in terms of both F-score and NDCG measures.

Dependability of High Performance Computing Clusters

1- The mystery of the failing jobs: Insights from operational data from two university-wide computing systems

Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is essential. This paper provides analyses of node and job failures in two university-wide computing clusters at two Tier I US research universities. We analyzed approximately 3.0M job execution data of System A and 2.2M of System B with data sources coming from accounting logs, resource usage for all primary local and remote resources (memory, IO, network), and node failure data. We observe different kinds of correlations of failures with resource usages and propose a job failure prediction model to trigger event-driven checkpointing and avoid wasted work. Additionally, we present user history based resource usage and runtime prediction models. These models have the potential to avoid system related issues such as contention, and improve quality of service such as lower mean queue time, if their predictions are used to make a more informed scheduling decision. As a proof of concept, we simulate an easy backfill scheduler to use predictions of one of these models, i.e., runtime and show the improvements in terms of lower mean queue time. Arising out of these observations, we provide generalizable insights for cluster management to improve reliability, such as, for some execution environments local contention dominates, while for others system-wide contention dominates.