Smartpick: Workload Prediction for Serverless-enabled Scalable Data Analytics Systems

A serverless-enabled scalable data analytics system that exploits serverless (SL) and virtual machine (VM) together to realize composite benefits, i.e., agility from SL and better performance with reduced cost from VM.

Project Motivation

Many Internet applications are running on cloud environments and generating large-scale data, e.g., Facebook, Twitter and Google. For these Internet applications, analyzing high volume of data is one of the most important workloads. To meet the performance goals, data analytics systems may deploy redundant compute resources, e.g., VMs, a priori, which incurs additional cost ($) for idle VMs. To avoid cost for unused compute resources, many previous works focused on determining optimal configurations, e.g., the number of VM instances and their types, and storage types, by predicting required compute resources for workloads. These systems, however, may not handle the latency-sensitive queries promptly due to the unavoidable overhead of VM, i.e., bootup latency (> 55 seconds). If queries cause peak workload due to a lack of compute resources, they must wait until additional VM instances are fully deployed to be processed.

Other recent works focused on adopting SL to avoid the cold-boot latency problem. These systems, unfortunately, may still encounter cost- and performance bottlenecks based on data analytic workloads because SL offers worse performance and more expensive cost than VM. The table below demonstrates these differences between SL and VM, with benefits of each compute resource highlighted in bold-green.

Determining compute resources configurations, e.g., how many SL and VM instances, is challenging due to the complexities:

heterogeneous compute resource characteristics
workload prediction (how long a query will be executed)
diverse cost-performance goals
dynamics from workloads

While some recent works tried to exploit SL and VM together but they could not address these challenges as they have focused on either simple workload (independent tasks) or simple assumption without workload prediction. Thus, they may not work well for data analytics.

Therefore, we introduce Smartpick, a serverless-enabled data analytics system (SEDA), that helps data analytics applications achieve desired cost-performance goals by addressing aforementioned challenges. Smartpick uses a machine learning prediction scheme, decision-tree based Random Forest with Bayesian Optimizer, to determine SL and VM configurations, i.e., how many SL and VM instances for queries, that meet cost-performance goals.

Smartpick Characteristics and Architecture

Smartpick's contributions are as follows:

Scalable data analytics system that predicts data analytics workloads with consideration of SL and VM together to determine optimal compute resource configurations.
Flexibility that allows unmodified data analytics applications and other SEDA systems to reap the benefits.
A simple way to easily explore the cost-performance tradeoff space using diverse mechanisms embedded within the workload prediction.
Event-driven re-training of the prediction model to handle workload dynamics, e.g., varying data size and new queries.

For workload prediction, Smartpick uses Random Forest based machine learning algorithm along with Bayesian Optimization. The model uses several system-level features, such as, instances, input-size, start-time-epoch, total-memory, available-memory, memory-per-executor, num-waiting-apps, total-available-cores and query-duration.

Publication

[Middleware'23] Smartpick: Workload Prediction for Serverless-enabled Scalable Data Analytics Systems
Anshuman Das Mohapatra and Kwangsung Oh
In Proceedings of the 24th ACM/IFIP International Middleware Conference. Dec 2023.
[pdf] [project] [slides] [bibtex]

Research Sponsors

CSR-2153422