Smartpick: Workload Prediction for Serverless-enabled Scalable Data Analytics Systems
A serverless-enabled scalable data analytics system that exploits serverless (SL) and virtual machine (VM) together to realize composite benefits, i.e., agility from SL and better performance with reduced cost from VM.
Project Motivation
Many Internet applications are running on cloud environments and generating large-scale data, e.g., Facebook, Twitter and Google. For these Internet applications, analyzing high volume of data is one of the most important workloads. To meet the performance goals, data analytics systems may deploy redundant compute resources, e.g., VMs, a priori, which incurs additional cost ($) for idle VMs. To avoid cost for unused compute resources, many previous works focused on determining optimal configurations, e.g., the number of VM instances and their types, and storage types, by predicting required compute resources for workloads. These systems, however, may not handle the latency-sensitive queries promptly due to the unavoidable overhead of VM, i.e., bootup latency (> 55 seconds). If queries cause peak workload due to a lack of compute resources, they must wait until additional VM instances are fully deployed to be processed.
Other recent works focused on adopting SL to avoid the cold-boot latency problem. These systems, unfortunately, may still encounter cost- and performance bottlenecks based on data analytic workloads because SL offers worse performance and more expensive cost than VM. The table below demonstrates these differences between SL and VM, with benefits of each compute resource highlighted in bold-green.