Kimchi: Network Cost-aware Geo-distributed Data Analytics System
Project Motivation
Geo-distributed data analytics (GDA) has become a popular method for mining valuable information from globally distributed big data generated by users and systems in a multi-cloud environment, in areas as diverse as querying global trend detection on social network data, and log monitoring of geo-distributed CDN servers.
Many GDA systems have focused on the network performance-bottleneck: inter-data center (DC) network bandwidth to improve query performance. Unfortunately, these systems may encounter a cost-bottleneck ($) because they have not considered data transfer cost ($), one of the most expensive and heterogeneous resources in a multi-cloud environment.
Large-scale geo-distributed data
One may think that minimizing WAN usage results in minimized cost. Yet, this is not always true due to heterogeneous cloud pricing policies, e.g., up to an 8X inter-DC transfer cost difference even within the same cloud provider (AWS), and a 12.5X cost difference across cloud providers (AWS and Azure). Due to the heterogeneity data transfer cost, minimizing data transfer size may not lead to the minimum data transfer cost.
To consider cost, we are motivated by the following questions.
What is the minimal query execution time given a target cost budget ($)?
What is the feasible cost range to execute a query?
How can a GDA achieve the desired cost-performance tradeoff in a multi-cloud environment?
How can a GDA handle dynamics for better performance during query execution without additional cost?
To answer these questions, we have designed and implemented Kimchi, a cost-aware GDA system. The goal of Kimchi is to explore a richer cost-performance tradeoff space and to achieve the best performance within a desired cost budget. To this end, Kimchi solves a constrained MIP (mixed integer programming) task placement problem that meets a desired tradeoff preference.
Architecture
To evaluate the optimized task placement in a multi-cloud environment, Kimchi requires inputs from underlying GDA systems.
Application desired goals i.e., cost-performance preference
Data transfer cost Information for each data centers (DC) location
Shuffle stages information, i.e., intermediate data sizes and locations
Network bandwidth Information among DCs