Kimchi: Network Cost-aware Geo-distributed Data Analytics System

Project Motivation

Geo-distributed data analytics (GDA) has become a popular method for mining valuable information from globally distributed big data generated by users and systems in a multi-cloud environment, in areas as diverse as querying global trend detection on social network data, and log monitoring of geo-distributed CDN servers.

Many GDA systems have focused on the network performance-bottleneck: inter-data center (DC) network bandwidth to improve query performance. Unfortunately, these systems may encounter a cost-bottleneck ($) because they have not considered data transfer cost ($), one of the most expensive and heterogeneous resources in a multi-cloud environment.

Large-scale geo-distributed data

One may think that minimizing WAN usage results in minimized cost. Yet, this is not always true due to heterogeneous cloud pricing policies, e.g., up to an 8X inter-DC transfer cost difference even within the same cloud provider (AWS), and a 12.5X cost difference across cloud providers (AWS and Azure). Due to the heterogeneity data transfer cost, minimizing data transfer size may not lead to the minimum data transfer cost.

To consider cost, we are motivated by the following questions.

To answer these questions, we have designed and implemented Kimchi, a cost-aware GDA system. The goal of Kimchi is to explore a richer cost-performance tradeoff space and to achieve the best performance within a desired cost budget. To this end, Kimchi solves a constrained MIP (mixed integer programming) task placement problem that meets a desired tradeoff preference.

Architecture

To evaluate the optimized task placement in a multi-cloud environment, Kimchi requires inputs from underlying GDA systems.