MapReduce
Optimizing Large-scale Distributed Data Processing
Stephanie (Thy) Le
Capstone Supervisor: Paul Cantrell
Teammate: Ling Ma
Context
The growth of web services and big data calls for a modification in how data is stored and processed. Traditional relational databases fall out of fashion in "big data" companies because of its complexity and rigid structure. Amazon noticed that when using SQL, 70% percent of their operations were using only <key,value> data, where a primary key is called to obtain a value in one row. In other words, most of their operations did not utilize the benefits of relational data, while updating such databases becomes more expensive as the amount of data scales.
With that context, in the first half of our independent project, we explored DynamoDB, Amazon's NoSQL database which stores data in <key,value> form. DynamoDB is built on a network structure to maintain data where many storage nodes communicate with each other, suitable for the data storage architecture of big data companies.
Working based upon this architecture, the MapReduce model is a simple and elegant solution to data processing that captures many applications in the real world.
What is MapReduce?
MapReduce is an algorithm that models how data can be broken down and processed in a distributed (multiple machines) system. Its high scalability makes it suitable for very large datasets.
How is MapReduce?
It's great! But how does it work?
There are two parts to MapReduce...map and reduce.
Map: Input data is partitioned and distributed to a group of machines doing the work. Each machine processes its data in the form of <key,value> pairs and feeds them to a "combiner" function which produces intermediate <key,value> pairs.
Reduce: Intermediate <key,value> pairs are combined, totaled or calculated according to how the user implements the Reduce function to produce the final output.
Here's an example
You have a text file and want to know the word count of all words in the file. Here is how MapReduce handles this problem:
Fun fact! MapReduce was developed by Google, but rumor has it that it was actually invented by Julius Ceasar, who also invented the term "divide and conquer." You know what, now I know why Brutus stabbed Caesar. His "brute force" approach is simply inferior.
Implementation
Our team implemented MIT's MapReduce lab in Go. See our project github for the code.
In this implementation, the user program forks one "master" machine contains all tasks (files) contained in a queue. Tasks are marked as ready, in queue, processing, finished or has error while processing. "Worker" servers constantly send requests to master (through Remote Procedural Call) asking for tasks, then process tasks independently. Each worker then reports the result back. If the reply received from the master sets “tasksAllDone” property to be true, the worker exits.
You've run into
The Unevenness Problem
We wanted to figure out how to optimize the runtime given this problem: when the set of input files are uneven in size, some workers have more load than others, which means lighter-load workers have to wait for heavier-load workers to finish. This causes longer runtime in total.
Load-balancing by Chunking
Based on the assumption that if we have uneven input files, the performance would be suboptimal, we implemented a basic load balancer. The idea is that the load balancer splits input files into even sized chunks, so that each worker can take up similar amounts of work. In this way, the situation where a few workers are waiting for one should be less frequent.
Chunking implementation:
scan the total input files, calculate the sum of the bytes of the input files.
split the files into chunks with start offset and chunk size.
pass that chunk info as part of a Task. When the worker receives the task, the worker only processes the chunk using the offset and chunksize.
How we generated custom input files to test our load balancer (if you're interested)
We split the source file into 4 uneven files, which are fed to the master.
The ratio of the 4 files is decided by the last four numbers in fibonacci sequence, where the last number is determined by user input. The four files always conform to the ratio a, b, a+b, a+ 2b. However, the larger the user supplied input is, the more disproportional a and b are, so the more uneven the sequence is.
We score uneveness on a scale as follows: [4 15 50 100 150 200 250 300 350 400 450 500], with 4 being the most even file size distribution and 500 being the most uneven. Then for each unevenness score, we ran 10 trials and recorded the median time. We choose median over mean result in order to avoid any irregular run time.
Results
First, we recorded total runtime of MapReduce without load balancing for each file size distribution. Below is the median of the results of 10 trials.
Then, we ran MapReduce with load balancing, in which we ran 10 trials for each chunk size and calculated the median runtime. Results with chunk size 0 are taken from Graph 1 (no load balancing) for comparison.
Aggregating runtime for all chunk sizes and uneveness scores
Discussion
When running MapReduce without a load balancer, runtime varies between different file size distributions. At the most evenly sized input (4), the program takes the least amount of time, while unevenness of 150,300,400 tend to be the local maxima in runtime. We hypothesize that this is due to when file sizes are unbalanced, workers assigned shorter files have to wait for workers assigned longer files. When file sizes are naturally balanced, no worker is idle for too long and each will end up finishing the task around the same time.
When we add load balancing, the runtime of the program decreases significantly. With each successive increase in chunk size, the runtime of MapReduce decreases until around 5.01 seconds at chunk size of 32000 byte, where no more improvement can be made. At the optimal chunk size, we are able to cut down up to four times the runtime of MapReduce without load balancing.
Runtime also becomes uniform across all file size distributions with load balancing. As the total amount of data is the same in all cases, chunking successfully distributed work evenly across all workers. This experiment shows that load balancing by chunking input files is a good strategy to optimize a distributed system.
Thank you...
To my teammate Ling Ma for bringing up the idea for this project and for supporting me through its completion,
To my teacher and capstone advisor Paul Cantrell for giving us critical advice and direction,
To my advisors, Susan Fox and Lauren Milne, for their support throughout my time in MSCS,
To all my friends and teachers at Macalester, who has helped me through ups and downs,
and You, whoever You are, for being part of my capstone experience :)