Md Sadik Yasir Tauki - Distributed MapReduce Framework

Distributed MapReduce Framework: Design & Performance Evaluation
Multi-threaded, distributed MapReduce system with scalability and synchronization analysis

Project Overview
In this project, I designed and evaluated a simplified distributed MapReduce framework that supports parallel execution of data-processing applications such as Word Count and Grep. The framework implements mapper–reducer coordination across multiple nodes using RPC-based communication and distributed storage, enabling systematic exploration of performance trade-offs involving parallelism, synchronization barriers, and buffer sizing.

evaluation_report.pdf

Key Features

⚙️ Implemented a complete MapReduce runtime with Mapper, Reducer, MR Master, and HDFS integration

🧵 Multi-threaded execution with configurable mapper threads and synchronization barriers

📊 Performance metrics: job completion time, scalability trends, and variance analysis

☁️ Deployed and evaluated on AWS multi-node clusters

🛠 Tooling: C/C++ system programming, RPC, HDFS, automated benchmarking

Research Contributions

Quantified the impact of barrier synchronization on end-to-end job completion time
Demonstrated strong scaling benefits with increased mapper units, validating parallel execution efficiency
Identified application-dependent bottlenecks (CPU-bound Grep vs reduce/network-bound Word Count)
Analyzed how mapper thread over-subscription leads to CPU saturation and diminishing returns

Technical Achievements

1. Designed thread-safe mapper and reducer pipelines with bounded in-memory buffers
2. Implemented sender threads to flush mapper buffers based on utilization thresholds
3. Built benchmarking workflows with repeated trials and statistical analysis (average, standard deviation)
4. Evaluated system behavior across:

# Mapper units ∈ {1, 4}

# Mapper threads ∈ {4, 8, 12, 16}

# Buffer sizes ∈ {10 KB, 100 KB, 1 MB}

# Barrier enabled vs disabled

Applications

Large-scale data processing and analytics frameworks
Distributed systems and cloud infrastructure design
Performance modeling of parallel workloads
Systems-level trade-off analysis for synchronization and buffering

Impact and Recognition

This project provides a practical, systems-level understanding of how parallelism, synchronization, and buffering interact in real-world distributed frameworks. The experimental results highlight why modern MapReduce systems carefully tune barriers and buffer sizes to maximize throughput while avoiding resource contention, making the project directly relevant to industry-scale data platforms.