999 Skyway Road, Suite 100
San Carlos, CA 94070
Phone: (765) 491-0424
I received my Ph.D. in the Department of Electrical and Computer Engg. at Purdue University, working in Computer Systems Architecture group under my advisor T. N. Vijaykumar. I got my Bachelor's degree from the University of Engineering and Technology, Lahore, Pakistan.
My research interests include cloud computing, data mining, statistical analytics, data center architectures, big-data and energy-aware computing, distributed systems and computer architecture. My current research focuses on analytical models for big-data analytics.
My past projects during my graduation include ShuffleWatcher, Shuffle-aware scheduling in multi-tenant MapReduce clusters (Usenix ATC 2014), Tarazu, optimizing MapReduce On Heterogeneous Clusters (ASPLOS 2012), PowerTrade, a joint optimization of idle power and cooling power to reduce overall data center power (ASPLOS 2010), and MaRCO, a runtime performance optimization for MapReduce, the well-known programming model for large-volume data analysis in data centers (Tech Report 2007). During this work, I also developed a benchmark suite for MapReduce (details below). I have also worked on providing architecture support for debugging multithreaded programming in multicores (TimeTraveler, ISCA 2010).
MapReduce is a well-known programming model, developed within Google, for processing large amounts of raw data, for example, crawled documents or web request logs. This data is usually so large that it must be distributed across thousands of machines in order to be processed in a reasonable time. The ease of programmability, automatic data management and transparent fault tolerance has made MapReduce a favorable choice for large-scale data centers batch processing. Map, written by a user of the MapReduce library, takes an input pair and produces a set of intermediate key/value pairs. The library groups together all intermediate values associated with the same intermediate key and passes them to the reduce function through an all-map-to-all-reduce communication called Shuffle. Reduce, also written by the user, receives intermediate key along with a set of values from Map and merges together these values to produce the final output. Hadoop is an open-source implementation of MapReduce which is being improved and developed regularly by software developers / researchers and is maintained by Apache Software Foundation. Despite being vast efforts on the development of Hadoop MapReduce, there has not been a very rigorous work done on the benchmarks side.
During our work on MapReduce, we developed a benchmark suite which represents a broad range of MapReduce applications exhibiting application characteristics with high/low computation and high/low shuffle volumes. The details of applications, their code (compatible with Hadoop-0.20 and Hadoop-1.0.0), and details about input datasets can be found below.