About my research

I am working on high-level parallel programming with MapReduce since 2009. The motivation of my study is to construct a calculational framework making the programming being more algorithmic and productive. Our approach is so-called program calculation that derives programs form suitable specifications by a process of equational reasoning.

Writing a parallel program has proven over the years to be a difficult task, requiring various specialized knowledge and skills. Many parallel programming models and frameworks are developed to make things simpler. For example, Pthreads, Message passing interface (MPI), Bulk synchronous parallel (BSP), and Parallel skeletons. Some of these models are considered as low level ones, e.g., Pthreads and MPI, which usually directly leads to overly complex, non-portable, and often unscalable and unreliable code. However, in order to overcome such problems high level parallel programming mechanisms are studied and preferred by many users. For example, using algorithm skeletons users can write parallel programs in a structured way, under some deterministic patterns, and thus they do not need to consider about many low level affairs such as data racing and read/write locks.

Recent years, MapReduce program model [link] becomes very popular in big data processing and analysis, which actually based on two specialized algorithm skeletons, i.e., Map and Reduce functions. Many researchers and programmers implement various of algorithms such as data mining, machine learning, nature language processing and etc. Despite the success of MapReduce, developing efficient MapReduce programs is still a challenge for many problems. The main difficulties are considered as follows:

  1. The programming model is limited to two functions, so that only a small domain of problems are suitable. For example some graph problems, MapReduce is not the best choice, therefor Google developed a new framework named Pregel [link] which is based on the BSP model.
  2. The gap between MapReduce programming model and the algorithms of individual problems. Programming using MapReduce requires programmers developing divide&conquer algorithms that must fit the massive parallel execution model of MapReduce. For algorithm design, MapReduce model is considered being too low-level and rigid [7, 8, 9, 10], thus designing algorithms on it is difficult and also, programmers need to write a great deal of code that is hard to maintain and reuse.
  3. Optimization for certain MapReduce programs is difficult and miscellaneous. The performance of MapReduce programs are decided by first, algorithm efficiency and second, configuration/tuning parameters. To measure and adjust the performance in ad hoc usually causes lots of efforts, not only because of the complexity of MapReduce algorithms but also because of the big input data. The static analysis and optimization is also not easy because of the in-black-box Map/Reduce processes, and complex environments of large computer-clusters.
program calculation based high level parallel proframming

Our approach is so-called program calculation that derives programs form suitable specifications by a process of equational reasoning. The left figure shows the basic content of our approach. The specifications are given by structural recursion functions. For example, MapReduce programs can be derived from structural recursion functions on lists (also called list-homomorphisms). By using program calculation and a set of expressive deterministic parallel skeletons (including Map and Reduce), we can constructively and automatically produce parallel (in terms of MapReduce) programs.

The benefits of our approach are:

  1. Algorithmic. Usually a structural recursion function naturally expresses an algorithm together with the data structure, in sequential manner, so that the algorithm is easier to be defined and the correctness also easier to be proofed, compared with parallel cases.
  2. Efficient. We automatically transform a structural recursion to a parallel program. The optimization is based on the semantic of the structural recursion function and avoid users’ attention, so that optimal solutions are embedded inside our framework, which guarantees efficiency of the generated parallel programs.
  3. Productive. Users only need to focus on define the specification in high level, without any concern of low level implementation and parallel-execution affairs. The user custom code usually compact and short, that leads high productivity.