About my research

I am working on high-level parallel programming with MapReduce since 2009. The motivation of my study is to construct a calculational framework making the programming being more algorithmic and productive. Our approach is so-called program calculation that derives programs form suitable specifications by a process of equational reasoning.

High Level Parallel Programming

Writing a parallel program has proven over the years to be a difficult task, requiring various specialized knowledge and skills. Many parallel programming models and frameworks are developed to make things simpler. For example, Pthreads, Message passing interface (MPI), Bulk synchronous parallel (BSP), and Parallel skeletons. Some of these models are considered as low level ones, e.g., Pthreads and MPI, which usually directly leads to overly complex, non-portable, and often unscalable and unreliable code. However, in order to overcome such problems high level parallel programming mechanisms are studied and preferred by many users. For example, using algorithm skeletons users can write parallel programs in a structured way, under some deterministic patterns, and thus they do not need to consider about many low level affairs such as data racing and read/write locks.

Recent years, MapReduce program model [link] becomes very popular in big data processing and analysis, which actually based on two specialized algorithm skeletons, i.e., Map and Reduce functions. Many researchers and programmers implement various of algorithms such as data mining, machine learning, nature language processing and etc. Despite the success of MapReduce, developing efficient MapReduce programs is still a challenge for many problems. The main difficulties are considered as follows:

The programming model is limited to two functions, so that only a small domain of problems are suitable. For example some graph problems, MapReduce is not the best choice, therefor Google developed a new framework named Pregel [link] which is based on the BSP model.
The gap between MapReduce programming model and the algorithms of individual problems. Programming using MapReduce requires programmers developing divide&conquer algorithms that must fit the massive parallel execution model of MapReduce. For algorithm design, MapReduce model is considered being too low-level and rigid [7, 8, 9, 10], thus designing algorithms on it is difficult and also, programmers need to write a great deal of code that is hard to maintain and reuse.
Optimization for certain MapReduce programs is difficult and miscellaneous. The performance of MapReduce programs are decided by first, algorithm efficiency and second, configuration/tuning parameters. To measure and adjust the performance in ad hoc usually causes lots of efforts, not only because of the complexity of MapReduce algorithms but also because of the big input data. The static analysis and optimization is also not easy because of the in-black-box Map/Reduce processes, and complex environments of large computer-clusters.

Program Calculation

Our approach is so-called program calculation that derives programs form suitable specifications by a process of equational reasoning. The left figure shows the basic content of our approach. The specifications are given by structural recursion functions. For example, MapReduce programs can be derived from structural recursion functions on lists (also called list-homomorphisms). By using program calculation and a set of expressive deterministic parallel skeletons (including Map and Reduce), we can constructively and automatically produce parallel (in terms of MapReduce) programs.

The benefits of our approach are:

Algorithmic. Usually a structural recursion function naturally expresses an algorithm together with the data structure, in sequential manner, so that the algorithm is easier to be defined and the correctness also easier to be proofed, compared with parallel cases.
Efficient. We automatically transform a structural recursion to a parallel program. The optimization is based on the semantic of the structural recursion function and avoid users’ attention, so that optimal solutions are embedded inside our framework, which guarantees efficiency of the generated parallel programs.
Productive. Users only need to focus on define the specification in high level, without any concern of low level implementation and parallel-execution affairs. The user custom code usually compact and short, that leads high productivity.

Publications

劉雨, 江本健斗, 松崎公紀, 胡振江, Accumulative Computation on MapReduce. IPSJ Transactions on Programming (情報処理学会論文誌：プログラミング), Vol.7, No1, 1-10, Jan 2014.
- Yu Liu, Kento Emoto, Zhenjiang Hu, A Generate-Test-Aggregate Parallel Programming Library for Systematic Parallel Programming, Parallel Computing, Volume 40, Issue 2, February 2014, Pages 116.-135, Elsevier.
Qi Wang, Meixian Chen, Yu Liu and Zhenjiang Hu. Towards systematic parallel programming of graph problems via tree decomposition and tree parallelism. The 2nd Workshop on Functional High-Performance Computing (FHPC2013), 23 rd , September 2013, Boston, Massachusetts, US.
Yu Liu, Kento Emoto, Zhenjiang Hu, A Generate-test-aggregate Parallel Programming Library, The 2013 International Workshop on Programming Models and Applications for Multicores and Manycores, in conjunction with PPoPP 2013, Feb 2013, Shenzhen, China.
Tao Zan, Yu Liu, Zhenjiang Hu, Automatic Parallelization of Graph Queries with MapReduce, Japan Society for Software Science and Technology (JSSST) the 29th Conference, Aug 2012.
Yu Liu, Sebastian Fischer, Kento Emoto, Zhenjiang Hu, Implementing Generate-Test-and-Aggregate Algorithms on Hadoop, Japan Society for Software Science and Technology the 28th Conference, CD-ROM (ISSN 1348-0901) Sep 2011.
Yu Liu, Zhenjiang Hu, Kiminori Matsuzaki, Towards Systematic Parallel Programming over MapReduce, EuroPar2011, Lecture Notes in Computer Science (LNCS) 6853 39-50 Sep 2011.
Yu Liu, Zhenjiang Hu, A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce, In proceedings of 13th Programming and Programming Languages Workshop (PPL2011) Mar 2011.
Kiminori Matsuzaki, Kento Emoto, Yu Liu, Parallelization of Regular Expression Matching and Its Evaluation on Hadoop,
IPSJ Transactions on Programming (情報処理学会論文誌: プログラミング), Vol.4, No. 4, 1-11, Aug 2011.