2015

2015 Conference on Advanced Topics and Auto Tuning in

High-Performance Scientific Computing

February 27-28, 2015

Room 101, Mathematics Research Center, National Taiwan University

Download poster in PDF. Download the Program Booklet PDF

Plenary Speaker

Toshio Endo (Global Scientific Information and Computing Center, Tokyo Institute of Technology)

Invited Speakers

Ray-Bing Chen (Department of Statistics, National Cheng Kung University)
- Chau-Yi Chou (National Center for High-Performance Computing)
- Cheng-Han Du (Institute of Applied Mathematical Sciences, National Taiwan University)
- Ryusuke Egawa (Cyberscience Center, Tohoku University)
Takeshi Fukaya (Advanced Institute for Computational Science, RIKEN)
- Shoichi Hirasawa (Graduate School of Information Sciences, Tohoku University)
- Feng-Nan Hwang (Department of Mathematics, National Central University)
Toshiyuki Imamura (Advanced Institute for Computational Science, RIKEN)
Takahiro Katagiri (Supercomputing Research Division, Information Technology Center, The University of Tokyo)
Chia-Chen Kuo (National Center for High-Performance Computing)
Tsung-Lin Lee (Department of Applied Mathematics, National Sun Yat-sen University)
Kengo Nakajima (Supercomputing Research Division, Information Technology Center, The University of Tokyo)
Satoshi Ohshima (Supercomputing Research Division, Information Technology Center, The University of Tokyo)
Simon See (Director and Chief Solution Architect, APJ, NVIDIA)
- Reiji Suda (Department of Computer Science, The University of Tokyo)
- Tomohiro Suzuki (Department of Computer Science and Engineering, University of Yamanashi)
Daisuke Takahashi (Center for Computational Sciences, University of Tsukuba)
Hiroyuki Takizawa (Graduate School of Information Sciences, Tohoku University)
Jengnan Tzeng (Department of Mathematical Sciences, National Chengchi University)
- Weichung Wang (Institute of Applied Mathematical Sciences, National Taiwan University)
- Yuan-Sen Vincent Yang (Department of Civil Engineering, National Taipei University of Technology)

Organizing Committee

Takahiro Katagiri (The University of Tokyo)
- Reiji Suda (The University of Tokyo)
- Weichung Wang (National Taiwan University)

Aims and Scope

The Conference on Advanced Topics and Auto Tuning in High Performance Scientific Computing focuses on the scientific impacts due to the latest computer architectures and the approaches to achieve high performance computing on these leading-edge computers. Advances in many-core architectures and high end computers have unveiled their significances in the scientific discoveries and engineering achievements. The complexity of these newly developed computers, however, also leads to contemporary challenges to achieve the best efficiency of the highly promising computational capabilities. The conference encourages interdisciplinary communications between researchers from applied mathematics, statistics, computer science, physical sciences, engineering and industry to prompt innovations and breakthroughes in this exciting field. The main themes include, but not limited to, simulations, numerical methods, applications, hardware, and particularly software and algorithm auto-tuning via statistical methods.

Contact Person

Ms. Ying-Pei Liu (tassist4@tims.ntu.edu.tw).

Download the Program Booklet PDF

February 27, 2015 (Friday)

Registration

08:40-09:00

Opening

09:00-09:10 Reiji Suda

Plenary Talk

09:10-10:00 [Chair: T. Katagiri] Toshio Endo

Invited Talks

10:10-11:00 [Chair: F.N. Hwang] Simon See, Takahiro Katagiri

11:10-12:00 [Chair: S. Ohshima] Tomohiro Suzuki, Yuan-Sen Yang

13:30-14:20 [Chair: T.-L. Lee] Toshiyuki Imamura, Ray-Bing Chen

14:30-15:20 [Chair: T. Fukaya] Hiroyuki Takizawa, Feng-Nan Hwang

15:40-16:55 [Chair: R.B. Chen] Reiji Suda, Chau-Yi Chou, Shoichi Hirasawa

February 28, 2015 (Saturday)

Registration

08:40-09:10

Invited Talks

09:10-10:00 [Chair: T. Suzuki] Daisuke Takahashi, Chia-Chen Kuo

10:10-11:00 [Chair: C.H. Du] Ryusuke Egawa, Jengnan Tzeng

11:10-12:00 [Chair: K. Nakajima] Takeshi Fukaya, Tsung-Lin Lee

13:30-14:20 [Chair: Y.S. Yang] Kengo Nakajima, Cheng-Han Du

14:30-15:20 [Chair: T. Imamura] Satoshi Ohshima, Weichung Wang

Group Discussion

15:40-16:30 [Chair: S. See] All participants

Closing

16:30-16:40 Takahiro Katagiri, Weichung Wang

February 27, 2015 (Friday)

Opening (09:00-09:10)

(09:10-10:00)

Harnessing Memory Hierarchy towards Extreme Fast and Big Simulations

by Toshio Endo (Global Scientific Information and Computing Center, Tokyo Institute of Technology)

Abstract. Towards exascale era, there is a need to develop technology that realizes bigger, finer and faster simulations. However, the "memory wall'' problem will be the one of largest obstacles. Our approach for this issue is to combine architecture equipped with deeper memory hierarchy, locality improvement of application algorithm, and system software for utilizing memory hierarchy. By taking stencil type applications as targets, and through evalutaion using a petascale GPGPU supercomputer TSUBAME, we demonstrate that this ``co-design'' approach is promising towards extreme fast and big simulations.

(10:10-11:00)

Accelerator technology and Tuning

by Simon See (Director and Chief Solution Architect, APJ, NVIDIA)

Abstract. In recent years, accelerators have been widely adopted by high performance computing communities. However due to the complexities of the architecture, tuning is not an easy task. In this talk, the author discuss about the trends and also the role of tuning and auto-tuning with respect to GPU.

Towards Auto-tuning of Scientific Codes for Many-core Architectures in Era of Exa-flops

by Takahiro Katagiri (Supercomputing Research Division, Information Technology Center, The University of Tokyo)

Abstract. In next generation of supercomputing environment, many-core architectures with 200+ parallelism is pervasive. It is hard to establish high performance without modification of legacy codes with respect to highly parallelisms on many-core architectures. Hence one of critical issues goes to productivity for high performance software. In this talk, we show a framework of auto-tuning (AT) to establish high performance for legacy scientific codes with low cost of performance tuning. Although strategy of optimization we use for the AT is well-known loop transformations, such as loop fusion and loop split, effect of AT is not small. AT effect is evaluated with codes from ppOpen-HPC, that is open-source software based on major numerical discretization methods. ppOpen-AT, which is a code generator of AT, is also used to optimize codes for current many-core architecture, the Intel Xeon Phi. A factor of 5.6x speedups by ppOpen-AT for an FDM code with a cluster of 8 nodes of the Xeon Phi is obtained by adapting automatic loop transformations.

(11:10-12:00)

Implementation of Tile Algorithms for Matrix Decomposition on CPU/GPU Systems

by Tomohiro Suzuki (Supercomputing Research Division, Information Technology Center, The University of Tokyo, Japan)

Abstract. Algorithms for matrix decomposition, such as to the Cholesky, LU, and QR forms, are of fundamental importance to numerical linear algebra. LAPACK is the most well-known numerical linear algebra library including matrix decomposition routines. It uses a block algorithm, which first decomposes a panel of width $b$ (block width) and then updates the trailing submatrix by using the transformation matrices generated by the previous decomposition. This process is repeated until the matrix is fully decomposed. We can introduce high-performance L3 BLAS operations into these block algorithms. However a block algorithm is a fork-join parallel computing model and this model cannot achieve a high performance in a parallel-computing environment, because some threads/processes will be stalled in the sequential process. In a tile algorithm (also known as algorithm-by-blocks) for matrix decomposition, the target matrix is divided into many equal-sized submatrices (tiles). The matrix is decomposed or updated every one or two tile(s), and in these processes, many fine-grained tasks are generated. Therefore, their suitability for processing with recent multicore CPU architecture has attracted much attention from the high-performance computing (HPC) community. Meanwhile, because GPUs have high computing capability, GPU-accelerated computing spread widely in recent years. In such computing fashion, GPUs commonly handle the coarse-grained data-parallel tasks. We are having difficulty to introduce GPU acceleration to tile algorithms for matrix decomposition. In this talk, we will introduce our implementation of tile algorithms for matrix decomposition on CPU/GPU systems and its performance comparing with the routine of other numerical linear algebra library.

Experiences on GPU Parallel Performance on Nonlinear Dynamic Structural Analysis

by Yuan-Sen Yang (Department of Civil Engineering, National Taipei University of Technology)

Abstract. Located in the Pacific Ring of Fire, Japan and Taiwan are of the most earthquake prone places in the world. Earthquake engineering experts and structural engineers make efforts on safeguarding the principle of collapse-free under severe major earthquakes. However, neither current seismic design codes nor underway performance-based design guides provide a reliable simulation method to represent complicated behaviors of structural collapse. One of the bottlenecks is that the usage and practical applications of refined structural models are impeded by their large amount of finite element computations.

Lunch Break (12:00-13:30)

(13:30-14:20)

Automatic-tuning for CUDA-BLAS Kernels Parameter by Multi-stage d-Spline

by Toshiyuki Imamura (Advanced Institute for Computational Science, RIKEN)

Abstract. We develop the CUDA-BLAS library on which we investigate the automatic-tuning technique of multi-stage d-Spline. Since to obtain higher and stable performance of CUDA-BLAS is of importance for HPC, we have to find the better parameter set even off-line or on-line. In this talk, we will present the latest performance results on the present project.

Performance Tuning of Next-Generation Sequencing Assembly via Gaussian Process Model with Branching and Nested Factors

by Ray-Bing Chen (Department of Statistics, National Cheng Kung University)

Abstract. For next-generation sequencing data, the selection of the assembly tool and the corresponding parameters has a great impact on the quality of de novo assembly. In practice, three different assembly tools, Velvet, SOAPdenovo and ABySS, are considered. Among these three tools, there are some shared parameters, and some parameters are specific only to particular tools. Thus it is a challenging problem to choose the proper assembly tool due to the complex structures. Here we transfer the assembly tool selection problem as an optimization problem in computer experiments by treating assembly tool as a branching factor and the parameters within different tools as nested factors. Then based on the Gaussian process model with branching and nested factors, a sequential procedure is proposed to select the proper assembly tool and tune the corresponding parameters simultaneously. The performance of the proposed procedure is demonstrated via several numerical experiments. Finally a real example is used to illustrate the usefulness of the proposed procedure.

(14:30-15:20)

Autotuning with User-defined Code Transformations

by Hiroyuki Takizawa (Graduate School of Information Sciences, Tohoku University)

Abstract. In this talk, a case study of using custom code transformations for autotuning is reported. In autotuning, it is often assumed that an application code is optimized with different parameters to generate various code variants, and then the best variant for a given system is empirically selected based on performance profiling. This assumption implies that code optimizations have been predefined and also parameterized, though code optimizations are not necessarily able to be predefined and parameterized. Indeed, code optimizations needed in real-world applications cannot be represented as a combination of predefined code transformations. Therefore, in this case study, a code transformation framework, Xevolver, is used to enable autotuning to collaborate with user-defined code transformation rules. As a result, autotuning can achieve parameter tuning of not only predefined code transformations but also custom transformations. Autotuning is also applicable to a custom code transformation that does not require any parameters. In such a case, autotuning will be used to decide whether the transformation should be applied or not. Xevolver has been developed on top of the ROSE compiler infrastructure and hence can easily collaborate with ROSE. If a code transformation is general enough and hence useful in many applications, it could be implemented using either ROSE or Xevolver. However, code transformations required in practice might be application-specific, system-specific, or domain-specific. General-purpose compilers and tools will not provide such transformations. Therefore, to generate code variants from an application code, not only compiler experts but also standard programmers need to define and customize code transformations for special demands of individual applications, systems, and/or domains. Motivated by this, Xevolver has been designed to help standard programmers define their own code transformations. In this case study, Xevolver makes an application code “auto-tunable” without major code modifications even if the application needs special code optimizations for high performance. Its benefits and current limitations are discussed in the talk.

A New Framework of Iteratively Adaptive Multiscale Finite Element Methods

by Feng-Nan Hwang (Department of Mathematics, National Central University)

Abstract. We proposed a new framework of multiscale finite element methods (MsFEM) for solving the second order partial differential equations (PDEs) exhibiting multiscale behavior. Our target applications include elliptic interface problems, convective-diffusive equations, and Helmholtz equations. The key ingredient of the MsFEMs is a set of multiscale basis functions, which is constructed by solving locally the original PDE problem with some proper boundary conditions. The selection of boundary conditions plays an important role on the overall performance of MsFEM. Finding an appropriate boundary condition setting for some particular application is the current topic in the MsFEM research. Either using purely local information or purely global information are two popular classes of MsFEMs in the available literature. In the proposed framework, namely iteratively adaptive MsFEM (or i-ApMsFEM), the local-global information exchanges through updated local boundary condition for these multiscale basis functions. Once the multiscale solution is recovered from the solution of global numerical formulation on coarse grids, which couples these multiscale basis functions, it provides a feedback for updating the local boundary conditions on each coarse element. As the approach iterates, the quality of MsFEM solutions get improved, since these adaptive basis functions are expected to be able to more accurately capture the multiscale feature of the approximate solution. Some numerical results for the convective-diffusive problems and interface problems are reported and some suggested research topics along this direction are also included.

(15:40-16:55)

Noise-reducing Collective Communication Algorithms

by Reiji Suda (Department of Computer Science, The University of Tokyo)

Abstract. Operating systems consumes small amount of CPU time for various housekeeping operations, known as "OS noise". Such noise is not problematic for sequential and small-scale parallel systems, but can incur an unexpectedly large overhead in large-scale parallel systems. In this presentation, we show some algorithms of collective communications that can reduce influence of OS noise.

Implementation and Applications of PETSc Linear Solver Selection

by Chau-Yi Chou (National Center for High-Performance Computing)

Abstract. The linear system solver plays a critical role in scientific computing. The PETSc (Portable, Extensible Toolkit for Scientific Computation) is an open source, parallel, scalable and famous package for linear solver. This study implemented the PETSc Linear Solver Selection (LSS) to speed up user's computation with shorter learning curve and cycle of numerical experiments. The PETSc LSS was developed in two types of user interface - libraries and command-line interface (CLI). We systematically present the results of several problems solved by PETSc LSS. They show around 10 times speedup, compared with those by traditional solvers. Moreover, we hope to facilitate NCHC’s (National Center for High-Performance Computing) users to select a suitable linear solver via our PETSc LSS in the future.

A Correctness Checking Framework for Empirical Auto-tuning

by Shoichi Hirasawa (Graduate School of Information Sciences, Tohoku University)

Abstract. Empirical auto-tuning is getting a lot of attention on HPC because it extremely reduces programmers' burden to improve the execution performance with automatically tuning applications to specific target computation platforms. However, the tuned application needs to run correctly and the burden of checking the correctness of the execution on a series of performance evaluation with different codes in auto-tuning still remains to the programmers. Programmers need to manually check whether the result data of every codes are the expected one to assure the codes run correctly, thereby assure the tuned application runs correctly. This problem becomes worse when the programmer who need to tune an application do not fully know what is the expected result data for the application users, who do not fully know how to tune their application to specific platforms. In this presentation, a framework of automatically checking the result data of the executions of different codes is proposed. The framework uses user-specified information, which is given by the application users, to automatically check the execution results of the codes to mitigate the burden for programmers to assure the correctness of application executions on auto-tuning.

February 28, 2015 (Saturday)

(09:10-10:00)

Automatic Tuning for Parallel FFTs on GPU Clusters

by Daisuke Takahashi (Center for Computational Sciences, University of Tsukuba)

Abstract. In this talk, we propose an implementation of a parallel fast Fourier transform (FFT) with automatic performance tuning on GPU clusters. Because the parallel FFTs require all-to-all communications, one goal for parallel FFTs on GPU clusters is to minimize the PCI Express transfer time and the MPI communication time. Performance results of FFTs on a GPU cluster are reported.

Simulations of Smoke Haze Effects Refresh A Vision for Modern Dance

by Chia-Chen Kuo (National Center for High-Performance Computing)

Abstract. With its mixture of eastern and western cultures, Taiwan is full of amazing artistic performances, such as the Cloud Gate Dance Theatre, which performs modern dance. Inspired by their fascinating performance, this study uses dynamics effects, which are conducted at the render farm on Formosa series supercomputers in Taiwan, to bring a new vision to modern dance. In this study, the relationship between the two major parameters, diffusion and opacity, in the particle system was elucidated to determine how long for particles to disappear. Furthermore, the non-contact force between the particles was defined by adjusting the cohesion and repulsion forces. Thereafter, the spread and cohesion phenomena were represented naturally. Finally a vortex flow was added to simulate the wind blowing the smoke haze system. In conclusion, we successfully applied the dynamics effects through supercomputer rendering to reviving modern dance. The efforts towards this new fusion of art and science launch a vision for potential future scenarios.

(10:10-11:00)

Overcoming Performance Portability Issues on Modern HPC Systems

by Ryusuke Egawa (Cyberscience Center, Tohoku University)

Abstract. In this presentation, based on the lessons learned from an installation of SX-ACE system on Tohoku University, performance portability issues among system generations, and research activities to overcome these issues are discussed.

Teaching Mathematics in the Parallel Vision

by Jengnan Tzeng (Department of Mathematical Sciences, National Chengchi University)

Abstract. There are many wonderful mathematical algorithms for the sequential computing. When data grow up rapidly and the parallel machines are widely used today, we try to implement these methods into the parallel versions. We will see that some of these methods or algorithms are not improvable. Hence, what are the “good” methods or “good” algorithms should be redefined today. The properties of computational hardware, not only the speed but also the quantity of memory, should be considered in this day and age. I will share some experiment about teaching parallel computing to mathematical students in this topic and I will proposed a concept of memory less computing which is more natural for students to construct a parallel solver for big system.

(11:10-12:00)

CholeskyQR2: an Algorithm of the Cholesky QR Factorization with Reorthogonalization

by Takeshi Fukaya (Advanced Institute for Computational Science, RIKEN)

Abstract. The Cholesky QR factorization is an algorithm for computing the QR factorization of a matrix and has excellent properties from the viewpoint of high performance computing. However, it is well know that the Cholesky QR factorization is numerically unstable: the loss of the orthogonality of the computed Q factor grows rapidly with the condition number of the input matrix. Thus, the Cholesky QR factorization has been rarely used in practical. Recently, we have pointed out that the instability of the Cholesky QR factorization can be remedied simply by repeating the process twice, which we call the CholeskyQR2 algorithm. We verify that CholeskyQR2 computes the QR factorization as accurate as the traditional Householder QR algorithm until the condition number is smaller than the threshold around $10^8$. In this talk, we first introduce the suitability of the Cholesky QR factorization for high performance computing. We then present our recent results on the numerical stability of the CholeskyQR2 algorithm. We finally give some performance results which indicate that CholeskyQR2 still has an advantage in computation time.

Hom4PS-3: A Parallel Numerical Solver for Systems of Polynomial Equations

by Tsung-Lin Lee (Department of Applied Mathematics, National Sun Yat-sen University)

Abstract. Hom4PS-3 implements many different numerical homotopy methods including the polyhedral homotopy continuation method. It is capable of carrying out computation in parallel on a wide range of hardware architectures including multi-core systems, computer clusters, distributed environments, and GPUs with great efficiency and scalability. Designed to be user-friendly, it includes interfaces to a variety of existing mathematical software and programming languages such as Python, Ruby, Octave, Sage and Matlab.

Lunch Break (12:00-13:30)

(13:30-14:20)

Parallel Preconditioning Methods on Intel Xeon/Phi

by Kengo Nakajima (Supercomputing Research Division, Information Technology Center, The University of Tokyo)

Abstract. In this talk, performance and robustness of preconditioned iterative solver for manycore architectures has been evaluated on Intel Xeon/Phi with 240 threads. Preconditioning method is ILU-based one with CM-RCM reordering. Target application is a static 3D linear-elastic problem discretized by finite-element method. ELL-based method for storage of sparse coefficient matrices is proposed. Finally, strategy for automatic tuning is also described.

Three-Dimensional Photonic Device Analysis: How Linear Algebra Library Tuning Helps

by Cheng-Han Du (Institute of Applied Mathematical Sciences, National Taiwan University)

Abstract. We introduce a photonic simulation tool based on direct matrix solver. Algorithm development and consideration of the algorithm are explained. Based on our proposed technique, matrix operations such as ZGEMM in a computation procedure is larger and more efficient to execute in multicore environment. With heavy use of BLAS and LAPACK, the algorithm can be easily tuned for various computing environment with modern CPU and/or accelerators.

(14:30-15:20)

Performance evaluation of Preconditioned Iterative Linear Solver Using OpenMP and OpenACC

by Satoshi Ohshima (Supercomputing Research Division, Information Technology Center, The University of Tokyo)

Abstract. OpenMP is widely used for accelerating many applications on CPU and MIC. Also, some applications are accelerated by using OpenACC on GPU. These parallel programming environment can make parallel programs easy with similar programming fashion. Users only have to insert some pragmas into existing source codes to make parallel programs. However, in order to obtain enough performance, users may have to make differ programs in view of both of the characteristics of target architectures and applications. For example, there are large differences between the numbers of computing cores. Traditional multi-core CPUs have around 10 cores and MIC has around 200 cores. In contrast, GPUs have over than 1000 cores and they are separated into some subgroups. Moreover, in order to obtain enough memory access performance, GPUs required “coalesced” memory access. Therefore, it is expected that users have to choose the best algorithm and implementation depending on the situation. In this talk, we show the implementation and performance of ICCG method using OpenMP on CPU/MIC and OpenACC on GPU.

Hierarchical Schur Method for Solving Linear Systems on GPU Cluster

by Weichung Wang (Institute of Applied Mathematical Sciences, National Taiwan University)

Abstract. We propose a direct hierarchical Schur method to solve sparse symmetric positive-definite linear systems on GPU cluster. The coefficient matrix and the corresponding elimination tree is reordered by nested dissection without overlapping. Our method solves the diagonal blocks by GPU-based Cholesky factorization in parallel. For the Schur complement matrix, we explore the structure of the submatrices to develop a scheme to distribute data and schedule tasks. We factorize these submatrices by BLAS3 dense operations to gain performance accelerations on GPUs. We also study the factors affecting scalability.

(15:40-16:30)

Group Discussion

Closing (16:30-16:40)

Sponsors