2014

2014 Conference on Advanced Topics and Auto Tuning

in High-Performance Scientific Computing

March 14-15, 2014

Room 101, Mathematics Research Center, National Taiwan University

Download the Program Booklet PDF

Plenary Speakers and Talks

- I-Hsin Chung (T. J. Watson Research Center, IBM)
- Auto-tuning for Data Centric Computing
  - Application Performance Measurement on IBM Blue Gene/Q
- Edmond Chow (School of Computational Science and Engineering, Georgia Institute of Technology)
- A Paradigm for Very Fine-Grained Parallel Matrix Computations
  - Tuning and Optimization of Particle Simulations with Long-Range Interactions
- Jakub Kurzak (Innovative Computing Laboratory, EECS, University of Tennessee)
- Principles of CUDA Development and Optimization
  - Bench-testing Environment for Automated Software Tuning (BEAST)

Invited Speakers

Hsi-Ya Chang (National Center for High-performance Computing)
Jen-Hao Chen (Department of Applied Mathematics, National Hsinchu University of Education)
Ray-Bing Chen (Department of Statistics, National Cheng Kung University)
Takeshi Fukaya (AICS, RIKEN):
- Shoichi Hirasawa (Graduate School of Information Sciences, Tohoku University)
Feng-Nan Hwang (Department of Mathematics, National Central University)
Toshiyuki Imamura (AICS, RIKEN)
Takahiro Katagiri (Supercomputing Research Division, Information Technology Center, The University of Tokyo)
Che-Rung Lee (Department of Computer Science , National Tsing Hua University)
- Yu-Tuan Lin (Institute of Mathematics, Academia Sinica)
Horng-Shing Lu (Institute of Statistics, National Chiao Tung University)
Kengo Nakajima (Supercomputing Research Division, Information Technology Center, The University of Tokyo)
Satoshi Ohshima (Supercomputing Research Division, Information Technology Center, The University of Tokyo)
- Katsuhisa Ozaki (Shibaura Institute of Technology)
- Reiji Suda (Department of Computer Science, The University of Tokyo)
- Tomohiro Suzuki (University of Yamanashi)
Daisuke Takahashi (Center for Computational Sciences, University of Tsukuba)
- Hiroyuki Takizawa (Graduate School of Information Sciences, Tohoku University)
Teruo Tanaka (Department of Computer Science, Kogakuin University)
Weichung Wang (Institute of Applied Mathematical Sciences, National Taiwan University)

Organizing Committee

Takahiro Katagiri (The University of Tokyo)
- Kengo Nakajima (The University of Tokyo)
- Reiji Suda (The University of Tokyo)
- Weichung Wang (National Taiwan University)

Aims and Scope

The 2014 Conference on Advanced Topics and Auto Tuning in High Performance Scientific Computing focuses on the scientific impacts due to the latest computer architectures and the approaches to achieve high performance computing on these leading-edge computers. Advances in many-core architectures and high end computers have unveiled their significances in the scientific discoveries and engineering achievements. The complexity of these newly developed computers, however, also leads to contemporary challenges to achieve the best efficiency of the highly promising computational capabilities. The conference encourages interdisciplinary communications between researchers from applied mathematics, statistics, computer science, physical sciences, engineering and industry to prompt innovations and breakthroughes in this exciting field. The main themes include, but not limited to, simulations, numerical methods, applications, hardware, and particularly software and algorithm auto-tuning via statistical methods.

Contact Person

For any further information, please consult the above menu links or contact Ms. Wei-Jhen Tsai (tassist3[AT]tims.ntu.edu.tw).

Download the Program Booklet PDF

March 14, 2014 (Friday)

Opening

09:00-09:10 [K. Nakajima, W. Wang] Opening

Matrix Computation

09:10-10:00 [W. Wang] PT 1 Edmond Chow

10:10-11:10 [F.N. Hwang] Takeshi Fukaya, Che-Rung Lee, Satoshi Ohshima

11:20-12:00 [T. Imamura] Kengo Nakajima, Feng-Nan Hwang

GPU

13:30-14:20 [D. Takahashi] PT 2 Jakub Kurzak

14:30-15:30 [H.Y. Chang] Daisuke Takahashi, Yu-Tuan Lin, Tomohiro Suzuki

Auto-Tuning and Data

16:00-16:50 [T. Katagiri] PT 3 I-Hsin Chung

17:00-17:40 [H. Takizawa] Hsi-Ya Chang, Henry Horng-Shing Lu

March 15, 2014 (Saturday)

Performance

09:10-10:00 [R. Suda] PT 4 I-Hsin Chung

10:10-11:10 [R.B. Chen] Takahiro Katagiri, Weichung Wang, Hiroyuki Takizawa

Applications

11:20-12:00 [S. Hirasawa] Katsuhisa Ozaki, Jen-Hao Chen

Auto-Tuning

13:30-14:20 [C.R. Lee] PT 5 Jakub Kurzak

14:30-15:30 [J.H. Chen] Toshiyuki Imamura, Teruo Tanaka, Shoichi Hirasawa

Auto-Tuning with Applications

16:00-16:50 [K. Nakajima] PT 6 Edmond Chow

17:00-17:40 [T. Tanaka] Ray-Bing Chen, Reiji Suda

Closing

17:40-17:50 [R. Suda, T. Katagiri] Closing

March 14, 2014 (Friday)

Opening (09:00-09:10)

Matrix Computation (09:10-10:00)

Plenary Talk 1

A Paradigm for Very Fine-Grained Parallel Matrix Computations

by Edmond Chow (School of Computational Science and Engineering, Georgia Institute of Technology, USA)

Abstract. This talk addresses the problem of the massive concurrency required in scientific and engineering algorithms in order to run efficiently on current and future computer architectures. We propose a counter-intuitive paradigm for parallelizing certain matrix computations that leads to much more parallelism than conventional approaches. Our approach is to express a matrix factorization as a large number of scalar equations, and then solving these equations approximately via an asynchronous iterative method. The approach is driven by computer architectural considerations, rather than modifying the best existing algorithms, and is conceivable only in the case of massive amounts of parallelism within a node. The paradigm of transforming a problem into a more parallel, iterative one could be applied in many situations. The matrix algorithms that benefit the most are those where an approximation is suitable, e.g., in preconditioning or when the matrix data has redundancies or noise. The paradigm, in effect, trades accuracy for concurrency. Our main example will be incomplete factorizations, but we will also discuss how the ideas extend to other sparse matrix problems such as matrix completions and the solution of sparse triangular systems.

Matrix Computation (10:10-11:10)

A Communication-Avoiding Algorithm for the Gram-Schmidt Orthogonalization

by Takeshi Fukaya (RIKEN, Japan)

Abstract. Recently, the importance of Communication-Avoiding (CA) has been widely acknowledged in the field of high performance computing. In this talk, referring the idea in the TSQR algorithm for the Householder QR factorization, we present a CA algorithm for the classical Gram-Schmidt orthogonalization. Then, we evaluate the performance of it on the K computer.

Minimal Split Checkerboard Method for Matrix Exponentials

by Che-Rung Lee (Department of Computer Science, National Tsing Hua University, Taiwan)

Abstract. Multiplication of matrix exponentials is one of the computational kernels in simulations of quantum statistical mechanics, in which matrices are symmetric and sparse. Although the matrix is not particularly large, its multiplication needs to be performed millions times, and thus becomes a performance bottleneck. For a sparse symmetric matrix $A$, the checkerboard method splits $A=A_1+A_2+\cdots + A_k$ and approximates $e^A$ by $e^{A_1}e^{A_2}\cdots e^{A_k}$, in which each $e^{A_i}$ is sparse. When combined with sparse matrix techniques, the checkerboard method can significantly reduce the expense of storing and multiplying $e^A$. However, the accuracy of the checkerboard method degrades as the number of split matrices increases. In this talk, the Minimal Split Checkerboard method (MSCKB) is introduced with two enhancements:.the Block Checkerboard method (BlkCKB) and the Low-Rank Checkerboard method (LRCKB). All the ideas can be extended to exponentiate skew-symmetric matrices and general matrices. Experiments based on the simulations of quantum statistical mechanics demonstrate the effectiveness of the proposed methods.

Implementation and Performance Evaluation of SpMV on Modern Parallel Processors

by Satoshi Ohshima (Supercomputing Research Division, Information Technology Center, The University of Tokyo, Japan)

Abstract. Various parallel processors such as CPU, MIC, and GPU are widely used today. Sparse matrix vector multiplicaton (SpMV) is widely used in numerical calculation applications. There are large expectations that high SpMV performance is obtained by these current processors. In this talk, implementation and performance evaluation results of SpMV are shown.

Matrix Computation (11:20-12:00)

Parallel Preconditioning Methods for Ill-conditioned Problems

by Kengo Nakajima (Supercomputing Research Division, Information Technology Center, The University of Tokyo, Japan)

Abstract. We evaluated performance and robustness of parallel preconditioning methods for ill-conditioned problems based on BILUT(p,d,t). Two types of parallel implementations, LBJ (Localized Block Jacobi) and HID (Hierarchical Interface Decomposition), are applied. Developed methods are applied to Hetero3D code, which is a parallel finite-element benchmark program, and the code provided excellent scalability up to 240 nodes (3,840 cores) of Fujitsu PRIMEHPC FX10 (Oakleaf-FX), the University of Tokyo. In this talk, various aspects of optimization and automatic tuning for this type of problem will be discussed.

A Parallel Adaptive Nonlinear Elimination Preconditioned Inexact Newton Method for Transonic Full Potential Equation

by Feng-Nan Hwang (Department of Mathematics, National Central University, Taiwan)

Abstract. We propose and study a right preconditioned inexact Newton method for the numerical solution of large sparse nonlinear system of equations. The target applications are nonlinear problems whose derivatives have some local discontinuities such that the traditional inexact Newton method suffers from slow or no convergence even with globalization. The proposed adaptive nonlinear elimination preconditioned inexact Newton method consists of three major ingredients: the subspace correction, the global update, and the determination of a new partition. The key idea is to remove the local high nonlinearity before performing the global Newton update. The partition used to define the subspace nonlinear problem is chosen adaptively based on the information derived from the intermediate Newton solution. Some numerical experiments are presented to demonstrate the robustness and efficiency of the algorithm compared to the classical inexact Newton method. Some parallel performance results obtained on a cluster of PCs are reported.

Lunch Break (12:00-13:30)

GPU (13:30-14:20)

Plenary Talk 2

Principles of CUDA Development and Optimization

by Jakub Kurzak (Innovative Computing Laboratory, EECS, University of Tennessee, USA)

Abstract. The objective of this short tutorial-style talk is to familiarize the audience with the range of software libraries and tools provided by Nvidia in the CUDA SDK, and to outline basic procedures for performance optimizations. The presentation provides a brief overview of Nvidia GPU architectures, and their capabilities, and a short introduction of the CUDA programming model, followed by the discussion of the Nvidia software stack with emphasis on high performance numerical libraries. Then, performance assessment / diagnostics using the NVPROF tool is introduced. Finally, techniques for low-level performance optimizations are discussed: writing inline PTX code, assembly using PTXAS, and disassembly using NVDISASM and CUOBJDUMP. Pointers to appropriate documentation are provided throughout the presentation.

GPU (14:30-15:30)

Implementation of Parallel FFTs on GPU Clusters

by Daisuke Takahashi (Center for Computational Sciences, University of Tsukuba, Japan)

Abstract. In this talk, an implementation of parallel fast Fourier transforms (FFTs) on GPU clusters is presented.

Because the parallel FFTs require all-to-all communications, one goal for parallel FFTs on GPU clusters is to minimize the PCI Express transfer time and the MPI communication time. Performance results of parallel FFTs on a GPU cluster is reported.

Parallel GPU Implementation of Local Radial Basis Approximation for Elliptic Equations

by Yu-Tuan Lin (Institute of Mathematics, Academia Sinica, Taiwan)

Abstract. The meshfree method is an effective numerical scheme for solving partial differential equations and has been applied to many engineering problems. The nodes in meshfree discretization procedures are arbitrarily in the problem domain whereas the mesh structure are required in the finite element or finite difference method. In this talk, we introduce the meshfree approximated formulations that are based on the radial basis functions for the finite collocation points. In constructing the approximation function, the only geometrical data needed is the local configuration of nodes falling within its influence domain. Since only a few numbers of local influence points are selected the local approximated system becomes small. It is good to implement parallel computation small linear system solvers in GPU threads independently. There is a truncated algorithm in our implementation and it will reduce the global mass and stiffness matrix condition number. This implementation confirm meshfree scheme efficiency and stability.

Implementation of Tile QR Factorization on CPU-GPU system

by Tomohiro Suzuki (University of Yamanashi)

Abstract. Tile algorithms for the matrix factorization have a coarse-grained parallelism which is required by recent hardware. We have implemented the tile QR factorization on the CPU-GPU heterogeneous system. There are two main steps in the tile matrix factorization, the factorization step and the update step. Because the update steps are rich in L3 BLAS, they can be well accelerated by GPUs. In this talk, we will introduce our implementation of tile QR factorization on the CPU-GPU system.

Auto-Tuning and Data (16:00-16:50)

Plenary Talk 3

Auto-tuning for Data Centric Computing

by I-Hsin Chung (T. J. Watson Research Center, IBM, USA)

Abstract. There is an explosion of data in terms of volume, velocity, veracity and variety. To harness this data, large and complicated data-centric computer system architectures are being invented. From the system design point of view, minimizing data motion, modularized and application-driven design are the guiding principles to build a cost-effective system that is adaptive to dynamic workloads. In this presentation, the examples illustrate how auto-tuning can be leveraged for software-hardware system co-design of data centric computing systems.

Auto-Tuning and Data (17:00-17:40)

NCHC pCDR Implementation

by Hsi-Ya Chang (National Center for High-performance Computing, Taiwan)

Abstract. NCHC pCDR Library (parallel Convection–Diffusion–Reaction) is a set of codes for solving a convection–diffusion–reaction scalar transport equation in parallel by GPU computing. CDR equation is important because it is one of the most frequently used models in science and engineering. It describes substances distributed in a medium and influenced by the three CDR processes. The main purpose of the pCDR library is to provide CFD scientists with easy interface for speeding up their programs on GPU machines and they can enjoy the benefit of GPU computing even without much knowledge about CUDA. This library has been implemented on NCHC SUN GPU Cluster (NVIDIA). The latest version provides two dimensional CFD schemes to solve real transient and steady state problems. This is a joint work with Chau-Yi Chou, Sheng-Hsiu Kuo, Chih-Wei Hsieh and Yu-Fen Cheng.

Phenomenal investigations on mixed big data using the density functional theory

by Henry Horng-Shing Lu (Institute of Statistics, National Chiao Tung University, Taiwan)

Abstract. This study proposes a new methodology to investigate the physical phenomena of high-dimensional mixed data from the perspective of density functional theory in Hilbert space. Inspired by the Hohenberg-Kohn-Sham theorem, the dimension-reducing scheme and the mathematical approaches are adopted to process the data density distribution in specified spaces. Associating the methodology with Noether’s ans Kato’s theorems, the informative features of the mixed big data, the effect of uniform coordinate scaling, and the corresponding clustering morphologies can be successfully and visually elucidated by means of the evaluations of Lagrangian density functional (LDF) and energy density functional (EDF) of the interesting system. The barriers of density of action rate (DAR) formed between the data groups in the LDF morphology indicate the threshold of data mixtures and the strength of data affinity, and then the DAR trenches depict the enclosures of data groups as well as the information of data connectivity. By means of carefully considering these extracted features within the data groups, the evolution of data migration under diverse circumstances can be also visually illustrated. Simulated results eventually illustrate that the proposed methodology provides an alternative route for analyzing the data characteristics with abundant physical insights. For a further demonstration of a non-constructed dataset without ground truth, the developed methodology is also applied on the post-process of MRI and better tumor recognitions can be achieved on the T1 post-contrast and T2 modes. It is appealing that the post-processing MRI using DFT treatment would help the scientists in the judgment of clinical pathology and/or the applications of high dimensional biomedical image processing.

March 15, 2014 (Saturday)

Performance (09:10-10:00)

Abstract. One major part of the performance tuning is to understand the characteristics of applications and the utilization of various system components. This tutorial provides an overview of the support and techniques commonly exercised on IBM Blue Gene/Q supercomputer. Similar methodologies can be applied on other high performance computing (HPC) systems. The presentation uses examples to show how the performance is measured and the related issues such as overhead should be considered. I will discuss how automated measurement provides a solid foundation for auto-tuning on large-scale HPC systems.

Performance (10:10-11:10)

Towards Auto-tuning Facilities into Supercomputers in Operation- The FIBER approach and Minimizing Software-stack Requirements -

by Takahiro Katagiri (Supercomputing Research Division, Information Technology Center, The University of Tokyo, Japan)

Abstract. Although several Auto-tuning (AT) frameworks have been proposed, little AT framework can be utilized in supercomputers in operation. One of the reason is that required software-stacks for the AT frameworks do not satisfy limitations of supercomputers in operation, such as batch job schedulers, prohibited OS kernel modification. In this presentation, we explain FIBER framework that enables us to provide AT libraries with full user-level execution. We introduce ppOpen-AT system that is one of AT languages based on the FIBER framework. The ppOpen-AT is a core AT system for ppOpen-HPC that is a project for developing a free numerical middleware for post-peta scale environments. Performance evaluation for AT effect by ppOpen-AT includes real simulation codes in ppOpen-HPC project. AT effect by ppOpen-AT on current CPUs, such as the Ivy Bridge, the Xeon Phi, and the Sparc64 IV-fx will be shown.

Performance Modelling for Qualitative and Quantitative Factors in Auto-Tuning

by Weichung Wang (Institute of Applied Mathematical Sciences, National Taiwan University, Taiwan)

Abstract. Auto-tuning problems involving both qualitative and quantitative (Q&Q) factors are common in scientific computing software. But little model-based tuning methods handle such Q&Q factors. To minimize the total runtime, we propose several Kriging-based surrogate methods that manage the Q&Q factors separately or jointly. The proposed approaches are applied to a parallel algebraic multigrid linear system solver simulating bubbles in liquid. Numerical results identify the advantages of these surrogate schemes and show their efficiency.

An Extensible Programming Framework for Custom Code Transformations

by Hiroyuki Takizawa (Graduate School of Information Sciences, Tohoku University, Japan)

Abstract. In general, different system architectures require different performance optimizations. Thus, we need to optimize an application code for a particular system to achieve high performance, which degrades the performance portability across different system architectures. To improve the performance portability, we are exploring an effective way to isolation of system-specific optimizations from an application code. In this talk, I will talk about a programming framework named Xevolver, which is extensible by using custom translation rules written in XML. Xevolver allows programmers to define custom compiler directives and the associated code transformations in an external file. As a result, the programmers do not need to complicate application codes themselves for high performance portability, i.e. performance optimizations for multiple system architectures.

Applications (11:20-12:00)

Block Matrix Computations in Terms of a Priori Error Analysis

by Katsuhisa Ozaki (Shibaura Institute of Technology)

Abstract. This talk is concerned with balance between performance of computations and accuracy of numerical results. Basically, there is a tradeoff between them. An a priori error bound is used for discussion about the accuracy of the worst case. Let A and I be a triangular matrix and the identity matrix, respectively. Denote X by an approximate inverse matrix for A. In the talk, we discuss well-balanced block computations between the computational performance and the a priori error bound for the residual AX-I (also XA-I).

An Automatic Clustering Algorithm for Distribution-Valued Data

by Jen-Hao Chen (Department of Applied Mathematics, National Hsinchu University of Education, Taiwan)

Abstract. We proposed an intuitive clustering algorithm for distribution-valued data. The proposed algorithm is computationally simple, and has the capability to determine a suitable neighborhood of the target data with a data-driven learning mechanism. The numerical experiments show that the proposed algorithm is reliable and provide the correct classification. We also apply the algorithm to categorize a subset of COREL image database. Some algebraic operations are accelerated by the library CUBLAS. The results reveal that the proposed clustering algorithm performs well in color image categorization.

Lunch Break (12:00-13:30)

Auto-Tuning (13:30-14:20)

Plenary Talk 5

Bench-testing Environment for Automated Software Tuning (BEAST)

by Jakub Kurzak (Innovative Computing Laboratory, EECS, University of Tennessee, USA)

Abstract. The goal of BEAST is to create a framework for exploring and optimizing the performance of computational kernels on hybrid processors that 1) applies to a diverse range of computational kernels, 2) (semi-)automatically generates better performing implementations on various hybrid processor architectures, and 3) increases developer insight into why given kernel / processor combinations have the performance profiles they do. We call this form of optimization “bench-tuning” because it builds on the model used for traditional benchmarking by combining an abstract kernel specification and corresponding verification test with automated testing and data analysis tools to achieve this threefold goal. The novelty of the BEAST approach lies in the following components: A) generation of a large search space without any arbitrary / artificial constraints, B) powerful pruning using latent (derived) parameters, such as occupancy, C) including implementation constraints in the pruning process, D) including potentially correctness-violating optimizations combined with user-supplied validation code, and E) providing a harness for massively parallel profiling and benchmarking.

Auto-Tuning (14:30-15:30)

Automatic-tuning for CUDA-BLAS kernels by Multi-stage d-Spline Pruning Strategy

by Toshiyuki Imamura (RIKEN, Japan)

Abstract. Performance tuning for a CUDA-BLAS kernel demands numbers of data sampling for various combinations of code segments corresponding to the possible parameter space. To reduce the cost of sampling for huge parameter space, we take advantage of the potential of the incremental d-spline approximation method. Furthermore, the multi-stage projection technique to reduce the number of parameters is applied in this work. This talk covers the outline of these topics and the latest results of the CUDA-BLAS performance on Kepler GPU processors.

A Study on Enhancement of Data Fitting Function “d-Spline”

by Teruo Tanaka (Department of Computer Science, Kogakuin University, Japan)

Abstract. "d-Spline" is a fitting function which has high flexibility to adapt given data and can be easily computed. It has the best fit to a series of data points using Akaike's Bayesian Information Criterion (ABIC). As enhancements of d-Spline, prior information such as monotonic function, fixing data points and others can be easily appended. Also, it can be applied to two phase regression problem to find a intersection point. In this presentation, some results of experiments on enhancement of d-Spline will be introduced.

A Light-weight Rollback Mechanism for Testing Code Variants in Auto-tuning

by Shoichi Hirasawa (Graduate School of Information Sciences, Tohoku University, Japan)

Abstract. In auto-tuning, we often need to test various code versions to find the best one. In this presentation, a light-weight rollback mechanism is proposed to test a lot of code variants at a low cost. The rollback mechanism can execute one version of the target code block once and then undo the execution to test other versions, so that to repeatedly execute the code block of a large-scale HPC application for auto-tuning. As a result, we significantly reduce the timing overhead of auto-tuning by finding

the best code version without suffering from executing the whole application many times.

Auto-Tuning with Applications (16:00-16:50)

Plenary Talk 6

Tuning and Optimization of Particle Simulations with Long-Range Interactions

by Edmond Chow (School of Computational Science and Engineering, Georgia Institute of Technology, USA)

Abstract. In dynamical particle simulations where the forces are long-ranged, algorithms faster than O(n^2) for n particles are necessary for scalability. This tutorial talk will introduce and explain one such fast algorithm, called particle-mesh Ewald (PME), widely used in computational physics and chemistry, for example, in molecular dynamics. The parameter tuning of this algorithm to minimize computation time is important, as simulations may take weeks or months. The choice of parameters affects accuracy as well as run time, and a minimum run time is sought that satisfies a given accuracy requirement. Some theoretical models of accuracy exist, but these are difficult to use when choosing parameters. To choose parameters, practitioners generally use "trial and error" rather than an automated approach. We have developed a particle simulation code based on PME designed for multicore nodes with Intel Xeon Phi co-processors. Balancing the computation on the CPU with that on accelerators requires a yet more complex parameter choice. The purpose of the talk is to lay the groundwork for a potential automated approach for tuning the particle-mesh Ewald algorithm for modern architectures.

Auto-Tuning with Applications (17:00-17:40)

Contour estimation via two fidelity computer simulators under limited resources

by Ray-Bing Chen (Department of Statistics, National Cheng Kung University, Taiwan)

Abstract. The utilization of multiple fidelity simulators for the design and analysis of computer experiments has received increased attention in recent years. In this paper, we study the contour estimation problem for complex systems by considering two fidelity simulators. Our goal is to design a methodology of choosing the best suited simulator and input location for each simulation trial so that the overall estimation of the desired contour can be as good as possible under limited simulation resources. The proposed methodology is sequential and based on the construction of Gaussian process surrogate for the output measure of interest. We illustrate the methodology on a canonical queueing system and evaluate its efficiency via a simulation study.

Autotuning with a Nuisance Parameter: A Case Study for Power Optimization

by Reiji Suda (Department of Computer Science, The University of Tokyo, Japan)

Abstract. There are several variables that affect computing performance. Tuning parameters and cost functions are variables directly related to autotuning. Sometimes it is required to execute autotuning in various conditions. There are three kinds of conditions: unobservable, observable and controllable, and observable and uncontrollable. Observable and uncontrollable variables are also called nuisance parameters. Autotuning with a nuisance parameter is not a trivial problem. In this talk we show our method for power optimization, where temperature is a nuisance parameter.

Closing (17:40-17:50)

Sponsors

Plenary Talk 4

Application Performance Measurement on IBM Blue Gene/Q

by I-Hsin Chung (T. J. Watson Research Center, IBM, USA)

Page updated

Google Sites

Report abuse