2021

2021 Conference on Advanced Topics and Auto Tuning in

High-Performance Scientific Computing

March 19-20, 2021

Room M107, 1st floor, Hong-Jing Building (Math Department Building)

國立中央大學 鴻經館(數學系) M107教室

National Central University

(with online participation)

Aims and Scope

The Conference on Advanced Topics and Auto Tuning in High-Performance Scientific Computing focuses on the scientific impacts due to the latest computer architectures and the approaches to achieve high-performance computing on these leading-edge computers. Advances in many-core architectures and high-end computers have unveiled their significances in the scientific discoveries and engineering achievements. The complexity of these newly developed computers, however, also leads to contemporary challenges to achieve the best efficiency of the highly promising computational capabilities. The conference encourages interdisciplinary communications between researchers from applied mathematics, statistics, computer science, physical sciences, engineering and industry to prompt innovations and breakthroughs in this exciting field. The main themes include, but not limited to, simulations, numerical methods, applications, hardware, and particularly software and algorithm auto-tuning via statistical methods.

Plenary Speaker

  • Trista Chen (Chief AI Officer, Inventec Corporation)

Invited Speakers

  • Meng-Huo Chen (National Chung Cheng University)

  • Ray-Bing Chen (National Cheng Kung University)

  • Takeshi Fukaya (Hokkaido University)

  • Toshiyuki Imamura (RIKEN)

  • Takeshi Iwashita (Hokkaido University)

  • Takahiro Katagiri (Nagoya University)

  • Jyh-Miin Lin (University of Cambridge)

  • Jenq-Kuen Lee (National Tsing-Hua University)

  • Kengo Nakajima (The University of Tokyo)

  • Satoshi Ohshima (Nagoya University)

  • Maxim Solovchuk (National Health Research Institutes)

  • Hiroyuki Takizawa (Tohoku University)

  • Hui-Hsu Gavin Tsai (National Central University)

  • Eh Tan (Academia Sinica)

  • Chin-Tien Wu (National Yang Ming Chiao Tung University)

  • Shu-Chih Yang (National Central University)

  • Chun-Chen Yeh (National Kaohsiung Normal University)

  • YungYu Zhuang (National Central University)

Organizing Committee

Contact Person

Ms. Jing-Ru Lu ( jessica@math.ncu.edu.tw )

Download the Program Booklet PDF

Program

March 19, 2021 (Friday) (UTC +8)

March 20, 2021 (Saturday) (UTC +8)

Conference Registration

You are welcome to join the conference. Free register is available at following link.

(For virtual participants, we will send an email with ZOOM meeting ID and password.)

NCU Map

Transportation

  • Taiwan Railways Administration

    • Get off at Zhongli Train Station. Take a city bus or a taxi (approx. NT$250) to NCU within 20~30 minutes.

  • Taiwan High Speed Rail

    • Get off at THSR Taoyuan Station. Take direct bus No. 132 or No. 172 at Bus Platform 8 to reach NCU (approx. one per hour, a 15- to 20-minute ride). You may also take other buses heading for Zhongli City, where a connection is needed.

    • City buses No. 132 and No. 172 operate between Zhongli City and NCU, with an extended line to THSR Taoyuan Station at certain hours.

    • Buses depart from THSR Taoyuan Station to downtown Zhongli: No. 170, 171, 5089. Fare: NT$26.

    • Driving directions: Start from Gaotie South Road to use Provincial Highway No. 31, turn left when reaching Jhongjheng Rd., and in a few minutes, turn right to Zhongda Road.

more imformation on NCU

Hotels

Tourism

Sponsors

Title and Abstract

2021 Conference on Advanced Topics and Auto Tuning in

High-Performance Scientific Computing

Plenary Speaker

  • Dr. Trista Chen (Chief AI Scientist, AI Centera at Inventec)

    • Learning at Scale: From Protein Folding to Federated Learning in Smart Manufacturing

    • Abstract:

    • It was said, "those long divided shall be united; those long united shall be divided: such is the way of the universe". Distributed learning has gained a lot of popularity when it comes to training the ever-more complex deep-learning models. Recently, a new revolution in distributed learning, called Federated Learning, promises to deliver more without compromising each computing unit’s privacy and is able to train in heterogeneous environments. However, there posts significant challenges when each computing unit in the Federated Learning setting only has access to a small amount of labeled data. In this talk, we share our real-world experience working with first-tier electronics manufacturing facilities on adopting a new deep-learning framework to reduce large data requirements, to learn and tune deep-learning models at scale, and to touch upon the open problems and future directions.

Invited Speakers

  • Meng-Huo Chen (National Chung Cheng University)

    • Fluid-structure interactions: one-field monolithic fictitious domain method and its parallelization

    • Abstract:

    • In this research we implement the parallelization of the method: one-field monolithic fictitious domain (MFD), an algorithm for simulation of general fluid-structure interactions (FSI). In this algorithm only one velocity field is solved in the whole domain (one-field) based upon the use of an appropriate L2 projection. ”Monolithic” means the fluid and solid equations are solved synchronously (rather than sequentially). For 3D fluid-structure interaction simulations on moderately resolved meshes it is quite often that the computations take several weeks or even months. We parallelize the finite element discretization and the linear system solver in order to reduce the simulation time from several months to few days. At the initial stage of the research we focus on parallelizing the algorithm on uniform meshes. The implemented parallel algorithm is then extended to the simulations on nonuniform meshes, where an adaptive mesh refinement scheme is used to improve the accuracy and robustness. Our goal is to provide an efficient, robust algorithm which can handle the difficult fluid-structure interactions such as the collision of multiple immersed solids in fluid where the high resolution mesh is necessary for resolving the phenomena near the collision and fluid-structure interfaces.

  • Ray-Bing Chen (National Cheng Kung University)

      • Tree-based Gaussian Process with Many Qualitative Factors for Computer Experiments

      • Abstract:

      • In computer experiments, Gaussian process models are commonly used for emulation. However, when both qualitative and quantitative factors in the experiments, emulation using Gaussian process models becomes challenging. In particular, when many qualitative factors are in the experiments, existing methods in the literature become cumbersome due to curse of dimensionality. Motivated by the computer simulations for the design of a cooling system, we propose a new tree-based Gaussian process for emulating computer experiments with many qualitative and quantitative factors. The proposed method incorporates tree structures to model the qualitative factors, with Gaussian process models in the leaf nodes for modeling quantitative factors. Numerical experiments show that the proposed method enjoys has good performance in the model fitting.

  • Takeshi Fukaya (Hokkaido University)

      • Exploiting lower precision computing in the GMRES(m) method

      • Abstract:

      • Lower precision computing has recently attracted much interest in the HPC community due to its advantages in modern and future computer systems. In this research, we focus on the GMRES(m) method, namely the GMRES method with the restart technique, which is widely used for solving a linear system of equations with a general (nonsymmetric) sparse coefficient matrix. Based on the structure derived from the restart technique in GMRES(m), mixed precision computing using lower precision arithmetic and data can be easily introduced into GMRES(m). In this talk, we investigate the possibility that the mixed precision GMRES(m) using FP64 and FP32 will outperform the conventional GMRES(m) using only FP64 mainly through numerical experiments. We also consider the possibility to introduce further lower precision computing into GMRES(m) than FP32.

  • Toshiyuki Imamura (RIKEN)

      • HPL-AI benchmark on Fugaku

      • Abstract:

        • HPL-AI benchmark on the supercomputer Fugaku was awarded the #1 spot in June and November 2020. Our first score reported the performance of 1.42EFlop/s, which was benchmarked on 126,720 nodes (6,082,560 cores), equaling Fugaku's five-sixths system with the normal mode (2.0GHz). It was the world's first achievement to exceed the wall of Exa-scale in a floating-point arithmetic benchmark. The second submission improved the score with the full system configuration of Fugaku (152,064 nodes, 7,299,072 cores) and the boost mode (2.2GHz), being broken to 2.0EFlop/s, an outstanding performance value. I will introduce some of the technical issues and performance analysis.

    • Takeshi Iwashita (Hokkaido University)

Hierarchical Block Multi-Color Ordering for Vectorized and Multithreaded ICCG Solver

Abstract:

In this talk, we introduce the equivalence condition for parallel orderings in the context of IC(0) preconditioning. Based on the condition, we propose a new parallel ordering method to vectorize and parallelize IC(0) preconditioning, which is called hierarchical block multi-color ordering. In this method, the parallel forward and backward substitutions can be vectorized while preserving the advantages of block multi-color ordering, that is, fast convergence and fewer thread synchronizations. To evaluate the proposed method in a parallel ICCG solver, numerical tests were conducted using seven test matrices on three types of computational nodes. The numerical results indicate that the proposed method outperforms the conventional block and nodal multi-color ordering methods in most of test cases, which confirms the effectiveness of the method.

  • Takahiro Katagiri (Nagoya University)

      • Dynamic Preconditioner Selection with Right-Hand-Side Vector by Deep Learning

      • Abstract:

      • In this talk, a preconditioner selection for sparse iterative methods is presented. Selection of preconditioner is one of crucial tuning procedures to solve linear equations.

      • In this research, an auto-tuning (AT) method with deep learning (DL) for the selection of preconditioners at run-time is presented. We use color feature images from input sparse matrices for learning data set of DL to predict the best preconditioners for a GMRES solver with restarting. In addition, to process dynamic change of numerical condition for a computational fluid simulation, we need additional information, obtained by run-time. The key idea is to utilize information from right-hand-size vector.

      • The proposed AT method is evaluated with a three-dimensional unsteady incompressible thermal flow simulator on a Cartesian grid system, named Frontflow/violet Cartesian (FFVC.) Evaluation result indicated that prediction accuracy by the proposed method is more than 90% in F1 score.

  • Jenq-Kuen Lee (National Tsing Hua University)

      • Support Scheduling Methods for Sparse Computations on TVM/Halide Compilers with AI Applications

      • Abstract:

      • Compiler infrastructures are in the focus of Halide languages and TVM IR where they represent image algorithms and AI models in the high-level halide format. Further scheduling for the programs is allowed by programmers or developers for scheduling programs with loop tiling, loop vectorization, loop split, loop fusion, loop reordering, etc. The capability of the scheduling in the programming models facilitates the optimizations of the algorithms for a variety of architectures. In this paper, we extend the scheduling policies from loop-based schedulers to deal with sparse applications. We will illustrate how the sparse scheduler can be extended in the Halide environments. Directions will be given for potential AI models and applications to use the technologies. In addition, important developments in the areas with a broader picture will be highlighted.

    • Jyh-Miin Lin (University of Cambridge, Development and Alumni Relations)

High-performance medical imaging reconstruction and PyNUFFT: practical considerations

Abstract:

Dedicated medical imaging reconstruction methods, such as linear and non-linear algorithms, have posed a challenge to engineers in practical applications, as many mathematical algorithms are computationally expensive. In this talk, we discuss practical considerations of the design of a high-performance system. For instance, one crucial component in medical imaging is the non-uniform fast Fourier transform (NUFFT), which could take several seconds on a single-core CPU before the era of parallel computing. Using modern graphics processing units (GPUs), the run-times have been reduced to several milliseconds. Another strategy is to use the alternative Toeplitz method, which can also achieve high performance. In the implementation of PyNUFFT, we had achieved highly accurate NUFFT by choosing the optimal interpolator and Reikna/PyCUDA/PyOpenCL script language. Performance is portable to OpenCL accelerators with minimum change of the basic code. Last, some recent developments will be discussed.

  • Kengo Nakajima (The University of Tokyo)

Integration of 3D Earthquake Simulation & Real-Time Data Assimilation on h3-Open-BDEC

Abstract:

Towards the end of Moore's law, we need to develop new algorithms and applications. We are developing h3-Open-BDEC, which is innovative software for sustainable promotion of scientific discovery by supercomputers in the Exascale Era by combining (Simulation + Data + Learning (S+D+L))}, where ideas of data science and machine learning are introduced to computational science. The h3-Open-BDEC is designed for extracting the maximum performance of the supercomputers in the Exascale Era, which is based on hierarchical, hybrid and heterogeneous (h3) architecture for Big Data & Extreme Computing (BDEC), with minimum energy consumption focusing on (1) innovative method for numerical analysis based on the new principle of computing by adaptive precision, and (2) Hierarchical Data Driven Approach ( hDDA) based on machine learning. Integration of (S+D+L) by h3-Open-BDEC enables significant reduction of computations and power consumption, compared to those by conventional simulations. We have applied the prototype of h3-Open-BDEC to Seism3D/OpenSWPC-DAF (Data-Assimilation-Based Forecast), which was developed by ERI/U.Tokyo for integration of simulation and data assimilation. In this talk, we will conduct the demonstration of real-time data assimilation of the developed code on the Oakbridge-CX system at the University of Tokyo, using real-time measured data through JDXnet with 2,000+ nationwide observation points.

  • Satoshi Ohshima (Nagoya University)

      • Effectiveness of Low-/Mixed-Precision Computation on Parareal Method

      • Abstract:

      • Parareal method is a parallel computation method which divides target time space into multiple regions and calculates them in parallel, and combines them with fixed computations. In this method, multiple levels of iterative methods are used repeatedly. However, some of their iterative methods don't require high-precision computation because the results are fixed by other iterative methods. Therefore, in this study, we are investigating the utilization of low-precision and mixed-precision computation in parareal method. In this talk, we show the current result of our investigations and evaluations.

  • Maxim Solovchuk (National Health Research Institutes)

    • Parallel computing for the treatment planning of the focused ultrasound ablation of liver tumor

    • Abstract:

    • High intensity focused ultrasound is very promising new technology, that has many therapeutic application, among them are the treatment of cancer in different organs without major side effects. The main difficulties, that limit further development of HIFU, are very difficult treatment planning and unpredictable ablation area in heterogeneous medium. Computational fluid dynamics can greatly help in the development of this technology. Without high performance computing, calculation of ultrasound propagation in a patient specific geometry is very time consuming process. We are working on the development of surgical planning platform for a non-invasive HIFU tumor ablative therapy in a real liver geometry based on CT/MRI image. This task requires coupling of different physical fields: acoustic, thermal and hydrodynamic. The cavitation model is also coupled with acoustic and thermal model in viscoelastic medium. These physical fields can influence each other. The surgical planning simulation platform is under current development on multiple GPUs. The predicted results are compared with the experimental data in-vivo and ex-vivo.

  • Hiroyuki Takizawa (Tohoku University)

      • Offload programming on a modern heterogeneous vector system

      • Abstract:

      • Supercomputer AOBA installed at Tohoku University employs NEC SX-Aurora TSUBASA, which is a "modern" vector system equipped with x86 processors as well as NEC's vector processors. Unlike previous-generation vector systems, non-vectorizable parts of an application could be executed on x86 processors, and thus vector processors execute only vectorizable parts. In this talk, I will introduce our research activities on offload programming models for such a heterogeneous computing system with vector processors. The performance evaluation results clearly indicate that assigning right processors to right tasks leads to higher performance.

  • Eh Tan (Academia Sinica)

      • Numerical algorithms for modeling the deformation within Earth interior

      • Abstract:

      • The Earth's crust and mantle deform slowly and continuously despite their solid state. The deformation is manifested in the Earth's surface as plate motion and earthquakes. The deformation can be described as buoyancy-driven Stokes flow coupled with thermal diffusion and advection. The buoyancy source is mainly due to the temperature contrast between the top (surface) and the bottom (core-mantle boundary) and partly due to the chemical contribution. Hot mantle rises from the core-mantle boundary and cold crust and mantle (lithosphere) sink from the surface. The stress-deformation response (rheology) of the Earth material is highly nonlinear and can be a function of temperature, pressure, and deformation history. How to solve the nonlinear Stokes equation accurately and efficiently is a big challenge faced by geodynamists. I will introduce the past and current strategy and algorithms deployed by geodynamists.

  • Hui-Hsu Gavin Tsai (National Central University)

      • Algorithms for Promoting the Efficiency of Phase Space Sampling by All-atom

      • Abstract:

      • Molecular dynamics (MD) simulations integrate the classical equations of motion of large molecular system numerically with time, which is widely employed to investigate macromolecular structure-to-function relationships. MD simulations are applied to aid drug design in the pharmaceutical industry, to investigate structure and transport phenomena of new liquids, and to study the dynamics of biological molecules. The efficiency of phase sampling and the speed of MD simulation are the bottlenecks retarding its wide applications due to the energy trapping and long-range non-bonded interactions, respectively. In this talk, two advanced algorithms, umbrella sampling and replica-exchange MD (REMD), will be discussed how they can effectively sample the phase space of molecular systems. In practice, convenient MD or MC simulations do not frequently sample high-energy regions of configuration space, preventing the energy of transition state been calculated with statistically meaning. An advanced sampling technique called “umbrella sampling” has been developed to overcome this difficulty. In this approach, the potential energy of a given system is applied with an umbrella potential, V U . In this regard, the high-energy configuration space can be sampled effectively by MD simulations. The biased probability P*(q) for the system using this biasing function, can be calculated from the MD simulation. The relative configuration probability, P(q) can be calculated from the biased probability, P*(q) by statistical mechanics. A series of separate calculations along the predefined reaction coordinate can performed in parallel. The replica-exchange method couples MD trajectories (REMD) and temperature-exchange Monte Carlo processes, and provides an effective conformational sampling method. The essential idea of REMD is to run MD simulations of the same system on different replicas simultaneously, but at various temperatures. Periodically, the configurations of two neighboring replicas are exchanged depending on whether the potential energy gap between two neighboring replicas is comparable with their temperature difference. The high temperature replicas are designed to cross the energy barrier more quickly than the low temperature ones. Furthermore, the replica swapping leads the high temperature replicas to assist in facilitating the low temperature replicas to cross the energy barriers quickly. Thus, the REMD simulation can sample the conformational space more effectively than the constant temperature MD simulations. In particular, each replica is only weakly coupled (temperature exchange), ensuring that REMD is a high performance method in parallel computing.

  • Chin-Tien Wu (National Chiao Tung University)

    • A weakly kernel-Supervised Neural Network for Image Blind Deconvolution

    • Abstract:

    • Blind deconvolution (BD) is an important issue in image processing. The BD process generally involves two steps: kernel estimation and image deconvolution. To obtain a satisfactory recovered image using BD, many parameters tuning are required. Each set of parameters result in a different blur kernel and recovered image. There is no optimal parameters in general, due to variation of image priors. Parameters tuning is the state of the art for the BD to success. In this talk, we shall review some of the well known Blind deconvolution (BD) methods such as Levin’s MAP x,k , Fergus’s Variational Bayes mean field, Bronstein’s Optimal sparse representation and Jiaya’s two phase estimation based on L0 sparsity of the kernel, etc. We propose a kernel-supervised cycle-consistent adversarial network (KS-CCAN) for BD where the kernel supervisor is obtained from the kernel estimation of traditional BD methods. The KS-CCAN can be considered as an auto-tuning network for BD. Our priliminary results show that KS-CCAN is robust for various blur images.

  • Chun-Chen Yeh (National Kaohsiung Normal University)

    • Computational Forensics for a Bombing Terrorism Case

    • Abstract:

    • The bombing of airliners has been a tactic used by terrorists during the past 40 years. It become a big issue for homeland security. To investigation of such bombings, this talk provides the results of the development of mathematical modeling and computer simulation for the study of aircraft bombings and associated forensics.

    • The speaker will discuss some photographic evidence, introduce mathematical modeling and do computational modeling regarding a particular case --- Daallo Airline's bombing case on February 2, 2016, in which only a small amount of explosives was used to demonstrate the event reconstruction can be accomplished for the purpose of forensic investigations. Most of our supercomputer results are visualized by video animations in order to show the dynamic effects and phenomena of explosives and the associated event reconstruction.

  • Shu-Chih Yang (National Central University)

      • The application of high-performance computing

      • to convective-scale data assimilation and severe weather prediction

      • Abstract:

      • Data assimilation for numerical weather prediction aims to optimally combine the numerical weather model with observation data. For high-impact weather prediction, such as heavy rainfall forecast, high-resolution and cloud-permitting models are essential to represent the development of the weather system. Also, observations with high temporal and spatial resolution are another key information. This talk will cover the developments and challenges of convective-scale ensemble data assimilation system under the high- performance computing framework.

      • The convective-scale data assimilation system couples the Weather Research and Forecast model with the Local Ensemble Transform Kalman Filter to assimilate the high-resolution radar data (WLRAS). Although the WLRAS system is skillful in terms of precipitation prediction, its performance is affected by the sampling errors of the ensemble, errors in the larger-scale environment and un-optimized error covariance matrices. I will present strategies to cope with these challenges and how they can improve the performance of the convective-scale data assimilation system and heavy rainfall prediction.

    • Yung-Yu Zhuang (National Central University)

Revising a scientific computing program to benefit from high-performance computing techniques

Abstract:

Many scientific research activities heavily rely on the results of running computing programs, while scientists might not be programming experts. When a scientific computing program was implemented and used for a long time, modifying it to benefit from high-performance computing techniques becomes challenging. In this talk, we share our experiences in revising a scientific computing program. The target program was developed for seismic tomography using a finite-difference travel time computation and modified by many scientists. It is composed of seven modules written in Fortran and C along with parameter files and shell scripts. We first rewrote all implementations with a single language, C, due to maintenance concerns. Then we applied OpenMP/MPI programming to improve performance and scale to extensive data. Finally, we built a domain-specific embedded language on top of Python for hiding implementation details. In every modification, how to ensure correctness is always the main issue.