2024 Titles and Abstracts

2024 Conference on Advanced Topics and Auto Tuning in High-Performance Scientific Computing

Plenary Speaker

Chao-An Lin 林昭安 (National Tsing Hua University, Taiwan)

Title: Fluid dynamics simulations on GPU cluster

Abstract:

The present presentation offers a comprehensive exploration into the realm of fluid dynamics simulations, specifically conducted within the framework of a GPU cluster infrastructure. To capitalize on the formidable computational prowess of GPUs, a meticulously crafted suite of explicit formulations has been developed and implemented. These formulations are engineered to exploit the parallel processing capabilities intrinsic to GPU architectures, thereby augmenting the performance and scalability of the simulations. A wide spectrum of test cases has been judiciously selected to assess the efficacy of the simulations executed on the GPU cluster. Through these meticulously chosen scenarios, the study not only aims to underscore the remarkable computational acceleration facilitated by harnessing GPU resources but also to validate the precision and dependability of the resultant data. By subjecting the simulations to rigorous scrutiny across diverse scenarios and benchmarking against established standards, the presentation endeavors to furnish a comprehensive understanding of both the advantages and constraints inherent in the utilization of GPU clusters for fluid dynamics simulations.

Invited Speakers

Takahiro Katagiri (RIKEN R-CCS / Nagoya University, Japan)

Title: Adaptation of Auto-tuning for Quantum Annelear

Authors: Takahiro Katagiri (RIKEN R-CCS / Nagoya University, Japan), Makoto Morishita (Nagoya University, Japan)

Abstract: 

Given the expected speed advantage over classical computers in specific calculations, there is considerable anticipation for the development of quantum computers. Notably, quantum annealers designed for combinatorial optimization are available in various hardware configurations, ranging from those harnessing quantum properties to quantum-inspired types operating at room temperature. 

This presentation aims to elucidate the necessity and provide examples of applying software auto-tuning (AT) in "quantum-related technologies," including quantum annealers and quantum-inspired annealers, from the vantage point of high-performance computing. Specifically, we will show instances of applying AT technology to address combinatorial optimization problems such as vertex coverage and support vector machines (SVM) utilizing quantum annealer technology.

Satoshi Ohshima (Kyushu University, Japan)

Title: Considering multi process calculations on current GPU

Abstract: 

Current GPUs are equipped with massively parallel compute cores, a high degree of parallelism is required to fill the cores and obtain reasonable performance. However, many programs don't have enough parallelism. In particular, many block matrix calculations involve multiple small calculations, even though the overall parallelism is high enough. 

To obtain good performance in such programs, we focus on the multi process calculation on GPU. MPS and MIG features of latest GPU might improve performance without heavy effort. We have confirmed the positive result on QR decomposition of Block Low Rank matrix (BLR-QR). We show the latest result and analysis.

Katsuhisa Ozaki (Shibaura Institute of Technology, Japan)

Title: Alternative Algorithms to Triple-Word and Quad-Word Arithmetic

Authors: Katsuhisa Ozaki (Shibaura Institute of Technology, Japan), Toshiyuki Imamura (RIKEN R-CCS)

Abstract:

If ordinary floating-point arithmetic cannot produce accurate numerical results, it is expected to consider using multiple-precision arithmetic. When only the length of the significand is an issue, the Double-Word arithmetic (DW), which expresses a number as the sum of two floating- point numbers and defines the arithmetic, is widely used. Bailey’s QD library is famous for this implementation. Triple-Word arithmetic (TW) and Quad-Word arithmetic (QW) are extensions of the Double-Word arithmetic. These methods are known to be fast because they are implemented using floating-point arithmetic supported by hardware. Recently, Pair Arithmetic (PA) was developed by Lange and Rump. The main difference between DW and PA is whether or not to allow overlapping of the floating-point numbers. PA accepts the overlapping of floating-point numbers so that it degrades accuracy and achieves low computational cost compared to DW.

In this paper, we propose an algorithm that adopts the idea of PA for Triple-Word and Quad-Word arithmetic. We call these algorithms Quasi Triple-Word (QTW) and Quasi Quad-Word (QQW) arithmetic, respectively. We also discuss the efficient application of the proposed algorithms to avoid degradation of the accuracy of the computed results. Numerical experiments show the efficiency of the proposed methods. The variety of methods (PA, DW, QTW, TW, QQD, and QD) will provide topics for automatic accuracy tuning in the future.

This is joint work with Toshiyuki Imamura (RIKEN R-CCS).

Toshiyuki Imamura (RIKEN, Japan)

Title: tmBLAS: Multiple- and Mixed-Precision BLAS with C++ Template

Abstract: 

We propose a new design for BLAS that can handle multiple- and mixed-precision computations. Our templated mixed-precision BLAS (tmBLAS) addresses weaknesses in existing BLAS implementation by decoupling the data types of each operand and operator using C++ generic programming with explicit descriptions of operators and type-castings. We demonstrate a release version of tmBLAS that instantiates routines with several FP formats{16, 32, 64, 128}, DD, and MPFR data types with those operations, improving one level higher internal precision than the input/output data precision.

Koki Masui (Osaka University, Japan)

Title: Parallelization of ICCG Method with high precision calculation in electromagnetic field analysis

Abstract: 

In solving the complex symmetric linear equation arising from the electromagnetic field analysis with the edge finite element method, we suffer from slow convergence rate of the iterative method such as the conjugate orthogonal conjugate gradient (COCG) method. Although Previous studies have improved the convergence rate by applying DD(double-double)-precision complex arithmetic to the equations, total calculation time increased. Therefore, in this study, we developed a parallelization method based on the block ICCG method for electromagnetic field analysis and implemented an iterative method applying DD precision. Specifically, in order to extract high parallelism with fewer rejections, we divided the preconditioning matrix into several region, and applied the corresponding parallelization method to each. We show some numerical examples demonstrated by using sparse matrices from the SuiteSparse Matrix Collection, and a high-frequency electromagnetic field analysis of a whole body cavity resonator.

Kengo Nakajima (The University of Tokyo / RIKEN R-CCS)

Title: Integration of Simulation/Data/Learning and Beyond 

Abstract: 

Recently, supercomputing has been changing dramatically. Integration/convergence of Simulation/Data/Learning (S+D+L) is important towards Society 5.0 proposed by Japanese Government, which enables integration of cyber space & physical space. In 2015, we started the BDEC project (Big Data & Extreme Computing) for development of supercomputers and software for integration of (S+D+L). In May 2021, we started operation of the Wisteria/BDEC-01. It is the first BDEC system, which consists of computing nodes for computational science and engineering with A64FX (Odyssey), and those for Data Analytics/AI with NVIDIA A100 GPU's (Aquarius). We also develop a software platform "h3-Open-BDEC" for integration of (S+D+L) on the Wisteria/BDEC-01, which is designed for extracting the maximum performance of the supercomputers with minimum energy consumption focusing on (1) Innovative method for numerical analysis by adaptive precision, accuracy verification and automatic tuning, (2) Hierarchical Data Driven Approach based on machine learning, and (3) Software for heterogeneous systems. Integration of (S+D+L) by h3-Open-BDEC enables significant reduction of computations and power consumption, compared to those by conventional simulations. In this talk, achievements in this project and future perspectives towards the next stage will be described.

Kenji Ono (Kyushu University, Japan)

Title: Equation discovery using genetic programming

Abstract:

I would like to introduce the discovery of equations using Genetic Programming. In this approach, the elements that constitute the terms of a differential equation are treated as genes. By evolving combinations of these genes, we explore the underlying laws, or governing equations, hidden within the data. So far, we have successfully estimated the original equations that generated the data from various datasets, including one-dimensional nonlinear Burgers equations, Lorenz equations, and coupled equations like the Brusselator. Furthermore, we have been able to replicate the form of terms even with very limited data, demonstrating a unique feature that, unlike other machine learning methods, results in low estimation errors even when data is scarce.

Hiroyuki Takizawa (Tohoku University, Japan)

Title: ML-based Autotuning of Quantum Annealing Schedule

Abstract: 

In commercially available quantum annealing devices, their anneal schedules are tunable and significantly affect the quality of solutions. In particular, pausing the anneal process at a point of time, called pausing, is known to be effective to increase the probability of obtaining a valid solution to the target problem. However, pauses are effective only within a certain region of the anneal. Meanwhile, given a problem, there is no established way of determining its optimal pause location with an affordable cost.  Therefore, this work first discusses whether a simulation model of quantum annealing can find an optimal pause location of the target problem, and then demonstrates the feasibility of simulating the effect of pausing. After that, as the simulation itself is time-consuming, machine learning is used to construct a surrogate model of the simulation. This is a good example of marriage of quantum computing and machine learning, both are promising in the extreme-scale computing era.

Teruo Tanaka (Kogakuin University, Japan)

Title: Acceleration techniques for software auto-tuning to hyperparameters on machine learning software

Abstract: 

This is a study of the search for optimal combinations of performance parameters that determine the performance of user programs in software auto-tuning (AT). In this research, optimization of hyperparameters of machine learning software is performed in AT. As subjects, we use (a) a program for predicting human movement in robot control and (b) a super- resolution program for natural image enlargement. Machine learning requires an enormous amount of training time. In response to this, we developed an AT tool to control multiple executions effectively using many GPUs of a supercomputer. For subject (a), we achieved further reduction of AT execution time and improvement of accuracy of learning results by using similar data that has already been learned. For subject (b), the two-stage learning method and parallel processing achieved a speed-up of 100-200 times, which is within the practical range of AT. As a result of the above, we were able to obtain the same level of results as those obtained by conventional time-consuming learning by experts.

ChungGang Li 李崇綱 (National Cheng Kung University, Taiwan)

Title: Converging-Diverging Nozzle Simulation Based on Hierarchical Cartesian Meshes in Supercomputer Environment

Authors: ChungGang Li (National Cheng Kung University, Taiwan Space Agency), Weng Chien-Chou (Taiwan Space Agency), Chen Li-Cheih (Taiwan Space Agency), Chou Chi-Chian (Taiwan Space Agency)

Abstract: 

The simulation of Converging-Diverging (CD) nozzle presents one of the most formidable challenges in Computational Fluid Dynamics (CFD) due to the intricate fluid phenomena involved, including shock waves, turbulence, and significant heat transfer. To leverage the computational power of supercomputers for CFD simulations in CD nozzles, we have developed a numerical framework based on hierarchical Cartesian meshes coupled with the Immersed Boundary Method (IBM) [1]. This framework enables accurate capturing of fluid phenomena at high Mach numbers. Furthermore, a tailored wall model has also been developed for our IBM framework to ensure accurate prediction of the velocity profile near the wall. Benchmark simulations employing billions of meshes have been conducted on Taiwan's latest supercomputer, Forerunner. A comparison between CFD and experimental results demonstrates good agreement, affirming the reliability of our approach. These results underscore the method's efficacy in design and analysis for CD nozzle.

Wei-Hsiang Wang 王威翔 (National Chung Hsing University, Taiwan)

Title: A Multi-physics CFD Framework for Practical Industrial Applications

Abstract: 

Simulating multi-physics phenomena, including fluid flow, chemical reactions, and phase changes, poses significant challenges in industrial applications due to the complexity of integrating diverse solvers. This study introduces a unified solution through a CFD framework that incorporates numerical models for these phenomena into a single solver. Utilizing a Cartesian-based approach, our framework employs a meshing system that combines a Hierarchical Cartesian grid with the Building Cube method and an immersed boundary method to accommodate arbitrary and complex geometries typical of industrial applications. The solver integrates an all-speed compressible flow solver, leveraging a preconditioned Roe scheme and a 5th order MUSCL scheme, with species transport equations for modeling flow and energy fields in reacting species. Chemical reactions and combustion processes are addressed by integrating the CANTERA chemical reaction library with a level-set flame front tracking method. Additionally, the parcel model-based Particle-Source-In-Cell (PSI-Cell) method is introduced for modeling liquid fuel spray and evaporation. To support the computational demands, a High-Performance Computing (HPC) system utilizing MPI facilitates massively parallel computations. The framework's efficacy is demonstrated through various qualitative and quantitative validation cases, underscoring its potential for addressing the complexities of multi-physics simulations in industrial settings.

Keywords: Building-Cube method, immersed boundary method, compressible flow, combustion, high performance computing (HPC)


Yu-Heng Tseng 曾于恒 (National Taiwan University, Taiwan)

Title: Development of Multi-scale Ocean-atmosphere Coupled Modelling System

Abstract: 

A novel Multi-scale Ocean and Atmosphere Coupled Modeling System (MUSOACS) has been recently developed to enhance the predictability of extended-range forecasts. MUSOACS comprises four distinct components, seamlessly integrating global (23km resolution) and regional (5km resolution) ocean-atmosphere coupled models that run concurrently. The regional coupled model is online-driven by the global coupled model. Skill evaluations of the model predictability indicate that the global atmospheric forecast in MUSOACS surpasses the operational atmospheric forecast system at Central Weather Administration up to 16 days in geopotential height, temperature, and wind vector below 200 hPa. The predictability of Madden Julian Oscillation, measured by the RMM index, extends from a 6-day to 20-day lead time. Furthermore, the regional prediction of the 2016/2018 cold surge events in East Asia notably improves, extending the predictability lead time from 5 to 7 days in the Regional compartment of MUSOACS. These improvements highlight the superior performance of MUSOACS compared to uncoupled models, attributed to its enhanced representation of ocean-atmosphere interaction. These findings suggest that MUSOACS holds promising potential to deliver high-performance higher-quality extended-range forecast from both global and regional perspectives.

Meng-Huo Chen 陳孟豁 (National Chung Cheng University, Taiwan)

Title: Enhancing ETD Computational Efficiency with a Hybrid MPI/OpenMP Parallel Algorithm

Abstract: 

This study tackles the computational bottleneck of matrix exponential calculations in Exponential Time Difference (ETD) methods through a novel parallel algorithm that synergizes MPI and OpenMP technologies. By partitioning the matrix among MPI processes and leveraging OpenMP for localized matrix multiplication, our approach significantly enhances the ability to address diffuse-interface problems at previously unattainable mesh sizes for serial computation. Through comprehensive performance evaluations, we demonstrate the algorithm's substantial improvements in efficiency and speed, showcasing a major leap forward in the computational capabilities of ETD methods for complex applications.

YungYu Zhuang 莊永裕 (National Central University, Taiwan)

Title: Static code analysis for dynamic typing in scientific computing

Abstract: 

Although Python is unsuitable for high-performance computing, it has been widely used as a host language for calling libraries implemented in C or Fortran. Programmers can quickly write Python programs as a frontend to perform scientific computing. By calling library functions, programmers can manipulate data in C/Fortran implementation efficiently rather than directly handling memory on the Python side. However, as a consequence, detecting errors may become complicated since all data manipulation is performed by function calls, which cannot be checked statically, especially in a dynamic typing language. To address this issue, we develop static code analyzers to help programmers detect errors before running programs. These include column label typing problems in data frames, invalid strings in function calls, and shape mismatches in arrays.

Marco Sutti (National Center for Theoretical Sciences, Taiwan)

Title: Numerical simulations using low-rank approximation

Abstract: 

When dealing with large-scale problems, it can be beneficial to exploit, if available, the low-rank structure of the problems under consideration. We start this talk by providing some motivation for working in the low-rank format over a dense format. We then propose two implicit numerical schemes for the low-rank time integration of stiff nonlinear partial differential equations. Our approach uses the preconditioned Riemannian trust-region method of Absil, Baker, and Gallivan, 2007. We demonstrate the efficiency and accuracy of our method for solving the Allen–Cahn and the Fisher–KPP equation on the manifold of fixed-rank matrices. Our method allows us to reduce the computational cost and avoid the restriction on the time step typical of methods that use the fixed-point iteration to solve the inner nonlinear equations.

Maxim Solovchuk (National Health Research Institutes, Taiwan)

Title: High Performance Computing for focused ultrasound treatment planning and Biomedical applications

Abstract: 

High intensity focused ultrasound is a very promising new technology that has many therapeutic application, among them are the treatment of cancer in different organs without major side effects. The main difficulties, that limits further development of HIFU, are very difficult treatment planning and unpredictable behavior of the necrosed area in heterogeneous medium. The efficiency still requires improvement due to the long treatment time and incomplete ablation. Calculation of ultrasound propagation in a patient specific geometry is very time consuming process. Therefore, to speed-up the simulations, High Performance Computing on multiple GPUs will be used for modeling of nonlinear acoustic field. The mathematical model consists of different physical fields coupled with each other: acoustic, thermal, hydrodynamic and cavitation.  We are working on the development of a surgical planning platform for a non-invasive HIFU tumor ablative therapy in a real liver geometry based on CT/MRI image. It still remains quite challenging to ablate the tumor close to the blood vessel. The temperature elevation and necrosed area formation have been predicted and compared with experimental data. This study demonstrated that with the help of numerical simulations a large uniform area (with a cross section size about 2×2 cm^2 or 3×3 cm^2) can be ablated by HIFU within a short time period (1-3 minutes) which is much faster than 1-2 hours of surgery. The agreement between the obtained simulated and ex vivo results is very good.

Chun-Yu Lin 林俊鈺 (National Center for High-Performance Computing, Taiwan)

Title: Computational challenges in the collective neutrino dynamics

Abstract: 

Neutrinos are incredibly light, weakly interacting fundamental particles ubiquitous in extreme astrophysical conditions such as supernova explosions and neutron star mergers. They have three “flavor” types that oscillate spontaneously and collectively due to their nonlinear interaction with matter and themselves, triggering flavor instability and enriching the neutrino microphysics and the transport mechanism. The collective neutrino oscillation is intrinsically a quantum many-body problem and poses a computation challenge even after being reduced to a quantum kinetic equation under a mean-field description. The talk summarizes our efforts on this scientific computing project and observations on the asymptotic behavior under various conditions. I will also review some techniques, such as tensor network approximations, for exploring the many-body effects of collective oscillation.

Student Speakers

Ping-Kong Huang 黃品康 (National Yang Ming Chiao Tung University, Taiwan)

Title: Control of Two-Degree-of-Freedom UAV Camera Gimbal on UAV for Target Tracking

Abstract: 

This research addresses the control of a two-degree-of-freedom camera gimbal mounted on a UAV (Unmanned Aerial Vehicle) for target tracking. It begins by constructing a 2DoF gimbal dynamics model and introduces a control method based on SDRE (State-Dependent Riccati Equation) to mitigate vertical oscillation induced by the UAV. To solve SDRE in servo motor control, SDA (Structure-preserved doubling algorithm) is applied. Moreover, inverse kinematics using quaternion representation is employed to determine the desired angles of the servo motors, ensuring precise alignment of the camera's image center with the target.Finally some numerical result and experiments will be shown.

Qi Ji (Kyushu University, Japan)  

Title: Optimizing Dynamic Buffer Management for Non-blocking MPI Communication between GPUs

Authors: Ji Qi, Kenji Ono (Kyushu University, Japan)

Abstract: 

Managing the memory of buffers is a key aspect in MPI-based parallelization, the situation is made more complicated with non-blocking communication for variable stencil computation on GPU. Using non-blocking communication, multiple communications can be initiated at the same time, the corresponding buffers should be kept until the communications complete and the data written back to the original array. In stencil computation, the amount of data needed to be transferred depends on the length of the stencil, which is subject to changes. Therefore, both the number and size of the buffers may vary during the program execution, making dynamic management a favorable choice.

However, before initiating MPI communications, data should first be transferred from GPU memory to CPU memory, resulting in extra latency. The use of pinned (or page-locked) memory on the CPU side may reduce GPU-CPU transfer overhead, but the allocation and release of pinned memory leads to far more overhead than those of ordinary CPU memory, making dynamic management a dilemma to be solved.

In this research, we implement a dynamic buffer management method in our multi-GPU CFD code, in which used buffers are kept unreleased for further use and reallocation of a buffer only happens when no buffer with the required size can be found. We also propose an algorithm for choosing which buffer to be used or reallocated, aiming to minimize buffer reallocations. We then compare the performance of our code using this method with the baseline method, and analyze this method’s influence to the performance of our code.

Amber, Tien-Chi Liu 劉天祺 (National Central University, Taiwan)

Title: Observation and simulation of Traveling Ionospheric Disturbances induced by the 2022 Tonga Volcano Eruption

Abstract: 

On January 15th, 2022, the Tonga volcano erupted violently, generating atmospheric pressure disturbances that propagated in Lamb waves, detected worldwide along with the associated oscillations. The perturbing atmospheric waves induce traveling atmospheric disturbances (TADs) near the Earth’s surface and traveling ionospheric disturbances (TIDs) in the upper atmosphere. In this study, the global propagation of TADs is demonstrated with the Himawari-8 satellite images. Doppler sounding systems, ionosondes, ground-based barometers, and tide gauges are also utilized to observe TADs and TIDs over Japan and Taiwan. To better understand the characteristics of the observed TIDs, GITM-R, a three-dimensional non-hydrostatic model of the upper atmosphere that features a two-way coupling between the coarser grid layer and locally refined layers, is conducted for numerical simulations that are compared and discussed alongside observational data.

Makoto Morishita (Nagoya University, Japan) 

Title: Auto-tuning of ScaLAPACK by GPTune 

Authors: Makoto Morishita (Nagoya University, Japan), Osni Marques, Yang Liu (Lawrence Berkely National Laboratory, USA), Takahiro Katagiri (RIKEN R-CCS / Nagoya University, Japan) 

Abstract: 

Numerical libraries often have many parameters that impact its performance. To obtain high performance of the libraries, tuning the parameters are required. However, it is difficult to tune such parameters without special knowledge on them. Software auto-tuning (AT), therefore, is one of promising approaches to establish high performance for numerical libraries. In this talk, we explain the methodology of GPTune, which is an AT framework developed by DOE’s Exascale Computing Project. In addition, we show an example of adaptation for the GPtune with a routine on ScaLAPACK.

Ivan Luthfi Ihwani 盧斯非 (National Central University, Taiwan)

Title: Unveiling Nonlinear Wave Propagation in Finite Gratings: A Multiscale Finite Element Approach

Abstract: 

Accurately simulating wave behavior in intricate structures like finite gratings is crucial for various scientific and engineering applications. This study introduces the multiscale finite element method (MsFEM) as a potent approach for addressing nonlinear Helmholtz problems in intricate geometries, such as finite gratings. MsFEM employs a coarse mesh to capture the overall structure of the grating while simultaneously constructing specialized basis functions within each element. These functions resolve the intricate details of the grating, including its periodic nature and localized material properties. This enables the incorporation of fine-scale features without necessitating an excessively refined mesh across the entire domain, thereby maintaining computational efficiency. We present the MsFEM framework tailored for the nonlinear Helmholtz equation and demonstrate its effectiveness in analyzing finite gratings. The obtained numerical results are compared with established methods, revealing the accuracy and efficiency of the proposed approach. This study paves the way for utilizing MsFEM in diverse applications involving nonlinear wave propagation in complex structures, offering valuable insights into wave behavior in various fields.