Parallelization of Sparse-Particle-In-Cell (S-PIC) methods

Shared memory parallelization on CPU:

Standard PIC methods are not well suited for scalable implementations on shared memory architectures. This is primarily due to the fact that PIC methods are globally memory-bound. Thus, multiplying the cores without a significant increase in memory bandwidth yields poor speed-up. This issue has garnered considerable attention over the years. Strategies are proposed for Non-Uniform Memory Access (NUMA) architecture platforms, including various workarounds to enhance data locality, increase cache reuse, and consequently reduce the number of requests to main memory. The goal is to leverage memory segments local to a subset of cores to enhance algorithm scalability. This is achieved through a decomposition of the particle population into samples, each assigned to a NUMA domain, with related operations performed by the local subset of cores. Sparse grid reconstructions involve operations of each particle with numerous (tens) anisotropic grids, featuring coarse discretizations and different sparsity patterns. These grids, referred to as component grids, have sparse memory footprints reduced to a few kilobytes (KB), small enough to fit into the highest level of cache hierarchy common to any modern CPU core. Furthermore, particle interactions with two different grids are independent and can be processed concurrently. This defines a new level of parallelism specific to sparse-PIC methods, crucial for achieving good scalability.

To sum up two strategies tailored to uniform (UMA) and non-uniform (NUMA) memory architectures have been conceived and mixed together to derive a two-levels parallelization strategy including coarse-grain and fine-grain parallelisms, achieving scalability close to optimal:

Particle sample work sharing (coarse-grain) strategy: the particle population is divided into independent samples (one for each NUMA domain) and distributed onto the NUMA domains. Each core of a NUMA domain executes operations related to the sample stored in the RAM local to the NUMA domain. This strategy takes full advantage of the (RAM) memory bandwidth increasing with the number of NUMA domains.
Component grid work sharing (fine-grain) strategy: the component grids are distributed onto the cores within a NUMA domain. The operations (i.e. charge accumulation, resolution of Poisson equation,etc.) on different component grids are independent and executed concurrently by the cores of a NUMA domain. This strategy takes full benefit of the tiny memory footprint of the component grids. These arrays are stored in the L1-cache of the cores.

A load balance strategy has been developed for UMA architectures to preserve the high scalability for general hardware configurations. Inside each NUMA domain, the particle sample associated to the domain is subdivided into as many clusters as cores within the domain. The number of tasks, defined as the product of the number of component grids and number of clusters, is a multiple of the number of cores in the domain. The work load is therefore equally distributed onto the cores of the NUMA domain.

The efficiency of the parallelization is close to optimal (126/128) for the most costly step of the method (i.e. projection (Proj) of the charge density), and the global implementation achieves a speed-up of 100 on 128 cores.

GPGPU parallelization:

The NUMA parallelization implementation, referred to as CPU-inherited, offers poor performances on GPGPU architectures. This is explained by the fundamental differences in the organization of the compute units in CPUs versus GPGPUs. Specifically, the CPU- inherited implementation is designed to optimize the L1-cache memory, private to each core. On GPUs, a large amount of compute units share the same L1-cache. Second, the NUMA implementation relies on a decomposition of the particle population into samples distributed onto the NUMA domains. This strategy reveals to be inefficient to organize a worksharing within the pool of Streaming Multiprocessors (SM) of the GPGPUs. This stems from the non coalesced memory accesses genuine to PIC algorithms. To mitigate the drops in performance, the SMs should be fed with a massive number of streams (independent tasks) to hide the memory latency. Porting the sampling worksharing strategies to GPGPUs would require reduction operation over thousand of arrays, penalizing the performance.

An implementation tailored to GPGPU architectures and parallelization strategies specific to sparse-PIC method has been developed. The key points of the charge accumulation implementation are the following:

The implementation takes advantage of the significant reduction of memory footprint achieved by the sparse-PIC methods. All the data required during the simulation are stored on the GPU so that the only memory transfers between the host and the device are performed at the initialization and at the end of the computations. It is a crucial feature of the implementation since a lot of GPGPU applications are limited by the memory transfers between the host and the device (because of the limited bandwidth).
The parallelization strategy is constituted of two levels of parallelism (coarse-grain parallelism and fine grain parallelism with a Single Instruction Multiple Thread (SIMT) execution model):

Particle work sharing (coarse-grain) strategy: the particle population is divided into clusters. A massive number of clusters (larger than the number of Streaming Multiprocessors (SM)), which can be as large 60k, is created. The clusters are distributed onto the SMs in a Single Program Multiple Data (SPMD) execution model, but are not bound to the SMs. The randomness of the memory accesses is mitigated thanks to the massive number of clusters that mask the latency of the non-coalesced memory accesses with computations: when a thread is stalled due to the unavailability of data (component grid nodes), a switch of context is operated to execute a different instruction stream with loaded data (interleave stream strategy).
Component grid work sharing (fine-grain) strategy inside the SMs: the component grids are processed at the same time by the different threads of a SM (SIMT execution model). The randomness of the memory accesses is also mit- igated by reducing the memory accesses latency thanks to an optimization of the L2-cache of the GPU. Contrary to the CPU implementation, the interaction of one particle is computed with all the component grids at once, and repeated for all the particles. The data structure used to store all the component grids is small enough to be nursed into the L2-cache which contributes to extract good performances from the platform.

The GPGPU-specific implementation achieves speed-ups of 100 on a Tesla V100 compared to the CPU of the host.

Publications:

[2] F. Deluzet, G. Fubiani, L. Garrigues, C. Guillet, and J. Narski. Efficient parallelization for 3d-3v sparse grid Particle-In-Cell: shared memory architectures. Journal of Computational Physics 480 (2023),112022 (en). Open access: https://www.sciencedirect.com/science/article/pii/S0021999123001171; prepublication : https://hal.science/hal-03773216v1.

[3] F. Deluzet, G. Fubiani, L. Garrigues, C. Guillet, and J. Narski. Efficient parallelization for 3d-3v sparse grid Particle-In-Cell: Single gpu architectures.

Computer Physics Communications (2023), 108755. Prepublication: https://hal.science/hal-03773226v2.

Page updated

Google Sites

Report abuse