A Massively Parallel and Scalable
Multi-GPU Material Point Method
Harnessing the power of modern multi-GPU architectures, we present a massively parallel simulation system based on the Material Point Method (MPM) for simulating physical behaviors of materials undergoing complex topological changes, self-collision, and large deformations. Our system makes three critical contributions. First, we introduce a new particle data structure that promotes coalesced memory access patterns on the GPU and eliminates the need for complex atomic operations on the memory hierarchy when writing particle data to the grid. Second, we propose a kernel fusion approach using a new Grid-to-Particles-to-Grid (G2P2G) scheme, which efficiently reduces GPU kernel launches, improves latency, and significantly reduces the amount of global memory needed to store particle data. Finally, we introduce optimized algorithmic designs that allow for efficient sparse grids in a shared memory context, enabling us to best utilize modern multi-GPU computational platforms for hybrid Lagrangian-Eulerian computational patterns. We demonstrate the effectiveness of our method with extensive benchmarks, evaluations, and dynamic simulations with elastoplasticity, granular media, and fluid dynamics. In comparisons against an open-source and heavily optimized CPU-based MPM codebase on an elastic sphere colliding scene with particle counts ranging from 5 to 40 million, our GPU MPM achieves over 100X per-time-step speedup on a workstation with an Intel 8086K CPU and a single Quadro P6000 GPU, exposing exciting possibilities for future MPM simulations in computer graphics and computational science. Moreover, compared to the state-of-the-art GPU MPM method, we not only achieve 2X acceleration on a single GPU but our kernel fusion strategy and Array-of-Structs-of-Array (AoSoA) data structure design also generalizes to multi-GPU systems. Our multi-GPU MPM exhibits near-perfect weak and strong scaling with 4 GPUs, enabling performant and large-scale simulations on a 1024x1024x1024 grid with close to 100 million particles with less than 4 minutes per frame on a single 4-GPU workstation and 134 million particles with less than 1 minute per frame on an 8-GPU workstation.
Crushing Concrete (93.8M particles, 1024x1024x1024 girds, 3.9min/frame, running on 4 GPUs)
Bomb Falling (134M particles, 6688 bombs, 512x2048x512 girds, 1min/frame, running on 8 GPUs)
Candy Bowl (23M particles, 6786 candies,1024x1024x512 girds, 4s/frame, running on 4 GPUs)
Soil Falling (53M particles, 512x512x512girds, 57s/frame, running on 4 GPUs)
Sand Armadillo (55M particles, 6786 candies, 512x512x1024 girds, 34s/frame, running on 4GPUs)
Improved Single-GPU Algorithm
When parallelizing MPM algorithms, the general concern about the performance is the transfer operations between particles and grids, i.e., P2G and G2P. These sub-steps become even more crucial to the performance of implicit schemes where significantly more transfer operations are required. We present two techniques to accelerate the transfer operations: 1) Grid-to-Particles-to-Grid (G2P2G), an innovative and fused algorithmic kernel, and 2) Array-of-Structs-of-Array (AoSoA), a new application of a particle data structure with an associated parallel loop strategy.
MPM pipeline reformulation
We start by analyzing the data dependencies among adjacent MPM sub-steps. As shown in the left column of this figure, we observe some order constraints on data dependencies and execution orders of the sub-steps:
1) The P2G must be finished before the grid update, and the G2P is performed after all the grid states been evolved.
2) The partition update, wherein the particle-grid mapping and the sparse grid data structure are maintained, depends only on the results of G2P, i.e., the advected particle positions.
3) The P2G transfer relies on the particle-grid mapping, i.e., particles need to know to which grid nodes they should rasterize to, which leads to the dependency between the partition update in the current time step k and the P2G in the next time step k+1.
The first two observations exhibit strict data dependencies, which are unchangeable to ensure correct computations. The third one, however, is a weak dependency, since the particle-grid mapping can be staggered differently. Therefore, we can reformulate the execution order of the sub-steps for better performance.
Following the above analysis, we devise a novel G2P2G kernel by grouping the G2P in time step k and the P2G in time step k+1 together.
When grouping the G2P and the P2G together, the velocity and higher-order velocity modes on particles can be interpolated from grids and referenced immediately for both the particle updates and the next P2G transfer, converting these quantities to temporary variables within the kernel instead of arrays allocated in GPU global memory; the only particle attributes that need to be preserved are the mass, positions, and deformation gradients.
Such G2P2G reformulation not only eliminates two transfer kernel launches and two particle data accesses for each time step, which significantly improves the performance, but also reduces the particle storage.
(Left) The conventional GPU MPM pipeline adopts an off-by-one staggered mapping between blocks and particles for more efficient use of the shared memory. (Right) The G2P2G pipeline adopts an off-by-two strategy: particles from a block should be located at least two-cell distance from the border of the arena. Such a design ensures the particles to stay in the same blocks after CFL-bounded advection in the G2P2G.
Since the state of other particles cannot be inferred during the execution of a single G2P2G kernel thread, we use the maximum velocity of the grid nodes, which can be computed before entering the G2P2G kernel, to estimate dt. Experimental results show little difference in the computed dt (less than 1%) between the computation performed with the maximum velocities of grid nodes and particles.
To exploit the advantages of both SoA and AoS layouts without compromising performance, we devise an AoSoA data structure to store particle attributes. The particles are grouped according to their positions, such that particles mapping to the same block can be gathered together in the memory. We adopt an SoA structure to store the particle attributes inside each group, while the particle groups are organized using an AoS structure. With such a design, the proposed AoSoA}particle data structure has the following advantages:
As long as the SoA group size is a multiple of the CUDA warp size, each warp of threads can access (read and write) particle data in a coalesced manner to ensure bandwidth efficiency.
The particles are grouped according to their positions, and the particle groups are organized in an AoS layout. Therefore, each block (a 4x4x4 cell size in our pipeline) of particles resides in contiguous memory, easier for faster migration among multi-GPUs. Note that the SoA layout does not possess such property as particle attributes are stridden across the GPU memory. Such a design suits better for the proposed G2P2G pipeline, wherein each CUDA block handles only one particle block.
By organizing particles inside each particle block with a finer granularity, we can reduce memory usage by making each particle block to occupy a minimal amount of memory to accommodate the particles inside.
The internal layout of each particle bin is SoA. The number of properties is flexible according to the material; we used 4 for fluids and 13 for solids.
Particles are often unequally distributed in space. Hence, uniformly allocating fixed memory space for each particle block would result in a significant amount of unused memory. Instead, a more efficient strategy is to allocate just enough particle bins for each block according to the current distribution of particles.
Using multi-GPUs for MPM simulations affords significantly larger simulations and shortens the overall simulation time. To extend from using a single GPU to running on multi-GPUs, we divide the whole simulation domain into partitions according to the device number and assign one partition to one GPU device. The load balancing is one of the essential considerations when distributing partitions for multi-GPU applications. Depending on the dynamics of the simulation, the same partitioning scheme could result in drastically different performances on various problems.
Halo Block Tagging
We tag grid blocks with three labels: assigned, transferred, and advected.
Assigned grid blocks are the blocks where particles are residing in. Transferred grid blocks contain the grid-nodes that particles may write to; e.g., with quadratic B-spline kernel, each particle may write to neighbor nodes within a three-cell distance. Advected grid blocks represent the blocks where particles may advect to after the G2P and the particle advection. The transferred and the advected grid blocks may contain no particles in the current time step.
In the context of multi-GPU MPM, the partition from one device can overlap partitions from other devices. These overlapping regions (i.e., halo regions) may include all three types of grid blocks. The grid data in the halo regions should be shared and synchronized by the corresponding GPUs to ensure the execution correctness.
Multi-GPU Static Partitioning by Particles - MGSP
MGSP is an ideal option for solid simulations, including elastic jellos, sand, and other granular materials, due to the stable halo distribution of solids. Since the overall shape of solid models remains intact even under large deformations, the halo regions typically reside on the model surfaces. Even when significant fractures happen, the halo regions still only occupy a small portion of the whole partition.
Instruction pipelines. The additional operations in the multi-GPU MPM compared to the single-GPU MPM are displayed in red. By masking these halo-region-related data transfers with the execution of the G2P2G kernel, one can achieve more optimized scaling results with multi-GPUs.
Multi-GPU Static Partitioning by Space- MGSS
In an MPM simulation, it is possible that the size of halo regions among multiple partitions grows beyond a threshold, such that the latency of the non-halo G2P2G kernel is not high enough to mask the device-to-device memory copies. This situation is especially common for fluid simulations where fluids can theatrically mix as an example), making halo sizes increasing dramatically as time goes by. In such cases, re-partitioning particles is necessary for load balancing, and statically partitioning by space is a simple yet efficient strategy.
Please refer to our paper for detailed introduction.
Single-GPU Performance Comparison
All candidate single-GPU MPM methods use the MLS-MPM transfer method in explicit time integration.
The per-time-step timing results are run on NVIDIA RTX 2080 and Quadro P6000 and gathered after objects hit the ground for better evaluation.
The Hu et al. [2019a] benchmark disabled the initial reordering.
The dragons* scene reduces particles per cell by half.
Performance comparison between an SIMD implementation vs our GPU pipeline
We compare the timing against an open-source, heavily optimized CPU-based MPM codebase (a SIMD vectorized implementation provided by its authors).
The experiment is conducted in an elastic sphere colliding scene with particle counts ranging from 5 to 40 million. On a workstation with an Intel 8086K CPU and a single Quadro P6000 GPU, our GPU MPM achieves 110 to 120x per-time-step speedup.
Both bomb failing and dragons scenes use irregular geometries; all dragons scenes have the very same geometry but are sampled with different numbers of particles per cell, and bomb failing scene is much denser in space. The cube scene is a uniformly sampled cube with particles ordered. All timings are computed using an NVIDIA RTX 2080 graphics card.
The first timing column is the sum of the timings of P2G and G2P kernels. The timing in the second timing column is measured by replacing P2G and G2P kernels with the proposed G2P2G kernel. The timing in the third timing column is measured by replacing the AoS layout with the proposed AoSoA layout on top of the G2P2G kernel. The speedup is calculated by comparing it with the reference timeHu et al. [2019a].
Weak scaling; compact layout.
(left) The per time-step wall time remains steady with an increasing number of GPUs for the G2P2G and overall performances.
(Right) Both the G2P2G kernel and the overall efficiencies still stay around 95% even when employing 4 GPU devices.
Strong scaling; side-by-side layout.
(left) Both G2P2G and overall performances scale well with an increased number of GPUs.
(Right) The loss of the strong scaling of G2P2G is trivial even when employing 4 GPU devices, while the overall strong scaling drops to around 90%.