We are going to implement a parallel version of the Barnes-Hut algorithm on NVIDIA GPUs using CUDA to simulate galaxy evolution over time. Our goal is to accelerate the computation of gravitational forces between particles by leveraging a tree-based approximation scheme and mapping the force evaluation to individual CUDA threads. We will explore challenges such as divergent tree traversal, irregular memory access, and floating-point precision tradeoffs.
Galaxy simulation is a classical example of the N-body problem, where a large number of particles—typically stars—interact with each other through gravitational forces. This type of simulation is widely used in astrophysics to study large-scale structure formation, but similar formulations appear in other fields like fluid dynamics and molecular modeling. The naive algorithm for solving this problem computes the pairwise force between every pair of particles, leading to O(N²) time complexity. As the number of particles increases, this approach becomes infeasible due to its poor scalability.
The Barnes-Hut algorithm reduces this computational burden to O(N log N) by organizing the simulation space into a hierarchical tree structure—a quadtree in two dimensions or octree in three. Instead of computing direct interactions between every pair, the algorithm groups distant particles and approximates their influence using a single aggregated mass positioned at their center of mass. This aggregation is determined using a distance threshold based on the ratio θ = L / D, where L is the width of the region represented by a node and D is the distance from the current particle to that region.
The simulation progresses in time steps, with each step following this basic pseudocode structure:
for each time step {
build tree structure from particles
compute center of mass for internal nodes
for each particle {
traverse the tree
accumulate gravitational forces using approximated nodes
update particle position and velocity
}
}
Our implementation focuses on accelerating this algorithm entirely on the GPU using CUDA. All simulation phases, including force computation and particle updates, will be handled on the device. Tree structures and particle data will be stored in GPU memory, and tree traversal will be implemented as a parallel CUDA kernel, with one thread per particle. This model allows us to leverage the parallel nature of force computation, as each particle can be processed independently.
However, this parallelism is not without challenges. While the outer loop over particles is embarrassingly parallel, each particle traverses the tree differently depending on its position. This results in irregular control flow across threads, causing warp divergence on the GPU. Additionally, the tree structure leads to non-coalesced memory access, which reduces effective memory bandwidth. Efficient data layout (e.g., structure-of-arrays) and shared memory reuse strategies are needed to mitigate this.
The algorithm also benefits from spatial and temporal locality. Nearby particles tend to traverse overlapping regions of the tree, making shared memory and caching more effective when used appropriately. Moreover, because particles move slowly relative to the spatial structure of the simulation, the workload per particle does not change drastically across time steps. This property allows us to adopt semi-static workload balancing, where we track traversal costs and reassign particles to balance computational load.
By targeting a full CUDA implementation, we aim to explore the limits and optimizations possible when mapping an irregular, tree-based algorithm to massively parallel hardware. The project will serve as a case study in how to manage divergence, memory pressure, and floating-point precision tradeoffs in a high-performance scientific application.
Accelerating the Barnes-Hut algorithm on the GPU using CUDA introduces a series of non-trivial parallelization challenges. The workload is hierarchical, spatially irregular, and highly data-dependent. Unlike dense, structured problems, this algorithm features fine-grained control flow variation, dynamic data structures, and non-uniform per-thread workloads—all of which are difficult to map efficiently onto CUDA's SIMD model. Below, we describe the key components of the workload and the specific challenges they pose on a GPU.
1. Tree Construction (Quadtree/Octree Generation)
The tree construction step organizes particle positions into a hierarchical spatial data structure. Each node represents a region in space, and recursive subdivision occurs whenever multiple particles occupy the same region. Over time, this structure evolves as particles move.
Concurrency and Synchronization:
Tree construction involves inserting many particles into a shared spatial structure. On CUDA, managing concurrent insertions is problematic. If multiple threads attempt to insert into overlapping regions, synchronization is required, but global atomic operations or fine-grained locking are costly on GPUs. These conflicts are especially common in dense areas where recursive subdivision is triggered frequently.
Irregular Control Flow:
The insertion process involves multiple conditionals based on spatial coordinates and existing node contents. This leads to divergent control flow across threads within a warp, which lowers CUDA efficiency. Since thread behavior becomes data-dependent, some threads may complete insertions quickly while others proceed through deep recursive calls, degrading parallel throughput.
Dynamic Memory Allocation:
Tree construction requires dynamic allocation of internal and leaf nodes. CUDA provides limited support for fast, safe dynamic memory allocation within kernels. Managing custom memory pools on the device introduces implementation complexity and can still suffer from fragmentation or contention, especially when many threads are allocating nodes simultaneously.
Due to these challenges, we will precompute the tree on the host and upload it to the GPU for traversal and force evaluation.
2. Center-of-Mass Computation (Internal Node Aggregation)
After building the tree, internal nodes must compute their center of mass and total mass by aggregating the values from their children, moving from leaves upward.
Dependency Structure:
Each internal node depends on its children being processed first. This creates a strict dependency graph that is difficult to parallelize using standard CUDA threads. Unlike GPU-friendly flat arrays or grids, the tree’s depth and topology vary per region, which prevents straightforward parallel traversal.
Workload Imbalance:
Regions of the tree with deep branching require more operations than shallow regions. Threads assigned to different parts of the tree will complete at different times, leading to load imbalance across warps and thread blocks.
Traversal Ordering Constraints:
A post-order traversal is necessary to ensure that a parent node’s mass is computed only after all of its children. Enforcing this traversal order on the GPU is difficult due to the lack of recursive kernel support and the complexity of building a dynamic execution schedule within CUDA.
For these reasons, center-of-mass computation will be handled outside the main CUDA kernel and reused across time steps where possible.
3. Tree Traversal for Force Computation
This step computes the gravitational force acting on each particle by traversing the tree. For each node encountered during traversal, the algorithm decides whether to treat the node as a single mass or to recurse further, based on a user-defined approximation threshold θ.
Divergent Execution Paths:
Each thread traverses a unique path through the tree, depending on the position of the particle and the surrounding mass distribution. This leads to divergence within CUDA warps, as threads execute different instructions at different times. As a result, execution becomes serialized, and overall throughput is reduced.
Per-Thread Stack Requirements:
Tree traversal is typically implemented using recursion or an explicit stack. CUDA does not natively support deep recursion in device code, and maintaining per-thread stacks in global memory is inefficient. Allocating stacks in shared memory is also difficult due to limited space, especially when working with thousands of threads.
Irregular Memory Access Patterns:
Each thread accesses different nodes of the tree, often jumping across memory addresses with little spatial locality. This results in poor memory coalescing and low cache efficiency, as neighboring threads in a warp are unlikely to access nearby memory locations.
Workload Variability:
Particles located near dense clusters of other bodies require deeper traversal and more force contributions. This causes variation in execution time per thread, leading to underutilization of some warps and imbalance in computation.
These constraints make the force computation kernel the most complex and performance-critical part of the CUDA implementation.
4. Position and Velocity Update
Once the net force on each particle is known, the particle’s velocity and position are updated using Newton’s second law of motion. This update occurs independently for each particle.
Floating-Point Precision Tradeoffs:
Using single-precision floating point is fast but can introduce significant rounding errors, especially over many time steps. Accumulated drift can lead to violations of energy conservation and system instability. Using double-precision improves numerical accuracy but greatly reduces performance on many consumer and mid-range GPUs.
Safe Parallel Updates:
Although each thread updates one particle, care must be taken to avoid race conditions if intermediate results are stored in shared buffers or reused across time steps. Improper indexing or concurrent writes can lead to silent data corruption or hard-to-debug logic bugs.
Numerical Integration Accuracy:
The choice of integration method (e.g., explicit Euler vs. Verlet) affects both the stability and accuracy of the simulation. More accurate methods may require reading from and writing to multiple versions of the particle state, increasing memory demand and bandwidth pressure on the GPU.
The combination of irregular data access, divergent control flow, memory layout complexity, and numerical sensitivity makes the Barnes-Hut algorithm a challenging but instructive problem to parallelize on the GPU. Through this project, we hope to better understand how to adapt tree-based hierarchical algorithms to CUDA, and to evaluate tradeoffs between precision, performance, and implementation complexity.
Week 1: March 26 – March 31
Set up project repository, development environment, and CUDA toolchain (nvcc, Nsight, etc.)
Implement the serial version of the Barnes-Hut algorithm in C++:
Tree construction
Center-of-mass computation
Force calculation
Position updates
Validate correctness of the serial version on small datasets (e.g., 100–1000 particles)
Begin designing data structures for a GPU-friendly tree format
Generate test datasets with varied particle distributions (uniform, clustered)
Week 2: April 1 – April 7
Begin CUDA implementation of the Barnes-Hut algorithm:
Design flattened tree structure suitable for device memory
Implement GPU-friendly tree layout and transfer it from host
Write initial per-thread tree traversal kernel with θ-based approximation
Test GPU memory transfers and traversal correctness for small inputs
Begin profiling: measure memory usage, thread divergence, and occupancy using Nsight Compute
Week 3: April 8 – April 14
Finalize CUDA force computation kernel
Complete CUDA implementation of:
Tree traversal logic
Force accumulation
Position and velocity updates
Compare CUDA output with the serial version for correctness
Benchmark performance on datasets of increasing size
Start work on visualization GUI: design file format for saving particle positions over time
Draft intermediate milestone report (due April 15)
Week 4: April 15 – April 21
Submit milestone report (April 15)
Complete and test GUI playback tool for visualizing particle motion
Begin CUDA kernel optimization:
Tune memory layout (SoA vs. AoS)
Experiment with shared memory for caching subtree nodes
Reduce warp divergence through traversal restructuring
Profile and compare float vs. double precision for accuracy and performance
Week 5: April 22 – April 28 (Final Week)
Final tuning of CUDA kernel (if needed)
Polish visualization for possible live demo
Finalize:
Full project report (technical design, analysis, profiling results)
Final poster (figures, performance charts, key insights)
Prepare demo walkthrough (if presenting live)
Rehearse final presentation
Submit all project deliverables before the deadline
Planned Deliverables
1. A complete serial CPU reference implementation
We will build a single-threaded CPU implementation of the full Barnes-Hut algorithm, including tree construction, center-of-mass computation, tree traversal, and force accumulation. This baseline will ensure correctness and serve as a point of comparison for measuring GPU performance improvements.
2. A fully working CUDA implementation of the Barnes-Hut algorithm
Our main deliverable is a CUDA implementation that offloads the performance-critical force computation and position update phases to the GPU. Tree construction and center-of-mass calculation will be performed on the CPU, and the tree structure will be uploaded to the GPU. Each thread will be responsible for computing the force on one particle by traversing the tree in parallel.
3. A flexible particle dataset generator
We will write a tool or script to generate synthetic datasets with adjustable size (e.g., 1,000 to 1,000,000+ particles), mass distributions, and spatial configurations. This will allow us to test the scalability and correctness of our implementation under various simulation conditions.
4. A graphical user interface (GUI)
We will develop a lightweight visualization tool that shows the evolution of particles across time steps. The GUI will load the output data and display particle movement over time to help with debugging, performance analysis, and presentation. If feasible, we will demonstrate this live at the poster session.
5. A detailed final report
We will submit a full project report that explains our design decisions, CUDA implementation strategy, optimization attempts, profiling results, and lessons learned. The report will include figures showing performance vs. problem size, numerical accuracy comparisons (float vs. double), and GPU utilization breakdowns using profiling tools such as Nsight Compute.
6. A technical poster for the final presentation
We will create a clear and well-organized poster that summarizes the project goals, implementation details, evaluation results, and open questions. It will be aimed at a technical audience familiar with parallel systems.
Optional Goals
1. CUDA kernel optimization for performance
If time permits, we will experiment with memory coalescing, shared memory tiling, and warp-level control to improve performance. We will test the effect of different data layouts (e.g., SoA vs. AoS) and identify the key bottlenecks that limit GPU throughput. Our target is to achieve a 10× speedup over the CPU baseline for the force computation step. We believe this is achievable based on parallelism in the problem and expected GPU occupancy.
2. Explore real-time performance with large particle sets
As an additional goal, we will try to simulate and render small to mid-sized systems (e.g., ~10,000 particles) in real time or near-real time. This would require optimizing both computation and memory access patterns, and may involve limiting tree depth or pruning traversal steps based on screen resolution.
For this project, we will use the NVIDIA RTX 2080 GPUs available on the GHC lab machines to develop and test our CUDA implementation. We are starting from scratch and writing our own codebase for both the serial and CUDA versions of the Barnes-Hut algorithm, as we want to design the memory layout, tree structure, and traversal strategy ourselves based on our understanding of the algorithm and the GPU architecture.
We are using several external resources as references to guide our design and implementation. The Arbor.js Barnes-Hut visualization provides an intuitive demonstration of how the tree structure behaves over time in a dynamic system. For GPU-specific considerations, we are referencing the following paper:
Burtscher, M., & Pingali, K. (2011). An Efficient CUDA Implementation of the Tree-Based Barnes Hut N-Body Algorithm. Proceedings of the GPU Computing Gems Emerald Edition. Link
This paper offers a high-level discussion of CUDA-specific challenges and performance tips for hierarchical N-body simulations, and it will help guide our approach to optimizing tree traversal and memory access.
We do not require any additional special machines or proprietary software at this time.
We have chosen to implement our project using CUDA on the NVIDIA RTX 2080 GPUs available on the GHC lab machines. This platform is well-suited for our workload because the force computation phase of the Barnes-Hut algorithm is highly parallel and maps naturally to the GPU's data-parallel execution model. Each particle’s force can be computed independently, allowing us to assign one CUDA thread per particle and take advantage of thousands of GPU cores operating in parallel.
The RTX 2080 offers a large number of CUDA cores and a high memory bandwidth, which is important for accelerating memory-intensive tasks like traversing the tree and updating particle positions. Additionally, CUDA provides mature tools such as Nsight Compute and nvprof for kernel profiling and memory access analysis, which will help us optimize memory layout, manage shared memory usage, and minimize thread divergence. The combination of compute power, tooling support, and programming flexibility makes this platform a strong fit for exploring the performance characteristics and parallelization challenges of the Barnes-Hut algorithm.