The Game Assembly — Specialization Project
Building a GPU-Driven
Particle System from
Scratch
Eight weeks. A bare-bones engine. No particle system, no UAV experience, and a codebase that had never seen a Compute Shader.
Here's how that went.
The Game Assembly — Specialization Project
Eight weeks. A bare-bones engine. No particle system, no UAV experience, and a codebase that had never seen a Compute Shader.
Here's how that went.
//Overview
For my specialization at The Game Assembly, I built a GPU-driven particle system on top of our school's in-house DX11 engine, an engine we've been building and extending together since August.
The idea came from a gap I noticed in our pipeline. Our Technical Artists are genuinely skilled at writing HLSL, but they had almost nowhere to use it for real-time effects. The engine had no particle system. No flipbook support. Nothing. The closest thing was a small shader-handled flipbook one of the TAs wrote themselves just to fill the void.
I wanted to fix that, and I wanted to do it in a way that gave artists real power without needing to involve a programmer every time they wanted to try something new.
The goal wasn't just a particle system. It was a particle system where a Technical Artist could drop in a new compute shader and have a completely different effect — without touching a single line of C++.
I also had never worked with Compute Shaders or UAVs before. So this was very much going in blind, which made every milestone feel genuinely rewarding.
//Core Technology
Most shader stages in a graphics pipeline have a fixed purpose, the Vertex Shader moves geometry, the Pixel Shader colours fragments. A Compute Shader is different. It's a general-purpose GPU program that runs completely outside the normal render pipeline, letting you use the GPU for arbitrary parallel computation.
The reason this matters for particles is scale. A CPU updates objects one at a time, or at best across a handful of threads. A GPU can run tens of thousands of threads simultaneously. With a Compute Shader, every single particle gets its own thread. They all update in parallel, in the same frame, without waiting for each other.
In my system, the Compute Shader is dispatched once per frame. Each thread picks up one particle, checks if it's alive or dead, updates or respawns it, and writes the result back. Simple per-particle logic that scales to hundreds of thousands of particles with almost no overhead.
The [numthreads(1024,1,1)] attribute tells the GPU to launch groups of 1024 threads at a time. 1024 is the maximum allowed and perfectly fills 32 GPU warps — the smallest unit of execution on Nvidia hardware — meaning no threads are wasted sitting idle.
//Rendering Pipeline
GPU resources can be bound to shader stages in different ways depending on what that stage needs to do with them. The two relevant ones here are the SRV (Shader Resource View) and the UAV (Unordered Access View).
An SRV is read-only. A UAV allows both reading and writing. The Compute Shader needs to be able to modify particles. Kill them, move them, respawn them, so it binds the particle buffer as a UAV. The Vertex Shader only needs to read particle positions to place them in the world, so it binds the same buffer as an SRV.
It's the same buffer underneath. Just bound differently depending on who needs it and what they need to do with it.
One detail worth calling out: particles are rendered as billboards, flat textured quads that always rotate to face the camera. This is handled in the Vertex Shader by aligning the quad to the camera's right and up vectors, so particles always look correct from any angle without needing actual 3D geometry.
//Emission
Emission is simply the logic that decides when a dead particle gets to respawn. Every particle has a lifetime and once it expires, it's eligible to be reborn. The emission mode controls how that rebirth is paced.
Burst mode fires all particles at once. Every particle spawns simultaneously and dies at roughly the same time, creating that sharp pop of a full emission followed by silence, then another pop. This is what you'd want for an explosion, a shockwave, or anything that needs a clear on/off rhythm.
Continuous mode staggers the spawns so there's always a stream of new particles being born while old ones die. Fire, smoke, sparks — anything that should feel like a constant flow rather than a wave.
Burst mode:
Continues mode:
Getting continuous emission to feel even was one of the trickier problems to get right. The current implementation does the gating inside the shader itself, which means the spawn window is tied to delta time — on a slow frame too many particles spawn at once, on a fast frame none do. It works well enough at stable framerates, but a cleaner solution would be a CPU-side accumulator that carries the fractional remainder between frames and writes a precise spawn count to the constant buffer each frame. That's something I'd improve with more time.
The system also supports a Burst Count parameter. Setting it to 3, for example, splits the particles into three waves offset evenly in time — so a new burst fires every third of a lifespan. There are always overlapping waves alive at once, giving a continuous layered burst feel rather than a single pop.
When a particle spawns, it needs an initial direction to travel. The emission shape determines where that direction comes from. Each shape is a small HLSL function that takes the particle index, some parameters, and the emitter's transform, and returns a normalized direction vector.
That direction gets multiplied by the particle speed and stored as the initial velocity. From there the particle just moves in a straight line — it's the shape of the spawn that gives the effect its character.
//C++ Integration
The engine already had a render loop and a graphics state stack — my job was to hook into that without breaking anything that already existed. Each frame, when the renderer encounters a ParticleProperty component, it looks up or creates a ParticleEmitter object, updates its state, dispatches the Compute Shader, then hands the result straight to the Vertex Shader for drawing.
The ParticleEmitter class itself is relatively small by design. It owns the three GPU resources that the particle system lives and dies on — the structured buffer, the UAV, and the SRV — and exposes just enough interface for the render loop to drive it.
The render-side code that actually drives a particle emitter each frame is more involved. It handles looking up the emitter by ID, syncing property state, binding textures to slots 16–19, choosing the right blend mode, dispatching the reset shader if needed, dispatching the update shader, then drawing with DrawInstanced. Each particle gets 6 vertices — one billboard quad built entirely in the Vertex Shader.
//GPU Threading
When you dispatch a Compute Shader, you don't just run it once — you run it in bulk. The [numthreads(X,Y,Z)] attribute defines how many threads launch per thread group, and your Dispatch(gX, gY, gZ) call defines how many groups to launch. The total threads are the product of both.
GPUs execute threads in fixed-size bundles called warps (Nvidia) or wavefronts (AMD) — always 32 threads wide. If your group isn't a multiple of 32, the GPU still pays for a full warp but leaves some lanes idle and wasted. So you always want your thread count to be a clean multiple of 32.
For a 1D workload like a particle buffer — where every particle is just one entry in a flat array — [numthreads(1024,1,1)] is the right choice. 1024 is the maximum allowed per group; it's exactly 32 warps with zero waste, and it minimises the number of groups you need to dispatch for large particle counts:
You'd use 2D groups ([numthreads(8,8,1)]) if you were working on a texture and wanted to index by pixel coordinate, or 3D for a voxel volume. The shape of the group should match the shape of your data. Particles are a flat array — one index, one thread.
There's also occupancy to consider: if your shader uses group-shared memory heavily, large groups can prevent the GPU from scheduling multiple groups on the same compute unit simultaneously, since they're competing for a fixed pool of shared memory. My particle shader uses none — every thread works independently — so 1024 is purely a win.
//The Default Compute Shader
The goal with the default shader was to pack in as much useful behaviour as possible — not just for our Technical Artists, but for level designers too. If a designer wants a quick-fire effect, some floating dust, or a burst of sparks on an event trigger, they shouldn't need a programmer or a TA to set it up. I used Unity's particle system as a reference for what that kind of expressiveness looks like in practice. I knew I wasn't going to match it, but it gave me a clear target to work towards.
The result is a shader that handles both emission modes, multiple shape types, burst waves, speed, lifespan, and colour — all driven by a constant buffer that the editor exposes directly. Every frame, one thread handles one particle: check if alive or dead, update or respawn, write back.
The shader is also designed to be swapped out entirely. Drop a new .hlsl in the shaders folder, and it appears in the editor dropdown — no C++ needed.
//Process
How I actually built this
The real timeline
Week one was planning, no actual implementation. Weeks two through four didn't go to plan for personal reasons. By the time I could actually start it was already week five. That left four weeks, and I used two of them building something I ultimately scrapped.
The week-six scrapping
My original plan was to build a CPU particle system first and migrate it to the GPU piece by piece. It seemed like the safe, incremental approach. It wasn't. Partway through I realized that doing it that way meant writing code for the CPU version, then deleting it and rewriting it to bridge the next step — and repeating that for every piece I moved over. Every migration would break the seam between what had moved and what hadn't. There was no version of that plan that didn't create a mess.
So at the end of week six I scrapped everything and started over with a clean GPU-first design. What's documented here is what I built in the two weeks that followed.
The memory problem
A few days after finishing I realized that each emitter owns its own particle buffer, meaning all that GPU memory is allocated whether the emitter is active or not. In a game, most emitters are dormant most of the time. That's a real cost.
I did expose the allocation size in the editor, anywhere from a single particle (~124 bytes) up to 10 million particles is valid, but that only helps if you know how many particles you will need on screen at once, and it requires restarting the editor to take effect. At the same time, the memory is still allocated eitherway, the memory just fits better when it's actually playing.
The cleaner solution is a shared pool: one structured buffer for alive particles, one for dead ones, emitters draw from the pool when active and return particles when done. I started rebuilding it this way — split spawn and update into two separate compute shaders, got those working, then began implementing the ping-pong buffer approach to move particles between two lists instead, one for dead ones and one for alive ones. That's where I got stuck. I couldn't get the update pass to behave correctly, and at that point I needed to shift focus to the portfolio site. The system is unfinished, but I wanna and am going to continue developing this until I'm fully satisfied whith the architecture of it.
Worth noting: there's a genuine argument for keeping the per-emitter model in some contexts. If you want a single high-fidelity explosion at 10 million particles, a shared pool makes that allocation harder to reason about. Per-emitter probably makes more sense for cinematic or offline tools. Pooling makes more sense for a game with many small, frequently toggled emitters.
Bugs that took longer than they should have
Most issues I ran into were architectual and often resolved the same day with the help of Render Doc. But there were three small bugs that really stuck with me.
#01 — ~4 hours
Invisible illegal character in HLSL. An illegal character had made its way into a shader file. Rider, Visual Studio, and Notepad all rendered the file cleanly — the character was invisible to every editor I tried. I ended up diffing against an earlier version in Perforce to isolate it, then deleted it manually. The compiler error gave no indication of where to look.
#02 — ~3 hours
Default case in a switch statement breaks CSO compilation. Having a default case in an HLSL switch statement would either cause the shader to silently fall back to an older compiled version or crash outright — no clear error either way. Removing the default case fixed it, but finding that as the cause took a while because the failure mode looked like an unrelated runtime issue.
#03 — ~5 hours (concurrent with architecture work)
ConsumeStructuredBuffer index failing validation. When using the Consume() intrinsic on a ConsumeStructuredBuffer, the HLSL compiler and GPU driver couldn't guarantee the returned index was valid, so the shader would fail validation or crash. The fix was to immediately copy the result into a local variable (I called it safeIndex) after consuming. That forces the compiler to treat it as a plain local rather than tracing it back through the consume intrinsic. Trivial once you know it, but this only manifests on some compilers, which is part of what made it hard to pin down. I was also mid-redesign on the architecture at the time, so isolating whether the bug was structural or a shader issue took extra time.
//Reflection
#01
Compute Shaders from zero. I had genuinely never touched a Compute Shader or a UAV before this project. Going in blind made the first time particles actually rendered feel disproportionately satisfying — and gave me a solid understanding of how GPU parallel execution actually works in practice, not just in theory.
#02
Thinking in parallel. Writing GPU code means abandoning the assumption that things happen in order. Particles don't update one after another — they all update simultaneously. Getting comfortable with that mental model changed how I reason about GPU workloads in general.
#03
Shipping under constraint. Two weeks to ship a working particle system for my technical artists. It forced fast, clear decisions about scope. While I'm proud of what I accomplished, I can't help but feel increadibly dense for overlooking the insane memory problem.
There's obviously a lot more I took away from this project, these are just the three that stuck with me most.
//Future Work
There's a lot I'd still like to add. The system works well, but it's nowhere near as expressive as something like Unity's VFX Graph yet. Here's what I'd tackle first with more time:
The Architecture. As I've already talked about, the architecture of the system leads to a huge memory flaw when using it for video games. My number on priority must be to get the new architecture working before I move onto anything else.
Perlin noise. Adding Perlin or Curl noise to give particles organic, unpredictable movement. Currently, every particle follows a straight path from spawn — layering noise onto the velocity each frame would make effects like fire and smoke feel genuinely alive rather than ballistic.
More artist-facing parameters. Size over lifetime, colour over lifetime, and noise controls are all exposed directly in the editor constant buffer. The goal is that an artist should never need to write a new shader unless they want to do something genuinely novel.
Scene collisions. Having particles respond to level geometry — bouncing sparks, smoke that fills a room, debris that lands on surfaces. Not just particle-on-particle but particle-on-world.
3D mesh particles. Right now, everything is a billboarded 2D sprite. Supporting actual 3D models as particle instances would open up destruction effects, object scatter, and a lot more.
This was one of the most genuinely fun things I've built at TGA. There's still so much I want to get in — and because of that, this isn't really finished. I'll definitely keep developing it in my free time.
Edvin Nordangård · TGA DX11 Engine · HLSL / C++
2026