At the 2004 Game Developer Conference Matt Pritchard (one of my coworkers at the now closed Ensemble Studios who wrote Age of Empires 2's graphics code), John Brooks (CTO and my boss at Blue Shift Inc., the graphics programmer on "Super Mario Wacky Worlds") and I came out of the cold and gave a fairly rushed, but groundbreaking 1 hour presentation on real-time deferred rendering techniques to an audience of 300-400 people. At the time, deferred shading was a pretty much unknown real-time rendering technique, and whenever I brought the topic up with developers or video card vendors I was typically treated like I had three heads. (I'm a bit hardheaded, and a very independent sort of person, so I didn't really care much. I knew the idea worked because I had already shipped a fully deferred original Xbox game by that time.) I had no idea what the response would be to this presentation, or how much deferred rendering would eventually catch on in the industry. After the talk I went back to Ensemble Studios and continued plugging away on the graphics engine and tools for one of Ensemble's early prototype games named "Wrench", so I didn't put this presentation on the web or do anything special to make it easy to find.
I didn't really realize that the work we were doing was pioneering new ground. The main alternative (the pass per object per light approach used on Doom 3) seemed laughably inefficient and primitive by comparison. By this time, Atman Binstock (who later moved on to Rad Game Tools, then eventually to Oculus as Chief Architect) and I had already shipped a fully deferred shaded 3rd person platformer way back in 2001. The game was named Shrek for the original Xbox (or "Xbox 1" on this page), created at Sandbox Studios/DICE Canada in London, ON. It was a Xbox launch title with very pretty graphics (but sadly, weak gameplay):
We didn't realize it, but Shrek Xbox was the first deferred shaded game. A while after shipping Shrek, and before this talk, I had already spent around a year contracting for Microsoft's ATG (Advanced Technology Group - the team behind Xbox) researching deferred rendering on ATI's early prototype shader model 2.0 hardware, so I was extremely confident this approach to rendering scenes was the right path forward.
Rewind to A Bit of Deferred Shading History, Way Back in 2001
If it wasn't for Atman I wouldn't have started down this path in the first place. Atman originally suggested that we try deferred shading, and without his very deep math skills there's no way I could have gotten omnidirectional lights to work efficiently on the original Xbox's groundbreakingly powerful (for the time) but constraining NVidia GPU. At the time Atman suggested it, I already had a bunch of real-time rendering experience after writing the software and Direct3D 7 renderers and 3DS Max exporters for an annoying and quirky little 3D PC kids game named Matchbox Emergency Patrol. We figured the Xbox had these new amazing programmable shader units, so how hard could it be?
I had read every NVidia white paper/presentation/etc. I could get my hands on about per-pixel lighting and normal mapping, especially every one written by Mark Kilgard at NVidia. Rendering the various attribute buffers (what we called the "C", "N", etc. buffers) and computing a deferred directional light was easy, but computing proper and efficient deferred omni lights on Xbox 1 was extremely challenging. I had a deferred omni light solution which used the NV2A's vertex pipeline to process pixels, but at ~20 megaverts/sec. (at best) it was just too slow. Early on I figured out how to alias the NV2A's depth/stencil surface to a 32bpp texture, so that the 24 bits of depth and 8 bits of stencil data could be read as regular RGBA color channels and processed in the texture or combiner units. The NV2A could fetch from a texture, perform three 3D or 4D dot products (at floating point precision as far as we could tell - not fixed point!) against the fetched/filtered results using interpolated texture coordinates, then look up the transformed results into another cubemap or 3D texture. (This corresponded to the texm3x3tex, etc. "instructions" in the Xbox's shader assembler.) The results could then be further processed in the powerful fixed point combiner units, where you could rotate the vector again, approximately normalize it, compute stuff like N.L or N.H, etc.
Atman figured out what vectors we needed to interpolate in order to properly unproject the depth values from the Z buffer into viewspace coordinates, then from there effectively transform the viewspace coords into normalized light space. Once we were in normalized light space we could then use the vector as texcoords to fetch from a precomputed 3D texture. This 3D texture lookup returned a (mostly) normalized vector in RGB and the omni light's attenuation value in A. Once Atman figured out the equations I stayed up over 24 hours straight trying to coax the NV2A into the right texture+combiner setup to implement them. We both doubted it was actually possible (due to worries about precision and whether we could actually get a "1.0" into the dot products) but I kept at it. There was nothing like PIX available at the time so I just had to keep running experiments and studying the output bits. At the time the only references I had were some NVidia papers and a Xbox Direct3D header with cryptic, undocumented combiner structs and enums. We couldn't use MS's early pixel "shader" assembler because it didn't support the obscure but crucial texture mode that we needed ("HILO_1") in order to get a "1.0" into the dot products. (Without the 1 we couldn't add in the translation column of the interpolated matrices.) I had to roll the whole combiner setup thing by hand. I first got the method working on a single axis, then visually verified the results in a single test room with a huge omni. Once I got one axis of the omni working and stable the other two were trivial.
Shrek's omni lights required two separate screenspace passes because I couldn't fit the entire diffuse/spec lighting equation into a single combiner setup. I spent a lot of time working on minimizing the number of pixels that needed to be processed for each omni in the scene. Like most devs who start down the deferred shading path, I started with full-screen quads to get things working, but of course that was brutally slow. The final shipping game rendered 2D n-gons at constant Z depths that were guaranteed to conservatively surround each omni light's screenspace projection. (I didn't render 3D tessellated spheres or whatever, these 2D polygons were computed completely on the CPU.) We enabled Z-testing to discard pixels that couldn't possibly have a non-zero attenuation value. (Or for omni pixels that were behind walls and obviously wouldn't impact the scene - I don't remember anymore.) From memory, the effective omni fillrate was around 65 megapixels/sec. on Xbox 1 which felt kinda slow and limiting at 640x480, even at 30 Hz.
While walking to lunch one day during development, we were discussing how to accomplish transparent water surfaces in a deferred renderer, without having to resort to writing a separate forward shading pipeline. Atman realized that stippling was the answer. For water splashes, I blended splash decals into the diffuse and normal buffers, which looked great over time because the splashes interacted with lighting in a natural looking way. (This was a sort of crude deferred decal.)
To get all this technology into a shippable state, I wrote a custom real-time timeline profiler using the Xbox 1's asynchronous DPC (Deferred Procedure Call) GPU callback mechanism to measure when, and for how long, the CPU and GPU processed each major operation. I would take a RDTSC snapshot when the events started/stopped on the CPU, and do RDTSC's in the DPC callbacks. The latency of these callbacks was surprisingly low. Instead of the usual PIX-like horizontal timeline view, my CPU/GPU timeline profiler displayed each event in two vertical columns, with lines used to connect the CPU and GPU events so I could easily visualize the amount of parallelism between the two processors. This profiler lived completely within the game itself so we had to use the Xbox1's controller to zoom, pan, etc. I used this profiler to tune the shaders and rearrange the frame's rendering order to ensure the GPU was busy as much as possible.
Anyhow, a few years later I wrote and gave a GDC talk based off this and my later Xbox 1 deferred shading work. In this talk I covered such topics as: the differences between deferred shading verses deferred lighting (apparently rediscovered and called "Light Pre-Pass Rendering"), attribute compression, "G-Buffer" channels (called the "C Buffer" and "N Buffer" in the talk), depth buffer aliasing, normal compression/packing, and other related topics.
The GDC Vault seems to be in flux and the old URL to the presentation doesn't work anymore, so I've placed a copy here:
Here's the Xbox 1 attribute packing "pixel shader" used by Gladiator missing from the slides.
At this presentation I showed my second deferred engine powering a Xbox deferred lighting (aka "light prepass") demo named "Gladiator". This demo showed what was probably the first and only single scene pass attribute compressing deferred renderer created on Xbox 1. (To put this in perspective, Xbox 1 did not support multiple render targets, and only had a series of complex fixed point combiners, not true pixel shaders as commonly understood today. Shrek had to use two full scene passes to output the same g-buffer data that the Gladiator demo could do in a single scene pass, allowing us to have more lights and/or meshes).
John Brooks (CTO of Blue Shift) inspired me to try attribute compression while creating my second deferred renderer, because he thought it was very inefficient and limiting that Shrek had to render the scene twice per frame. At this talk he demonstrated a PS2 deferred shading demo that used the PS2's VU (Vector Units) to compute per-pixel lighting.
I've placed a 5mbps WMV encoded video of the demo here (warning: this file is ~100MB):
Here's the video on Youtube:
This interactive demo was shown running in real-time on a devkit in early 2003, at Blue Shift's GDC booth. It was then recycled for my GDC 2004 talk. Fast forward to 2:00 to see the lighting accumulation buffer.
During 2004-2005 I wrote my third (and last) deferred shaded engine for the new Shader Model 2 hardware. This code would eventually be used at Ensemble Studios, in their "Wrench" game prototype demo, which was shown to Bill Gates as part of some larger "Xenon" demo. We had access to ATI prototype hardware to make this demo doable in real-time at 800x600. This deferred engine featured dynamically shadowed omni and spot lights, multi-pass ray marching (with multiple shadow buffer lookups per pass) for participating media effects on shadowed spotlights, deferred transparent surfaces via stippling/dithering (with a "opacity" G-buffer channel and neighborhood blending done in the post-process pass), "medium" dynamic range (>8 bits per RGB component) lighting accumulation via two 32-bit MRT's, and Kawase (graphics programmer - Wreckless) inspired camera bloom/glow effects. The team at Ensemble made this demo with the Wrench prototype code in approximately 7-10 days, where it was known inside Ensemble as "SevenDemo":
Much of Wrench's graphics and tools code was used to bootstrap Halo Wars (Xbox 360), Ensemble Studio's last product. At the beginning of Halo Wars I had decided to move on from deferred shading and try out some alternative approaches.
Shrek, this GDC presentation and the Gladiator deferred lighting demo were later cited by several articles and papers on deferred rendering:
GPU Gems Chapter 9, Deferred Shading in S.T.A.L.K.E.R.
"A bit more deferred – CryEngine3"
"Interactive Massive Lighting for Virtual 3D City Models"
Real Time Rendering, "Deferred lighting approaches"
"Hybrid Deferred Rendering", Marries van de Hoef