Early on in development of our own deferred shader architecture we set about implementing a dynamic 'local' lighting solution for VFX, that provided lighting for explosions, headlights and street lights.
We had already implemented a global lighting pass - a full screen pass that shades the directional lit view of the scene and combines with deferred shadowing.
Our first local lighting implementation was based upon common techniques - rasterize light volume geometry on the GPU (e.g. point light spheres ) and calculate the lighting values per pixel in a lighting fragment shader.
The local light shading takes place in screen space; we shade the screen pixels that are covered by the light volume.
We also intersect test the light volumes with the scene using the depth and stencil so that we shade the areas only where the light volume intersects scene geometry
While shading each pixel we shoot a ray down the camera frustum and sample the Depth buffer, this reconstructs the world position of the surface under that pixel, information we need to attenuate the light. We also sample the Normal buffer to determine whether the surface is facing the light.
The resulting light value is then combined with GBuffer Albedo and accumulated into the Global Light Pass buffer, where the final lit image of the scene is generated.
We found that the ps3 (in comparison to the xbox 360) quickly became fill rate bound - large explosions were very expensive.
We decided to move the GPU local lighting to the SPU processors.
We needed the SPUs to generate local lighting in parallel with the GPU doing other work, like shadow cascades.
To calculate the local light values for a single screen space pixel we need the GBuffer Normal and the Depth buffers. We also need access to data that describes the lights covering that screen pixel; for example light position, color, type.
The ps3 has 8 SPU units, each with access to 256kb of fast RAM.
With a screen resolution of 1280x720 pixels, the 32bit GBuffer Normal and Depth buffers are 3.5MB each - clearly too much to fit on a single SPU and process in one job.
We split the screen into 64x64 pixel tiles, which happened to be the size of one ps3 main memory frame buffer tile.
We fed the SPUs with a queue of jobs, each job shaded a 64x64 pixel tile.
A screen resolution of 1280x720 would require 20x12 tiles or 240 SPU jobs.
The SPU lighting takes place after the GBuffer had finished rendering - the GPU calls back to the PPU and this triggers the job queue to send the jobs to the SPUs.
With each SPU job and we send the tile's GBuffer Normal and Depth buffer to the SPU in a DMA call. We also send the list of lights touching each tile, classified in screen space.
Each SPU job executes and shades the light values for its 64x64 pixel tile, and then DMAs the result to the equivalent tile region in a dedicated Local Light buffer.
While these jobs are executing, the GPU goes on to render the shadow cascade, after which it sits on a command buffer JTS waiting for a signal that the SPU jobs have finished.
The final SPU job clears the JTS (wait) in the GPU command buffer and the GPU goes on to render the Global Light Pass, sampling the Local Light buffer.
Each SPU job contains the list of lights touching each tile, classified in screen space.
To classify, we built the list of lights visible to the camera frustum and depth sorted it, so that the nearest lights took priority in each tile.
We then found the screen space bounds of each light, and stepped over this to determine which tiles the light intersected. Each tile was tagged with the lights that touched it.
We also set a maximum limit of 16 lights per tile, this clamped the cost of shading.
Each job executes SPU code which shades the tile. The SPU executable code is written using intrinsic instructions, this gives optimal performance. We shade 4 lights at a time, to take advantage of SIMD parallelization.
void ShadePointLights(
JobLight::cLightData & lightData,
const vec_float4 worldPosX, const vec_float4 worldPosY, const vec_float4 worldPosZ,
const vec_float4 normalX, const vec_float4 normalY, const vec_float4 normalZ,
vec_float4& outputR, vec_float4& outputG, vec_float4& outputB)
{
//
// float3 lightToPixel = lightPos - worldPos;
//
vec_float4 lightToPixelX = spu_sub( (vec_float4&)lightData.positionX, worldPosX );
vec_float4 lightToPixelY = spu_sub( (vec_float4&)lightData.positionY, worldPosY );
vec_float4 lightToPixelZ = spu_sub( (vec_float4&)lightData.positionZ, worldPosZ );
//
// float dist = length(lightToPixel);
//
vec_float4 dist_reciprocal_0 = LengthReciprocal( lightToPixelX, lightToPixelY, lightToPixelZ );
vec_float4 dist_0 = spu_re( dist_reciprocal_0 );
//
// lightToPixel /= dist;
//
lightToPixelX = spu_mul( lightToPixelX, dist_reciprocal_0 );
lightToPixelY = spu_mul( lightToPixelY, dist_reciprocal_0 );
lightToPixelZ = spu_mul( lightToPixelZ, dist_reciprocal_0 );
//
// float attenuation = light_falloffScale[ in_ID ].x - ( light_falloffScale[ in_ID ].x * dist ) /lightRange;
//
vec_float4 attenuation = spu_sub( (vec_float4&)lightData.falloffScale,
spu_mul( spu_mul( (vec_float4&)lightData.falloffScale, dist_0 ),
(vec_float4&)lightData.rangeReciprocal ) );
//
// float lambert = dot(normal, lightToPixel);
//
vec_float4 lambert = spu_madd( lightToPixelX, normalX,
spu_madd( lightToPixelY, normalY,
spu_mul( lightToPixelZ, normalZ ) ) );
//
// float attenuation = max(attenuation,0)
//
attenuation = (vec_float4)spu_and( (vec_uint4)attenuation , spu_cmpgt(attenuation, ZERO) );
//
// attenuation = min( attenuation,1);
//
attenuation = (vec_float4)spu_sel( ONE, attenuation, spu_cmpgt(ONE, attenuation) );
//
// float lambert = max(lambert,0)
//
lambert = (vec_float4)spu_and( (vec_uint4)lambert, spu_cmpgt(lambert, ZERO) );
//
// float3 finalLight = lightDiff.rgb * lambert * attenuation;
//
vec_float4 attenuation_lambert = spu_mul( lambert, attenuation );
//
// accumulate output
//
outputR = spu_madd( (vec_float4&)lightData.diffColorR, attenuation_lambert, outputR );
outputG = spu_madd( (vec_float4&)lightData.diffColorG, attenuation_lambert, outputG );
outputB = spu_madd( (vec_float4&)lightData.diffColorB, attenuation_lambert, outputB );
}