As someone writing a graphics engine on the GPU, I came across the problem that appears others,1,2 have come across as well, whether
Case:
1. to put as much as possible in 1 shader and read uniform condition while branching to decide which code path to run
2. or use the GLUseProgram to switch between each object without incurring branching cost
There was no conclusive answer, and thought I should investigate a little myself.
Both Nvidia GPU [3] and GCN [1] have SIMT architecture. Intel has SMT architecture, pretty much akin to 2 threads/core. The threads in a warp(32 out of 3072 Nvidia cores)/wavefront(64 out of 4096 AMD cores)/Execution Unit (1 out of composed of 40 Intel Graphics cores or 80 threads) share the same PC and execute instruction in lockstep [1] meaning only 1 warp/wavefront or a thread from EU gets blocked when univform condition changes in the best case when the other units have other work to do.
The core counts uses the top of the line GPU products for each vendor.
Scenario A:
Let's set up a scenario and determine the cost of 1 frame using branching in a single vertex shader switching vertex shaders. There are 'n' objects, each object with 'v' vertices. A uniform set from CPU costs 'u' to deliver to GPU memory stalling one warp/wavefront/ or half of EU. There is additional branch 'b' cost for each branch added to the HLSL code. Shader switching cost is 'u', because the CPU still needs to signal the GPU same as setting a uniform, + additional memory access latency for the additional data shader requires 'm'. 'm' should cost about the same as 'u' because while the shader is bigger file size than a uniform, the wide bandwidth of the GPU should be able to fetch everything in a single fetch from memory command. The MULT/Adder unit number crunching time is 'c' if the GPU only has one unit of warp/wavefront/thread of EU. Let 'w' be the total number of warps/wavefront/threads of EU.
There is either 'n' number of different shaders(case 1) or 'n-1' branches (case 2) to draw n objects uniquely.
Case 1: using branching
Best case time complexity <= max(n*c/(w-1)+n*v*(n-1)*b,n*u)
If the hardware/scheduler implementation is optimal, only 1warp/wavefront/EU unit is used for waiting for the condition and 'w'-1 units are at work, accounting for c/(w-1). This is plausible because OpenGL accumulates these calls before delivering to the GPU and the scheduler can schedule task so that at more often than not only 1 unit is waiting for condition, whereas all other units are doing the previous draw call(s). n*c/(w-1) accounts for the time it takes to do c amount of work across w-1 warp/wavefront/EU units over n objects. n*(n-1)*b accounts for n-1 branch conditions code that the warp/wavefront/EU unit must run for each vertex over n objects. If u takes longer than MULT/ADD work for the warp/wavefront/EU units, then take the longer of the two. (summary: cores in the SIMD/SIMT would not be fully utilized since all cores take the one condition and do NO-OP and scheduler would schedule the next condition with the rest of cores).
While CPU has branch prediction to make a single thread run fast as possible, GPU lets the thread wait for the condition change while doing stuff on the other threads.
Worst case Time complexity = max(n*2*c/(w-1)+n*v*(n-1)*b,n*u)
In the worst case, if the hardware implementation of updating a state, forces a partial flush, then Updating a uniform require all pending work needs to be done before continuing. The '2' there accounts for the one warp/wavefront/EU unit takes another c/(w-1) work delay.
Case 2: switching Shaders
time complexity: max(n*(2c)/(w-1), n*(u+m))
If the hardware/scheduler implementation allows it, it is possible to have w-1 warps/wavefronts/EU thread work on the current shader and have 1 unit wait to change to the new shader. But all pending work on the current shader needs to be done before the switch occurs, hence the '2' if warp/wavefront/EU unit did not start the task as every other unit takes another c/(w-1) work delay.
Some numbers to plug in
'u' costs 100ns (Assuming 1 ghz frequency and from L2 cache) to retrieve the smallest data type from GPU memory + <1000ns to send from CPU to GPU over PCIE
'm' should cost 400-800ns [2]* size of current needed the data (ie texture, vertex data) in bits/384bits (assuming the GPU has 384bits memory bandwidth)
'b' should be in the order of a few cycles (that includes fetch, decode, execute)
Conclusions
For a compute heavy work with a large c, and hardware/scheduler implementation is optimal it is better to use case 1 and waste less compute time. For a program that spends more time getting from memory than the n*c/(w-1)+n*v*(n-1)*b computation, case 1 is better.
For a program with lots of objects, without much shader computation, and uses very less GPU memory, case 2 is better.
In general, in today's modern game engine that uses lots of memory bandwidth, lots of shader effects, and uses batching to limit the 'n', the number of shaders or branches case 1 has better performance.
References
[1] A Case for Flexible Scalar Unit in SIMT Architectaure, 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2014
[2]https://fgiesen.wordpress.com/2011/07/02/a-trip-through-the-graphics-pipeline-2011-part-2/
[3]http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
[4] https://devtalk.nvidia.com/default/topic/496975/cuda-programming-and-performance/fermi-l2-cache-how-fast-is-the-l2-cache-how-do-i-access-it-/#entry1289120