The Nux Gaussian Blur

Post date: Aug 26, 2013 5:52:36 PM

Introduction

I spent much time this summer working on https://bugs.launchpad.net/ubuntu/+source/nux/+bug/1167018 and learned a great deal. I'll use this page to document much of what I learned. The problem is that when unity is set to use a blur rather than a background an ugly color pattern is the result instead for the Radeon X1250 GPU. The answer I found was to improve the blur's performance in the GPU. I gather that the existing blur is timing out and producing various (ugly) results.

Nux's gaussian blur is used by Unity to produce a dialog box without completely covering the underlying information. The Dash and HUD as well as the shutdown and logout dialogs use the blur. The blur can also be seen under the indicators at the very top of the screen, depending on how transparency is set for that panel.

There's good reason blurs can be performance problems. The blurred image is a set of pixels, each affected by a number of pixels all around it. For the standard current Nux gaussian blur a pixel is surrounded by other pixels in a 15 pixel radius. Computing perhaps a million pixels by reading 15x15 pixels each time would be slow, so there are several tricks used to reduce this. For a true guassian blur it is only necessary to blur by using the pixels in a horizontal line and then, in another pass, all of the pixels in a horizontal line. Since each pixel is affected by each blur it all adds up OK given the special properties of the gaussian blur. In a blur a particular pixel is affected most by the color that was there in the first place, then the color of the next pixel over a little less influential, until you get to the least influential pixel furthest out. These influences, or weights, are calculated depending upon the sigma of the blur, which also tells you about how many surrounding pixels need to be considered. A sigma of 5 is what I talk about above. A sigma of 3 is used on less powerful GPU's like the X1250 and has a radius of 9 rather than 15. For a gaussian blur the sigma specifies the number of pixels sampled and therefore the strength of the blur. Sigmas are related to statistical distributions, discrete distributions, and binomial coefficients. The underlying image to be blurred is copied to system memory termed a texture and the result is written to a different texture...and back and forth, eventually composited and sent to the computer's screen.

If you want to speed up a blur you want to reduce the number of reads you have to do. Doing the Horizontal then Vertical trick does wonders for this--for the standard blur 31x31 reads (samples) is replaced by 2x31 reads. You also want to reduce the time necessary for the hardware to read the memory. More on the that later.

Another trick to reduce the number of memory accesses and speed things up is to use Linear Sampling available in GPU hardware to sample (read) half as many pixels. By reading between the pixels 1 pixel away from the target pixel and the pixel 2 pixels away, computing that exact distance by looking at the influence or weight of each and then moving on to a place between pixels 3 and 4, 5 and 6, 7 and 8, and so on you can get the influence of all of the pixels you want without reading each and every pixel. The GPU will return a value between the pixels that reflect each pixels color value. (This is how we got rid of the "jaggies" in text (anti-aliasing). Modern text isn't just white or black, it's also gray because of this hardware feature.)

Shader programs are tough to program because you want them to run on just about any hardware. For older and less expensive hardware shaders have to be pretty restrictive in several ways.

Last winter a revision was made to Unity to start to use the above Linear Sampling trick, but it broke the blur for low-end hardware, specifically the ATI X1250 line of integrated GPU's. That's what is on my laptop--when I bought it I chose ATI because they had started sharing information with the Linux kernel developers. My research was to determine why the blur broke and what to do about it.

It turns out that the blur broke because it was simply too complicated to perform for the low end hardware--it probably took too long to be tolerable for the hardware. The new blur used the hardware's ability to read (or sample) between texture pixels, but it sampled between pixels 1 and 2, between 2 and 3, between 3 and 4, etc. Thus it didn't really save any memory reads.

The new blur wasn't gaussian because it was using the weights of pixels at offset 2...n-1 twice each. It wasn't actually an example of linear sampling because it didn't reduce the number of samples for a given strength. What is was was strong--stronger than a standard gaussian blur, so it could be run at a lower sigma and lower number of pixels sampled. Of course that was the very result that was desired from the change.

My first patch proposal

The X1250 attempted to run at a sigma of 3 (specified by Unity--see unity-shared/BackgroundEffectHelper.cpp around line 165). This patch produced a true Gaussian blur with Linear Sampling and each texel within the limits computed in Unity. This would work but the blur didn't match the strength of the current one and the patch was therefore unacceptable. Multiplying the sigma by 1.5 would approximately match the blur, but the X1250 wouldn't run that either with the proposed true LS Gaussian blur.

The next attempt

After several weeks of experimentation It seemed to me that the best approach to proceed is to try to improve the shader and its setup so that it will run the current blur successfully on lower capability hardware such as this.

The current horizontal and vertical LS Gaussian blurs in NuxGraphics/RenderingPipeGLSL.cpp pass the width or height to the shader, as well as offsets and weights. The shader than constructs a texture coordinate to sample:

It zeros either the x or y coordinate
it normalizes the offset by the width or height placing the result in the other coordinate
It adds the results from above to the coordinate in question
It samples that point
It accumulates the result in a variable

At the end of this it sets the result from the variable.

It does this in a pseudo do loop (constructed by the GLSL compiler). Because the hardware is using primitive facilities the iterations must be defined by a constant. There is no abs() or sign() function at this level of shader, so the iterations fill in values in this order 0 1, -1, 2, -2...n, -n, That's a very poor order if you want to often find memory values in a cache.

My proposal is to change the distribution of work -- doing as little as possible in processing the shader does for each pixel, and doing that processing in an order improving locality of memory reference of the points being sampled. It seems to me that both aspects are very important.

At the top of NuxGraphics/RenderingPipeGLSL.cpp 's GraphicsEngine::QRP_GLSL_HorizontalLSGauss and GraphicsEngine::QRP_GLSL_VerticalLSGauss, after the number of samples needed is determined (important--including sample zero, a problem with the old code) an array of [n*2-1] [3] is defined. The rows are defined by the offsets and weights -n...0...n. The first two columns are the normalized delta x and delta y values needed to access the sample, the third column is the weight. Either the delta x or delta y is 0, the other is the bit offset divided by the width or height of the whole (that's what is meant by normalized--it runs from 0..1).

Thus as much information is precomputed as possible.

It's tempting to keep the weights in a separate vector, but indexing into this vector causes an error for too many constants during shader compilation. I assume this is bug in mesa gallium for Radeon.

In the shader above the array is read in order (-n...0...n) and the vec2(delta x, delta y) are added to the target pixel's coordinate, that resulting texel is sampled, and the result is incremented. In my implementation I zeroed glFragColor (the result of the shader) and incremented it with each sample. It's possible that some hardware won't allow glFragColor to be read or incremented so this could represent a problem--but I've seen other shaders do this. The texsize and the weights vectors are no longer used and are removed from the shader and from its setup in the above functions.

In the last tested version of my shader I changed the shader setup in the above two functions to dynamically construct a vector of structures (the latest version of C++ is used by Nux). The structure holds the two relative normalized coordinates and the weight. Those rows are constructed and appended from the [-n] offset and weight to 0, and then from [0] to [n], skipping the [0] the second time if its offset is 0. The address of the very first row's x value is passed to the shader, as is the number of rows in the vector. The previous code, that in my merge request, used an ugly mapping of a two dimensional array on one dimensional memory. The use of a vector to represent the shader's work (the kernel) is much clearer to follow.

With these changes the current blur can be run at sigma 3, as desired for this version of OpenGL. Each texture is sampled and its value accumulated in just 3 Radeon hardware instructions.

The hardware in question has limitations on how many stages of processing it can do (samples can be done in 3 stages). The mesa gallium software limits the number of samples that can be collected per stage--by default 8, so a total of 24 samples can be taken, including the (0,0) point. By changing the environmental variable RADEON_TEX_GROUP this can be increased significantly.There are other variables that can provide shader program input and output and shader program statistics so one can see where they are compared to the hardware limits, e.g. MESA_GLSL="dump" RADEON_DEBUG='info,pstat,fp'. References below contain some information on shader limits for different levels of hardware and shader standards.

Note that you can actually use the same shader for both the vertical and horizontal blur passes, the zeros in the offsets just trade places in the shader setup. Note that other blurs can also be implemented by the same shader--just pass their integer offsets and their weights to the shader in the input array.

If mesa can see as it generates code that a limit is exceeded it will substitute a dummy shader that will return all black. So there's no blur, but at least you get a nice black background rather than ugliness.

Looking at this I found a problem--the maximum index of a shader program is set in NuxGraphics/GraphicsEngine.h as 11, for a total shader that runs from -11, -10, ... 0, 1, 10, 11. That's enough for a sigma of 3.66, but a sigma of 5 is now used for higher level GPU's. I tried changing this maximum to 16 (3x5 plus the zero sample) and got a segfault from some code somewhere unrelated to blurring. One simple partial solution is to index by sigma rather than by number of samples, the definition of the shader creator GraphicsEngine::InitSLHorizontalLSGaussFilter (and the vertical variant) pass both the number of passes and an index. Make the integer index correspond to the sigma being used, and use that index to both store and retrieve the compiled shader for the number of passes passed in the other parameter. Better would be to address the reason the maximum can't be changed. Someone probably knows why but I don't.

Note that it is important to go through samples in order rather than skipping around as the current shader does. This maximizes the number of samples coming from cache rather than slowly coming all of the way from memory.

Fixing this bug as a performance problem has the side effect of reducing the GPU's speed through the blur. The dynamic blur Unity setting allows changes to the desktop to be reflected in the produced blur. A fix like this would probably improve the frame rate of that effect.

Two other problems

In the return from GraphicsEngine::LinearSampleGaussianWeights in NuxGraphics/GraphicsEngine.cpp returns the variable support--the number of non-zero offset samples needed on either the positive or negative side. If it's set to that number plus 1 you get space for the correct number of samples reserved in the shader instead of having one too few places to store the offsets and weights in the current code. I would return the size of the weights vector, which returns the correct number and would return the correct number should Linear Sampling be used to provide a fewer number of samples.
At the bottom of the same routine the weight adjustment loop is short one iteration--the loop should run from 0 through support not through support-1. Again it looks like the 0 offset sample is forgotten. This is very minor since the adjustment factor is very close to one and the weight of the final sample is very small.

Making the blur a standard Linear Sampling One

To make the current blur save texture samples the following can be done to NuxGraphics/NuxGraphicsEngine.cpp 's GraphicsEngine::LinearSampleGaussianWeights:

Go through the loop incrementing each time by 2

Instead of referring to weights[i] refer to w2 + w1 (two places)

Make the final adjustment loop run through weights.size() times

Return weights.size() rather than support (see above).

Adjusting the blur

To adjust the standard one to produce a blur similar to the current one multiply the sigma by 1.5. It should come pretty close.

With both changes in effect, at sigma 5 a total of 50 samples for the horizontal and vertical passes replaces 62, This may not be worth the effort.

With the blur algorithm adjusted and RADEON_TEX_GROUP increased significantly you can achieve a sigma of 7.5 on the X1250 hardware.

Choosing Blur Sigmas

It may be a good idea to choose sigmas such as 2.999, 3.3333, 3.66666, rather than 3.0, 3.3334, and 3.6667. The is to use the highest sigma possible before generating another 4 texel samples (horizontal and vertical, negative and positive).

Increasing Blur Sigmas on Low End Equipment

There's unused code in Nux to allow multiple passes. I've read the an additional pass with a gaussian blur would produce a new blur about equal to the old blur times about 1.4 (the square root of two). It may be more effective to clear the initial glFragColor by multiplying by a parameter of vec3 of (0.0, 0.0, 0.0). By later setting it to (1.0, 1.0, 1.0) one may be able to pass additional sample parameters. By doing the horizontal in several stages, followed by the vertical in stages it may be possible to directly produce a stronger sigma blur using fewer samples than by merely repeating a smaller sigma blur. The size of each stage in samples would be determined by the capabilities of the GPU.

I doubt that R300 GPU's could produce blur with only the above changes--I'd expect a black--as I've read that an R300 can only do 32 instructions in a shader program. (The X1250 is an R400 type).