Experiments in Luma-Optimized and Mipmapped DXT1 Compression


Over the past few years, I've had somewhat of an obsession with the DXTc texture compression formats (for example, see my crunch project here), and other related related things like clusterization techniques and VQ (vector quantization). DXTc's elegant simplicity can trick the casual observer into thinking that the encoding format is easily mastered, and whatever power it possesses has surely already been exposed by existing encoders years ago. (Most people I've met who express this opinion have not actually attempted to write and release a competitive DXTc encoder of their own. But I digress.)

DXTc's most useful and deepest format is probably DXT1, which is a 4 bits per pixel (bpp) format suitable for encoding 24bpp RGB, or 24bpp RGB with 1-bit alpha (but transparent pixels are also always black, so in practice it doesn't seem to be used a whole lot). Note that 4bpp is stupendously massive for an image compression format. Even the old JPEG image compression format does pretty well down to approximately 1bpp, and newer codecs can reach down to ~.5bpp. But still, of the full-color texture formats commonly available on desktop/console GPU's, it's the smallest widely supported texture format we have to work with. (Unfortunately, the new D3D11 formats aren't very interesting in practice, because their memory footprint is twice DXT1's at 8bpp, they aren't supported by all desktop GPU's or consoles yet, are very complex, and there's not much information about or tools available to work with these formats in the open source world.)

This page presents a few fun DXT1/DXT5A tricks that I haven't seen covered elsewhere. They won't light the world on fire, may not be new, and are probably only interesting to 5 people out there. But it's fun to explore the "side alleys" of DXTc compression. I think this stuff is valuable because most of the texels read from memory in modern 3D desktop/console games come from some sort of DXTc compressed images. This work was originally inspired by the High Quality Texture Compression Demo by Humus, and this cbloom blog post.

A Win32/x64 D3D9 viewer app and experimental command line tool can be downloaded here which demonstrates the described approaches (see the very bottom for details).

[5/3/12] - Fixed the .PNG file for "Luma-Optimized DXT1 Compressed using crnlib v1.03 with the cCRNCompFlagGrayscaleSampling option" - the posted image had some weird purple artifacts, no idea why yet. (Getting all these ever so slightly different looking PNG images onto this page over a ~1mb net connection, using Google Sites as an editor, has been just a little frustrating. I've just double checked most of the images on this page.)

Luma-Optimized DXT1 Compression

Here's the first idea: What if your DXT1 encoder knew that the input image was grayscale (only contains luma), and (most importantly) it knew that the shader would immediately convert the fetched+filtered RGB results back to luma using a cheap ALU operation in the shader? Is it possible to achieve higher quality (measured in Y' or luma PSNR, or MSE, etc.) this way? (Note, I'm using Rec. 601 Luma coefficients exclusively here.)

It turns out that the answer is yes. The quality difference is not massive (around .5-1 dB Luma PSNR, sometimes higher in the experiments I've tried), but this method is so trivial to implement in DXT1 encoders and shaders that it's probably worth the trouble. I've experimented with several ways of going about this type of encoding. crnlib's initial release actually supports this type of compression out of the box (using the cCRNCompFlagGrayscaleSampling compression parameter).

This type of encoding is useful for things like: Heightfield compression, grayscale-only texture masks, or to hold the Luma of Luma/Chroma encoded mipmapped textures (more on this later).

Here's an example grayscale image (1024x1024) I got somewhere containing various test patterns and Lena at the center. (Note: There are some issues with using this particular image to demonstrate this approach. I think the large, easily compressed solid color blocks tend to skew the statistics, either lessening or amplifying how much quality gain is possible on real textures.)

Above: Original Image (testpat.1k.png)

And here it is converted to vanilla DXT1 using crnlib v1.03. Here are the stats:

RGB  MSE: 3.697, PSNR: 42.452
Luma MSE: 1.050, PSNR: 47.919

By comparison, in case you don't trust crnlib's DXT1 compressor, the results with ATI Compressonator v1.50 x64 are worse by almost 1 dB (Luma):

RGB  MSE: 5.325, PSNR: 40.867 
Luma MSE: 1.321, PSNR: 46.922

Here's crnlib's DXT1 output:

DXT1 Compressed Image

Above: DXT1 Compressed using crnlib v1.03 (Luma MSE: 1.050, PSNR: 47.919)

Here's the 8x RGB delta image between the original and crnlib's DXT1 version:

Above: 8x RGB Delta Between Original and crnlib DXT1 Compressed

Notice how the 8X delta image is not completely grayscale -- it contains pixels with a slight saturation due to the inherent nature of DXT1 compression (such as 565 endpoint quantization, 1/3 and 2/3 interpolation in 888 space, and the optimizer's freedom to choose whatever endpoint and interpolated colors it wants in order to reduce MSE).

Anyway, the above results are from plain DXT1 compression. Now let's try introducing Luma conversion in the pixel shader.

Initial Attempt

Here's a simple HLSL function that converts from RGB to Y:

const float F_YR = 19595/65536.0f, F_YG = 38470/65536.0f, F_YB = 7471/65536.0f;
float RGB_to_Y(float3 rgb)
return rgb.r * F_YR + rgb.g * F_YG + rgb.b * F_YB;

The weird RGB->Y constants are fixed point versions of the Rec. 601 coefficients. They are exactly the same constants as what crnlib uses (which is important - otherwise the shader and crnlib will compute different results which can actually be a disaster to quality in some situations).

Next, let's modify an existing DXT1 compressor to optimize the results assuming the shader converts to Luma. The easiest way for me to do this is to modify crnlib's DXT1 endpoint optimizer's color distance metric function to convert the input colors to luma before computing distance. (This mode actually shipped in crnlib v1.00 - see the cCRNCompFlagGrayscaleSampling compression option.) The result is an output that has slight RGB colorization, like this:

Above: Luma-Optimized DXT1 Compressed using crnlib v1.03 with the cCRNCompFlagGrayscaleSampling option

Note: The cCRNCompFlagUseTransparentIndicesForBlack flag was also enabled when compressing the above DXT1 image, which allows the compressor to try transparent (3 color+transparent black) blocks in order to gain access to black (independent of the other 3 colors used in the block). The examples in the upcoming section (brute force evaluation of all endpoints) don't use this optimization.

Next, here's the above image (DXT1 compressed with cCRNCompFlagGrayscaleSampling) converted to Rec. 601 Luma using the above RGB->Y HLSL function:

Above: Compressed using experimental luma-optimized DXT1 compression, then converted to Rec. 601 Luma (
Luma MSE: 0.926, PSNR: 48.464)

And here's the 8x RGB delta between the original and the above image (i.e. original vs. DXT1 compressed using cCRNCompFlagGrayscaleSampling, then converted to luma in the shader):

Above: 8x RGB Delta Between original vs. Luma-optimized DXT1 compressor's output converted to Rec. 601 luma

Notice the 8x delta image is looking noticeably better vs. the original DXT1 delta image above. Here are the new stats:

RGB  MSE: 2.779, PSNR: 43.693
Luma MSE: 0.926, PSNR: 48.464

So plain crnlib DXT1 got a Luma PSNR of 47.919 dB (ATI's was 46.922), while this initial attempt at luma-optimized DXT1 compression gets 48.464 dB (higher is better with this metric), for a .545 dB improvement. Not great, but not bad for a few lines of code either. 

Note that this method is available and can tested by anyone right now in crunch.

Second Attempt (Or Performance be Damned, Just Try all Possibilities)

crnlib's DXT1 endpoint optimizer doesn't "stray" extremely far from the principle axis of the input vectors (colors). It does carefully vary the trial endpoints more aggressively on the 565 lattice than most compressors (such as ATI_Compressor or squish). However, it has no inherent knowledge of the RGB->Y conversion going on in the shader, so it's bound to miss some useful endpoint pair possibilities that could result in lower Luma error. Basically, crnlib's DXT1 compressor was designed to optimize for minimum RGB error (either uniform or perceptual), not just Luma error, so it'll inevitably waste time on useless endpoint pair trials.

One possibility is to compress to luma-optimized DXT1 by trying all 2^32 endpoint pairs on every 4x4 image block, then comparing the results to see if the Luma error was reduced vs. crnlib. But this would take forever, even using a GPU to accelerate the tests. It seems like there should be many RGB endpoint pairs (and their interpolates) that actually result in exactly the same Luma values, so most of the effort would be wasted anyway.

To verify this assertion, I hacked together a small Win32 console app that computes the set of "interesting" luma endpoint pairs. The idea is: There are bound to be RGB 565 endpoints and their 888 interpolates that, when converted to Luma, amount to the same set of Luma values as another completely different set of RGB endpoints. To find the unique "Luma endpoints", the tool iterates through each possible low/high 565 endpoint (all 2^32 of them) and expands the endpoints to either three or four 888 colors. (DXT1 transparent blocks have 3 colors+transparent black, and DXT1 opaque blocks have 4 colors.) It converts these colors to Luma, then checks the resulting Luma values against a hash table to detect and discard collisions. (It not only checks for direct collisions, but it also checks the hash table for "reversed" luma value collisions too.)

You can see the computed list of 195,841 32-bit DXT1 "luma endpoint" pairs here. So only about .005% of the possible endpoints (expressible in ~17.6 bits) seem "interesting" for luma-optimized DXT1 compression.

Note that some endpoint pairs are much more desirable than others, because how GPU's interpolate the endpoints is not well defined. So when a RGB endpoint pair is discovered that results in a luma collision, the tool chooses the endpoint pair that covers the least distance in RGB space. If the distances are equal, it chooses the pair that is the "purest" in the grayscale sense. 

Here's the test image compressed to luma-optimized DXT1 using these endpoint pairs, using a simple brute force compressor that tries all luma-endpoints for each 4x4 block:

Above: Luma-Optimized DXT1 Compression using brute force endpoint evaluation

And here's the above image converted to Rec. 601 luma:

Above: Luma-Optimized DXT1 Compression using brute force endpoint evaluation, converted to Rec. 601 Luma (Luma MSE: 0.885, PSNR: 48.663)

And here's the 8x RGB delta between the original and the above image (i.e. original vs. luma-optimized DXT1 compressed using brute force endpoint evaluation):

Above: 8x RGB Delta Between original vs. Luma-optimized DXT1 compression using brute force endpoint evaluation converted to Rec. 601 Luma

And now the new stats:

RGB  MSE: 2.654, PSNR: 43.891
Luma MSE: 0.885, PSNR: 48.663

So 48.663 dB using brute force optimization, vs. 48.464 dB (an improvement of .2 dB) using a DXT1 compressor modified to use a Luma distance metric. Not impressive, especially compared to the leap going from plain DXT1 to luma-optimized DXT1 (which was 
.545 dB), but it was a gain. (And to make these results seem better, this is a whole 1.741 dB better compared to vanilla ATI DXT1.)

Interestingly, DXT1 images compressed in this way really only need ~18 bits to represent the block endpoints, vs. the usual 32 (because there are only 195,841 possible endpoints that can be utilized by the encoder). The potential transmission or disc storage savings is ~14 bits per DXT1 block (~22%).

I'm not sure if I'm going to add this luma optimization mode to crunch's DXT1 compressor just yet.

Mipmapped Luma-Chroma DXT1 Compression and Texture Sampling

The next idea builds off the previous one. So now we have a way of encoding luma-only images into DXT1 at a higher quality level than can be done using vanilla DXT1 compression. Is there a way to efficiently add color to this format, without resorting to two separate textures like Humus did in his High Quality Texture Compression Demo, and still fully preserve GPU texture filtering?

Turns out there is at least one way, if we combine Chroma Subsampling with Mipmapping, and use D3DSAMP_MAXMIPLEVEL to clamp the largest mipmap level a texture sampler is permitted to fetch from. Here's the basic approach:
  • Create a single DXT1 texture with a full mipmap chain.
  • Convert the input image to Luma and encode into the first (largest) mipmap of the DXT1 texture using Luma-Optimized DXT1 compression (or even plain DXT1 compression, but at lower quality).
  • Now downsample the original (color) image by 1/2 on each axis, encode this to vanilla DXT1 and place the bits into the second (half-res) mipmap level. Continue to do this for all subsequent mipmap levels.
The end result is a single DXT1 mipmapped texture that contains a Luma-only image in mip0, with all the other mipmaps containing plain color images. There are some improvements to the basic idea possible (you can tack on a feedback optimization pass that selectively alters the luma pixels based off the unpacked/upsampled chroma mipmap, in order to optimize the overall error in the unpacked/filtered mip0), but this is the basic idea.

To efficiently sample from this strange Luma-Chroma DXT1 texture in a pixel shader, you need to bind it twice to two separate samplers (in D3D9 - there may be a simpler way that doesn't require two samplers in D3D10/11). The first bind is nothing special, and the second bind is the same except you set the D3DSAMP_MAXMIPLEVEL setting to 1. So the second sampler (which retrieves texture chroma) cannot fetch from the first mipmap level. Here's code for D3D9:
   const int nMaxAniso = 6;

   // Bind texture to sampler 0 with filtering, no LOD bias, MAXMIPLEVEL set to 0 (i.e. nothing special) 
   pDevice->SetTexture(0, pTex);
   pDevice->SetSamplerState(0, D3DSAMP_ADDRESSU, D3DTADDRESS_CLAMP);
   pDevice->SetSamplerState(0, D3DSAMP_ADDRESSV, D3DTADDRESS_CLAMP);
   pDevice->SetSamplerState(0, D3DSAMP_ADDRESSW, D3DTADDRESS_CLAMP);
   pDevice->SetSamplerState(0, D3DSAMP_MINFILTER, filtering ? D3DTEXF_ANISOTROPIC : D3DTEXF_POINT);
   pDevice->SetSamplerState(0, D3DSAMP_MAGFILTER, filtering ? D3DTEXF_ANISOTROPIC : D3DTEXF_POINT);
   pDevice->SetSamplerState(0, D3DSAMP_MIPFILTER, mip_filtering ? D3DTEXF_LINEAR : D3DTEXF_POINT);
   float mipBias = 0.0f;
   pDevice->SetSamplerState(0, D3DSAMP_MIPMAPLODBIAS, *(DWORD*)&mipBias);
   pDevice->SetSamplerState(0, D3DSAMP_MAXMIPLEVEL, 0);
   pDevice->SetSamplerState(0, D3DSAMP_MAXANISOTROPY, nMaxAniso);
   pDevice->SetSamplerState(0, D3DSAMP_SRGBTEXTURE, FALSE);

    // Bind texture again to sampler 1, this time setting MAXMIPLEVEL to 1
   pDevice->SetTexture(1, pTex);
   pDevice->SetSamplerState(1, D3DSAMP_ADDRESSU, D3DTADDRESS_CLAMP);
   pDevice->SetSamplerState(1, D3DSAMP_ADDRESSV, D3DTADDRESS_CLAMP);
   pDevice->SetSamplerState(1, D3DSAMP_ADDRESSW, D3DTADDRESS_CLAMP);
   pDevice->SetSamplerState(1, D3DSAMP_MINFILTER, filtering ? D3DTEXF_ANISOTROPIC : D3DTEXF_POINT);
   pDevice->SetSamplerState(1, D3DSAMP_MAGFILTER, filtering ? D3DTEXF_ANISOTROPIC : D3DTEXF_POINT);
   pDevice->SetSamplerState(1, D3DSAMP_MIPFILTER, mip_filtering ? D3DTEXF_LINEAR : D3DTEXF_POINT);
   mipBias = 0.0f;
   pDevice->SetSamplerState(1, D3DSAMP_MIPMAPLODBIAS, *(DWORD*)&mipBias);
   pDevice->SetSamplerState(1, D3DSAMP_MAXMIPLEVEL, 1);
   pDevice->SetSamplerState(1, D3DSAMP_MAXANISOTROPY, nMaxAniso);
   pDevice->SetSamplerState(1, D3DSAMP_SRGBTEXTURE, FALSE);

Now here's how to recover the high-resolution, colored pixels, with full GPU filtering: You can use the first sampler (which can access all mipmap levels with no restrictions) to fetch Luma, and the second sampler (which is forced to only sample the smaller mipmap levels 1 or higher) to only sample Chroma. The shader immediately converts the sampled RGB results from sampler0 to Y, and the RGB results from sampler1 to Chroma.

Next, the shader transforms the resulting Y+CbCr color values to plain RGB. All the usual hardware features (trilinear filtering, anisotropic filtering, etc.) work as usual because these are linear operations, and no distortion is introduced when sampling from mipmap levels 1 and above because the two samplers always fetch from the same mipmap levels after mip0. When the textured primitive is large enough in screenspace to sample exclusively from mipmap level 0, you get Luma from mip0 and chroma from upsampled mip1 (exactly like the usual Chroma Subsampling scheme used in JPEG or MPEG). As the textured primitive's projection gets smaller, the GPU begins to sample exclusively from the other mipmap levels, and the whole operation is effectively a no-op. (I know this whole idea probably sounds dubious - that's why I created and tested it in a working D3D9 demo - see below.)

Here's the fully tested shader code, using the YCbCr colorspace:

const float F_YR = 19595/65536.0f, F_YG = 38470/65536.0f, F_YB = 7471/65536.0f;
const float F_CB_R = -11059/65536.0f, F_CB_G = -21709/65536.0f, F_CB_B = 32768/65536.0f, F_CR_R = 32768/65536.0f, F_CR_G = -27439/65536.0f, F_CR_B = -5329/65536.0f;
const float F_R_CR = 91881/65536.0f, F_B_CB = 116130/65536.0f, F_G_CR = -46802/65536.0f, F_G_CB = -22554/65536.0f;
float RGB_to_Y(float3 rgb)
   return rgb.r * F_YR + rgb.g * F_YG + rgb.b * F_YB;

float3 RGB_to_YCC(float3 rgb)
   float3 ycc;
   ycc.r = rgb.r * F_CR_R + rgb.g * F_CR_G + rgb.b * F_CR_B;
   ycc.g = rgb.r * F_YR   + rgb.g * F_YG   + rgb.b * F_YB;
   ycc.b = rgb.r * F_CB_R + rgb.g * F_CB_G + rgb.b * F_CB_B;
   return ycc;

float3 YCC_to_RGB(float3 ycc)
   float y = ycc.g;
   float cb = ycc.b;
   float cr = ycc.r;
   float3 rgb;
   rgb.r = y + F_R_CR * cr;
   rgb.g = y + F_G_CR * cr + F_G_CB * cb;
   rgb.b = y               + F_B_CB * cb;
   return rgb;

float4 PixQuadLumaChroma(float2 Tex : TEXCOORD0) : COLOR0
   // Sample luma (MAXMIPLEVEL is 0 on this sampler)
   float flLuma = RGB_to_Y(tex2D(g_samQuad0, Tex).rgb);

   // Sample chroma (MAXMIPLEVEL is 1 on this sampler)
   float3 vChromaRGB = tex2D(g_samQuad1, Tex).rgb;
   float3 vChromaYCC = RGB_to_YCC(vChromaRGB);
   // Combine Luma and Chroma, and convert back to RGB.
   // This is a no-op unless the MAXMIPLEVEL clamp setting is kicking in.
   vChromaYCC.g = flLuma;
   float3 vFinalRGB = YCC_to_RGB(vChromaYCC);

   // Return the RGB results
   return float4(vFinalRGB.r, vFinalRGB.g, vFinalRGB.b, 1.0f);

Here's kodim05.png expressed as a Luma Chroma DXT1 mipmapped texture (with each mipmap level appearing to the right of the previous, larger one). The top miplevel (the largest/leftmost) uses Luma-Optimized DXT1 compression (brute force endpoint evaluation), and the mipmap levels where packed to plain DXT1 with crnlib v1.03. A optional, two step process with feedback was used to generate the Luma image in this example (the luma pixels in mip0 where adjusted prior to compression to partially compensate for the error introduced by using upsampled, DXT1 compressed chroma from mip1).

Above: Mipmapped Luma-Chroma DXT1 Texture

Above: Mipmap level 0 Converted to Rec. 601 Luma (sampler0 after conversion to Y the pixel shader)

Above: Final Mipmap level 0 of a Mipmapped Luma-Chroma DXT1 Texture, converted back to RGB in the pixel shader (Y is from mip0, CbCr is from upsampled mip1)

Blocky chroma artifacts around areas with harsh chroma transitions seem to be the primary issue with this encoding/sampling method, because the 4x4 DXT1 block patterns present in mipmap level 1 are magnified into somewhat filtered 8x8 block shapes at mipmap level 0.

If you're wondering how this format compares to plain DXT1: At mipmap level 0, this format trades off chroma and RGB quality (and shader cycles) for slightly higher Luma quality (see the above section on Luma-Optimized DXT1 compression to get a feel for how much more). At mipmap levels 1 and above there are no differences between this format and plain DXT1 from a quality perspective (ignoring the extra shader cost, which may be skippable via a dynamic jump). Whether or not this particular tradeoff actually makes sense to do is up in the air, but it's an interesting scheme which could lead to other, even more interesting things.

TODO: Add MSE/PSNR statistics, comparison against plain DXT1

Command Line Conversion Tool and D3D9 Viewer App

You can download a Win32/x64 demo demonstrating this approach here. This archive contains d3d9view.exe and d3d9view_x64.exe, which is a modified D3D9 sample that supports loading .CRN/.DDS/.TGA/.BMP/.JPG/etc. textures. There are options to view the sampled texture as RGB, Luma, Luma/Chroma (using the conversion pixel shader above), or just chroma. d3d9view uploads to the GPU the mipmapped luma/chroma DXT1 texture bits loaded from disk, and exclusively uses the GPU to convert the texture to RGB. The anisotropically filtered textured quad may be translated/rotated, and MIN/MAX and MIP filtering can be separately toggled on/off. The shader source code is in d3d9view.fx, and I included a few already converted sample textures in the archive. Please email if you would like the full source to any of these EXE's.

Important: To use the viewer, run it with the path to the texture to view on the command line, like this:

d3d9view_x64.exe kodim05.png_lumachroma_y_optimized.dds

Above: D3D9View.exe, viewing a mipmapped luma/chroma DXT1 texture

I've also included the totally hacky experimental x64 command line tool I've been using to create these images. Use this command line to create your own mipmapped luma/chroma DXT1 textures:

crunch_x64 kodim03.png unpacked.bmp /exp4

Note the output DDS file will be "kodim03.png.dds" in this example. The final output .DDS file is always placed in the same directory as the source image, and "2.dds" will contain mipmap level 0, and unpacked.bmp will be the unpacked version of the first mipmap level (simulating what the pixel shader will be doing). The /exp option will use crnlib's Luma-optimized DXT1 compressor, vs. the brute force compressor (much faster but slightly less quality). 

A Possible Improvement, and Other Uses for Mipmapped Luma/Chroma Textures

It would be nice to use DXT5A (aka BC4, or the alpha-only portion of a DXT5 texture) to encode mipmap 0, instead of trying to cram Luma-only data into DXT1. DXT5A is much more effective at representing grayscale images than DXT1. If the GPU allowed us to place DXT5A luma data in the first mipmap level, and DXT1 color/chroma data in the subsequent mipmap levels, and automatically sampled from each level using the right decoder logic, we would have the best of both words.

To support this, vendors would need to allow textures that have different formats in mipmap level 0 compared to the other levels. DXT5A is 4bpp, the same as DXT1, so this doesn't seem like a big deal. Also, the GPU would need the ability to decode DXT5A blocks from mip0, and DXT1 blocks from mip1/mip2/etc. This also seems like a minimal request (after all, the gates are already there to decode DXTc format blocks - we just want to select which decoder gets applied to each level).

I've already experimented with this format in software, and will report results with statistics over the next week or two. It'll probably be argued that the minor improvement in quality doesn't justify the additional complexity of this scheme, but there are other interesting uses for this technique. For example, another potential advantage to both DXT1 and DXT5A+DXT1 mipmapped Luma/Chroma formats is faster software decoding of JPEG compressed textures into GPU-friendly DXTc compressed textures. Currently, several modern game engines (see here for an example) unpack JPEG (or some other highly compressed image format) to DXT5 (8bpp) using streamlined/simplified real-time compressors that typically pack Luma into alpha and Chroma into R and G (or G and B). They convert from YCbCr (or some variant, like YCoCg) back to RGB in the pixel shader. With Luma-Chroma DXT textures, approx. 75% of the total number of pixels in a mipmapped texture only need to be packed into a format containing just a single sample (luma), the chroma samples do not need to be upsampled to full resolution by the CPU, and the total texture cache footprint of a mipmapped texture can be cut in half (4bpp vs. 8bpp). 

Alternately, if only a single image is actually needed for rendering (not a full texture mipmap, like with 2D sprites), a truncated texture mipmap containing just 2 levels can be used for a total memory footprint of 5bpp (mip0=luma, mip1=chroma, or 4bpp+1bpp) to represent the image.

Initial Release: Apr. 29, 2012, Page Last Modified: May 3, 2012