We shall attempt to make a block DCT based encoding scheme, and study its compression for 2 images in Python. Performance shall be judged on metrics like PSNR, compression ratio and bits per pixel.
The compression pipeline is divided into 2 stages: preprocessing and then channel wise compression.
First, the input image is read. If it is a gray-scale (single channel) image, no preprocessing is required. If it is a RGB image, it is first converted to YUV. Then the U and V channels are downsampled by 2. This is the 4:2:0 format. This step is lossy as information is irretrievably lost. But since the U and V channels do not contain much details, there is not much loss visually. This step gives a compression ratio of 1.5 rightaway.
Preprocessing block
Channel wise compression
At this point we have 3 channels, which needs to be encoded separately. In block based coding, we divide the image into blocks of sizes 8x8, 16x16 or 32x32. Some schemes implement fixed block size, that is the whole image is divided into blocks of equal sizes. Here we shall experiment with variable block sizes.
First we divide the channel into 32x32 blocks. If the variance is very high, we subdivide the 32x32 block into 16 8x8 blocks. If the variance is medium, we subdivide into 4 16x16 blocks. This subdivision helps us have large blocks in regions of low variance and smaller blocks in regions of high variance.
In python, scipy.fftpack
offers 1-D DCT and IDCT functions. Therefore the 1-D DCT functions are adapted to 2-D using the following separable method.
def dct2d(x):
return fftpack.dct(fftpack.dct(x.T, norm='ortho').T, norm='ortho')
def idct2d(x):
return fftpack.idct(fftpack.idct(x.T, norm='ortho').T, norm='ortho')
First we calculate the variance of all 32x32 blocks and divide them into 3 sections (low, mid and high) based on variance. The variances are sorted and t1
and t2
percentiles are used as the thresholds. Then DCT is applied using block sizes dependent on variance.
def doDCT(ch, th1, th2, display):
sh = ch.shape
v = getVarTh(ch) #get variance in 32x32 blocks
thresholds = [np.percentile(v,th1), np.percentile(v,th2)]
dctSize = [[-1 for j in range(sh[1]/32)] for i in range(sh[0]/32)]
blkSize = {0:8, 1:16, 2:32}
coefflist = [[None for j in range(sh[1]/32)] for i in range(sh[0]/32)]
for i in range(sh[0]/32): #outer loop for rows
for j in range(sh[1]/32): #inner loop for cols
blk = ch[32*i:32*(i+1), 32*j:32*(j+1)].astype('float32')
if v[i][j] > thresholds[1]: #use 8x8 #depending on variance, choose DCT block size
dctSize[i][j] = 0
elif v[i][j] > thresholds[0]: #use 16x16
dctSize[i][j] = 1
else: #use 32x32
dctSize[i][j] = 2
numsubblocks = 32/blkSize[dctSize[i][j]]; subblocksize = blkSize[dctSize[i][j]]
coeffs = []
for kk in range(numsubblocks): #perform DCT on 1 32x32 block or 4 16x16 blocks or 16 8x8 blocks
for mm in range(numsubblocks):
coeffs += [dct2d(blk[kk*subblocksize:(kk+1)*subblocksize,mm*subblocksize:(mm+1)*subblocksize])]
coefflist[i][j] = coeffs
return coefflist, dctSize
Left: Moon image with variable block sizes. Variances are sorted for all 32x32 blocks. Then blocks with less than 50 percentile use 32x32 DCT. Blocks with variance between 50 to 80 percentile use 16x16, and the top 20 percentile use 8x8 DCT
Bottom: Similar thresholding (50,80 percentile) for the baboon image. All three channels are shown. U and V channels are downsampled.
We observe that regions of more details have smaller block sizes, while near constant regions have large block sizes.
We can choose the threshold for the block sizes (that is at which percentile of variance we shift to a different size). Some samples are provided below. If we use th1=th2=0, then only 8x8 blocks are used. Similarly for th1th2=100, only 32x32 blocks are used. For furthur analysis we use th1=50 and th2=80, whose blocking results are shown above.
th1 = 0, th2 = 0
th1 = 0, th2 = 50
th1 = 100, th2 = 100
For quantization, we use JPEG's standard quantization matrix and scale it using JPEG's Q factor mechanism. Quantization is a lossy step.
Q factor is an integer between 1 to 100. Scaling factor s
is calculated from it as shown below. Then each element of the base quantization matrix is multiplied by (s/1000)
to get the new quantization matrix. Finally each element of the DCT is pointwise divided and rounded.
def getS(Q):
return 5000/Q if Q<50 else (200-2*Q if Q!=100 else 1)
JPEG's base quantization matrix is 8x8. There are 2 matrices, one for Y and one for U and V channels. The U and V channels are more coarsely quatized. Since we are using blocksize of 16x16 and 32x32 also, the original 8x8 matrix is upsampled to 16x16 or 32x32 using the following code:
def expandQMtx(q, factor): #function to expand 8x8 quantization matrix to 16x16 or 32x32
newQ = np.zeros((8*(2**factor), 8*(2**factor)))
for i in range(8):
for j in range(8):
newQ[i*(2**factor):(i+1)*2**factor, j*(2**factor):(j+1)*2**factor] = q[i][j]
return newQ
The following code applies quantization.
def quantize(coefflist, baseMtx, Q):
s = getS(Q)
qMtx = baseMtx*(s/1000.)
qDict = {1: expandQMtx(qMtx, 2), 4: expandQMtx(qMtx, 1), 16: qMtx} #expand the base matrix for 16x16 and 32x32 case
roundedCoeff = [[None for j in range(len(coefflist[0]))] for i in range(len(coefflist))]
for i in range(len(coefflist)): #For each of the 32x32 blocks
for j in range(len(coefflist[0])):
cff = coefflist[i][j]
q = qDict[len(cff)]
roundedCoeff[i][j] = [np.round(np.divide(k,q)).astype(int) for k in cff] #quantization step
return roundedCoeff
Run length Encoding (RLE) is a lossless compression technique that compresses long runs of zeroes effectively. After quantization, we expect high frequency components have become zero, so RLE can help here
To perform run length encoding, we first need to perform zigzag scan on the 2-D DCT coefficients to make them 1-D. This scan order ensures we go from low to high frequencies gradually, thus enabling long runs of zeros.
Zigzag scan of 8x8 DCT coefficients. From wikipedia
The function to get the zigzag scan order is given below:
def getZigzagScanOrder(sz):
scanOrder = [(0,0)]
r = 0; c = 0;
dir = 'b' #'b' = border, 'd' = down, 'u' = up
while(len(scanOrder)<sz*sz):
if dir=='b': #if at border
if r==0: #top border
c += 1; dir = 'd'
elif r==sz-1: #bottom border
c += 1; dir = 'u'
elif c==0: #left border
r += 1; dir = 'u'
elif c==sz-1: #right border
r += 1; dir = 'd'
else:
assert False
elif dir=='u': #direction is 'up'
if r==0 or c==sz-1: #if direction is 'up' nd we are at the border, change mode to 'border'
dir = 'b'; continue
else: #if direction is 'up' nd we are not at the border, go 'up'
r -= 1; c += 1
else:
if c==0 or r==sz-1: #if direction is 'down' nd we are at the border, change mode to 'border'
dir = 'b'; continue
else: #if direction is 'down' nd we are not at the border, go 'down'
r += 1; c -= 1
scanOrder.append((r,c))
return scanOrder
Ru length encoding in JPEG is done using the following convention: ((runlength,size),amplitude)
. Runlength is the number of zeros preceding a non zero value, size is the number of bits needed to code amplitude and amplitude is the actual value. Runlength and size form a single symbol and amplitude forms another symbol. After performing RLE, we can apply Huffman coding on those symbols, and bring down the size even more. However since Python does not have a readily builtin function to do Huffman coding unlike MATLAB, we skip entropy coding here.
We plot the number of bits per pixel, compression ratio and PSNR vs Q and get the following graphs. We observe the expected trends, number of bits allocated per pixel and PSNR increases with quality, while compression ratio decreases with quality. For same quality factor, the baboon image needs more bits per pixel compared to the moon image as it is a coloured image. Similarly for same quality factor, the moon image has more psnr.
The moon image is grayscale, hence needs 8 bits per pixel. The baboon image is RGB, hence it needs 8*3=24 bits per pixel. However after downsampling the U and V channels, it needs only 2/3 bits, that is 16 bits. We observe that as Q becomes close to 100, the number of bits per pixel needed for moon becomes close to 8, while the number of bits per pixel needed for baboon approaches 16. As Q becomes close to 100, the compression ratio drops to 1.
Number of bits per pixel vs Q
PSNR vs Q
Compression ratio vs Q
In the run length coding, we must specify number of bits to be used to denote the runlength. If the number of bits is too large or small it might be wasteful. We can plot graphs of number of bits per pixel to find optimal bits to allocate to denote runlength
For two Q factor values (5 and 10), we observe that 2 bits for runlength gives best performance.