Sound Compression using DCT
Discrete Fourier Transform
The numbers X0,...,XN-1 described in the first formula are called the Discrete Fourier Transform (DFT) of the sequence x0,...,xN-1.
The Inverse Discrete Fourier Transform (IDFT) of the sequence X0,...,XN-1 is given by the numbers x0,...,xN-1 from the second formula.
In the literature, there is no standard definition of the DFT; instead, you can find other versions with different signs and normalization factors.
For example, here in Scipy for Python.
The Discrete Cosine Transform
Similarly, the Discrete Cosine Transform (DCT) of the sequence x0,...,xN-1 has different versions, as does the Inverse Discrete Cosine Transform (IDCT) of the sequence.
Sound Waves
The sound wave representing the audio is a continuous function defined on the finite interval [0,T]. Analog sound is digitized by taking a certain number of samples per second and storing these samples in digital form with a certain precision ("bit depth").
For example, in an audio CD we have
Sample rate: 44100 (44100 samples are taken per second, or 44.1 kHz)
Sound quality (or Bit Depth): 16 bits (each sample is stored in a 16-bit variable)
Bitrate = 44100*16*2 = 1,411,200 bit/s = 1.41 Mbit/s if the sound is stereo.
That is, 1.41/8 = 0.1764 MB/s o 10.58 MB/min.
Here, the Python code for the example download.
import matplotlib.pyplot as plt #paquete para graficar
import numpy as np #paquete de operaciones matemáticas
from scipy.io import wavfile #paquete para cargar archivos wav
from scipy.fftpack import dct, idct #paquete para dct y idct
frec, x = wavfile.read('audio1.wav') # leemos el archivo wav
x = x[:,1] # el audio es estéreo, para simplificar nos quedamos con un solo canal
# calculamos la DCT # cuidado, sino normalizamos idct(dct(x)) no da x
X = dct(x, norm="ortho")
# en F guardo las frecuencias ordenadas de manera ascendente
F = np.sort(np.abs(X))
# en indices[k] guardo la posición que ocupaba originalmente en X en elemento F[k]
indices = np.argsort(np.abs(X))
# quiero que F esté ordenado descendientemente (sort ordena ascendente)
F = F[::-1]
indices = indices[::-1]
# X es del tipo "list" lo paso al tipo "array de numpy"
Xantes = X
X = np.array(X)
F = np.array(F)
# nos quedamos con "cuanto" de las frecuencias (1=100%)
cuanto = 0.5
cuantasFrecuencias = int(len(X)*cuanto)
cuantaEnergia = np.linalg.norm( X[ indices[range(0,cuantasFrecuencias)] ] )/np.linalg.norm(X)
# al resto de las frecuencias las ponemos en 0
X[ indices[range(cuantasFrecuencias+1, len(X))] ] =0
# reconstruimos el sonido con las nuevas frecuencias
xx= idct(X, norm="ortho")
print("Frecuencias totales:", len(X))
print("Frecuencias que consideramos:",cuantasFrecuencias, "que conforman el", cuanto*100,"% de la energia")
# guardamos el nuevo sonido
wavfile.write("Naudio.wav", frec, xx)
#graficamos la señal/frecuencias viejas/nuevas
fig1 = plt.subplots()
plt.plot(Xantes, 'r,')
plt.plot(X,'g,')
fig2 = plt.subplots()
plt.plot(x, 'r,')
plt.plot(xx, 'g,')
plt.show()
A particular example:
We take Q = 0.8. Here you can see the first 2500 coefficients of the vector X, with those necessary to reach 80% of the energy shown in red. All the blue points are discarded
With Q = 0.8. In blue, we see the values of the original sound x. In red, we see the values reconstructed from the coefficients sufficient to reach 80% of the original energy.
Histogram of the original sound. The reds correspond to high coefficients, decreasing with orange, yellow, green, cyan, and blue, which represents null coefficients.
Histogram of the "compressed" sound. The reds correspond to high coefficients, decreasing with orange, yellow, green, cyan, and blue, which represents null coefficients.