The zero-crossing rate measures the frequency at which a signal changes its sign, that is, the number of times the speech signal shifts from positive to negative or from negative to positive within a given frame. Typically, a higher zero-crossing rate corresponds to a higher frequency approximation. Thus, the zero-crossing rate can be used to represent the frequency feature of the signal.
Short-Term Energy represents the mean value of energy for each short frame.
The spectral Centroid is a crucial parameter for describing timbre properties, as it represents the center of gravity of the frequency components. It is calculated as the energy-weighted average of frequencies within a specific range, measured in Hz. The spectral centroid provides valuable information about the distribution of both frequency and energy within an audio signal.
Humans have a relatively reduced sensitivity to the high-frequency components of speech signals. While humans can perceive sounds up to 20,000 Hz, the human ear processes sound waves at different frequencies non-linearly. Therefore, to align with the auditory properties of the human ear, researchers proposed the Mel scale, which is a set of triangular bandpass filters, arranged from low to high frequencies in a dense to sparse configuration.
The log-Mel spectrogram is acquired by applying the Mel scale to the linear spectrogram and taking the logarithm. The MFCC is obtained by applying the Discrete Cosine Transform (DCT) to the log-Mel spectrogram, and retaining the second to the 13th coefficients, resulting in 12 coefficients that form the MFCCs. These features effectively capture the key spectral characteristics of the speech signal while being consistent with human auditory perception.
GFCC is similar to MFCC but it uses a GammaTone filter instead of Mel scale and logarithms. GammaTone filters are a set of filter models designed to simulate the frequency decomposition characteristics of the cochlea. When external speech signals enter the cochlea's basilar membrane, they are decomposed according to frequency, generating traveling vibrations that stimulate auditory receptor cells. In comparison to the Mel filter, the GammaTone filter exhibits greater sensitivity to high-frequency information and reduces energy leakage. Moreover, it offers robustness and noise immunity advantages.
In the process of CQCC feature extraction, constant Q transform (CQT) is first applied to input audio signals. The nonlinearity of the CQT is effective in simulating the properties of music theory.