Learnable spectrogram
The model takes the raw audio waveform as input and extracts features using 1-D convolution layers. The output is a feature representation that is similar in size to an input spectrogram.