Musical source separation (MSS) is the task of extracting the audio of any single source (instrument / voice) from an audio signal that contains a combination of several sources playing simultaneously. For example, extracting the vocals from a recording that contains drums, guitar, keyboard, etc. in the background.
If the mixture contains as many channels as the number of sources and the mixing process is known and fixed, then the task may not be that difficult.
However, this is not the case in most practical scenarios, and MSS then involves understanding the unique properties and structure of the individual sources and using these to break the mixture down into its constituent parts.
The Texas Instruments TMS320-C6748 is a floating point processor that is quite capable of carrying out some advanced audio and image processing, but not quite meant for deep learning inference. The TMS320-C6678 multi-core processor is probably slightly better, but still not really meant for very deep networks.
Be that as it may, we still set out, quite ambitiously, to write the model architecture up in C and dump the learned weights onto the C6748 and perform inference on an audio received through the audio-in port on the board. The network we used was a slightly pruned version of the one detailed here: https://github.com/MTG/DeepConvSep, which was described by the authors as being a low-latency model in comparison to the several other state-of-the-art models. It's a deep convolutional encoder-decoder type architecture that can simultaneously separate out multiple sources. Some key modifications we made were to the sizes of the filters and weight matrices in the various layers, and training the model on audios downsampled to 16kHz.
The inference did not, of course, turn out to be operational in real-time, and it took the processor a whopping 15 minutes to separate a 10 second clip. But it was good to see it working!
Rohit M. A., Mohammed Niyas, "Musical Source Separation on the TI TMS320C6748", Technical Report - PDF