In a crowded room, people are easily able to listen to a conversation of interest, a phenomenon known as the "cocktail-party effect." Humans are well-designed for this task, as they are able to listen with two ears (localization), use non-audio cues (watching moving lips), and process all of this information with higher-order cognitive processing. In our single-channel source separation research, we start with a much more difficult problem. We have only one channel and not having any visual cues.
One of the latest and most promising methods of source separation is non-negative matrix factorization (NMF). NMF decomposes a matrix into a product of two matrices, with one matrix corresponding to the bases and the other corresponding to the weights, so that each column in the matrix is a linear combination of the bases. NMF is applied to the magnitude of the short-time Fourier transform (STFT). If the resulting bases correspond to different talkers, then separation is achieved by multiplying the bases and weights together that correspond to each source.
NMF, however, makes some fundamental assumptions that are incorrect in composite signals. NMF assumes that the sources have the exact same phase as all the other sources at every point in time and frequency, which is false. This causes two significant problems. The first is that since superposition does not hold in the magnitude domain, the bases, weights, and resulting separated signals will be incorrect. The second is that new phases aren't estimated for the new sources. To solve these problems, we are working on the brand new field of complex matrix factorization (CMF), which doesn't have the two problems previously mentioned. Already in initial tests, we have shown that CMF doesn't contain the noisy artifacts audible in signals separated by NMF. Our current goals are to gain a better understanding of the theory behind CMF and to apply it towards the problem of multiple-talker automatic speech recognition.
Air Force Office of Scientific Research