This web page illustrates a novel method for two-speaker speech separation based on a joint acoustic/modulation frequency representation of speech. The goal is to separate blindly mixed speakers from a single-channel signal. The joint frequency representation isolates the two speakers into separable regions. The information in the joint frequency representation can then be adjusted to suppress one of the talkers.
Presented below are spectograms and the joint acoustic/modulation representations for a two-speaker single-channel signal. The original speech file consists of two simultaneous speakers saying "two" in English (speaker A) and "dos" in Spanish (speaker B). The original spectogram and joint frequency representation is shown in Figure 1. The original audio file can be located here. The Spanish speaker's region in the joint frequency plane was suppressed, resulting in the reconstructed spectogram and joint frequency representation shown in Figure 2. The reconstructed audio file of the English speaker can be located here. Similar processing was performed to extract the Spanish speaker, the reconstructed audio file can be located here. Note that there is a "tinny" sound to the speech clips due to a pre-emphasis of the higher acoustic frequencies prior to processing.
Figure 1. Spectrogram (left panel) and joint acoustic/modulation frequency representation (right panel) of the central 500 milliseconds of "two" (speaker A) and "dos" (speaker B) spoken simultaneously by two speakers. The y-axis of both representations is standard acoustic frequency. The x-axis of the right panel representation is modulation frequency, with an assumption of a Fourier basis decomposition. Green and blue lines surround most of speaker A's and speaker B's respective pitch information.Original Two-Speaker Speech Clip ("two" / "dos")
Original Two-Speaker Speech Clip ("one two three" / "uno dos tres")
Figure 2. The reconstructed speech after enhancement of the English "two" and suppression of the Spanish "dos." Note that the pitch information of the Spanish speaker (the blue lines in Figure 1) have effectively been removed from the joint acoustic/modulation representation. (The remaining higher frequency fricative information at 40-55 milliseconds is incorrect for the English "two." However, segmentation is a topic of our proposed research.) High-quality synthesis is indeed possible from any modification of this acoustic/modulation frequency representation.Separated English Speaker Clip ("two")
Separated English Speaker Clip ("one two three")
The speech of the Spanish speaker "dos" was extracted by suppressing the pitch information of the English speaker (the green lines in Figure 1.)Separated Spanish Speaker Clip ("dos")
Separated Spanish Speaker Clip ("uno dos tres")
Past Projects >