Qingju LIU's homepage - 180423 IBM or IRM

Ideal Binary Mask (IBM) or Ideal Ratio Mask (IRM)?

IBM and IRM have been widely used for source separation tasks. In this report, we systematically evaluated their performance considering two speaker scenarios, using metrics PESQ, SDR and STOI.

The IBM was generated as IBM(t,f) = 1, if s1(t,f)>s2(t,f), and otherwise 0, where s1 and s2 were the STFT magnitude of the source images.

The IRM was generated as IRM(t,f) = s1(t,f)/(s1(t,f)+s2(t,f)) Note, the optimised IRM equivalent to wiener filtering should be s1^2(t,f)/(s1^2(t,f)+s2^2(t,f)) . Here we use the magnitude instead of the power magnitude since it is more divergent to IBM.

A hybrid mask was also generated as Hybrid(t,f) = 1, if s1(t,f)>s2(t,f), and otherwise IRM(t,f)

The performance evaluations were performed on 720 mixtures, where the two speech signals were randomly chosen from the TSP dataset, at different gender combinations (MM, MF, FF). Reverberation (RT60=325 ms) was added as well.

From the above figure, it can be found IRM and "Hybrid" gain similar results. Also, IRM significantly outperforms IBM in terms of PESQ, while very slight improvement was found in STOI and similar performance was observed in SDR.

More evaluations should performed on recognition results, i.e. by feeding some ASR engine directly.