Ideal Binary Mask (IBM) or Ideal Ratio Mask (IRM)?

IBM and IRM have been widely used for source separation tasks. In this report, we systematically evaluated their performance considering two speaker scenarios, using metrics PESQ, SDR and STOI.

The IBM was generated as IBM(t,f) = 1, if s1(t,f)>s2(t,f), and otherwise 0, where s1 and s2 were the STFT magnitude of the source images.

The IRM was generated as IRM(t,f) = s1(t,f)/(s1(t,f)+s2(t,f)) Note, the optimised IRM equivalent to wiener filtering should be s1^2(t,f)/(s1^2(t,f)+s2^2(t,f)) . Here we use the magnitude instead of the power magnitude since it is more divergent to IBM.

A hybrid mask was also generated as Hybrid(t,f) = 1, if s1(t,f)>s2(t,f), and otherwise IRM(t,f)

The performance evaluations were performed on 720 mixtures, where the two speech signals were randomly chosen from the TSP dataset, at different gender combinations (MM, MF, FF). Reverberation (RT60=325 ms) was added as well.