Basic audio quality of coded audio material is commonly evaluated using ITU-R BS-1534 Multi Stimulus with Hidden Reference and Anchors (MUSHRA) listening test. MUSHRA guidelines call for experienced listeners. However, the majority of consumers using the final product are no expert-listeners. Also the degree of expertise in a listening test may vary among listeners in the same laboratory. It would be useful to know how the audio quality evaluation differs between trained and untrained listeners and how training and actual tests should be designed in order to be as reliable as possible. To investigate the rating differences between experts and non-experts, we performed MUSHRA listening tests with 13 experienced and 11 inexperienced listeners using 5 speech and audio codecs delivering a wide range of basic audio quality. Except for the hidden reference, absolute ratings of non-experts were consistently at least 10% higher than those of experts. However, they could be mapped to each other by a z-transform. For lower quality values, confidence intervals were significantly larger for non-experts than for experts. Experienced listeners set more than twice as many loops as non-experts, compared more often between codecs and listened to high quality codecs for a longer duration than non-experts.
Application of lossy audio codecs at low bitrates may result in impairment of audio quality for certain items. ITU-R BS.1534 MUSHRA (MUlti Stimuli with Hidden Reference and Anchor) listening tests are frequently used for evaluating the subjective quality of the stimuli under test.
MUSHRA listening tests often contain sound material in different languages. However, it is unclear whether a listener can evaluate the quality of a speech sequence in a foreign language equally well as he or she can evaluate speech sequences in his or her first language. On the one hand, semantic understanding allows the listener to partly predict future words or phonemes. This might facilitate judgment of artifacts. On the other hand, this could draw attention from the sound quality to the content, leading to a less accurate evaluation. Furthermore, foreign languages often contain phonemes which are not part of the native language of the listener. Thus, the listener may not be able to distinguish these phonemes from similar sounding phonemes and also has greater difficulties to perceive the impairment of these phonemes.
In a first study we analyzed whether understanding the test items helps in rating their quality. For this study we did MUSHRA tests with regular German sentences and with German sentences containing pseudo words, which followed German phonotactics and thus sounded German but have no meaning. Listening tests with experienced and inexperienced listeners showed that participants tend to be more critical for the regular German sentences. Also the ratings of expert listeners are more reliable for regular German sentences. Expert listeners that are rather fast in their judgement spend more time listening to the pseudo sentences, indicating that those may need more effort and more comparison to the reference. In contrast, subjects that rate very slowly and carefully spend more time listening to the regular German sentences, which may suggest that listeners are more careful rating their native language than non-understandable pseudo sentences.
A poster on this topic that I presented at the 133rd Convention of the Audio Engineering Society can be downloaded here.
An article about me in the Fraunhofer newspaper "Quersumme":