Independent and Joint Understanding

To thoroughly investigate the understanding capabilities of LALMs, we propose two evaluation paradigms, namely "independent understanding'' and "joint understanding''. Specifically, for independent understanding, the LALM is required to focus on a single task (i.e., ASR, ASC or AT). While for joint understanding, the LALM is expected to consider the correlations among speech, scene, and events, and generate predictions for all three tasks.