Ye:2023:TVCG
Z. Ye and M. Chen. Visualizing ensemble predictions of music mood. IEEE Transactions on Visualization and Computer Graphics, 29(1):864-874, 2023. DOI. (Presented at IEEE VIS 2022.)
Z. Ye and M. Chen. Visualizing ensemble predictions of music mood. IEEE Transactions on Visualization and Computer Graphics, 29(1):864-874, 2023. DOI. (Presented at IEEE VIS 2022.)
The work reported in this paper is part of the DPhil project of the first author. It is concerned with the development of an ensemble model for predicting music mood. The ensemble model consists of 210 sub-models. If one simply relies on a summary prediction of the ensemble model, one naturally feels unsure about such prediction as one may feel the same about another person's judgement. Such uncertainty is actually caused by the fact that the ensemble model compress the information too quickly (Alg-High-AC). One obvious solution is to use visualization to convey the predications by individual sub-models in a less-aggregated manner (Vis-Low-AC). Below, we use orange text to indicated the original text in the paper. The work considered two groups of target users, music experts and ML developers.
• R1. Both would like to observe how ensemble models collectively voted on individual sections of music, so someone with music knowledge can reason if the voting results are sensible or not.
• R2. Both would like to observe how ensemble models are collectively influenced by the less accurate “global” ground truth labels in parts of the music where the mood changes.
• R3. Both would like to locate where ensemble models voted for a mood change so we can relate such changes with the corresponding music score.
• R4. Both would like to see the dominant opinion of ML models, the second dominant opinion, the third, and so on, and music experts would like to exercise their own interpretations of the different predictions generated by an ensemble of ML models.
• R5. Both would like ideally to identify visual representations that can be used to accompany music for non-experts.
• R6. ML developers would like to observe sub-groups of models (e.g., by methods and interval length) to compare their performance with the ensemble group.
• R7. ML developers would like to observe individual models’ performance to compare their performance with the ensemble group and related subgroups.
Three Visual Designs. For the requirements described in Section 3, the three line-graph based visual design (i.e., stacked line graph, original ThemeRiver, and dual-flux ThemeRiver) cannot support R6 or R7 easily. Although all three visual designs convey more or less the same amount of information, they have different strengths and weaknesses in supporting R1∼5.
Symptom: With the stacked line graph and the original ThemeRiver, one cannot see easily the changes in the dominant opinion, the ordering of other opinions, and the place where ordering changes. Observing such information is an essential part of R1∼5.
Cause: Although such information is depicted implicitly, the cognitive cost for gaining it is very high, as it would involve perceptual estimation of the heights of different cross-sections, and cognitive comparison of such height measures [5]. The stacked line graph has some advantages over the original ThemeRiver in estimating the total height and that of the bottom stream.
Remedy: Introduce a more explicit depiction of such information to reduce the cognitive cost. With the dual-flux ThemeRiver, the dominant opinion, the ordering, and the places of mood changes are all explicit, ready to be perceived.
Side-effect: The mood streams are no longer continuous, and it may take extra effort to re-connecting the same stream e.g., to quantify the amount of mood change. With only four moods and appropriate color-coding, the side-effect is not a big issue. It could become more serious if there were many streams.
Pixel-based Visualization and dual-flux ThemeRiver. This combined use of two visual designs is for supporting requirements R6 and R7. To address the issue, we have to go back to the traditional methods for observing ML models’ performance. When one has a few models to compare, one might be able to ensure the demanding effort for observing their performance against individual data objects (music clips in this work) by reading classification logs. However, this would not scale up to 210 ML models.
Symptom: It is almost impossible to observe a large number of ensemble models against individual data objects by reading classification logs.
Cause: It incurs very high cognitive costs of reading numbers, remembering them for building up a mental overview model, and performing comparative tasks mentally.
Remedy: Both pixel-based visualization and dual-flux ThemeRiver provide external memorization, substantially reducing the cost of repeated reading-remembering. By removing the burden of memorization, the users can devote more cognitive resources to the patterns depicted.
Side-effect (new symptom): Identifying individual ML models is difficult with an arbitrary list of models, and grouping models visually is even harder.
Cause: Labelling small pixels is not easy. Visual grouping demands extra cognitive load for remembering and formulating groups mentally.
Remedy: Using different sorting schemes.
Side-effect: There could be an issue if the sorting scheme is unfamiliar to a user. For ML model-developers, this is unlikely.
The paper was first submitted to EuroVis2022 with the cost-benefit analysis. We hoped that reviewers might be able to evaluate our diagnosis of the symptoms, analysis of the causes, and the prescription of remedies in a way similar to one doctor evaluates another doctor's diagnosis, analysis and treatment. The reviewers rejected the paper, asking us to conduct a user-centered evaluation. We conducted user-centered evaluation and resubmitted it to VIS2022, which accepted the paper.