First, we present the performance of all hyperparameter combinations for the three methods discussed in the paper, along with the pooled version. This evaluation is conducted across the three selected classes: dog, music, and speech.
Below, we provide examples of the explanations generated by each method, comparing them to the ground truth.
For all classes, RF-Zeros appears to be the best method according to AUC.
As for the drums dataset, we can now compare with the conclusions we would obtain using the FF metric.
We can see that, for these datasets, FF-top-adapt does not correlate well with AUC, suggesting that the LR method is better than the RF method, contradicting the conclusions obtained with AUC. As we will see in the examples below, it is clear that AUC is more accurately reflecting the relative quality of these two explainers.
We do not present results for the FF metric with fixed X because they are not interesting. For dog and music, none of those metrics yield the same system ranking as AUC. Only for speech, when removing the top 50% of features, we observe a similar ranking.
For Dog and the RF model, the FF metric favors the noise masking approach for every value of X, while AUC favors the zero masking approach. The same pattern is observed for music. Below we select some Dog examples at random to analyze whether RF-Zeros or RF-noise result in better explanations.
The plots illustrate the importance assigned by each method to different segments. Additionally, the annotated ground truth is highlighted in color, allowing for a direct comparison between the model's attributions and the actual relevant segments.
These annotations were made by humans since this is not a synthetic dataset. The last example shows a case where the explanation found a Dog bark which was missing from the annotations. This suggests that some low-AUC cases in this dataset may in fact be due to annotation errors.
Example 1
Example 2
Example 3
Example 4
Example 1
Example 2
Example 3
Example 4
Example 1
Example 2
Example 3
Example 4