First, we present the performance of all hyperparameter combinations along with the pooled version for the Drums dataset. Next, we include examples of the explanations generated by each method, compared to the ground truth.
The plot below shows the AUC for each combination of masked percentage and window size (measured in segments) for each type of mask for the three different surrogate models. The final column shows the results when pooling all window sizes and percentages for the corresponding mask type (these are the results shown in the paper). The number of samples used to train the surrogate model is always 3000.
From the plot above, we observe that RF-noise is the most effective method for identifying the ground truth label and that the pooled results are better than the parameter-specific results in almost all cases.
The AUC metric relies on the existence of ground truth labels which may not always be available. In contrast, the faithfullness (FF) metric commonly used in the literature for evaluating explanations does not require ground truth. Hence, an important question arises: can we reach the same development conclusions as above using the faithfulness (FF) metric? The FF is computed as the difference in log-odds for the correct class after removing X% of the most important features. Below, we present results for FF-topX with X = 1, 5, 10, 20, and 50. We also repeat the results for AUC shown in the paper for a direct comparison.
In these plots we can see that FF and AUC generally show different trends, reaching different conclusions about what system is best. Moreover, the rankings vary significantly across different values of X.
For instance, if we select the top 1% or the top 10% of features, SHAP-noise emerges as the best method but this method ranks as the second worst according to AUC. At the 5% threshold, LR-zero is preferred, yet this method is the third worse based on AUC. Only for 20%, RF-noise is the best choice, in agreement with AUC. Yet, the rest of the ranking differ between the two metrics.
Next, we analyze whether FF and AUC lead to the same conclusions when FF uses information about the duration of the important events. To do this, we compute FF top-adapt, a variation of the FF metric where, for each instance, we remove the most important X% of features, with X corresponding to the percentage of the audio that corresponds to the ground truth. Interestingly, as shown below, this version of FF produces rankings that more closely align with those given by the AUC.
These results suggest that if the ground truth location of the events is unknown but some knowledge of the duration of the events is available, the adaptive FF metric can be used as a good proxy for the AUC.
The plots illustrate the importance assigned by each method to different segments. Additionally, the annotated ground truth is highlighted in color, allowing for a direct comparison between the model's attributions and the actual relevant segments.
First, we show six examples for the RF-noise explainer. The first three examples where FF-top20 is low (some even negative) but AUC is high, while the other three correspond to the top three worst cases for AUC (see table below for the values of each metric for each example).
In this scenario where the two metrics contradict each other, which metric should we rely on? The results clearly show that the explanations are quite good in the first three cases, which aligns with the AUC's assessment rather than with the FF assessment. The last three, which are the worst in terms of AUC, perform well for top20 FF. Again, for those cases, the AUC accurately diagnoses that the explanations are not great. Overall, these examples show that AUC aligns better than FF with our judgment of the quality of the explanations.
As mentioned above, one reason for the poor diagnostic value of the FF is that the percentage of removed segments does not necessarily correspond to the percentage of important segments. If only a subset of the important segments is removed, the impact on the model's output may be small, resulting in a small value for the FF, incorrectly suggesting that the explanation is poor.
The table below shows the AUC and the FF values for different values of X and for the adaptive X, for those six examples mentioned above. We can see that none of the FF values -- not even the adaptive one -- correlates well with the AUC for these examples.