Explanations help us understand the patterns behind a model’s predictions. However, the underlying reason may not always align with our expectations. A model might rely on spurious factors, resulting in a CleverHans effect.
CleverHans refers to a phenomenon where models learn to exploit spurious correlations rather than capturing patterns that are relevant for the task. The term originates from a horse that appeared to perform arithmetic operations but was actually responding to subtle, unintended cues from its handler. Similarly, AI models can latch onto irrelevant features in the data to make predictions.
We investigate whether explanation methods can reveal such spurious correlations. To test this, we trained two models to predict emotion. In one case, we deliberately added coughing sounds to samples labeled as happy. This simulates real-world scenarios where models might pick up on unintended artifacts in the data. Meanwhile, the other model was trained using the normal audio without any artificially introduced bias, serving as a baseline for comparison.
For the model trained with the corrupted data, the RF-zeros explainer achieved an AUC of 0.92 for detecting the segments containing the cough.
Below, we present examples of the explanations generated by the RF-zeros explainer for audio clips of class happy containing a cough.
In each of the examples below, the red line, representing the importance scores for the model trained with corrupted data, consistently assigns higher importance to the moment when the cough occurs. In contrast, the blue line, which shows the importance scores for the model trained on normal audio, follows a different pattern, indicating that this model focuses on other cues to predict the "happy" emotion, even when the cough is present in the audio. This suggests that if a model is relying on spurious factors, explainability methods can help uncover the problem.