HIerarchical Prototype Explainer (HIPE)

This page presents our work Hierarchical Explanations for Video Action Recognition .

Implementation is available here.

Abstract: To interpret deep neural networks, one main approach is to dissect the visual input and find the prototypical parts responsible for the classification. However, existing methods often ignore the hierarchical relationship between these prototypes, and thus can not explain semantic concepts at both higher level (e.g., water sports) and lower level (e.g., swimming). In this paper inspired by human cognition system, we leverage hierarchal information to deal with uncertainty: When we observe water and human activity, but no definitive action it can be recognized as the water sports parent class. Only after observing a person swimming can we definitively refine it to the swimming action. To this end, we propose HIerarchical Prototype Explainer (HIPE) to build hierarchical relations between prototypes and classes. HIPE enables a reasoning process for video action classification by dissecting the input video frames on multiple levels of the class hierarchy, our method is also applicable to other video tasks. The faithfulness of our method is verified by reducing accuracy-explainability trade off on ActivityNet and UCF-101 while providing multi-level explanations.

Our approach generates explanations by learning hierarchical prototypes at different levels: at grandparent level prototypes belonging to sports class, parent level belonging to water sports class, and class level belonging to rafting class.

Overview of HIPE

Visual Explanations

Single Level

Leftmost: Original Video. Second: Parts in the original video that are highly activated by the prototype. Third: Saliency map in the original video that are highly activated by the prototype. Fourth: Training videos where prototypes come from. Rightmost: Prototypes.

Hierarchical

Grandparent Level: Human-object interaction

Parent Level: Self-grooming

Child Level: Blow dry

Leftmost: Original Video. Second: Parts in the original video that are highly activated by the prototype. Third: Saliency map in the original video that are highly activated by the prototype. Fourth: Training videos where prototypes come from. Rightmost: Prototypes.