In-Context Learning with Noisy Labels (2024.2, preprint)

Improving ViT Interpretability with Patch-level Mask Prediction (2023.4, preprint)

TL;DR : In this work, we propose a novel visual explanation method for Vision Transformers via patch-level binary mask prediction instead of attention scores.

What do Vision Transformers see? (2022.7)

Summary of the paper: The research examines methods for extracting the visual explanations of Vision Transformers on image classification. Previous methods have relied on attention maps (i.e. attention scores of the class token), but these methods have two limitations: 1) the use of attention score as explanation is controversial 2) They cannot be applied to models using global average pooling for the classification head. We explore two alternative methods that use feature map (tokens reshaped as 2D): the Patch Gradient Map and CAM. The former is a straightforward application of Input Gradient method for intermediate feature maps, and the latter is computed as GradCAM that can be interpreted as CAM by the pooling structure. Evaluating the explanation power of these extraction methods using localization metrics on ImageNet and OpenImages30k dataset, we found that CAM is still most powerful explanation method for Vision Transformers.

(2023 Summer-Fall Internship) Seminar on Geometry [article1]