Summary
The study reveals some interesting findings:
Transformer based models are consistently more robust against most of the perturbations when compared with CNN based models,
Pretraining helps Transformer based models to be more robust to different perturbations than CNN based models, and
All of the studied models are robust to temporal perturbation on the Kinetics dataset, but not on SSv2; this suggests temporal information is much more important for action recognition on SSv2 datasets than on the Kinetics dataset.
Next, we study the role of augmentations in model robustness and present a real-world dataset, UCF-101-DS, which contains realistic distribution shifts, to further validate some of these findings. MViT is made worse when trained on perturbed UCF101 dataset compared to CNN based models when evaluated on real-world distribution shifts.
@inproceedings{robustness2022large,
title={Large-scale Robustness Analysis of Video Action Recognition Models},
author={Schiappa, Madeline C and Biyani, Naman and Kamtam, Prudvi and Vyas, Shruti and Palangi, Hamid and Vineet, Vibhav and Rawat, Yogesh},
booktitle={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}