Abstract: Zero-shot action recognition requires a strong ability to generalize from pre-training and seen classes to novel unseen classes. Similarly, continual learning aims to develop models that can generalize effectively and learn new tasks without forgetting the ones previously learned. The generalization goals of zero-shot and continual learning are closely aligned, however techniques from continual learning have not been applied to zero-shot action recognition. In this paper, we propose a novel method based on continual learning to address zero-shot action recognition. This model, which we call Generative Iterative Learning (GIL) uses a memory of synthesized features of past classes, and combines these synthetic features with real ones from novel classes. The memory is used to train a classification model, ensuring a balanced exposure to both old and new classes. Experiments demonstrate that GIL improves generalization in unseen classes, achieving a new state-of-the-art in zero-shot recognition across multiple benchmarks. Importantly, GIL also boosts performance in the more challenging generalized zero-shot setting, where models need to retain knowledge about classes seen before fine-tuning.
Authors: Shreyank N Gowda , Davide Moltisanti, Laura Sevilla-Lara
Code: Will be released soon!
Salient points:
The paper introduces Generative Iterative Learning (GIL), the first continual learning-based framework for zero-shot action recognition, combining generative feature synthesis and replay memory to enhance generalization and mitigate forgetting.
GIL uses a replay memory to store class prototypes and noise distributions, allowing balanced training on both old and new classes, which ensures the retention of past knowledge while learning new tasks.
Experiments on benchmarks like UCF-101, HMDB-51, and Kinetics demonstrate that GIL significantly improves zero-shot and generalized zero-shot action recognition, achieving up to 20% performance gains over previous methods.
GIL is compatible with various video backbones and semantic embeddings, showcasing its flexibility and potential applicability across diverse zero-shot learning scenarios in video tasks.
GIL training has three stages: Initialization, where class prototypes and a feature generator are prepared; Incremental Learning, where the model is fine-tuned with synthetic and real features; and Update, which refreshes the memory with new class prototypes to retain old knowledge while learning new tasks.
The paper has been accepted in ACCV-24! The full paper can be found here.
If you find our work useful please cite:
@article{gowda2024continual,
title={Continual Learning Improves Zero-Shot Action Recognition},
author={Gowda, Shreyank N and Moltisanti, Davide and Sevilla-Lara, Laura},
journal={arXiv preprint arXiv:2410.10497},
year={2024}
}