Check out the project webpage here.
Published in NeurIPs Datasets and Benchmark Track 2022, this work is in collaboration with Dr. Yogesh S. Rawat, Dr. Vibhav Vineet, Dr. Hamid Palangi and Dr. Shruti Vyas. This work evaluates text-to-video retrieval on the YouCook2 dataset and MSRVTT dataset. It extends to perturbations made on text input as well as video input. Models evaluated are video-langauge models, meaning they take video and text as input and output features for each in a "joint" space. During training, the model learns to project those features in a way that makes matching text and video more similar. This is evaluated using the text-to-video retrieval.
This project involved:
Reading and understanding code available open-source, debugging this code, and integrating new code.
Data engineering by creating perturbed versions of video using 80 different perturbation types for multiple datasets while optimizing for storage, especially with the minutes-long YouCook2 videos.
Data preprocessing that is unique to each model.
Virtual environment management for different Python package requirements for each open-source model.
Programmed organizational methods to run a large variety of experiments with a more complex set of models with their own respective workflow each.
Accepted in CVPR 2023. Check out the webpage here.
Published in IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. This work is in collaboration with students Naman Biyani and Prudvi Kamtam and suprvisors Dr. Yogesh S. Rawat, Dr. Vibhav Vineet, Dr. Hamid Palangi and Dr. Shruti Vyas. In this work, we perform a large-scale robustness analysis of these existing models for video action recognition. We focus on robustness against \textbf{real-world distribution shift} perturbations instead of adversarial perturbations. We propose four different benchmark datasets, HMDB51-P, UCF101-P, Kinetics400-P, and SSv2-P to perform this analysis. We study robustness of six state-of-the-art action recognition models against \textbf{90} different perturbations. The study reveals some interesting findings:
transformer based models are consistently more robust compared to CNN based models,
Pretraining improves robustness for Transformer based models more than CNN based models,
All of the studied models are robust to temporal perturbations for all datasets but SSv2; suggesting the importance of temporal information for action recognition varies based on the dataset and activities.
Next, we study the role of augmentations in model robustness and present a real-world dataset, UCF101-DS, which contains realistic distribution shifts, to further validate some of these findings. We believe this study will serve as a benchmark for future research in robust video action recognition
This project involved:
Reading and understanding code available open-source, debugging this code, and integrating new code.
Data engineering by creating perturbed versions of video using 80 different perturbation types for multiple datasets while optimizing for storage.
Programmed organizational methods to run a large variety of experiments with a more complex set of models with their own respective workflow each.
Data visualization to present the results in a concise, digestable way.
Helped organize data and evaluation of Robustness in Sequential Data (ROSE) challenge. Generated the test dataset for the challenge using a variety of perturbations. Created the leaderboard evaluation website here. This work was with Dr. Yogesh S. Rawat, Dr. Vibhav Vineet, Dr. Hamid Palangi and Naman Biyani.
This work led to a paper submission that is in progress.
The research conducted re-implemented existing models and evaluated them on the Something-Something-P (Perturbed) dataset. This project involved:
Reading and understanding code available open-source, debugging this code, and integrating new code.
Data engineering by creating perturbed versions of video using 80 different perturbation types for multiple datasets while optimizing for storage.
Visualizing high-dimensional data and other results to communicate research questions and answers.
Programmed organizational methods to run a large variety of experiments.
An example of visualizations conducted using high-dimensional space. These are results for three popular deep learning models. Videos were were perturbed videos that simulate a live-stream video freezing with a greater severity meaning the longer and more often a video would "freeze". These videos were fed to the models and features were extracted and visualized using a PCA and t-sne analysis to convert the high-dimensional space to 2D space.