Abstract
Human action recognition is the process of automatically identifying and categorizing human activities in video sequences, utilizing computer vision and machine learning techniques to analyze and interpret temporal patterns of movements. It finds applications in surveillance, sports analysis, healthcare, and human-computer interaction.
Types of Human Action Recognition from videos
Low-resolution videos exhibit poor quality settings, including blurriness, shaky camera motion, cluttered backgrounds, and occlusion. These issues contribute to a diminished visual experience, hindering the clarity and overall appeal of the video content.
Sports analysts use action recognition to track and analyze player movements during games. It provides valuable insights into player performance, strategies, and game dynamics. This can be applied in various sports, including soccer, basketball, and tennis.
Human action recognition is extensively used in video surveillance systems for identifying suspicious or abnormal activities. It helps in detecting intruders, monitoring crowded places, and enhancing overall security.
In healthcare, action recognition can be used to monitor patients' movements and activities. This is especially useful for elderly care, rehabilitation, and ensuring patient safety within healthcare facilities.
Motivation
Existing frameworks does not assume video quality as a problem , they are designed for processing high quality videos
To facilitate Video Forensic Investigators in case of any incident regarding anomalous activities
To automate Person action recognition process for surveillance activities in airports, metro stations, Shopping Malls, Roads, Parks etc.
To reduce the workload of humans via automated system that once required continuous monitoring.
Research Gap
Addressing Real-world Complexity
Researchers need to develop methodologies that account for the variability and unpredictability of real-world environments
Focus should be on creating and utilizing datasets that authentically represent the challenges posed by low-quality videos in everyday life
Action Recognition from Low-quality Videos while dealing with real-world problems such as camera sensor noise, grain, blurriness, and multi-view variation using pre-trained Generative adversarial network and 3D custom model for super-resolution and action prediction respectively.
Model Pipeline
Methodology Steps
Input Low-quality Video: Even though there are several commonly used standard datasets for human action recognition, the suggested framework for this research concentrates on the TinyVIRAT-v2 dataset. The motivation for choosing these is because of their vast number of classes, numerous short clips, and unrestricted nature, they are currently the most difficult dataset for real-time action recognition in low-quality videos.
Pre-Processing: The input videos then undergo pre-processing. The most frequent and frequently utilised pre-processing techniques that have been employed were data augmentation and video scaling into fixed resolution. Cropping the frame of the video is a commonly utilized data augmentation approach that has been proven to be effective in raising the effectiveness of Action recognition.
Feature Extraction: The processed input then goes to the Convolutional Neural networkbased feature extractor. From both test and training videos, a sequence of Spatial-Temporal features describing the action in the video is derived.
Super Resolution Network: The extracted features are then fed into the Super Resolution network. We will be using GAN based Super Resolution Network. Generative models, or GANs, yield new data samples that replicate existing training sets. To create better resolution videos, GAN uses a deep network in association with an adversarial network. Most of the state-of-the-art methods have used SRGAN as Super Resolution Network. We used pre trained SRGAN weights just for inference.
Action Classifier: The Super-Resolved videos are then fed into the Action Classifier which predicts the Action of the video.
Output Class: Upon classifying, every testing video is allocated a predicted category, and the cumulative outcomes are evaluated using performance measures.
Dataset
To evaluate quantitative and qualitative effectiveness, evaluating Human Action Classifier is required. But video datasets that describe activities under multiple contexts are necessary for an analysis to be effective. Unfortunately, there are no databases like this one that includes every possible circumstance. As a result, many scientists opt to produce their unique data. The most popular publicly available datasets for human action recognition will be explored here. Around 10 distinct action types have been gathered in the existing human activity recognition datasets under very controlled circumstances. Modern efficiency on such datasets is very close to its limit, requiring the invention and development of new standards. To solve this problem, they gathered the broadest action recordings dataset to date, comprising 51 action classes. They assess the effectiveness of two exemplary computer vision algorithms for action recognition using this data, and they investigate how reliable these techniques are under various circumstances, such as image acquisition, angle, video quality, and obstruction. Kay et al. created the dataset for Kinetics human action videos. Almost 400 videos from each of the human activity types are included in the dataset. Each ten-second video originally comes from a separate Video on YouTube. The acts are human-centered and span a variety of types, including both human-to-human and human-to-object interactions, such as performing a musical instrument or holding hands. Yang et al. investigated the challenge of action detection in low-light recordings. They gather a unique data set, the Activity Recognition in the Dark (ARID), to fill the data shortage for this challenge. There are more than 3.8K video clips overall, divided into Eleven activity categories. In other paper they have created a brand-new video dataset known as PA-HMDB51 that includes annotations for both expected work (activity) and specific private features (skin color, expression, identity, sexuality, and connection) to each frame. In other paper introduced two benchmark datasets tinyVIRAT-v1 and tinyVIRAT-v2 for activity detection in a low-quality tiny video. They collected these datasets by gathering short Clips from the CCTV footage. Because of their vast number of classes, numerous short clips, and unrestricted nature, they are currently the most difficult dataset for real time action recognition in low-quality videos. They are considerably harder to analyze since each clip has numerous actions and a multi-label dataset. The TinyVIRAT-v1 dataset includes over 13K recordings and the TinyVIRAT-v2 dataset includes over 26K recordings from 26 unique action classes, all of which were recorded at a frame rate of 30 frames per second. The actions range in length from sample to sample, with a duration of around 3 seconds. They feature low-quality films of random sizes, ranging in size from 10x10 pixels to 128x128 pixels on average. The train and test split of TinyVIRAT-v1 is more than 7K and 5K respectively whereas the train, validate, and test split of TinyVIRAT-v2 is more than 16K, 3K, and 6K respectively. The recordings in this collection are inherently poor quality and depict difficulties encountered in everyday life. The TinyVIRAT-v2 dataset is an expansion of the original TinyVIRAT-v1 dataset and comprises authentic low quality action recordings taken from CCTV footage from the TinyVIRAT-v1 and MEVA datasets. Also, TinyVIRAT-v2 makes it harder by including both inside and outdoor video recordings. Hence, we will be using these two datasets as they are publicly. Dataset are available in these links.
1. https://www.crcv.ucf.edu/data/UCF101.php
2. https://www.crcv.ucf.edu/tiny-actions-challenge-cvpr2021/
3. https://github.com/vyzuer/Tiny-VIRAT
4. https://paperswithcode.com/dataset/hmdb51
5. https://paperswithcode.com/dataset/kinetics
Backbone Network - 3D Resnet50
Residual 3D-CNN
Per Class Performance of Model
Evaluation Measure
Tiny Action Challenge
Introduction of Challenge:
TinyVIRAT v2 published by Mubarak Shah and team.
Open to teams worldwide.
Global Participation:
Numerous teams from around the world took part.
Evaluation Process:
Rigorous assessment of submissions.
Criteria included innovation, performance, and relevance.
Publication of Results:
Top three scores highlighted in the official report.
Detailed insights into the winning entries.
Significance of the Challenge:
Showcased global collaboration in the field.
Demonstrated diverse talents and innovations.
Recommendation for Information:
For the latest and detailed information, refer to official publications or announcements by Mubarak Shah and his team in the provided link of the challenge page https://tinyactions-cvpr22.github.io/
Future Work
Integrating Attention Mechanisms:
Future research should focus on incorporating attention-related procedures into our 3D custom framework to enhance the recognition of activities in real-world CCTV video.
Exploring spatial and temporal attention techniques will optimize the model's focus on crucial regions and temporal dynamics, improving the identification of important features.
Useful Reads
[1] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre, "HMDB: A large video database for human motion recognition," in International Conference on Computer Vision, 2011.
[2] "activity-net," 2016. [Online]. Available: http://activity-net.org/.
[3] K. Soomro, A. R. Zamir and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes From Videos in Wild," 2012.
[4] M. Vrigkas, C. Nikou and I. A. Kakadiaris, "A Review of Human Activity Recognition Methods," Frontier in Robotics and AI, 2015.
[5] T. Özyer, D. S. Ak and R. Alhajj, "Human action recognition approaches with video datasets—A survey," Knowledge-Based Systems, vol. 222, 2021.
[6] J. K. AGGARWAL and M. S. RYOO, "Human Activity Analysis: A Review," vol. 43, p. 1–43.
[7] I. Jegham, A. B. Khalifa, I. Alouani and M. A. Mahjoub, "Vision-based human action recognition: An overview and real world," Forensic Science International: Digital Investigation, vol. 32, 2020.
[8] U. Nadeem, S. A. A. Shah, M. Bennamoun, R. Togneri and F. Sohel, "Real time surveillance for low resolution and limited data scenarios: An image set classification approach," Multimedia Tools and Applications, vol. 580, 2021.
[9] c. d. boor, A Practical Guide to Splines, vol. 27, Springer New York, 2001.
[10] "Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images," 1997.
[11] M. Elad and A. Feuer, "Image Super-Resolution Using Sub-Band Weighted Convolutional Networks," in IEEE Transactions on Image Processing, 1997.
[12] "Image Super-Resolution as Sparse Representation of Raw Image Patches," in IEEE Transactions on Image Processing, 2010.
[13] K. Dabov, A. Foi, V. Katkovnik and K. Egiazarian, "Image Denoising by Sparse 3D Transform-Domain Collaborative Filtering," in IEEE Transactions on Image Processing, 2007.
[14] C. Dong, C. Loy, H. & and .. K, "Image Super-Resolution Using Deep Convolutional Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
[15] J. Kim, J. Lee, K. J., L. & and M. K., "Accurate Image Super-Resolution Using Very Deep Convolutional Networks.," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[16] X. Xhang, S. Ren, S. & and K. He, "Deep Residual Learning for Image Recognition.," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[17] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta and W. & Shi, "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[18] B. Lim, S. Son, H. Kim and K. & Lee, "Enhanced Deep Residual Networks for Single Image Super-Resolution," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] S. Woo, J. Park, J. Lee and I. & Kweon, "CBAM: Convolutional Block Attention Module.," in Proceedings of the European Conference on Computer Vision (ECCV), 2018.
[20] I. Laptev, "On Space-Time Interest Points," International Journal of Computer Vision, vol. 64, no. 2-3, pp. 107-123, 2005.
[21] N. Dalal and B. & Triggs, "Histograms of Oriented Gradients for Human Detection," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
[22] H. Wang, A. Klaser, C. Schmid and C. & LIu, "Action Recognition by Dense Trajectories.," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1482-1495, 2013.
[23] P. Scovanner, S. Ali and M. & Shah, "A 3-dimensional SIFT Descriptor and Its Application to Action Recognition," in Proceedings of the 15th International Conference on Multimedia, 2007.
[24] I. Laptev, M. Marszalek, C. Schmid and B. & Rozenfeld, "Learning Realistic Human Actions from Movies," in EEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[25] M. Brand and N. & Oliver, "Statistical Approaches to 3D and 2D Model-Based Object Recognition.," in International Journal of Computer Vision, 1997.
[26] A. & Zisserman and K. Simonyan, "Two-Stream Convolutional Networks for Action Recognition in Videos," in Advances in Neural Information Processing Systems, 2014.
[27] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. & Paluri, "Learning Spatiotemporal Features with 3D Convolutional Networks," in IEEE International Conference on Computer Vision, 2015.
[28] H. Wang, L. Torresani, J. & Ray and D. Tran, "A Closer Look at Spatiotemporal Convolutions for Action Recognition," in IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[29] J. Carreira and A. & Zisserman, "Action Recognition? A New Model and the Kinetics Dataset," in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[30] Y. Xiong, L. Wang, Z. Wang, Y. Qiao, D. Lin and X. Tang, "Temporal Segment Networks: Towards Good Practices for Deep Action Recognition.," in European Conference on Computer Vision, 2016.
[31] T. Lin, X. Zhao and Z. & Shou, "Towards High Performance Video Object Detection," in IEEE Conference on Computer Vision and Pattern Recognition.
[32] A. Amini, P. St-Charles, A. Nair and D. & Fox, "A Specialized RNN for Disambiguating and Predicting Trajectories.," in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[33] A. Diba, M. Fayyaz, V. Sharma, A. Pazandeh and L. & Van Gool, "Temporal Action Transformer Network," in IEEE International Conference on Computer Vision, 2019.
[34] A. Kappeler, S. Yoo, Q. Dai and A. K. Katsaggelos, "Video Super-Resolution with Convolutional Neural," in Transactions on Computational Imaging, 2016.
[35] C. Dong, C. C. Loy, K. He and X. Tang, "Image Super-Resolution Using Deep Convolutional Networks," in Transactions on Pattern Analysis and Machine Intelligence, 2016.
[36] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014.
[37] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang and W. Shi, "Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation," in Computer Vision and Pattern Recognition (cs.CV), 2017.
[38] X. Tao, H. Gao, R. Liao, J. Wang and J. Jia, "Detail-Revealing Deep Video Super-Resolution," in International Conference on Computer Vision (ICCV), 2017.
[39] Y. Jo, S. W. Oh, J. Kang and S. J. Kim, "Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation," in In CVPR, 2018.
[40] M. Xu, A. Sharghi, X. Chen and D. Crandall, "Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition," in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
[41] J. Chen, J. Wu, J. Konrad and P. Ishwar, "Semi-coupled two-stream fusion convnets for action recognition at extremely low resolutions," in IEEE Winter Conference on Applications of Computer Vision,, 2017.
[42] I. R. Dave, C. Chen and M. Shah, "SPAct: Self-supervised Privacy Preservation for Action Recognition," in Computer Vision and Pattern Recognition (cs.CV), 2022.
[43] A. Otani, R. Hashiguchi, K. Omi, N. Fukushima and T. Tamaki, "On the Performance Evaluation of Action Recognition Models," 2022.
[44] H. Kim, M. Jain, J.-T. Lee, S. Yun and F. Porikli, "Efficient Action Recognition via Dynamic Knowledge Propagation," in IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[45] T. Li, B. Yang and T. Zhang, "Human Action Recognition Based on State," in 16th Conference on Idustrial Electronics And Applications, 2021.
[46] Q. Shi, H.-B. Zhang, Z. Li, J.-X. Du, Q. Lei and J.-H. Liu, "Shuffle-invariant Network for Action Recognition in Videos," in ACM Transactions on Multimedia Computing, Communications, and Applications, 2022.
[47] K. Gkountakos, D. Touska, K. Ioannidis, T. Tsikrika, S. Vrochidis and I. Kompatsiaris, "Spatio-Temporal Activity Detection and Recognition in Untrimmed Surveillance Videos," in Proceedings of the 2021 International Conference on Multimedia Retrieval, 2021.
[48] H. Zhang, D. Liu and Z. Xiong, "Two-Stream Action Recognition-Oriented Video Super-Resolution," in IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[49] P. RUSSO, S. TICCA, E. ALATI and F. PIRRI, "Learning to See Through a Few Pixels: Multi," 2021.
[50] D. Purwanto, R. . R. Adhi Pramono, Y. T. Chen and W. H. Fang, "Three-Stream Network With Bidirectional Self-Attention for Action Recognition in Extreme Self-Attention for Action Recognition in Extreme," in IEEE SIGNAL PROCESSING LETTERS, 2019.
[51] Z. Wang, S. Chang, D. Liu and Y. Yang, "Studying Very Low Resolution Recognition Using Deep Networks," in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[52] Y. Bai, Y. Zhang, M. Ding and B. Ghanem, "Finding tiny faces in the wild with generative adversarial network," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2018.
[53] U. Demir, Y. S. Rawat and M. Shah, "TinyVIRAT: Low-resolution Video Action," in 25th International Conference on Pattern Recognition (ICPR), 2021.
[54] J. He, Z. Zhang, Z. Xu and Z. Luo, "Delving into High Quality Action Recognition for Low Resolution Videos," 2021.
[55] P. Tirupattur, A. J. Rana, T. Sangam, S. Vyas, Y. S. Rawat and M. Shah, "TinyAction Challenge: Recognizing Real-world Low-resolution Activities in Videos," in Computer Vision and Pattern Recognition (cs.CV), 2021.
[56] M. Yang, Y. Guo, F. Zhou and Z. Yang, "TS-D3D: A novel two-stream model for Action Recognition," in International conference on image processing, computer vision and Machine Learning, 2022.
[57] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre, "HMDB: A large video database for human motion recognition," in International Conference on Computer Vision, 2011.
[58] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. green, T. Back, P. Natsev, M. Suleyman and A. Zisserman, "The Kinetics Human Action Video Dataset," in Computer Vision and Pattern Recognition (cs.CV), 2017.
[59] Y. Xu, J. Yang, H. Cao, K. Mao, J. Yin and S. See, "ARID: A New Dataset for Recognizing Action in the Dark," in Computer Vision and Pattern Recognition (cs.CV), 2020.
[60] Z. Wu, H. Wang, Z. Wang, H. Jin and Z. Wang, "Privacy-Preserving Deep Action Recognition: An Adversarial Learning Framework and A New Dataset," in Computer Vision and Pattern Recognition (cs.CV), 2019.
[61] C. Sun, "Mini Kinetics-200 dataset," 2020. [Online]. Available: https://github.com/s9xie/Mini-Kinetics-200.
[62] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," in Computer Vision and Pattern Recognition (CVPR), 2016.
[63] "Wikipedia," 2022. [Online]. Available: https://en.wikipedia.org/wiki/Video_super-resolution.
[64] "ChOOCH," 24 August 2021. [Online]. Available: https://chooch.ai/computer-vision/what-is-action-recognition/.
Team
usamabajwa@cuilahore.edu.pk
Co-PI, Video Analytics lab, National Centre in Big Data and Cloud Computing,
Program Chair (FIT 2019),
HEC Approved PhD Supervisor,
Assistant Professor & Associate Head of Department
Department of Computer Science,
COMSATS University Islamabad, Lahore Campus, Pakistan
Samia Akram
samakram96@gmail.com
Research Scholar (RCS),
Department of Computer Science,
COMSATS University Islamabad, Lahore Campus