Title: "Auto-Highlights Extraction from match video"
Introduction: Watching highlights of any sports match is a tradition and commonly happening event everywhere. Obvious way of creating a "Highlights" video is constantly following the live match and manually recoding time intervals of the video where all the parameters to be a highlight scene are satisfied . After identifying intervals of video, using various video editing Aplication, highlights video is created by performing different operations like trimming. Though manual working is not so complex task there is chance of human error and might be an issue if there are more number of matches to process in a given time. There should be an alloted human all time infront of live stream to identity key time intervals.
Project Objective: Goal is to extract highlight clips from a full length video of a match automatically . Considering full match video as an input , developing a model to automatically detect exciting content occured during the match. The whole process of extraction depoends upon video and audio files of the match.
Github : https://github.com/Srialokam/Project_606
Dataset: Dataset for this project is combination of video and audio files
Source link: https://arpane4c5.github.io/CricketStrokesDataset.html
Description: Size: 1.03 GB
Type : Audio (.mp3), Video (.avi)
Dataset is collection of previous matches highlights videos from event "ICC World cup T20". All the videos available includes audio like commentary and crowd cheering which are important assets for processing. Along with video files few audio(.mp3) files are downloaded and converted for additional feature traning of the model.
Methodology : In the beginning, video files are loaded using Opencv and audio is extracted for each clip with appropriate labels included.
Audio files are processed to train the model in identifying sounds of certain events like 6, 4, wicket or like goal, smash etc..,
Natural Language Processing is used to detect time intervals of audio high pitch events like crowd cheering or player's shouts as a part of celebration.
For available video clips data which is label appropriately , using Linear Support vector machine classifier events occured are classifed as highlight -content and non-highlight content.
Multiple audio sources overlapping is the major challange associated with audio pitch variations classifier. In order to deepen the filter , speech recognition of commentary associated with the match is introduced.
Pronunciation lexicon model or Dynamic Bayesian Networks deal with speech recognition of commentators by encountering key words from available speech.
Literature / Industry research: Live streaming of any sports is possible for limited audience as there will be factors effecting them from not attending live. Demand for highlights video is very high in video stream platforms, especially there is lot of expectations from customers for highlighting video clips. Industry of sports match live streaming filed provides equal priority to highlights of match video as like live stream of a match. Manul edit of highlights video is commonly used practice. Complexity appers when there are multiplt number of matches conducting or in case multiple games conducting like Olympics, competition sports tournaments etc..,. Different machine learning models contributed few techniques to perform the task automatically but each has it's missing important parameters. In my project I'll try to introduce additional processing methods like speech recognition and make models accurate possible to automatically detect maximum existing highlights from a streamed match video.
Youtube Link of Phase-1 presentation
PHASE II
Work done:
Exploring, cleaning and visualization of data.
Process audio files into signal data to calculate energy levels of signal for every sample and extract time stamps of audio where significant variation of enery is observed.
Develop a Linear SVM classifier model to detect video frames as umpire and non-umpire class
Module that takes audio file as input and extracts speech of commentators as text for identification of highlight event.
Challenges:
Handling wave form data of audio file.
Feature extraction from image data files , also includes label attachment manually.
creating appropriate configuration for speech-to-text model which was assumed as another layer of filtering regular and highlight content in an event.
Results:
Time frames of audio file where the energy levelsof continuous audio are significantly higher than threshold value.
Successful in extracting features of image files, along with labels, to create a classifier model.
Train Data accuracy : 94%
Test Accuracy: 88.9%
Approaches :
Using sample rate and amplitude levels of audio signal , energy range data is extracted along with time interval of variation is also recorded and saved. Thus creating a dataframe which exhibits information about energy level and time stamps of respective significant levels , helps to identify scenes of match that have high probability of highlight content of match.
Breaking large audio files into small chunks and transcribin audio to generate commentator's speech to text, using google speech recognizer. Considering confidence levels which are not robust, planning to change approach for speech to text processing
Changes from previous phase:
Introducing new datasets which help in many ways like, additional training of classifier model and validation of signal energy levels concept of extracting timestamps from match video.
Vision based analysis is carried based on umpire or referee presence frames and his gesture or signs.
Next Steps:
So far, successful in obtaining tinmeframes of high energy moment in the match and after developing vision based model capable of dtecting upmire signs to categorize events with respective to type of sign acted by upmire.Identify appropriate model to convert speech of commentators to text.
Develop project to end to end level to automatically extract short videos of highlights from regular match video.
PHASE III:
Work Done:
Exploring image data and pre-processing of images to extract features for model building
Developing two different models for event base analysis
Model 1 : Binary Classification [ frame of a video which classified into frame with and without Umpire in it]
Model 2: Multi Class classification [ Umpire action detection : Signals of Six, Out, No Ball, Wide, No Action]
Save features and best accurate models to implement for actual data
Implement saved model and features to match video to obtain results.
Comparing number of images present for each class in Binary classification
Umpire Images are 240 and Non Umpire Images are 213
Comparing number of images for each class in multi class classification
Each class present contains equal number of images for training the model
Pre-Processing of Image Data:
In order to make the image data consumable by libraries used to extract features from each image pre-processing of image has to be perfromed.
Image data is converted into numpy array format with the help of keras built-in 'image' library that handles image data processing. 'from keras.preprocessing import image'
New array format of data is used to generate features with respective to class of image. Similar technique of Pre-Processing is applied for Binary and Multi class classification models .
Once image is pre-processed , features are extracted and saved for each image based on it's class available.
Feature Extraction:
As the image is converted to array form with the help of Keras preprocessing, labels are extracted from image name saved in the directory.
Immediately after pre-processing of an image is completed , labels are also created and appended as part of features data.
Labels are generated based on content inside image, each class of image array data is assigned with associated label value.
Labels for Binary-class Classification:
Labels for Multi-class Classification:
Model Construction:
Model 1 : Binary Classification using 'LinearSVC' [C=10] : This model is to classify images on the basis of Umpire present or Not. Defined model predicts presence of Umpire in each frame based on label value.
If model 1 prediction value returns 1: Image/Frame has No Umpire.
If model 1 prediction value returns 2: Image/Frame has Umpire in it.
Model is trained with features and labels of image data available. After appropriate number of iterations adjiusted , model perfromace is evaluated.
Accuracy: {'Train_Accuracy': 95.30030030030028, 'Test_Accuracy': 95.6043956043956}
Accuracy values are finalized after 5000 iteration value.[Where default iterations = 1000]
Model 2: Multi Class Classification using 'LinearSVC': After an image is successfully classfied with umpire presence, such image will be processed with multi class classification to identify class of action occured in processed image. Depending upon predicted outcomes of Model 2, each image is further classified into 5 different classes of Umpire Signal or Gesture. Similar to Model 1 , Model 2 is also trained with data generated during feature extraction phase for multi class classificartion.
If model 2 prediction value returns 1: Image/Frame has an Umpire with No Ball signaled.
If model 2 prediction value returns 2:Image/Frame has an Umpire with Out signaled.
If model 2 prediction value returns 3: Image/Frame has an Umpire with Six signaled.
If model 2 prediction value returns 4: Image/Frame has an Umpire with Wide signaled.
If model 2 prediction value returns 5: IImage/Frame has an Umpire with No Action performed (idle).
Model 2 Perfromance:
Accuracy{'Test_Accuracy': 80.95238095238095, 'Train_Accuracy': 81.95}
Accuracy values are finalized after 10000 iteration value.[Where default iterations = 1000]
Validation of Models with one image as input:
Binary Classification is to classify each frame as Umpire/Non-umpire
Multi class classification to predict umpire signals of Six, Out, No Ball, Wide, No Action
Implementation:
Input video : Regular Cricket match of t [minutes/hours]
Output: Small chunks of videos which includes highlight content like Out, Six, No Ball, Wide.
Pipeline of Video processing :
Results and Conclusion:
Successful in extracting highlight clips from an input video of a cricket match.
Github-link for sample output: https://github.com/Srialokam/Capstone_spring2021/tree/main/Output_sample
CUT CLIPS:
Changes form Last phase: Due to no involvement of any machine learning algorithm audio based analysis is observed to be inconsistent and having compatibility issues with true outputs of video based analysis. Considering the accuracy factors of Model 1 and Model 2 , validating the outputs makes video based analysis is fit enough to carry out project goals.
Alternative Ways and Ideas:
During my Journey of this project I found that , similar to Umpire and his gesture classification one can also deal "score board " that is always displayed along with match video.Change appeared in score board indicates highlight occured at that moment of the match. This alternative way of bulding a project will also work for other sports and the dependency on Umpire and signals can also be made Low Priority .
With appropriate support of machine parameters , this video based approach of extracting highlights can also be implemented for live streaming match video
References:
P. Shukla et al., "Automatic Cricket Highlight Generation Using Event-Driven and Excitement-Based Features," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 1881-18818, doi: 10.1109/CVPRW.2018.00233.
Hao Tang, V. Kwatra, M. E. Sargin and U. Gargi, "Detecting highlights in sports videos: Cricket as a test case," 2011 IEEE International Conference on Multimedia and Expo, 2011, pp. 1-6, doi: 10.1109/ICME.2011.6012139.
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37109.pdf
A. Ravi, H. Venugopal, S. Paul and H. R. Tizhoosh, "A Dataset and Preliminary Results for Umpire Pose Detection Using SVM Classification of Deep Features," 2018 IEEE Symposium Series on Computational Intelligence (SSCI), 2018, pp. 1396-1402, doi: 10.1109/SSCI.2018.8628877.
https://ieeexplore.ieee.org/abstract/document/8628877
“Sklearn.Svm.LinearSVC — Scikit-Learn 0.24.2 Documentation.” Https://Scikit-Learn.Org/Stable/Modules/Generated/Sklearn.Svm.LinearSVC.Html, scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html. Accessed 13 May 2021.
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
Barbhuiya, Abul Abbas. “CNN Based Feature Extraction and Classification for Sign Language.” Multimedia Tools and Applications, 18 Sept. 2020, link.springer.com/article/10.1007/s11042-020-09829-y?error=cookies_not_supported&code=0cf577d4-c715-44a9-afc0-5df860fa17b2.
https://link.springer.com/article/10.1007/s11042-020-09829-y
opencv-python. PyPI. (n.d.). https://pypi.org/project/opencv-python/.
https://pypi.org/project/opencv-python/
Franky. “Using Keras' Pre-Trained Models for Feature Extraction in Image Clustering.” Medium, Medium, 5 Apr. 2018, franky07724-57962.medium.com/using-keras-pre-trained-models-for-feature-extraction-in-image-clustering-a142c6cdf5b1.
https://franky07724-57962.medium.com/using-keras-pre-trained-models-for-feature-extraction-in-image-clustering-a142c6cdf5b1