Sports Video Classification Framework Using Enhanced Threshold Based Keyframe Selection Algorithm and Customized CNN on UCF101 and Sports1-M Dataset

The computer vision community has taken a keen interest in recent developments in activity recognition and classification in sports videos. Advancements in sports have a broadened the technical interest of the computer vision community to perform various types of research. Images and videos are the most frequently used components in computer vision. There are numerous models and methods that can be used to classify videos. At the same time, there no specific framework or model for classifying and identifying sports videos. Hence, we proposed a framework based on deep learning to classify sports videos with their appropriate class label. The framework is to perform sports video classification using two different benchmark datasets, UCF101 and the Sports1-M dataset. The objective of the framework is to help sports players and trainers to identify specific sports from the large data source, then analyze and perform well in the future. This framework takes sports video as an input and produces the class label as an output. In between, the framework has numerous intermediary processes. Preprocessing is the first step in the proposed framework, which includes frame extraction and noise reduction. Keyframe selection is carried out by candidate frame extraction and an enhanced threshold-based frame difference algorithm, which is the second step. The final step of the sports video classification framework is feature extraction and classification using CNN. The proposed framework result is compared with pretrained neural networks such as AlexNet and GoogleNet, and then the results are also compared. Three different evaluation metrics are used to measure the accuracy and performance of the framework.


Introduction
Te processing of images and videos is the core focus of computer vision research. A dynamic area of computer vision is video classifcation (VC) [1]. Due to its widespread use in automatic video analysis, video retrieval, and other similar kinds of applications, activity recognition is an essential topic [2]. Te automatic classifcation of various sports using machine vision techniques is referred to as the classifcation of sports videos based on semantic information. Due to the huge demand for sports videos for training and development, the classifcation of sports training videos using machine learning and vision technologies has signifcant potential for commercial use [3]. However, video classifcation is still challenging work since it involves a huge number of data and processing steps [1]. After deep learning models became a booming technique for robotically identifying videos, this subject attracted greater attention [4]. Te importance of accurate video classifcation is recognized by the huge amount of data available both online and in repositories. Convolutional neural networks (CNN) were investigated and proven to be useful tools for the categorization and analysis of image or picture material [2]. CNN has been widely used for object segmentation, detection, and so on. Similarly, video content analysis, video processing, and classifcation are also implemented with the convolutional neural network. CNN has been used with the functions and techniques of deep learning, which have also achieved good results in computer vision applications.
In recent years, a video categorization task has shown tremendous success. Tis study gives a very detailed and technical strategy for sports video classifcation in order to acknowledge the signifcance of the video classifcation task and to highlight the accomplishments of deep learning models for this work. Generally, people all over the world generate and handle huge amounts of video and share it through social media like Facebook, WhatsApp, Signal, and so on. Currently, on YouTube alone, over one billion hours of video are being watched by diferent people every single day [4]. Businesses like Google AI are investing in several challenges to fnd creative solutions to difcult issues with limited resources. Google AI has released a public dataset called YouTube-8M, featuring millions of video attributes and over 3700 labels, to promote the development of autonomous video categorization tasks. All of these initiatives highlight the requirement for an efective video categorization model [4].
Te ultimate aim of this work is to classify sports videos based on their content using the proposed model. Tis model begins with preprocessing the input sports video. Preprocessing includes both frame extraction and noise reduction. Ten, the keyframes are extracted using the proposed enhanced key-frame extraction algorithm. Finally, keyframes are given to CNN to extract feature sets based on its trained knowledge of CNN. Given the input, sports video is categorized as a specifc class at last.
Tis work's primary contribution is as follows: (a) Frame extraction and noise reduction are carried out in preprocessing. (b) keyframe selection technique is used to extract keyframes. (c) Te framework classifying sports videos are built using convolutional neural networks and deep learning techniques. (d) Run the framework using the test data.
Te efciency of the suggested technique is then confrmed by comparing the fndings to an existing model. Te suggested framework for classifying sports video is novel in that it uses a specially designed keyframe selection method, a fuzzy adaptive window-based mean flter (FAWMF) to eliminate noise, and hyper-parameters that are adjusted depending on the two datasets stated above. Tere are many frameworks or models available for categorizing videos in general. As an illustration, consider the classifcation of moving objects, the recognition of human movement, action recognition, etc. Tey are mentioned in the literature review section in detail. Recent advancements in deep learning models have shown how efective these methods can be at categorizing videos. However, the majority of the popular deep learning models for video classifcation have been mostly adapted from those in the image/speech domains. Existing works are stated with the name of the model, optimization techniques or algorithms, a dataset, and the outcome of the model. Te framework proposed in this research is specifcally for sports video classifcation and also delivers outstanding performance and good accuracy, and the results are compared with the existing work.
Te rest of this paper is organized as follows: video classifcations and literature reviews are included in Section 2. Te deep learning architecture is given in Section 3, including CNN and its layers. Datasets and their characteristics were discussed in Section 4. Section 5 ofers a proposed method with its framework architecture; the experimental fndings and comparisons to related works are presented in Section 6; the conclusion and future works are presented in Section 7.

Literature Review
Atiqur Rehman and Samir Brahim Belhaouari explored a review of video classifcation in diferent categories of approaches, like hand-crafted approach and 2D-CNNs. 3D-CNNs, spatiotemporal convolutional networks, and recurrent spatial networks [4]. Moumita Sen Sarma et al. used to categorie traditional sports videos from Bangladesh by removing both the spatial and temporal characteristics from the recordings [5]. Te development of a scratch model using the two most popular deep learning techniques, convolutional neural network (CNN) and long short-term memory (LSTM)is a fundamental contribution of this paper [5]. Malik Tabish et al. worked to invent a convolutional neural network (CNN)-based model for sports activity recognition with similar content. Te pretrained VGG16, VGG19, ResNet50, and Inception V3 models are used to train the model, and the clustered cricket video frames from the specifcally produced dataset are used to test it [2]. Na Feng et al.'s Soccer Dataset for Shot, Event, and Tracking (SSET) was created to create a soccer dataset that may be used for player tracking, shot segmentation, and soccer event detection research [6].
Yunjun Xu et al. presented the event matching method is used to match the convolutional neural network output to complete the sports training video classifcation [3]. Hana et al. present an efcient keyframe extraction method. By applying the modularity concept to graph clustering, the keyframe selection is carried out. Te results of the tests demonstrated that the suggested method is efective in extracting keyframes that maintain the pertinent video content without duplication [7]. Shahil et al. propose a sports identifcation system using a more complex CNN model that includes fne-tuning and a fully linked layer. Five diferent sports groups are categorized using photographs and videos. In this study, we employ an image-based video categorization approach [8]. Te notable papers are mentioned in Table 1. Based on the related research, this framework takes sports video classifcation as the core concept.

Deep Learning Architectures
Machine learning is a subset of artifcial intelligence, while deep learning is a subset of machine learning. It is a very important element of data science and certainly includes statistical approaches and predictive models. Data scientists are mostly preferred for deep learning architecture for the task of collecting the data, analyzing the data, and also Computational Intelligence and Neuroscience interpreting a very large amount of data. Te biggest advantage of using deep learning is that it makes the above process smarter, faster, and easier. DL is a class of algorithms and topologies, not a single method, which can be used to solve a variety of problems. Many architectures and algorithms are used in deep learning. Generally, deep learning architecture is classifed into two categories. Tere is supervised and unsupervised learning. LSTM and CNN fall under supervised learning. Tese two are the oldest approaches and most widely used architectures in various applications. In this paper, we experimented with the pretrained networks AlexNet and GoogleNet, and the results are compared with the proposed model.

CNN-Convolutional Neural
Network. CNN is a class of ANN in deep learning. It is especially useful for analyzing visual content like images and videos. AlexNet, DesnseNet, GoogleNet, LeNet, ResNet, and VGGNet are predefned and widely used architectures for image and video content analysis [4]. Te CNN model covers one or more layers of subsampling and convolution, which go behind the fully connected layers, which can be single or multiple, and an output layer [2]. CNN has been attested to be the most efcient one when it comes to classifcation problems [16]. It is a great model for both image and video analysis. Since CNN has become more popular in the past few years, and this is the very basis for modern computer vision-based applications like video embedding, encryption, classifcation, and so on. CNN can be partitioned into diferent types of layers, and each layer performs various missions. Among these layers, one of the most important is the convolutional layer. It handles feature extraction with the support of convolution maps. Ten, the remaining layers are the input layer, the ReLU layer, the pooling layer, and the fully connected layer.

Input Layer.
Te working principle of the input layer in CNN is similar to the way we use it to give input for a model. Te only diference is that it takes three-dimensional values. Te height and width of the layer represent the horizontal and vertical pixels of the image, respectively, while the depth represents the RGB color channel values.

Convolution Layer.
Te major component of CNN is the convolution layer. Tis layer is where the convolution occurs, which means the layer tries to fnd features in the input like images or frames in the case of videos. Frames and images are constant, which means that one component of the image is created similarly to all other components [2]. As a result, the training function in one region can be replicated in another region. Te most important features are going to be found for classifcation with the help of flters. Filters are passed over the image or frame. Te result of this process is known as feature extraction. Te following variables are used to extract features: More convolution layers in a CNN are also possible. When we need high-level features, we need to use more than one convolution layer. In the frst layer, the network could detect simple edges, and then in the next layer, those edges could be fltered into simple shapes, and so on. Formal convolution layer activity is shown in Figure1.

Te ReLU Layer.
In general, the ReLU layer is useful for activation functions. It helps the network maintain the minimum computational cost.

Pooling Layer.
Subsampling or downsampling is nothing but a pooling layer. Te pooling layer is a mediator between two consecutive convolution layers. Te method of downsampling of an image is well-known as pooling. Te convolution layer (CL) output is subsampled for a single output using a small amount of it as an input [2]. Popular pooling methods are average pooling, max-pooling, mixed pooling, Lp pooling, stochastic pooling, spatial pyramid pooling, and region of interest pooling. In general, pooling reduces the number of parameters to be calculated but makes the network constant or equal in form, size, and scale translations [2]. An average or mean pooling layer achieves downsampling by separating the input into rectangular pooling regions as well as computing the average values of each region. Te average pooling example is shown in  . SPP (Spatial Pyramid Pooling) removes the fxed size constraint of the network, which pools the features and generates fxed-length outputs that are then fed into the fully connected layers. Te working principle of spatial pyramid pooling is shown in Figure 3. Figure 3 may be analyzed using SPP, which has a number of pooling layer scales that can be applied to convolutional layer features of any size and ultimately produce eigenvectors with fxed dimensions [18].

Fully-Connected Layer.
Tis layer is the only layer that is fully connected to the previous layer, and it is the last layer in CNN. It classifes the feature data extracted and downsampled in previous layers. It takes feedback from the previous layer and produces output.

Dataset
UCF101 is a dataset of realistic action videos with 101 action categories that were gathered from YouTube. Te UCF50 dataset has been expanded to create this data collection.
Tere are 50 activities in UCF50.13320 videos from 101 activity categories are included in the UCF101 data collection. With huge diferences in camera movements, object appearance and position, object scale, viewpoint, cluttered backgrounds, illumination conditions, and other factors, UCF101 provides the most diverse range of sports in terms of actions. UCF101's major purpose is to promote more action recognition research by learning and exploring new realistic action categories. Te Sports-1M dataset contains over a million YouTube videos. Te collection contains over a million videos, divided into 487 sports-related categorise with 1,000 to 3,000 videos each. By examining the text metadata connected with the videos, the YouTube Topics API is used to automatically categories the videos into 487 sports classes.

Methodology
A video is a three-dimensional signal in which the horizontal axis corresponds to the frame width and the vertical axis corresponds to the frame height; the third axis depicts the evolution of frame content over time. Figure 4 demonstrates the framework for the proposed sports video classifcation. Data collection and preprocessing proceeded before the keyframe extraction task. Ten, the dataset is divided into train and test for CNN in the ratio of 80 : 20. Te suggested model accepts sports videos as an input and generates class labels as an output. Two benchmark datasets, UCF101 and Sports1-M, are used to train and test the suggested framework.

Preprocessing.
Preprocessing is the very frst step for every research and its implementations. Hence, sports video classifcation also begins with preprocessing. It involves a process known as frame extraction, which entails converting the given sports video into frames. As mentioned in section, UCF101 has 101 diferent categories of sports videos with sports names as folder names. (1) In (8), SV 1 is sports video category1, and F 1 , F 2 , F 3 , . . ., F 101 are various folders for diferent sports categories. Each and every folder has many types of videos with diferent styles.
5.1.1. Frame Extraction. In this model, frame extraction is one of the major courses of action in video preprocessing. A video is a collection of pictures that are taken and then shown repeatedly. However, a single video frame, or image, is obtained by pausing the sequence at a particular frame. Te mathematical approach for frame extraction is mentioned in equation (3). Frame conversion is the very frst course of action in the sports video classifcation model. Tis model acquires sports video as an input, then the same is converted into frames. Once the input sports video is converted into frames, the converted frames are passed to the next stage, which is known as keyframe selection or extraction. Table 2 shows the properties and their respective values of an input sports video. Te total frames from the original video fle are extracted and stored in a specifc location. From the following equation (3), V i represents each video from the dataset. f i is the number of frames in a video, which are indexed from 1 to n. n is nothing but the number of frames. Tere are a large number of sports video datasets available on social media and other sources. But very few datasets are considered standard. In Table 3 some sample benchmark datasets for sports video are listed.

Noise Reduction Using FAWMF.
When video signals are acquired, transmitted, and received, noise is a signifcant element that can signifcantly reduce the quality of the signals [19]. We may not get an exact outcome because of a noisy image or frames in an application. Noise reduction is a highly attractive process for better video quality. Since interframe noise reduction is efcient for areas of video frames where there is no motion, we cannot use inter-frame noise reduction in the proposed framework. We can apply a spatial-temporal flter, which is successful in removing noise [20]. In general, such flters have the capability to decrease noise efciently. But uncertainly, it may cause blurring effects on the input frames. Te fuzzy adaptive median fltering (FAMF) technique is useful for the preprocessing stage, which removes the noise in the video frames [21]. In order to do spatial processing and identify the pixels that are impacted by impulse noise, FAMF is primarily used [21]. In this paper, a fuzzy adaptive window-based Mean Filter (FAWMF) is used for preprocessing the sports video after the frames are extracted. Te pixels are categorized as "noise" based on how each pixel in the picture is compared to its neighbors. After the noise intelligence test, these pixels Computational Intelligence and Neuroscience are then replaced by the value of mean pixel value in accordance with their neighbors. FAWMF improves the quality of video frames and removes impulse noise.
Algorithm 1 fuzzy adaptive window-based mean fltering follows these steps to apply the flter to each frame in a sports video. Initialize the window size w(i, j), where i and j � 5. Ten, travel the flter matrix over the video frame f and w(0, 0) must go along with the current frame position (x, y). Apply the product term to each frame element using the  corresponding flter coefcient w(x, y). Find average of sum of products. Finally, the current position value is replaced by an average value.

Keyframe Selection.
Keyframe selection is one of the most crucial and signifcant works in the video classifcation model since the processing of all the frames in a video will boost both time and space complexity, which may degrade the performance of the model. Figure 5 shows the subsequent frames with similar content. Te content of the frames and their features are similar, so we do not want to train and test all the similar frames. A 00.05 minutes spots video may have a minimum of 50 to 55 frames. Tis model was trained and tested with the UCF101 and Sports-1M datasets. Te two datasets have more than 101 diferent sports categories separately. Te foremost intention of keyframe extraction is that there be no signifcant variation between consecutive frames. Te proposed approach for keyframe extraction is composed of two steps: frst, identifying the candidate frame using the skip factor (SF) [1], which is stated in Algorithm 1. Because all the frames in a video do not to be processed, consecutive frames may have common objects and features. Ten, apply an enhanced threshold-based frame diference (ETFD) algorithm, which is mentioned in Algorithm 2, for identifying keyframes. Te following algorithm extracts 24% of frames from a video. As a result, processing time will be reduced while model performance will improve. A mathematical formula for candidate frame selection is given as follows: Once candidate frames are extracted, an apply enhanced threshold-based frame diference algorithm for keyframe extraction. Te Algorithm 3 takes candidate frames as input and applies the frame diference method to extract keyframes. A mathematical formula for keyframe selection is given as follows:

CNN.
Te convolutional layer is the most important element of the convolutional neural network (CNN). We applied CNN to classify objects in all the frames from the sports video dataset into various classes. We trained over 1,800 videos in the UCF101 dataset, which are identifed with 91% accuracy. Te working model of CNN is shown in Figure 6. CNN is always compiled with multiple layers, one after another. Te convolutional neural network begins with convolution and pooling layers, which are mainly used for breaking down the input frames into features and studying them autonomously.   Computational Intelligence and Neuroscience Table 3: Sample spots in a video benchmark dataset.
Year Name of the datasets  Number of videos  Number of classes/actions  2009  UCF11  1600  11  2009  UCF sports  150  10  2010  UCF50  50  10  2010  Olympic sports  800  16  2012  UCF101  13320  101  2014  Sports1-M  1133158  487  2018 Youtube8-M 6.1M 3862 Input: Frame f i Output: Enhanced Frame f i ' Step 1: w(i,j) where i � 5 and j � 5 Step 2: Step 3: Step 4: Step 5: Repeat step2 to step4 Until complete the entire frame f i Stop loop Step 6: Display "De-noised or enhanced frame f i "....."   Computational Intelligence and Neuroscience transferred to the appropriate classes in the classifcation process [23]. Te outcome of this development is fed to the next layer, known as the fully connected layer, which takes the fnal classifcation. Figure 7 displays the training progress of the proposed model. For the above purpose, we apply ReLU, which initializes all of the negative values in a frame that is a two-dimensional matrix with zero. Te value "zero" means the specifc picture element has no value. Te maximum value from the matrix is obtained using the max-pooling layer. Te softmax algorithm is then used to assign decimal probabilities to all classes using the output of the fully linked layer. Figure 8, illustrates the comparison of various optimizers and their accuracy when applied to the UCF101 and Sports1-M datasets.

Experimental Results
Experiments were performed on the UCF101 and sports1M datasets for sports video classifcation. Table 4 shows the experimental setup, hyper-parameters, and their respective values used for the implementation. Te reason for choosing the abovementioned dataset is that almost all kinds of sports are covered. UCF101 and sports1M datasets have 13,320 videos with 101 diferent classes and 11,33,158 videos with 487 classes, respectively. Initially, we evaluate the performance of the pretrained models like AlexNet and GoogleNet on the UCF101 dataset. Ten, we fne-tune the abovementioned model with the Sports1-M dataset. Four sports videos are randomly intercepted from the UCF101 dataset and appropriately classifed by the proposed model, and the result is shown in Figure 9. Figures 10 and 11 demonstrate training loss on a pretrained model using various optimizers with the UCF101 and Sports1-M dataset, respectively. Figures 12 and 13 indicate training loss on the proposed model for the UCF101 and Sports1-M datasets.

Evaluation
Metrics. Space complexity, time complexity, precision, recall, f-measure, and compression ratio are various general metrics to measure the performance of an algorithm. Accuracy, F1-Score, precision, and recall are frequently used evaluation indices based on multilabel classifcation [25]. Since we also used only four metrics among the six metrics to measure the performance of the proposed model, the following mathematical method is used to calculate the compression ratio based on the keyframe selection. Te compression ratio is determined by the uncompressed frames and compressed frames of a video.
Compression Ratio � N f K f , (7) where N f is the total number of frames in a video and K f is the number of keyframes selected to proceed. Recall and precision are employed in the felds of image classifcation, information retrieval, video classifcation, and segmentation.
where NVCA is the number of videos classifed accurately, TNVC is the total number of videos classifed, and TNVDS is the total number of videos in a dataset. A benchmark metric known as F-measure uses the harmonic mean to combine the precision and recall values into a single value.
Tese measurements are determined by categorizing shots correctly or incorrectly for each category [12]. Using input: cf output: kf CF: Candidate Frame KF: Keyframe FD: Frame Diference T: Treshold Value procedure keyframe Computational Intelligence and Neuroscience the following equation, accuracy is calculated as the ratio of true positive and true negative samples to the total number of samples [2]: where TNV is total number of videos and TNVCA total number of videos is classifed accurately.

Results.
Te proposed framework's evaluation and performance are assessed using the evaluation metrics.
Utilizing the UCF101 and sport1-M benchmark datasets, the framework is trained and tested. Table5 includes information on the accuracy and various optimizers that are used to investigate performance. Te UCF101 dataset with the SGDM optimizer produced great accuracy in terms of training and test performance, per the analysis of the abovementioned datasets. Comparatively speaking, the keyframe selection method and fuzzy adaptive windowbased mean flter combination performs better. Te framework ofers improved performance with a less computational expense. As long as there is sufcient training data, feature extraction architectures using CNN (convolutional neural network) can outperform those using hand-crafted features.       6.3. Performance Comparison with Other Works. As a result, the majority of current architectures are still unable to handle the more complex nature of video data, which contains a wealth of information in the form of spatial, temporal, and audio cues [4]. Tables 6 and 7 show the results of the proposed model with a few pretrained network models. According to the investigation and the experiment results, the proposed framework produced better accuracy compared with the existing architecture with various optimizers. Te results of the proposed framework are mentioned in Table 8. Finally, the training and test accuracy of the proposed framework is 92.77% and 93.59%, respectively for the UCF101 dataset. On the other hand, the training and test accuracy of the proposed framework is 82.52% and 89.75%, respectively, for the Sports1-M dataset. Also, the investigation results showed that the proposed framework obtained the best precision, recall, and f-measure at 96%, 94%, and 94%, respectively. Based on the results of training and testing, the suggested framework performs efectively in terms of both time and cost.

Conclusion and Future Work
Using the UCF101 and Sports1-M datasets, we proposed a framework for classifying sports videos in this research. Te framework uses a sports video as its input and uses a number of intermediary processes to obtain the appropriate class label. Te framework begins with frame extraction, keyframe selection using the skipping factor, and noise reduction, which are intermediate steps that are followed by the custom CNN. A personalized CNN was tested and trained using various optimizers, including SGDM, ADAM, NADAM, ADADelta, and ADAGrad. CNN is typically employed to extract the features and categories of the data in accordance    with the objective of the research. Te output of the suggested framework is then compared with the output of pretrained neural networks like AlexNet and GoogleNet, and the results are stated. Te efectiveness and performance of the framework are evaluated using three separate indicators. Only the two benchmark datasets are used for training and testing the proposed system. Terefore, using the stated experimental setup, this can only give results that are adequate for these two benchmark datasets. Tis can be a drawback to the suggested framework. In the future, we may utilize efective keyframe extraction algorithms, diferent optimizers, and/or improved noise removal techniques to obtain better classifcation results for the same sports video classifcation problem [28] [29] [30].

Data Availability
Te datasets are available in the UCF101 and Sport1-M repository.

Conflicts of Interest
Te authors declare that they have no conficts of interest.