Keyframe Extraction Algorithm for Continuous Sign-Language Videos Using Angular Displacement and Sequence Check Metrics

. Dynamic signs in the sentence form are conveyed in continuous sign-language videos. A series of frames are used to depict a single sign or a phrase in sign videos. Most of these frames are noninformational and they hardly efect on sign recognition. By removing them from the frameset, the recognition algorithm will only need to input a minimal number of frames for each sign. Tis reduces the time and spatial complexity of such systems. Te algorithm deals with the challenge of identifying tiny motion frames such as tapping, stroking, and caressing as keyframes on continuous sign-language videos with a high reduction ratio and accuracy. Te proposed method maintains the continuity of sign motion instead of isolating signs, unlike previous studies. It also supports the scalability and stability of the dataset. Te algorithm measures angular displacements between adjacent frames to identify potential keyframes. Ten, noninformational frames are discarded using the sequence check technique. Pheonix14, a German continuous sign-language benchmark dataset, has been reduced to 74.9% with an accuracy of 83.1%, and American sign language (ASL) How2Sign is reduced to 76.9% with 84.2% accuracy. A low word error rate (WER) is also achieved on the Phoenix14 dataset.


Introduction
Sign language, a visual language, is used by the majority of hard-to-hear people.Both static and dynamic gestures are used to represent words and phrases in sign language.Continuous sign language (CSL) is a collection of sign expressions that can be expressed as a sequence of motions in both space and time.Te continuous sign language recognition and translation (CSLRT) task aims to bridge the gap between sign and spoken language by recognizing a series of continuous gestures and translating them into natural language expressions.One sign sentence can contain 100-250 frames (approximately 9 words), depending on the frame rate of the recording device.All these frames are not required for sign interpretation to be performed.Transition frames and noninformative frames can be removed from the frameset, leaving about 1-5 key frames per word.Te most informative frames in CSL are keyframes, which contain extensive sign gesture and motion information.Tis method reduces storage and execution overheads.With a proper keyframe set, neural models can extract spatial and temporal features more precisely.With applications in felds such as action detection [1], video summarization [2], educational video summarization [3], video segmentation [4,5], and video copyright protection [6], keyframe extraction from videos is one of the thoroughly investigated topics that keep the scientifc community interested.Keyframe extraction from sign language videos is considered challenging as we have no indicators to identify the start and end frames of signs in the video and need to identify small motions that may be a part of a sign.Most of the existing keyframe extraction methods are not suitable for sign language video representation, as they do not meet requirements such as multielement identifcation, minor motion detection, continuity, scalability, and stability.
1.1.Keyframe.Keyframe extraction can be defned as if a video F is represented as a set of frames, F: {f 1 , f 2 , . . .‥ f n }, where n is the total number of frames in F.
Te keyframe set is then represented by K such that K ⊂ F and abstractly represents the original video with a frameset of length m less than n.
Te following is a representation of the keyframe extraction algorithm.
Let H be a keyframe extraction algorithm, then keyframe elements K m can be defned as follows: where K is the reduced video.Te efciency of the algorithm H depends on the reduction rate it attained and the accuracy by which the sign can be recognised.Te reduction rate can be expressed as follows: ( Let K = K 1 , K 2 , . . ., K p   be a keyframe sign sequence and S = S 1 , S 2 , . . ., S q   be the ground truth sign frames on a video F of size r frames, then the accuracy is the proportion at which the frame is correctly recognised against the ground truth. where n() gives the total number of frames in the set.Te concept of keyframe extraction is depicted in Figures 1 and  2. Te sign language representation of the word "LIEB" from the Pheonix4 dataset [7] is illustrated in Figure 1.Te keyframes for the same word can be thought of as in Figure 2(a), with two frames handling gesture structure and motion.A graphical depiction of the word "LIEB" based on the signwriting [8] technique taken into consideration for evaluation is shown in Figure 2(b).It is noteworthy that the word with 14 frame length may be reduced to a two-frame representation, which also acts as the keyframe and properly communicates the sign word concept.Te lack of distinct word breaks and continuous gesture transitions make keyframe extraction from CSL video challenging.Te gesture position, orientation, and direction of movements must be considered while fnding keyframes or in eliminating uninformational frames.Minor and substantial variations in hand forms, motions, positions, nonmanual elements, context, and the signer speed all provide challenges to the keyframe extraction process.
Noninformational frames and informational frames are two instances of frames in CSL videos.Te suggested method aims to fnd a set of keyframes that accurately and efciently refect the maximum sign information from a continuous sign-language video with a good reduction rate.Te orientation information contained in sign words must also be preserved for a sign to be correctly recognized.
By integrating all keyframes, an abstract of a specifc sign can be obtained.Te motivation for the use of continuous signlanguage keyframes is strong since it reduces processing time and storage requirements for representation learning models and other related computer vision tasks.
Keyframe extraction strategies employed in motion analysis, video summarising, or compression cannot directly enhance CSL videos.Te spatial, temporal, and directional characteristics of gesture frames in CSL must be evaluated to determine whether they are informative.Certain signs difer solely in the direction of motion of the sign elements.So, the direction of motion is an important information in the sign to fully interpret the link between movements and hence the gesture.Tis is the frst time the concept of gesture orientation has been examined on the keyframe extraction task.
A majority of current sign language key extraction research focuses on dynamic gesture videos (word level), with a few attempts using continuous sign language (sentence level) with the hand as the region of interest [19,20], leaving the nonmanual elements unresolved.A combination of image entropy and density clustering is used to obtain the keyframes for the hand gesture video in [21].Minor motions and motion directions cannot be taken into account by this method due to its static threshold value.Te method is, therefore, inefective for CSL videos.Te research [22] identifes signifcant frames and treats each gesture as a separate, isolated gesture using a gradient-based key frame extraction technique.Te direction of motion continuity and minute motions are left unresolved.Most sequential approaches use static thresholds such as in [20,23], which make it difcult to record small, repetitive movements.Specifcally, tapping or rubbing does not propagate data over successive frames, preventing static thresholds from distinguishing movements between such frames.Solutions based on threshold values like entropy or sampling do not address scalability or signer independence [14,24].Tis work handled these sign gestures efectively and consistently throughout the huge dataset, which had never been studied before.Te proposed work ofers an interesting, simple, and efcient approach for extracting successive keyframes from CSL video, which may be fed into a CSLR system for speedy decision, while taking into account hurdles and faws in earlier works.Te following contributions make up this work: (1) Tis study proposes a new approach for choosing keyframes from continuous sign video, which signifcantly reduces computation overhead in time and space dimension (2) Angular displacement metric is used to evaluate the motion between the frames 2 International Journal of Intelligent Systems (3) Te decision of keyframe selection is based on the whole frame; thus, all sign elements are considered (4) A sequence check metric and frame pixel diference with an adaptive threshold are used to reduce frameset from candidate keyframe set (5) To analyse and visualise the suggested technique, this work utilise the sign representation method, SignWriting [8].(6) WER is calculated in conjunction with existing sign language recognition systems to analyse performance of the reduced dataset.
Te remaining sections are organized as follows.Section 2 reviews keyframe extraction techniques used in sign language recognition systems.Te proposed FSC2 (frame sequence count check) keyframe extraction algorithm is described in Section 3. Te experimental results are presented in Section 4. Lastly, a summary of the proposed work and some suggestions for further research are presented.
1.2.Related Work.Tis section discusses the keyframe extraction techniques that were employed in prior research of sign language recognition tasks.
Keyframe extraction utilising time-varying parameter detection was proposed by [25].Tey used statistical analysis of variables such as position, posture, orientation, and motion to detect discontinuities in frames considering only the major motion elements.In [26], fewer gesture motions such as preparation motion and unnecessary movement between sign phrases were deleted using fuzzy partitioning and state automata.For fltering uninformative frames, the authors of [27,28] employed a gradient-based keyframe extraction method.In [29], the authors randomly sampled 10-50 keyframes from each video and translated directly the sign video representations to spoken language.A method for extracting keyframes in a trajectory density curve using a sliding window is proposed in [19].In [30], an online lowrank approximation of sign videos to choose keyframes is employed.A method for locating video frames representing single signs in a one-hand fnger alphabet is provided in [20], which uses a combination of object tracking and visual attention.In [31], the angular and distance metric of a 3D trajectory skeleton is used for keyframe detection.
Te ARSS approach for optimal sampling and alignment of RGB and depth input is proposed in [32], and a relatively complete keyframe set of the video is acquired.In [33], a new sample approach called keyframe-centred clips (KCCs) sampling was given, with the goal of selecting a specifc number of frames to describe the entire sign language video.In comparison to other sampling methods, KCC has greater recognition performance.To improve keyframe-centred clips (KCCs) sampling, a new method termed optimised keyframe-centred clip (OptimKCC) sampling was proposed in [14] to optimise the KCC sampling using the DTW distance.In all of the preceding studies, signs are considered as isolated.
Te authors proposed two types of distances in [34], interkeyframe distances and model set distances.Te sum of the distances to other keyframes and the average distances from the model set are used to pick the set of keyframes K.In [35], Zernike's moments were used to detect the keyframe in a dynamic gesture video clip.A keyframe is one in which Zernike's moments' (ZMs) difference between neighbouring frames is greater than a value (value is set to 50).In [36], a random sampling method is applied.A sequence technique based on the statistical of elements such as colour, picture diference, and weighted frames is proposed in [13] to detect keyframes from dynamic sign-language videos.Edge detection and discrete wavelet transform are used in [37] to extract keyframes.A hybrid clustering approach is provided in [38] and two sets of keyframes are obtained; the spliced original keyframe picture represents the spatial dimension feature, and the optical fow keyframe image represents the time dimension feature.Te author of [24] proposed the median of entropy of mean frames (MME) approach for keyframe extraction, which uses the mean of consecutive k frames of video data with a sliding window of size k/2 to select the frame that satisfes the median entropy value.Te methodology used in [39] considers multievaluation factors to select critical frames from raw videos.For creating high-quality video clips, essential frames are chosen based on their hand height, hand movements, and frame blurriness levels.In [40], the parameter used for sampling the keyframes was hand coordinates.In [41], the author proposed a clip summary approach to choose the important video clips.In [42], the author used DTW for keyframe extraction.
In comparison to other computer vision tasks that use keyframe extraction such as video summarising and compression, there are few works on keyframe extraction of sign language videos, and it remains a challenging research subject for researchers.Te majority of the work is focused on the word level or small phrase extraction which comes under isolated sign.For its complexity, there is very little literature in the realm of continuous sign-language videos.A continuous sign-language sentence stream can have over 250 frames, with a few keyframes functioning as representative frames and the rest as transitional or noninformational frames.Due to the little variation between two consecutive frames and the long length of the input, the demand for modelling temporal sequence of signs at the sentence level is rather stringent.
Te majority of early techniques used threshold settings that varied depending on the dataset, which reduced stability and scalability.Repetitive signs and signs with little momentum are disregarded, which results in information loss.Most early research treats the principal hand structure as a single region of interest that is retrieved using a segmentation method in order to condense the gesture space.In addition, each sign phrase's beginning and ending frames were manually chosen and continuous signs were transformed into isolated frames to control the motion.When designing an algorithm for a continuous sign video challenge that heavily relies on continuous data, such restrictions must be minimized.
Tis work proposes a keyframe extraction algorithm for handling the signifcant difculty of keyframe extraction in CSL videos based on the diference in angular displacement of pixels between frames and a sequence check metric.

Frame Sequence Count Check (FSC2) Keyframe Extraction
Algorithm.Te FSC2 keyframe extraction algorithm is designed in simple and statistical steps to keep it light weight and efective.Te proposed FSC2 keyframe extraction architecture is shown in Figure 3. Te FSC2 algorithm has three phases of execution; motion analysis, wrapper, and reduction.Motion analysis uses the Gunnar Farnebäck optical fow algorithm [43] to obtain optical fow data between two nearby neighbouring frames.Tese data are fed to wrapper where the α value is calculated, which are the mean of the angular displacement obtained from optical fow data.Te frames are then arranged in two boxes depending on the α value by the selector and weighed which form the candidate key.Te sequencer receives these frames and counts how many of each one can be found in an order and updates the weights depending on the sequence.Te frames are sequence checked inside the reducer and then they are reduced using the s-reduction algorithm.S-reduction counts the number of sequences.For a sequence of 3, if the middle element has a positive α value, it is kept and the other frames are discarded; otherwise, the middle element is discarded.In the case of a count of 2 and one is from a box with a negative α value, it will be rejected; otherwise, both will be kept.If the sequence is greater than 3, then the mean pixel diference is used as the threshold and is reduced.Te output is a collection of keyframes which form the abstract of signs in CSL videos.
Te FSC2 keyframe extraction algorithm evaluates a second-order frame diference by employing a two-frame optical fow calculator (Gunnar Farnebäck) as the frst order diference and two successive optical fow diferences as the second-order diference.In order to analyse the motion on frames, the algorithm relates to the subsequent three frames.Tis information allows the algorithm to represent the motion of three subsequent frames, which aids in capturing minute interframe motions.

Motion Analysis.
By obtaining two consecutive frames, the optical fow algorithm calculates the motion of each pixel in a frame.Te Gunnar Farnebäck optical fow method was employed for this study to determine the optical fow information between two successive sign frames.[43] is a two-frame motion estimation algorithm developed to produce dense optical fow results.Te algorithm is broken down into four steps.Optical fow is determined by quadratic polynomials representing the local neighbourhood of an image in the frst step.Tese quadratic polynomials are used in the second stage to generate a new signal from a global displacement.Te following step involves equating quadratic polynomials to calculate global displacements.Te coefcient is then calculated by using a weighted least squares estimate of the pixel.

Gunnar Farnebäck Optical Flow. Gunnar Farnebäck
Te Gunnar Farnebäck two-frame method was chosen for this study because it can be used to examine each individual pixel displacement between subsequent frames and depends on the notion that sign language frames have a lot of small motion embedded in neighbouring frames.
Te mathematical representation of the algorithm is as follows.
Image intensity model with quadratic function for the frst frame at pixel location x can be represented as where A is a symmetric matrix, b is a vector, and c is a scalar.Coefcients are obtained by ftting the weighted least squares to the intensity values in the neighbourhood.For the second frame with global displacement d, International Journal of Intelligent Systems On expanding and substituting, Further reading can be found in the paper [43].

Wrapper.
Te Gunnar Farnebäck optical fow algorithm generates an optical fow vector for each pixel that lies between two adjacent frames.By using polarization, angular displacement, A → , is calculated from the vector data.Te next step is to determine the diference in angular displacement between adjacent pairs of fow data, which corresponds to the angular displacement between three successive frames as shown in equation (8).Te parameter utilised for frst level candidate keyframe selection, α, is then derived as the mean of the angular displacement diference of the fow data and is represented in equation (9).Tis process discards a small number of frames.
Tus, the wrapper selects candidate keyframes that may be part of the keyframe set.Te selector and sequencer are the two components that determine the rating for the frames.Te selector checks the α value, distributes the frames into appropriate boxes, and assigns each frame a weight based on it.Let f w � w g be the weight assigned to frames in box with α > 0 and f w � w l be the weight assigned to the other.Tis work assumes greater priority to frames in box with α > 0, i.e., w g > w l .
Te sequencer uses these weighted frames to determine the sequence check, or the number of frames that follow each other, and divides them into three boxes, designated S1, S2, and S3.Boxes S1, S2, and S3 have score value(s) 2, 3, and 4, respectively.Frames with sequence number two are kept in S1, frames with sequence number three are kept in S2, and frames with sequence number greater than three are kept in S3.Single frames without any adjacent frames are discarded in this step as any abrupt change in motion is considered uninformational.
For example, consider the scenario that a box contains Tese weighted frames are what make up the candidate keyframes.In this way, the wrapper initially reduces frames.Ten, the frames are combined and sent to the reduction procedure.

Reduction.
Upon receiving the candidate keyframes, the reduction unit starts the reduction process.S-reduction and P-reduction are performed based on the sequence count and pixel diference.Te approach is based on the assumption that a signifcant number of information frames is kept in the box with α > zero.

S-Reduction.
Tere are two types of reductions involved in S-reduction or sequence-check reduction.Te frst step to determining potential keyframes is to count the continuous frame sequence in the candidate set.For sequence count two, if any one of the frame is from box 2 , that frame is discarded; otherwise, both frames are kept.From the set f i , f i+1 , f i+2   with sequence count 3, if the frames ∈box 1 , then f i+1 is discarded; otherwise, f i , f i+2   are discarded.Frames with a sequence count greater than three is sent for P-reduction.

P-Reduction.
A key frame is chosen by comparing pixel diferences between succeeding frames to an adaptable threshold.Te mean pixel diference of the current sequence set International Journal of Intelligent Systems is used to determine adaptive thresholds.Te fnal output will be the key frameset that represents the sign video abstractly.Te algorithmic representation of FSC2 keyframe extraction is given in Algorithm 1.It takes in the frame sequence from the sign video and output the keyframe set K. f i represents the frame index and f w represents the frame weight.Te number of frames for a given sign is chosen by the FSC2 algorithm with no reference to any specifc parameters.Each sign's motion dictates how many keyframes the FSC2 algorithm selects for it.Te choice of keyframes for small signs is one or two.If a sign moves a lot, the algorithm will select more keyframes to identify it.

Experimental Results and Analysis
Two datasets were tested using the FSC2 keyframe extraction algorithm, the RWTH-PHOENIX-Weather 2014 dataset [7] and the How2Sign dataset [44].RWTH-PHOENIX-Weather 2014 includes German sign language weather data captured at 210 × 260 pixels per frame at 25 frames per second.Extracting the keyframe in an exact way is an important research perspective since the dataset serves as the baseline for all the current sign language research studies.Tere are more than 80 hours of sign language videos recorded in parallel by 11 signers in How2Sign, a multimodal and multiview continuous American sign language dataset.Te backgrounds of both datasets are static.Tree sentences of varying length and signer are taken from datasets for analysis and visualization.Table 1 details the sentences used for evaluation and analysis.Two sentences are from the Phoenix4 dataset and one is from the How2Sign dataset.
Figure 4 demonstrates the output achieved for the 176 frames recording "LIEB ZUSCHAUER ABEND WINTER GESTERN loc-NORD SCHOTTLAND loc-REGION UEBERSCHWEMMUNG AMERIKA IX" and the corresponding sign writing notation for each word.Te suggested approach reduces the frameset 176 frames to 48 frames, and the fgure shows that all informational frames are efectively captured while the directional information is preserved.Te sign for the word "LIEB" is well captured as a rubbing gesture as notated in signwriting.Te "WINTER" gesture is a modest forward and backward motion of both hands which is also captured well with a low frame count.

Analysis of the α Value.
Te α value is the diference between two consecutive angular displacement data obtained from Gunnar Farnebäck optical fow algorithm.In Figure 5, a trace of the α over the ground truth frames from the original video for the sentences in Table 1 is depicted.Sentences with varying word lengths and signers and fnger signs are taken at random from the datasets.Estimation of the ground truth is done manually.Te αground-truth mapping chart illustrates that most signs appear at α > 0. Tus, this study, by prioritizing box 1 , is capable of identifying the informational frames of signs.As can be seen from Figure 6, the α value can capture both small and large displacements in a sign and beneft wrapper and reduction algorithms.

Experimental Setup.
Te procedure is divided into two sections.Te main contribution is the generation of keyframes from continuous sign-language videos.An AMD Ryzen 5000 series CPU system is used for this study.Python 3.10 was used to create the algorithm.Google Colab is utilised for training and testing in the second task, which involves sign language recognition.

Performance Analysis.
Tree metrics have been used to evaluate the efectiveness of the proposed algorithm: (1) Reduction rate (R) (2) Accuracy (A) (3) Word error rate (WER) Table 2 shows the obtained reduction rate and accuracy for two datasets when applied on diferent keyframe extraction methods.As the value implies, FSC2 performs well on both datasets, capturing the majority of signifcant frames while eliminating unimportant frames.
In Figure 6, the accuracy chart is presented for diferent sentences.Te keyframes obtained from the FSC2 keyframe extraction algorithm is traced across ground-truth sign frames.A Venn diagram is plotted for the same in Figure 7 to demonstrate the reduction and the accuracy rate.It is demonstrated in Table 3 that the approach is scalable and stable by providing the representation across diferent sentences, signers, and sentence lengths.
From the fgures, it is evident that the FSC2 keyframe extraction algorithm efciently captures almost all the major and minor gestures in the continuous sign video.WER is evaluated by giving the reduced frameset to two recognition systems.Tis work chooses SAN [45] and VAC [46] as recognition systems and the obtained results are shown in Table 4. SAN [45] is a transformer-based architecture and with some data augmentation, the network is able to attain better WER when trained with keyframes.VAC [46] uses an iterative training scheme on the CNN framework.Both SAN and VAC are trained and tested with diferent datasets obtained from methods such as pixel diference, gradientbased approach, Zernik's moment, and the FSC2 algorithm.Te outcomes demonstrate that the proposed algorithm efciently collects information frames while eliminating transitional frames that can be efcient on both global as well as local receptive felds.Figure 8, shows the percentage variance of the WER value obtained after keyframe extraction based on Table 4. Te fndings indicate that compared to the previous methods, the FSC2 keyframe extraction reduces WER more successfully.As compared to other algorithms, the proposed algorithm reduces the WER relative to the baseline.

Computational Complexity.
Te computational complexity of neural network models is commonly assessed using train time complexity, run time complexity, and space complexity.Using the FSC2 keyframe extraction algorithm, the computational complexity can be reduced by a factor of m, where m is the new size of the dataset.

6
International Journal of Intelligent Systems 3.4.1.Space Complexity.Space complexity can be assessed by the amount of space required to store the model input.
Worse case space complexity � O(n), where n � total number of frames without reduction; average or best case space complexity � O(m), where m � total number of keyframes after reduction, m < n Table 2 shows that the algorithm can reach a reduction rate of approximately 75%.As a result, the space complexity is decreased from n to m, which is a 75% reduction and a cost-efective solution.Because the number of input frames is 75% less than in the original dataset, train and run time complexity can be reduced by 75%, allowing the network to extract and learn features faster.

Input: F: set of all frames in a CSL video
International Journal of Intelligent Systems 3.5.Scalability and Stability.Scalability refers to the capacity of keyframe extraction methods to run on various kinds of datasets captured under a variety of circumstances and yield exact results.Te scalability of keyframe extraction techniques can be impacted by variables including data independence, signer independence, and phrase or word length.Te FSC2 keyframe extraction algorithm can be used to reduce any sign language dataset, regardless of the type of sign language or the frame rate of the video and thus data independent and scalable.Table 2 shows the reduction rate and accuracy obtained using two datasets.Both the design statistics and the lack of a threshold value lend credence to this beneft.On four distinct signs executed by three distinct signers, the algorithm ofers the best and most precise reduction, as shown in Table 3. Te signs "MORGEN," "GESTERN," "LIEB," and "KNIFE" are taken into consideration for analysis, and it is discovered that the signs are accurately reduced when performed by various signers.It was, therefore, determined to be accurate and stable, regardless of the signer and language.Te word length on 5670 videos in the Pheonix4 dataset is less than 9 words, on average.Te How2Sign dataset includes fnger signing as

ACCURACY CHART-S1
Frames      International Journal of Intelligent Systems becoming isolated signs.A qualitative analysis of the keyframe extraction algorithms can be found in Table 5. Te analysis shows that the algorithm successfully meets the abovementioned three signifcant qualities when extracting keyframes from CSL videos.

Ablation Study
4.1.Changing the α Criteria.Te main notion of FSC2 is that keyframes may be identifed at α > 0. A study is carried out with α < 0. When compared to the previous notion, the obtained result is less precise.Tere was inadequate similarity between ground truth and keyframes.Figure 9 shows the results for the parameters extracted with keyframe count m, reduction rate R, and accuracy A with two values of α when applied to three sentences.Te keyframe count and reduction rate are depicted by bars, while the accuracy is represented by a line.Figure 9 shows that the choice of α >0 gives a better results.

Motion Analysis Using the Lucas-Kanade Method.
Te Gunnar Farnebäck algorithm is replaced by the Lucas-Kanade method, and the results demonstrate that the GF algorithm is superior to the LK methodology because the GF can capture motion between two successive frames and captures all motions in the signs.Figure 10 shows the performance of both optical fow algorithms when applied on three sentences.Accuracy is represented by the line chart and it is clear that Gunnar Farnebäck gives a better value as it can capture the small motions between two frames.

. Conclusion
Te proposed FSC2 keyframe extraction method is developed to extract keyframes from a video of continuous signs.As a result of the extraction process, every informational frame was successfully extracted and also achieved a high reduction rate.Tis enables researchers to complete CSL-related tasks in less time, with less sophisticated computational hardware and with less storage.In contrast to previous works, the algorithm extracts gesture information from videos while maintaining factors such as continuity and motion direction.Despite the computationally expensive nature of optical fow techniques, FSC2 keyframe extraction is efcient for both long and short sign sequences in terms of accuracy and stability.With statistical methods on optical fow data that function on all basic hardware, the algorithm design is kept simple.Te results showed that the suggested strategy produced highly competitive outcomes when compared to the state-of-the-art approaches.Tus, the algorithm solves six major problems related to keyframe extraction from CSL videos such as stability, scalability, preserving direction information, detecting small and repeated movements in sign, low information loss with great accuracy, and good reduction rate.An evaluation of the algorithm's performance is conducted on existing systems to ensure that it performs the task effciently.All datasets included in this study have static backgrounds.Te angular displacement and optical fow data are impacted by background object movement.As a result, the motion estimate employed in the FSC2 approach cannot precisely determine the sign when the background is changing.Terefore, the proposed algorithm performs poorly compared to how it does with static data.Additional investigation about real-time sign language with diferent static backgrounds is necessary.

Figure 1 :Figure 2 :
Figure 1: Sign representation of the word "LIEB" without frame reduction (initial transition frames also included).
are put in S3 box (for all counts > 3, s is set to 4), frames f 10 , f 11 , f 12   are placed in S2 box, and f 20 , f 21   will get S1 box.Ten, each frame's weight in each box is updated in accordance with equation (9).

3. 4 . 2 .
Time Complexity.Time complexity can be estimated as the train time complexity or run time complexity, when the keyframe set is fed as input to the neural network model.Worse case time complexity � O(n), where n � total number of frames without reduction.Average or Best case time complexity � O(m), where m � total number of keyframes after reduction, m < n

Figure 5 :
Figure 5: Te graphical representation of the relation between the α value and the ground truth sign frames for the sentences in Table 1.(a) Ground truth-α mapping: sentence 1.(b) Ground truth-α mapping: sentence 2. (c) Ground truth-α mapping: sentence 3.

Figure 6 :
Figure 6: A comparison of the ground truth frames and the keyframes produced by the FSC2 algorithm for the sign videos in Table 1.(a) Ground truth-keyframe mapping to show accuracy for sentence 1.(b) Ground truth-keyframe mapping to show accuracy for sentence 2. (c) Ground truth-keyframe mapping to show accuracy for sentence 3.

Figure 7 :
Figure 7: Venn diagram showing the algorithm's accuracy and the reduction rate for three videos.

Table 1 :
Te following sentences were taken for the purpose of analysis and visualization.

Table 2 :
Performance analysis of the FSC2 keyframe extraction algorithm and existing algorithms in terms of the reduction rate and accuracy on Pheonix4 and How2Sign datasets.S denotes the total input video, n denotes the total frame count before reduction, m denotes the keyframe count after reduction, R is the reduction rate, and A is the average accuracy.Te highlighted values shows that the FSC2 algorithm gives a higher reduction rate and accuracy.

Table 3 :
Te stability of the algorithm is assessed by comparing the number of keyframes extracted, done by diferent signers on four sign words from Pheonix14 and How2Sign datasets.
m denotes the number of frames in the original video and n denotes the number of frames in the reduced set.

Table 4 :
A comparison of the WER metrics obtained by diferent systems on sign language recognition tasks.Tested on the RWTH-PHOENIX-weather 2014 dataset.A lower WER value is better.Te highlighted value suggests that FSC2 performs well when integrated with existing systems.
Figure 8: Percentage variation of WER was calculated for four key extraction algorithms based on Table4.SAN and VAC systems are given new dataset with maximum accuracy obtained and WER is calculated.Te percentage variance in WER is shown in each case.

Table 5 :
Comparison of the FSC2 keyframe extraction algorithm with other algorithms on parameters.: static threshold dependency, DI: data independency, S: stability, and C: continuity.Figure 9: Ablation study on the FSC2 algorithm by altering the α value.Figure 10: Comparison study of the Gunnar Farnebäck optical fow algorithm and the Lucas-Karnade method when applied to FSC2 on three sentences.Metric considered are the Keyframe count, reduction rate, and accuracy.International Journal of Intelligent Systems 4.3.Changing the Sequence Value.For keyframe extraction, the FSC2 algorithm examines three sequence values, i.e., 2, 3, and more than 3 to capture both long and short signs.Te sequence values are altered in various orders in order to catch the important frames.Te outcome fell short of the standard set by FSC2 in the reduction rate and accuracy. T