An Improved Sign Language Translation Model with Explainable Adaptations for Processing Long Sign Sentences

Sign language translation (SLT) is an important application to bridge the communication gap between deaf and hearing people. In recent years, the research on the SLT based on neural translation frameworks has attracted wide attention. Despite the progress, current SLT research is still in the initial stage. In fact, current systems perform poorly in processing long sign sentences, which often involve long-distance dependencies and require large resource consumption. To tackle this problem, we propose two explainable adaptations to the traditional neural SLT models using optimized tokenization-related modules. First, we introduce a frame stream density compression (FSDC) algorithm for detecting and reducing the redundant similar frames, which effectively shortens the long sign sentences without losing information. Then, we replace the traditional encoder in a neural machine translation (NMT) module with an improved architecture, which incorporates a temporal convolution (T-Conv) unit and a dynamic hierarchical bidirectional GRU (DH-BiGRU) unit sequentially. The improved component takes the temporal tokenization information into consideration to extract deeper information with reasonable resource consumption. Our experiments on the RWTH-PHOENIX-Weather 2014T dataset show that the proposed model outperforms the state-of-the-art baseline up to about 1.5+ BLEU-4 score gains.


Introduction
Sign languages are visual-based natural languages used by the deaf people for their communication. Since most hearing people cannot understand sign language, sign language translation (SLT) has become an important application to bridge the communication gap between deaf and hearing people. In recent years, researchers have successively proposed deep learning models for neural SLT (e.g., [1][2][3][4][5][6]). e existing SLT models basically follow a multimodal architecture, where convolutional neural network (CNN) and neural machine translation (NMT) are sequentially connected. e CNN module is used to extract image-level features, reduce the fine-grained input, and generate a tokenization layer as the input to the NMT module; the NMT module is the main translation module for encoding and decoding to generate target sentences. e above basic SLT architecture was first proposed by Camgoz et al. [1]. e tokenization layer serves as a hub layer in this architecture. Hence, optimizing it can improve the performance of both CNN and NMT.
However, most of the current SLT works only improve the CNN or NMT module separately, resulting in poor connection between the two modules which causes two serious problems: (1) Poor interpretability: most of the improvements focus on some common tricks, rather than considering the uniqueness of SLT. e characteristics of SLT determine that it is a special NMT task, although the input form is different from conventional spoken language. erefore, analyzing from the input form may help us to find some interesting SLT phenomena and get a better interpretability. For a spoken sentence, the input is usually a series of words. Although there are semantic connections between words, they are expressed in a discrete form. As for a sign sentence, the input is usually a video signal. In actual application, the video needs to be framed into continuous frame images. Intuitively, we can compare each video frame to the basic word element of sign language. Unlike spoken language, the video frames of any sign sentence are continuous, and the order is closely related. In other words, it is illegal to reverse the order between any frames. Specifically, we found that there are many similar frames in the neighborhood, and these frames repeatedly express some meanings, which will cause redundant information and long sentence. However, no works use this visual phenomenon to custom optimization algorithms for sign language. (2) Poor performance for long sentences: longer sentences result in long-distance dependencies, large resource consumption, and low evaluation scores. is shows that both CNN and NMT modules need to be improved. However, the visual CNN module is obtained more attention, and the work of the innovative NMT module is obtained less attention. Besides, the improvement from the perspective of model interpretation is also a very important aspect.
Longer sentences mean more frames. e longer the sentence is, the more complicated the relationship between video frames will have, which leads to insufficient connection between frames. In theory, the amount of calculation may increase exponentially. Hence, the SLT model generally specifies a maximum number of input frames for the CNN module. For longer sentences, how to express more effective information within a certain window size is a meaningful research point. However, there is no work considering reducing useless frames from understandable visual features. Especially for longer sentences, CNN is more pressured and less efficient. If we can reduce the number of sign language frames according to the visual surface image features, then we may still get the same sentence meaning with a fewer frames (like turning long sentences into short sentences), which can not only reduce the convolution pressure, but also generate a higher quality tokenization layer. Moreover, the tokenization layer is then input into the NMT module, so optimizing it in the tokenization level will be a key role for improving the subsequent NMT.
To solve the above mentioned issues, we propose a novel SLT model with a better interpretability for longer sentences, as shown in Figure 1.
ere are two improvements with tokenization-related units.
First, we propose a frame-level frame stream density compression (FSDC) algorithm, which can compare pixels at the image level in an unsupervised manner, reducing redundant frames in temporal neighborhood. Intuitively, it can be understood as retaining high-density information by comparing the similarity of input image frames in the neighborhood. e reduced convolution information can generate tokenization with a smaller size, which allows more information to be transmitted within the limited window length. Besides, for the NMT module, reducing the number of input frames means a shorter length of input. Overall, this is a visually interpretable optimization of sign language that converts long sentences into short sentences.
Second, we replace the traditional encoder in the NMT module with an improved architecture to further strengthen the association between long sentence video frames. Inspired by the study of FairSeq [7], a hybrid model is proposed. e model incorporates a temporal convolution (T-Conv) unit and a dynamic hierarchical bidirectional GRU (DH-BiGRU) unit sequentially. It first convolves the input in the time domain and then encodes the semantic information in the subsequent deep hierarchical RNNs. We can still treat the tokenization layer as a vector representation layer of the dimensionality-reduced frames. As an improvement, 3DCNN/C3D was used in the CNN module [8,9] to strengthen the association between frames in the time domain. However, it requires larger resource consumption and does not always work well in the case of low sign language resources. We observed that, if the NMT module convolves the sign sentences at the tokenized level in the time domain using 2DCNN, it can not only approach the function of 3DCNN/C3D, but also approach the speed of 2DCNN. All in all, this also shortens long sentences in the time domain and deepens the RNN structure in a hierarchical way. In this case, the NMT structure can handle longer sentences as easily as short sentences.
e main contributions of this paper are as follows: (1) We have proposed a novel SLT model with tokenization-related units, which can better handle longer sentences in lower resource consumption, and has a better interpretability. (2) We have introduced for the first time an unsupervised FSDC algorithm to compress the density of the input frames without removing key information. is method is suitable for many similar video tasks.

Our Proposed Approaches
As a special language, sign language has its own specific linguistic rules as well [10], so the SLT model follows the 2 Computational Intelligence and Neuroscience NMT framework, as shown in Figure 1. Now suppose that y � (y 1 , y 2 , . . . , y T y ) is an output sentence that corresponds to the sign video frame sequence x � (x 1 , x 2 , . . . , x T x ) in the training set. At the very beginning, we use the unsupervised FSDC algorithm module to optimize the frame-level input sentences. en, a spatial CNN is used to convolute frames to gain tokenization layer which is then input into the NMT module for encoding and decoding. In this section, we will introduce the proposed approaches in detail. Figure 2(a), the spatial CNN is mainly used to reduce the fine-grained input of video frames. In SLT, the video frame is the most basic input unit. e compression of video frames directly affects the processing efficiency of CNN and the quality of the tokenization layer. erefore, optimizing the number of frames also means optimizing the tokenization layer. For any video dataset, we must follow a fixed frames per second (FPS) to frame all the videos, which leads to massive similar redundant frames in the temporal neighborhood. As an illustration, a signer signs the same sign language at fast and slow speeds, respectively. Although the two express the same meaning, they produce videos of different lengths. Obviously, a video signed at a slower speed will get more redundant similar frames in temporal neighborhood.

Unsupervised FSDC Module. As shown in
To reduce this effect, the FSDC algorithm is proposed. We delete the less-important frames by comparing the similarity index and to keep the sequence of the frames fixed at the same time. In theory, it helps us to reduce the amount of training data as well as errors caused on account of sign speed and FPS.
We use the SSIM algorithm [11] to calculate the similarity between two images, which is close to the intuitive feeling of the human eye. When calculating the structural similarity of frame f i and frame f j , the corresponding calculation flow chart is shown in Figure 3. e formula of the SSIM algorithm is as follows: where L( * ) denotes the luminance comparison, C( * ) denotes the contrast comparison, and S( * ) denotes the structure comparison. Note that x > 0, y > 0, and z > 0, we initialize x � y � z � 1. SSIM( * ) is a decimal between 0 and 1. Extremely, SSIM � 1 means two images are completely identical, while SSIM � 0 means completely different.
e FSDC calculates the SSIM indexes for both each frame and all frames in the neighborhood. If the SSIM index is greater than a certain threshold δ(0 < δ < 1), only one of them will be retained, while the rest will be discarded as redundant frames. A running example of Algorithm 1 is shown in Figure 2 Formally, we explore frame-level input tokenization as shown in Figure 2(a) and map the feature vectors to the tokenization layer as Γ � SpatialCNN(FSDC(x)).
(2) Figure 4 shows the improved NMT module we proposed. Specifically, we improve the encoder in two folds. e first is T-Conv unit for the tokenization layer; and the second is DH-BiGRUs for mining semantic information.

TC-DHBG-Net for Encoding Stage.
e T-Conv unit is inspirited by the work of Bérard et al. [12] on the end-end speech task. It takes as input a sequence of features for tokenization layer. ese features are given as input to two nonlinear (tanh) layers, which output new features of size n. In order to enhance the optical flow feature capture, we concatenate the positional encoding [13] to obtain the feature vectors with position information. Like [14], this new set of features is then passed to a stack of two convolutional layers. Each layer applies 16 convolution filters of shape (3, 3, depth) with a stride of (2, 2) w.r.t. time and feature dimensions; depth is 1 for the first layer and 16 for the second layer. We get features of shape (T x /2, n/2, 16) after the 1st layer and (T x /4, n/4, 16) after the 2nd layer. is latter tensor is flattened with shape (T x � T x /4, 4n) before being passed to a stack of three-level DH-BiGRUs. is set of features has 1/4th the time length of the initial features, which speeds up the raining because the complexity of the model is quadratic with respect to the source length. e DH-BiGRU unit computes a sequence of annotations h � h i , . . . , h T x , where each annotation h i is a concatenation of the corresponding forward and backward states. e hidden state of the last GRU layer in each hierarchy is inserted into the next hierarchy. Formally, first we insert the tokenized vectors into a recurrent neural structure to obtain the semantic information of the context sequence. For recurrent unit type, we choose GRU [15] instead of LSTM [16] because the former has fewer gate structures. e hierarchical structure [2,12,17] and bidirectional structure can extract deeper relevant information. Suppose that the hierarchy of HGRU is n, then ξ encoder � φ en rnn n ,en rnn n−1 ,...,en rnn 1 where (h 1 , h 2 , . . . , h n′ ) are the hidden states of the last GRU layer, and n′ is a variable, and φ en rnn ( * ) indicates the processing of RNN in the encoder.     Computational Intelligence and Neuroscience vectors of spoken language words to a denser space as follows:

Decoder and Attention
where ω i is the embedded version of the spoken word y i . In the decoding stage, we aim at maximizing the probability p(y | x). e decoder computes a probability of the translation y by decomposing the joint probability into the ordered conditional probabilities as follows:

Attention Mechanism.
Like other SLT models, we may also suffer from long-term dependencies, vanishing gradients, and performance deterioration with many input frames. To solve the issues, we utilize attention mechanisms which have been proved useful in various tasks including but not limited to machine translation. e most common attention mechanisms are the mechanisms of Bahdanau et al. [18] and Luong et al. [19]. Based on hyperparameter experiments, we take Bahdanau as our attention mechanism. Given the input x, we define each conditional probability at time i depending on a dynamically computed context vector c i as follows: where s i is the hidden state of the decoder at time i and g is a linear transformation that outputs a vocabulary-sized vector.
Note that the hidden state s i is computed as where φ de rnn ( * ) indicates the processing of RNN in the decoder and ω i−1 is the word embedding of the previously predicted word y i−1 , s i−1 is the last hidden state of the decoder, and c i is computed as a weighted sum of the hidden states from encoder as where α ij is the weight of each annotation h j .

Experiments
In this section, we conducted a series of experiments on the RWTH-PHOENIX-Weather 2014T dataset by employing our improved SLT model with tokenization-related units compared to the baseline.
3.1. Baseline. As described above, the baseline is an attention-based structure combined by 2DCNN and Seq2Seq sequentially. e spatial 2DCNN is an AlexNet [20], and its parameters are pretrained on Imagenet [21]. e encoder and decoder of Seq2Seq are nonhierarchical GRUs. In order to compare with the baseline fairly, all experiments run in the same dataset and GPU environment. Except for the differences mentioned in the paper, other configurations for all models are consistent by default.

Dataset.
e RWTH-PHOENIX-Weather 2014T is the most popular continuous SLT dataset. It is collected by extending the German sign language recognition (SLR) dataset, RWTH-PHOENIX-Weather 2014 Corpus [22]. Compared with other SLT datasets, this dataset has larger data and higher quality. It contains 4,839 vocabulary, 8,257 video clips, 947,756 frames, and 113,717 words in total, as shown in Table 1. Each video corresponds to a translation sentence. Although the dataset includes sign language gloss corpus, our model is trained without gloss-level alignment, where the glosses give the meaning and the order of signs [1,23,24]. Nevertheless, the use of glosses is limited to a prerequisite that word label in sentences is consistent with the order of corresponding visual content. In the other words, if the word is out of order, it is unsuitable to tackle sequential frame-level classification under word labels in disorder. In fact, most datasets do not include gloss annotations. Although we do not consider it for this work, we conducted NMT experiments using gloss to gain optimal settings as [1].

Settings.
Based on baseline conclusions and our experience, we preset some important hyperparameters. We use GRU as the recursive module for both encoder and decoder, where each recurrent layer contains 1,000 hidden units. During the training, the optimizer used is Adam [25], and the learning rate is 0.00001 with a decay factor of 0.98 and a batch size of 1. During the decoding, we use beam search with a width size of 3 to generate sentences.

Evaluation.
We use BLEU [26] and ROUGE [27] as the evaluation metrics, which are most used in machine translation tasks. Note that the BLEU score is represented by BLEU-1, 2, 3, 4 and the ROUGE score refers to ROUGE-L F1-SCORE. In training, the BLEU-4 score on the development set is used to select the best model. Table 2 shows the performance comparison between our proposed systems and the existing baseline systems. e existing baseline systems use different attention mechanisms, of which the Bahdanau mechanism performs best. It is worth mentioning that although the transformer has good performance in many NMT tasks, it does not achieve good results in the SLT dataset due to its small data size.

Comparison to Existing Approaches.
Our proposed systems contain innovations in multiple places, so we added different improved modules on the baseline for comparison. We can see that after using the unsupervised FSDC algorithm (#2h), the model achieves better performance. As for the improvement of the encoder in NMT module, either T-Conv or DH-BiGRUs units have a Computational Intelligence and Neuroscience promoting effect as shown in Table 2 (#2e and #2f ). e complete improved encoder module which uses both T-Conv and DH-BiGRUs units (i.e., TC-DHBG-Net) improves more significantly as shown in Table 2 (#2g). From the performance, we can see that the improved encoder in the NMT module is the most important and the FSDC algorithm can slightly improve the basis as shown in Table 2 (#2i). Overall, the proposed tokenization-related units without extra information improve significantly for the SLT.

Validation on TC-DHBG-Net.
In order to validate the role of the T-Conv} unit of the TC-DHBG-Net, we only add T-Conv units to the encoder of the baseline, while the recursive neural unit remains unchanged. In Table 2, #2e exceeds the baseline moderately, which proves the positive role of the T-Conv unit.
e DH-BiGRUs unit is another important component of the TC-DHBG-Net. We replace the original GRUs of the baseline with our DH-BiGRUs unit in 3 levels by default. As shown in Table 2 (#2f ), the multilevel structure is introduced and the performance is moderately improved, proving the effect of the hierarchical structure.
Although T-Conv unit and DH-BiGRUs unit have been proved by the above experiments, it does not mean that the combination of the two will be better. erefore, it is necessary to introduce Table 2 (#2g). Compared with baseline, #2g improves significantly, which is better than any single module (#2e or #2f ).

Ablation on the Levels of DH-BiGRUs.
e DH-BiGRU has an important hyperparameter, the number of RNN levels. To test the scores for different levels of DH-BiGRU in the recurrent neural unit, we set the number N level to 1, 2, 3, and 4, respectively. We conducted experiments based on the previous experiment as shown in Table 2 (#2g). Table 3 illustrates that the hierarchical structure has a significant impact on the scores. When N level is set to less than 3, the scores increase as the number of levels increases, and when N level � 3, the score increases to peak; but when N level > 3, the score starts to drop. As a conclusion, a larger number of layers do not mean a higher score. erefore, we set N level � 3 to the optimal hyperparameter.

Validation on FSDC Algorithm.
At the very beginning, we analyze the structural similarity of all frames in the dataset. Figure 5(a) shows that the number or proportion of the separable redundant frames varies with different thresholds. Even if the threshold is set to 75%, we can see that the number of frames for temporal neighborhood exceeds 85%. Once the threshold is lower, the proportion of frames will be greater. is indicates that the relationship between the frames is tight. A reasonable initial threshold is crucial to the model, but the threshold is an empirical and experimental hyperparameter. If the threshold is set too low, much more useful frame information may loss; on the contrary, the optimization will not work at all. Analyzing Figure 5(a), we think that the similarity threshold is set to at least 94%.
To validate the FSDC algorithm, we set the thresholds from 94% to 99%, to control the percentage of redundant frames. We conducted the experiment on the baseline (Table 4 (#4a)) and the structure we proposed (Table 4 (#4b)), respectively. Figure 5(b) shows that within a reasonable range, the FSDC algorithm can be positive relative to the improvement of the baseline, especially when the threshold is set to 95%. But the relative value of negative numbers in Table 4 (#4b) also shows that not all thresholds can improve performance.
Moreover, it is worth mentioning that the size of the training data is reduced by 9.28% when the threshold is set to 95%. e optimized dataset not only saves storage space, but also saves processing time (about 10% reduction).
3.9. About Length. Figure 6(a) shows the distribution of the number of sentences with respect to the different lengths of source sentences (frames) on the test set. Since the frame number of most sentences is less than 100, we think that more than 100 frames are considered as long sentences. Input: input F; threshold δ (0 ≤ δ ≤ 1); number of video frames N.
Retain f x , f x+i , update x � x + i, i � 1 end if end for ALGORITHM 1: FSDC algorithm for temporal neighborhood. Computational Intelligence and Neuroscience Figure 6(b) shows the BLEU scores of generated translations on the test set with respect to the lengths of the source sentences. In particular, we split the translations into different bins according to the length of source sentences (frames), and then test the BLEU scores for translations in each bin separately with the results reported in Figure 6(b). Our approach can achieve big improvements over the baseline system in almost all bins, especially in the long sentences which have more than 117 frames. e performance comparison intuitively shows that our model can better adapt to the translation of long sentences, which benefits the FSDC algorithm and the improved encoder.  Figure 5: (a) Numbers and percentage of redundant frames with respect to different similarity thresholds. (b) e increased absolute values of BLEU compared to the baseline after using the FSDC algorithm. When the threshold is around 95%, both models reach the peak.  Computational Intelligence and Neuroscience 7 closer to the ground true. Note that the translation results closer to the target in Table 5 are marked in bold.

Related Work
According to a recent review [28], sign language is an ongoing research that began decades ago. e SLR system can be classified into three based on the type: (1) fingerspelling recognition; (2) isolated word recognition; (3) continuous sign sentence recognition. As for SLT, it is a more advanced task to further understand the semantic information of sign language.
In earlier work, the SLR system employed traditional recognition methods. For instance, Gao et al. [29] used HMM to recognize SLR words; e authors of [30,31] used SVM to classify continuous sign language alphabets and isolated words; Baccouche et al. [32] performed a trajectory matching to classify the isolated words. Compared to the above, deep learning-based models have been employed recently. CNNs [33,34], LSTMs [2,[35][36][37], or hybrid models [3,38] have been used for continuous sentence recognition.
When it comes to SLT, few research results are published up to now. However, the development of SLR has laid a foundation for SLT. Camgoz et al. [1] released the first available continuous SLT dataset and proposed a neural SLT model. ey combined CNN with the classic machine translation model-Seq2Seq. eir work maintains state of the art on the RWTH-PHOENIX-Weather 2014T dataset. Later, Ko et al. [4] proposed a neural SLT model based on human pose estimation, converting a video frame to keypoints, which simplifies the complexity of recognition, but ignored much important semantic information, e.g., expressions. We believe that it is under consideration. Guo et al. [2] proposed a hierarchical LSTM model that performed both SLR and SLT experiments on a Chinese dataset. ey used 3DCNN for features extraction and compared it with the video captioning model S2VT [39].
e critical problem about their dataset is that it only includes 100 sentences, which is inappropriate for translation tasks. Overall, SLT achievement is still underperforming, limited by a lack of large-scale datasets and better translation models.   Figure 6: (a) Numbers and percentage of redundant frames with respect to different similarity thresholds. (b) e increased absolute values of BLEU compared to the baseline after using the FSDC algorithm. When the threshold is around 95%, both models reach the peak.

Conclusion
In this work, we propose a novel weakly supervised SLT model with improved tokenization-related modules to adapt to longer sentences. We first propose an FSDC algorithm for temporal neighborhood to optimize the limited training data by removing the redundant frames and compress the sentence length to get a better interpretability. en we introduce a T-Conv and DH-BiGRU-mixed NMT, which can consider the temporal information with reasonable resource

Source
Target der wind weht mäßig bis frisch mit starken bis stürmischen böen im bergland teilweise schwere sturmböen im südosten mitunter nur schwacher wind. ( e wind blows moderately to fresh with strong to stormy gusts in the mountains, sometimes severe gusts in the southeast, sometimes only weak winds.) BASE der wind weht mäßig im norden frisch mit frisch mit stürmischen böen an der nordsee schwere sturmböen. ( e wind blows moderately in the north fresh with fresh with stormy gusts at the north sea heavy gusts of wind.) OURS der wind weht mäßig bis frisch bei schauern und gewittern kann es stürmische böen auf den bergen sturmböen. ( e wind blows moderately to fresh during showers and thunderstorms, it can be stormy gusts on the mountains.)

Frames
From 192 to 182

Example (b)
Source Target und morgen wird es dann in der südosthälfte nochmalähnlich werden wie heute allerdings im nordwesten bereits dichtere wolken. (and tomorrow it will be similar again in the southeast half of the day as in the northwest, however, with thicker clouds.) BASE morgen im süden und süden bleibt es allerdings schon wolkenlücken und gewitter das wird es schon schon werden werden aus den westen. (Tomorrow in the south and south there will be cloud gaps and thunderstorms it will be from the west.) OURS und morgen wird es dann in der südosthälfte nochmalähnlich am alpenrand wieder mal südwestwind und gewitter. (and tomorrow it will be similar in the south-east half again on the edge of the alps again south-west wind and thunderstorm.) Frames From 196 to 169 BASE: baseline model; Ours: the optimal model mentioned above; and the texts in parentheses represent the English translation corresponding to German.
Computational Intelligence and Neuroscience consumption as well as succeed in extracting deeper information. To evaluate our approaches, we conducted experiments on the public dataset-RWTH-PHOENIX-Weather 2014T. Compared with the existing state-of-the-art baseline, our model can reduce the size of training data by 9.3% and outperform the baseline up to about 1.5+ BLEU-4 score on the sign-to-text translation task. Moreover, we conducted a series of comparison and ablation experiments and analyzed the translation performance qualitatively. Despite the improved performance, SLT still has a lot of room to be studied. In future work, we will explore better interpretative methods to translate longer sentences.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.