Fingerspelling Identification for Chinese Sign Language via AlexNet-Based Transfer Learning and Adam Optimizer

Nanjing Normal University of Special Education, Nanjing 210038, China Joint Accessibility Key Laboratory, China Disabled Persons’ Federation, Nanjing 210038, China School of Computer Engineering, KIIT Deemed to University, Bhubaneswar, India School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454000, China School of Mathematics and Actuarial Science, University of Leicester, Leicester LE1 7RH, UK School of Architecture Building and Civil Engineering, Loughborough University, Loughborough LE11 3TU, UK School of Informatics, University of Leicester, Leicester LE1 7RH, UK Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia


Introduction
Nowadays, hearing-impaired people account for a large number, especially in China, where there are about 27.9 million deaf people [1]. eir lives, learning, and communication have all encountered unprecedented challenges. Sign language (SL) is their way of expression. But sign language contains a series of content and elements such as hand shape, movement, posture, and emotion, which is a relatively complicated system and not easy to be learned and mastered. Sign language translation and sign language recognition are two important solutions for above challenges. However, the former needs to arrange staff in advance and is expensive, and the latter is receiving more and more attention due to the rapid development of intelligent technology.
Chinese Sign Language (CSL) is designed specifically for Chinese deaf and hearing-impaired groups [2]. It has distinct characteristics, that is, rich in semantics, wide in area, and various in expressions. As an important part of it, Chinese finger sign language is of great significance. First of all, it is simple and easy to learn, which consists of 26 monosyllabic letters (A to Z) and 4 double syllable letters (ZH, CH, SH, NG). e content is clear, unique, and easy to remember. Secondly, the pinyin function is mapped, which can be used as the basis for gesture sign language learning. e fingerspelling in Chinese Sign Language is different from the letter expression of American Sign Language (ASL). e fingerspelling of American Sign Language is conducive to the direct expression of words or meanings, while the Chinese finger sign language represents the composition of a pinyin element in most of the time. Several finger sign languages constitute a complete meaning, which is not easy to produce ambiguity. In addition, Chinese finger sign language has advantages in representing abstract concepts and terminology.
Sign language recognition (SLR) refers to the utilization of computer technology to translate sign language into texts, images, audio, video, and natural language that can be understood and accepted. Generally, it can be divided into two types: one is based on sensors and another is based on computer vision. Especially, the recognition technology based on computer vision [3][4][5][6] is a popular trend, which is flexible to operate, easy to implement, low in cost, and reliable in technology. Many researchers have contributed to these areas by providing superior classification and recognition algorithms and their variants. Using four different support vector machine kernels, a fingerspelling recognition system focusing on ai fingerspelling sign language achieved the average accuracies of 91.20%, 86.40%, 80.00%, and 54.67%, respectively [7]. But it only had 375 character pictures totally. Based on HMM, K-means, and ant colony algorithm, Li et al. [8] recognized Taiwan Sign Language with accuracy of 91.3%. However, their database just included 11 special Taiwan Sign Language words. e dynamic time warping (DTW) was presented by Lichtenauer et al. [9] and gained a recognition rate of 92.3%. ey evaluated sign classification on a set of Dutch Sign Language with 120 diverse signs. In addition, the hierarchical conditional random field (HCRF) method was proposed by Yang and Lee [10]. Rao et al. [11] trained the ANN classifier to match words. Kumar et al. [12] used the hidden Markov model (HMM) method. Lee et al. [13] combined SVM with HMM.
Some other research studies of sign language recognition utilizing deep learning were introduced in [14][15][16][17]. Jiang [18] employed the Gray-Level Co-occurrence Matrix (GLCM) and Parameter-Optimized Medium Gaussian Support Vector Machine (MGSVM) method to identify isolated Chinese Sign Language. A 6-layer convolutional neural network with the leaky rectified linear unit (LReLU) technique for Chinese Sign Language fingerspelling recognition was proposed by Jiang [19].
Although these techniques and methods have achieved favorable results, they still have their own problems and shortcomings. For instance, HMM and DTW need to establish an effective template first and are not suitable for realtime system. Feature extraction is quite important for SVM, which will consume a huge amount of resources. e accuracy of HCRF can get great improvement potential. In contrast, with strong self-learning ability and organizational capability, neural networks bring a new dawn. In particular, transfer learning is fueling those mature and efficient neural networks.
In this paper, we proposed a AlexNet-based transfer learning method equipped with data augmentation and Adam optimizer for fingerspelling identification of Chinese Sign Language and gained a stable average accuracy of 89.48 ± 1.16%, which can be considered effective and significant.
Our contributions are as follows. (i) AlexNet-based transfer learning was introduced, which included some advanced techniques such as rectified linear unit function (ReLU), local response normalization (LRN), and dropout. (ii) Optimization algorithm Adam was utilized to accelerate the learning speed and enhance the performance. (iii) Data augmentation (DA) was used to provide sufficient training dataset, which strongly alleviated overfitting and raised the accuracy. (iv) Our study offered a new approach to smooth the barriers of communication between hearing-impaired people and healthy people.

Dataset
A self-built dataset of Chinese finger sign language based on computer vision was established, which was consisted of 1,320 images. e dataset is available upon e-mail request. According to the universal sign language standard issued by the state, Chinese finger sign language contains totally 30 sign language letters, 26 of which are single letters (A to Z) and 4 are double syllable letters (ZH, CH, SH, and NG). Figure 1 demonstrates thirty categories of Chinese finger sign language intercepted from sample images. ese samples were preprocessed and normalized to 256 × 256 background-optimized images. Our experiment was executed with this private dataset including 1,320 images. Among them, 1050 images were used for training, and the rest were used for testing. e fingerspelling data used to support the findings of this study are available from the corresponding author upon request.

Transfer
Learning. An important reason why transfer learning catches the focus is that traditional machine learning applications require a large amount of labeled data, but these datasets may have problems such as distribution differences and training data expiration [20]. Fortunately, transfer learning can solve these issues. It can make full use of the previously labeled data and can guarantee the accuracy of the model on new tasks, which makes it get more and more attention.
Transfer learning is defined as the ability of a system to recognize and apply knowledge and skills learned in previous domains/tasks to novel domains/tasks. Transfer learning can be divided into instance-based transfer, featurebased transfer, and shared parameter-based transfer. e key to transfer learning is to find out which shared knowledge can be moved between different domains, design appropriate algorithms to extract and transfer common knowledge, and avoid negative transfer [21]. ere are two general strategies for the application of transfer learning in deep learning. One is fine-tuning, which involves using a pretrained network on the base dataset and training all layers in the target dataset; the other is freezing and replacing, for instance, freezing all layers (the weights are not updated) except the last layer and training the last layer or freezing N layers before (that is, the freezing layer was selected by custom) and training the last few layers remaining. Figure 2 shows the application of fine-tuning idea of the transfer learning. It can be seen that mapping from a big dataset to a target small data can be quickly achieved through reasonable fine-tuning, and a new suitable model is established from a pretrained model [22].
In the image processing field, pretraining ImageNet is frequently chosen and its model is initialized. In this study, we initially trained a pretrained model on a subset of ImageNet having 1,000 categories and then transferred the previously learned knowledge to 30 categories of Chinese Sign Language recognition based on a small amount of private data, which is a comparatively simplified task. e entire structure of the neural network was retrained. us, the AlexNet-based transfer learning brought advantages for fingerspelling identification of Chinese Sign Language.

AlexNet: Structure and Layers.
AlexNet has attracted much attention due to the fact that it was far ahead of the second winner in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012), which shocked the academic and industrial researchers at the time. It can be thought as a deeper and wider version of LeNet. LeNet contains the basic modules of convolutional neural networks, providing the basis for other deep learning models [23]. From a holistic perspective, AlexNet retains the original idea but introduces more tricks such as ReLU, LRN, and dropout. e rapid development of technologies such as the use of GPUs and resolution of data sample limits has provided an opportunity for AlexNet.
In terms of layers that associate with learnable weights, AlexNet consists of a total of eight layers, including five convolutional layers and three fully connected layers [24]. e structure of AlexNet is illustrated in Figure 3. Although AlexNet is similar to traditional neural networks, it has its own characteristics. e main aspects on advanced techniques can be summarized as follows: (a) N learnable filters are contained in convolutional layers, and each filter can generate one feature map. us, finally a size of N × O W × O H activation map will be obtained, which implemented the function of feature extraction. Suppose that we have a 3D input with size of I W × I H × L and a filter with where the number of filters is defined as N, S indicates the stride size, and P denotes padding size. e entire process of convolution is shown in Figure 4, which started from the input with a series of filters and finally output an activation map. (b) Replacing the traditional sigmoid function with rectified linear unit function (ReLU) is one of AlexNet's successful practical strategies. Because it is essentially a half-wave rectifier, it significantly speeds up the training process and prevents overfitting. In  Scientific Programming contrast, the sigmoid function is susceptible to gradient disappearance. It has been proven that the convergence speed of ReLU as an activation function in deep neural networks is about 7 times that of the traditional activation function. e formulas for both are expressed as follows: (2) (c) AlexNet's another superior practical strategy is local response normalization (LRN) which was proposed by Krizhevsky et al. [25] to promote the convergence. Based on the computable neuron α i with application of kernel i and nonlinear ReLU, the response-normalized neuron β i can be computed as Among them, where n denotes the number of contiguous kernel maps, which is the size of window channel, and N represents the total number of kernels in that layer. According to the usual practice, the parameters are set as follows: (d) Dropout should be considered a big innovation in AlexNet, which has become one of the must-have structures of neural networks. Dropout was introduced primarily to prevent overfitting. e implementation process is as follows: for those neurons of a certain layer, they are set to 0 with a probability P. Generally, In this case, the dropout randomly generates the most network structure. ese zero-set neurons do not participate in forward and backward propagation, as if they were frozen or discarded.

Scientific Programming
But at the same time, the number of neurons in the input and output layers is kept unchanged, and the parameters are updated according to the neural network learning method. e process is iterated repeatedly until the end of training. In this way, the entire neural network is reduced in scale and streamlined. It can also be seen as a combination of models that combine multiple models each time by generating different network structures. us, it can effectively reduce overfitting. Figure 5 represents a plain neural network; here, we use gray dotted circles to denote dropout neurons and use blue solid circles to indicate retained neurons. (e) e fully connected layer (FC) can linearly transform one feature space into another, that is, map the learned "distributed feature representation" to the sample tag space. Essentially, it implements the "classifier" function. FC tends to appear at the end to weight the previously designed features. After multiplying the input by the corresponding weight and adding an offset, they can be obtained.
e softmax layer (SL), also called the softmax function, is usually followed after the final fully connected layer. e SL works on a multiple-input multiple-output mode. e softmax function converts the value into a probability, and the node with the highest probability will be selected as the prediction target. e softmax function is defined as follows: 3.3. Transfer Setting. In this study, only one GPU was used to implement AlexNet because of the acceptable dataset. Table 1 provides the parameters of learnable weights and biases of AlexNet. Totally, weights and biases of AlexNet are 60,965,224 (60,954,656 + 10,568). In Matlab, single-float type is chosen to store all variables. As every variable occupies four bytes, the total 233 MB space was allocated. e neural network structure needs to be modified, especially all the last fully connected layers. e original fully connected output was designed for 1000 classifications and is not suitable for our classification topic. us, we redesigned the relevant layers by fine-tuning. As shown in Table 2, a fully connected layer initialized randomly containing 30 neurons was introduced, and a softmax layer and a classifier layer that can implement 30 categories were utilized. Here, the layers 23, 24, and 25 are layer indexes in the Matlab deep neural network model, which also counts the ReLU layer and pooling layer, so the indexes here are larger than indexes from a Python model. e options of training were set by practice. Based on the principle that the training epoch of the transfer learning is relatively small, the training epoch was selected as 10. e global learning rate was set to 10 −4 . Considering that the new layers were randomly initialized with weights and bias while the transferred layers were pretrained, the new layers learning rate were defined 10 times that of the transferred layer.
Different transfer learning settings were tried. As shown in Figure 6, setting M indicates that the layers which followed M are replaced and other layers reserved are transferred layers. Transferred layers keep learning rate as 1 × 10 −4 while replaced layers initialize with learning rate of 10 × 10 −4 . We will test four different transfer learning configurations: M1 to Mn-M1 replaces the last FCL8; M2 replaces FCL7 and 8; M3 replaces FCL6-8; M4 replaces CL5 and FCL6-8.

Data Augmentation on Training Set.
Since the deep neural network model has many parameters, the proposed model needs to contain a considerable number of sample images to achieve optimal performance. Data augmentation (DA) can extend the dataset, which helps to enhance the performance of deep learning and improve the accuracy of classification recognition. For each of the original images selected, we applied the following six DA techniques: PCA color augmentation, affine transform, noise injection, scaling, random shift, and gamma correction.
Take the "j" image as an example, which was represented in 4th column and middle row of Figure 1. As each method produced 30 new images, we gained totally 180 new augmented "j" images, which can be seen in Figure 7.
e parameters of each DA techniques were set as follows: PCA color augmentation shifted the most present color values in original images. Affine transform exerted deformation to the images and preserved straight lines. e zero-mean 0.01 variance Gaussian noise was employed to every sign language images to generate 30 new noised images. Images were scaled with scaling factor from 0.7 to 1.3

Input node
Output node       Table 3, which denoted number of images in three partitions: training, augmented training, and test, every original image generated new training sets 181 times. e experiment was repeated ten times with data split resetting randomly.

Training Algorithms.
Deep learning often requires a lot of time and computer resources for training, so optimization algorithm is widely concerned. Adam (adaptive momentum) algorithm occupies less resources and makes the model converge faster [26], which can accelerate the learning speed and improve the effect. Adam is a first-order optimization algorithm that can replace the traditional stochastic gradient descent process. It joins the second moment estimation on the basis of momentum's first-order moment estimation and adds a moment to Adadelta. e learning rate of each parameter is dynamically adjusted by using the first and second moment estimation of the gradient. Bias correction is also added, which makes the parameters relatively stable. e iterative formulas are as follows: where g is the calculated gradient, m t indicates the first moment of gradient g, which is also the expectation of gradient g, v t denotes the second moment of gradient g, β 1 represents the first-order moment attenuation coefficient, β 2 represents the second moment attenuation coefficient, θ stands for the parameter that needs to be solved (or updated), and m ∼ t and v ∼ t indicate offset correction of m t and v t , respectively.
Two other popular training algorithms are stochastic gradient descent with momentum (SGDM) and root mean square propagation (RMSProp). In this study, we used those two algorithms as comparison basis.
SGDM, whose full name is stochastic gradient descent with momentum, introduces first-order momentum on the basis of SGD. e first-order momentum denotes the exponential moving average of the gradient direction at each moment, approximately equal to the average of the sum of the gradient vectors at the most recent T j moment.
Calculation of m t is shown in formula above; T j can be represented as follows: In other words, the descending direction at time t is determined not only by the gradient direction of the current point but also by the descending direction accumulated before it. e empirical value of β 1 is 0.9, which means that the direction of decline is mainly the previously accumulated direction of decline.
On the other hand, RMSProp (root mean square propagation) is an optimization algorithm proposed by Geoffrey E. Hinton in Coursera. In order to further optimize the loss function in the update of the problem of excessive swing and further speed up the convergence of the function, RMSProp algorithm used the differential squared weighted average for the gradient of weight Wand bias b. As a result, it makes greater progress in the direction where the parameter space is gentler. e sum of squares of historical gradients is smaller because of gentler direction, which leads to a smaller learning drop. Assuming that in the process of iteration t, each formula is derived as follows: where s dw and s db are the gradient and gradient momentum accumulated by the loss function during the previous iteration t − 1 and the vector β is an exponential of gradient accumulation. To avoid the denominator becoming 0, ε is going to be a very small number. RMSProp helps to eliminate the direction of the large swing and is used to correct the swing so that the swing in each dimension is smaller. On the other hand, it also makes the network functions converge faster. RMSProp is very similar to momentum in that it eliminates the wobble in gradient descent, including minibatch gradient descent, and allows to use a higher learning rate a to speed up learning of algorithm.

Experiment Results and Discussion
We ran this experiment on the platform of a personal computer whose main components include 2.5GHZ Core i7 CPU, 16 GB memory, and 2 GB AMD Radeon Graphics Processor, under the operating system of Win10. Different settings of transfer learning were tried to gain optimal hyperparameters, and then we carried out 10 runs on the test set using the final model. Average accuracy (AA) is employed to evaluate the experiment results. AA is the average of the correctly classified categories among the research objectives, which reflects the average of the accuracy of each category. Table 4 provides the average accuracy of 10 runs over the test set. In each run, 270 images were randomly selected from the entire dataset to construct the test set. Table 4, the results of 10 runs show that the highest average accuracy is 91.48% and the minimum value of AA is 87.78%. We marked the highest AA in bold. Additionally, it can be seen that values of AA exceed 90% in 4 times. Finally, the value of mean and standard deviation achieves 89.48 ± 1.16%, which can be regarded as effective.

Training Algorithm Comparison.
In this experiment, we compared Adam algorithm with SGDM and RMSProp algorithms. e comparison is shown in Figure 8.
As can be seen from Table 5, the mean and standard deviation of SGDM, RMSProp, and Adam are 88.33 ± 2.03%, 89.04 ± 0.93%, and 89.48 ± 1.16%, respectively. We can find that Adam algorithm is superior to SGDM and RMSProp. On the on hand, the stability of Adam is more excellent than SGDM. We find that the highest average accuracy of SGDM achieves 92.22% while the lowest value of AA drops to 84.81%. ere is a huge amplitude between the two, which means gradient smoothness of SGDM is not enough to achieve stable transition, that is, SGDM's inertia is not maintained enough to accommodate an unstable objective function. One the other hand, the accuracy of Adam is higher than RMSProp. Obviously, in total 10 runs, result of RMSProp has surpassed Adam only three times.

Setting of Transfer Learning Comparison.
Four different transfer learning settings (configurations M1, M2, M3, and M4) were run on the test set. Table 6 and Figure 9 show the results of different settings; it can be seen that the mean and standard deviation values of configurations M1, M2, M3, and M4 are 89.48 ± 1.16%, 88.93 ± 0.75%, 88.59 ± 0.90%, and 86.85 ± 0.94%, respectively. Obviously, the greatest performance was obtained by configuration M1 (replacing the last FCL8) among all four measures. Overall, performance declines as the number of replacement layers increases. Configuration M4 (replacing CL5 and FCL6-8) achieved mean ± SD value of 86.85 ± 0.94% which is the lowest in four configurations.
is situation indicates that using most transferred layers from a pretrained model is more efficacious and practical. Moreover, data augmentation expanded our dataset to a sufficient training set, which can strongly avoid overfitting and improve the accuracy.

Effect of Data Augmentation.
In this experiment, we compared the performances of using data augmentation (DA) against not using DA. e experiment settings were all the same as in Section 4.1, only removing the DA on the training set. e results are shown in Table 7. We tested four different augmentation factors. First, we generate 10 new images for each method of six DAs. erefore, we obtained   60 new images; together with original image, we have in total 61x augmentation on the training set. Second, we set the augmentation factor as 121x. Afterwards, we also check the performance of 181x and 241x augmentation.
As can be seen from Figure 10, effect of not using DA is obviously poorer than using DA. When using DA, the performance increases with the improvement of augmentation factor. Nevertheless, 241x augmentation brings the   drop of average accuracy, and it also requires the largest computation resources. us, setting the augmentation factor as 181x, that is, the purple bar in the diagram, gains the best performance.

Comparison to State-of-the-Art Approaches.
Five other state-of-the-art approaches: HCRF [10], HMM [12], SVM-HMM [13], GLCM-MGSVM [18], and 6-layer CNN-LReLU [19], were sought to compare with proposed method in this experiment. e average accuracy of HCRF, HMM, SVM-HMM, GLCM-MGSVM, and 6-layer CNN-LReLU is 78.00%, 83.77%, 85.14%, 85.3%, and 88.10 ± 1.48%, respectively, which is demonstrated in Figure 11. It can be observed that our method is superior to all the comparisons. ree attributes are helpful to enhance performance of our method. (i) With fine-tuning, the mature pretrained model can quickly transfer previous task to our task, which is comparatively simplified. erefore, the accuracy of our method is close to the level of the AlexNet model. (ii) e Adam algorithm can boost retraining AlexNet model, which accelerated the learning speed and improved the effect. (iii) Data augmentation extended the dataset and improved the accuracy of classification recognition. us, our method achieved a satisfying result. e shortcoming of our method is AlexNet needs a massive amount of computing resources. In the future, we may try to move our algorithm to cloud computing [27,28] and mobile edge computing [29,30] areas.

Conclusion
In this study, an AlexNet-based transfer learning method was proposed, equipped with data augmentation and Adam optimizer. Our method was applied to fingerspelling identification for Chinese Sign Language. e experiment results demonstrated that our method achieved the average accuracy of 89.48 ± 1.16%, which was excellent among the six state-of-the-art approaches. We compared three training algorithms: Adam, RMSProp, and SGDM algorithm, and found that Adam algorithm is more remarkable and stable. We tested four different transfer learning settings and discovered that configuration M1 (replacing the last FCL8) acquired greatest performance. In addition, this strategy is very practical, and it can reduce the large amount of training of the network and make full use of existing models. We also observed that using DA is more effective than not using DA, and setting different augmentation factors leads to distinct performance, in which 181x augmentation achieves the best average accuracy.
In the future, we shall try to verify different transfer learning models, such as ResNet, GoogleNet, and Squee-zeNet. We need to solve the issue of setting appropriate learning rate factor in individual layer. e dataset also needs to be further expanded for getting higher accuracy. We shall try to shift this method to other applicable areas and test other feasible methods.

Data Availability
e fingerspelling data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.