Intelligent Malaysian Sign Language Translation System Using Convolutional-Based Attention Module with Residual Network

The deaf-mutes population always feels helpless when they are not understood by others and vice versa. This is a big humanitarian problem and needs localised solution. To solve this problem, this study implements a convolutional neural network (CNN), convolutional-based attention module (CBAM) to recognise Malaysian Sign Language (MSL) from images. Two different experiments were conducted for MSL signs, using CBAM-2DResNet (2-Dimensional Residual Network) implementing “Within Blocks” and “Before Classifier” methods. Various metrics such as the accuracy, loss, precision, recall, F1-score, confusion matrix, and training time are recorded to evaluate the models' efficiency. The experimental results showed that CBAM-ResNet models achieved a good performance in MSL signs recognition tasks, with accuracy rates of over 90% through a little of variations. The CBAM-ResNet “Before Classifier” models are more efficient than “Within Blocks” CBAM-ResNet models. Thus, the best trained model of CBAM-2DResNet is chosen to develop a real-time sign recognition system for translating from sign language to text and from text to sign language in an easy way of communication between deaf-mutes and other people. All experiment results indicated that the “Before Classifier” of CBAMResNet models is more efficient in recognising MSL and it is worth for future research.


Introduction
Malaysia Sign Language (MSL), or Bahasa Isyarat Malaysia in Malay, was founded in 1998 when the Malaysia Federation of the Deaf (MFD) was established [1]. It is the primary sign language in Malaysia. It is used for daily communication for the deaf-mute community, including deaf people, people with hearing impairments, and physically unable to speak. e MSL has grown in popularity among deaf leaders and participants.
Generally, the American Sign Language (ASL) has had a significant influence on Malaysian Sign Language. Although there are a few similarities between the MSL and Indonesian Sign language, both are perceived differently. Otherwise, the foundation of Indonesian Sign language was based on MSL. e communication is accomplished by interpreting the meaning of the signer's hand gestures and, on occasion, by using appropriate facial expressions. In 2013, about 58700 people from the Malaysian population used the MSL [2]. CBAM [6] consists of a channel and spatial attention submodules, which are used to extend the structure and enhance the performance of Residual Network (ResNet) in images recognition.
is study emphasised this method's performance, efficiency, and practicability to produce a robust sign language translation system that benefits Malaysian deaf-mutes. CBAM-ResNet 2D convolutions are implemented with two methods known as "Within Blocks" and "Before Classifier." Efficiency evaluation of CBAM-ResNet was completed using multiple metrics such as classification accuracy, loss, precision, recall, F1-score, confusion matrix, and training time.

Significance and Contribution.
e main objective of this research is to test and evaluate the efficiency of the CBAM-ResNet method using MSL.
e School of Automation initially conducted CBAM-ResNet neural network and Electrical Engineering (USTB) of Beijing, China, to implement Chinese Sign Language Recognition, which has a different network architecture from this study [7]. As sign languages vary from place to place, it is crucial to determine the diversification and performance of model in terms of multi-metric before it can be widely implemented on MSL. erefore, this study designed a new model and combined CBAM with ResNet to extend the structure and enhance the performance of ResNet. e subobjectives of this research are as outlined as follows: (i) To implement the new method, which is CBAM-ResNet on MSL recognition, to increase the efficiency of the sign language recognising mechanism. (ii) To further investigate the differences between CBAM-ResNet "Within Blocks" and "Before Classifier" regarding the efficiency of recognising MSL. (iii) To develop a real-time MSL Recognition System through human gestures recognition using the CBAM-ResNet method.
is is the first study that adopts the CBAM-ResNet method in the context of MSL. is study introduces the CBAM-ResNet neural network to resolve the problems such as accuracy and applicability in the previous MSL recognition technology. By evaluating the efficiency of the CBAM-Resnet method on MSL recognition, the proposed method can be effective as the researcher expectation and helps identify any potential for improving the sign language translating mechanism. is study is also crucial to improve communication between deaf-mute's populations and ordinary people in Malaysia to understand each other throughout their conversations. Once the efficiency of CBAM-ResNet on MSL recognition is proven, it can be implemented to develop a robust MSL translating system. us, deaf-mutes will enjoy equal access to the same privileges as ordinary people and encourage social harmony.

Organisation.
e remainder of the work is organised as follows: Section 2 briefly discusses related past studies on other sign languages recognition. e methodology of 2 Computational Intelligence and Neuroscience CBAM-ResNet implementation on MSL is explained in Section 3. Section 4 presents the experimental settings, results, and discussion to compare CBAM-ResNet "Within Blocks" and "Before Classifier" in MSL signs recognition. Finally, Section 5 provides the conclusion for this paper.

Literature Review
From past till nowadays, various computational algorithms and machine learning methods have been applied to different sign language recognition, such as Artificial Neural Network (ANN), Convolution Neural Network (CNN), Support Vector Machine-based machine learning (SVM), Hidden Markov Model-based machine learning (HMM), Fuzzy rules-based algorithm, Back-Propagation algorithm, Recurrent Neural Network (RNN), 3D Residual Convolutional Network (3D-ResNet), and CBAMResNet neural network. ese methods have their respective strengths and limitations in recognising sign languages. Generally, researchers in past studies used two main streams of sign languages recognition methods: vision-based and glovebased techniques. e vision-based method was relatively more convenient than the glove-based method as it does not require any wearable device, making it a hassle-free solution. However, the vision-based method still has limitations, such as the quality of the camera and image used, capturing distance and direction from the camera, lighting of surroundings, accessories worn by signers, and overlapping hands in presenting sign language [8][9][10]. ese factors may affect the performance of the model. e critical evaluation parameters such as accuracy, speed of recognition, time of response, applicability, and accessibility are used to measure the efficiency of the sign language recognition algorithm.

Relevant past Studies on Different Sign Languages around the World.
As the world is more concerned with deaf-mute's welfare, it shows positive development and a gradual increase in research associated with sign languages in recent times. Researchers worldwide have proposed different machine learning algorithms in sign languages recognition. Meanwhile, methods implemented on sign languages recognition also change with advancements in technology, which can boost the performance of those machine learning algorithms.

Artificial Neural
Network. An artificial neural network (ANN) consists of nodes, simulating the neurons interconnections in biological life's brain [11]. It was usually applied to solve the problems that required data processing and knowledge representation. For example, Tangsuksant et al. [12] researched American Sign Language static alphabets recognition using feedforward backpropagation of ANN. eir research returned an average accuracy of 95% throughout repeating experiments. Another study used the same method and achieved a higher average accuracy of 96.19%, with 42 letters of ai Sign Language examined [13]. López-Noriega, et al. selected gloves with built-in sensors for sign alphabets recognition using ANN with Backpropagation, Quick propagation, and Manhattan propagation [14]. Mehdi and Khan [15] carried out a study with seven sensors equipped with gloves and ANN architecture, which had achieved an accuracy rate of 88%. Finally, Allen et al. [16] developed a fingerspelling recognition system using MATLAB for American Sign Language alphabets. e chosen neural network was perceptron which received a matrix with 18 rows and 24 columns as input from the 18 sensors on CyberGlove through training. eir model got an accuracy of 90%.

2.1.2.
Convolutional Neural Network (CNN). Convolutional Neural Network is a subtype of the Feed-Forward Neural Network (FNN) suitable for images and videos processing [17]. Jalal et al. [18] proposed an American Sign Language translator that did not rely on pre-trained models. ey proved that their model has up to 99% recognition accuracy. It was higher than the modified CNN model from Krizhevsky [19]. Another study that employed CNN on an American Sign Language dataset with around 35000 images was carried out in India [20].
is study adapted a CNN with the topology of three convolutional layers with 32, 64, and 128 filters, max-pooling layers, and Rectified Linear Unit (ReLu) activation function [21].
rough experimental testing, their proposed system was able to achieve 96% recognition accuracy.

Recurrent Neural Network (RNN) and Long Short-Term
Memory (LSTM). A Recurrent Neural Network (RNN) is one of the neural networks equipped with internal memory, where its output will be mapped again into RNN for duplication. As RNN depends on inputs from previous sessions in the sequence, the duplicated antecedent elements will merge with the new input for completing decision-making tasks [22]. However, RNN usually has the problem of vanishing gradient in training. erefore, Long Short-Term Memory (LSTM) is introduced as a refined version of RNN that can deal with this problem.
Liu et al. [23] suggested an LSTM-based Chinese sign language system with their self-build sign language vocabulary datasets using Microsoft Kinect 2.0. eir study returned an accuracy rate of 63.3%. Besides, RNN and LSTM are also applied in the sign language of Bahasa Indonesia with the use of TensorFlow [24]. ey extend to recognising root words attached with affixes, which vary from the original meaning and parts of speech such as "noun" or "verb." A study from Indonesia implemented the 1-layer, 2layers, and bidirectional LSTM. It achieved 78.38% and 96.15% of recognition accuracy on inflectional words and root words.
All efforts contributed by researchers in previous studies on exploring robust sign language recognition mechanisms are much appreciated.

Relevant past Studies on MSL.
e timeline diagram of some published studies on the MSL through the past 13 years is depicted in Figure 1. It shows the trends of researches in Computational Intelligence and Neuroscience the field of this study. For example, Akmeliawati et al. [25] proposed an automatic sign language translator to recognise only fingerspelling and sign gestures. Another gesture recognition system for Malay sign language collected inputs from 24 sensors, consisting of accelerometers and flexure sensors connected via Bluetooth module wirelessly [26]. ese studies could not provide real-time translation system. A gesture recognition system was developed for Kod Tangan Bahasa Melayu (KTBM) in 2009. It captured images through a webcam and then processed with Discrete Cosine Transform (DCT) to produce feature vectors [27].
is system obtained 81.07% for classification rate using an ANN model. In 2012, researchers for MSL built a wellorganised database with different classifications [28]. In the consequent year, Karabasi et al. [29] proposed a model for a signs recognition system through a mobile device in realtime. Majid et al. [1] implemented ANN with Backpropagation to classify skeleton data of signs obtained from Kinect sensors. ey trained the network with a learning rate of 0.05 using 225 samples and achieved 80.54% accuracy on 15 dynamic signs. In 2017, Karbasi et al. [30] demonstrated a dataset development for MSL consisting of alphabets and ten (10) dynamic signs using Microsoft Kinect. In 2019, Fahmi et al. [31] proposed a hand signs translator system based on the fuzzy logic method. e system translates the hand patterns into A-Z English alphabet. Also, the use of the fuzzy logic method has the advantage to deal with uncertain cases of the input and unknown parameters of the system [32,33]. All these studies could not solve the problem of translation of gestures into text and voice.
Researchers favoured ANN in recognising Malaysia Language signs. erefore, this study implemented a CNNbased neural network called CBAM-ResNet, introducing a new classification method in MSL recognition to solve the gestures translation problem.

Convolutional-Based Attention Module (CBAM).
e strength of CNN in images and videos recognition is the availability of different convolutional kernels capable of extracting variations of features in the image. In this research, CBAM is adopted, which has two submodules: channel and spatial focused on detection tasks. Both attention submodules presented different functions. e channel attention submodule gives prominence to representative information provided by an input image. e spatial attention sub-module focuses on the representative region that contributes to the meaningfulness of the image. In addition, both submodules emphasised the concept of "What" and "Where." Figure 2 shows the sequential order of these two submodules when processing information flow in the convolution block of the neural network. e sequential order is chosen before parallel structure for both submodules, where input features are directed to channel attention followed by spatial attention. It was proven that sequential order generated better results [6].
In channel attention, average pooling and maximum pooling are applied separately. e input features will be directed into a multilayer perceptron (MLP) with only one hidden layer to generate channel attention maps. e element-wise summation will combine the two output maps to compute the channel attention sub-module. Equation (1) shows the representation of channel attention, Mc in symbols: where σ refers to the sigmoid function applied, MLP is the multi-layer perceptron, AvgPool and MaxPool represent average pooling and maximum pooling, respectively.  Unlike channel attention, spatial attention submodules apply both average pooling and maximum pooling processes along the channel axis with a convolution layer to produce a spatial attention map. At this time, MLP is not implemented. e spatial attention Ms is shown in equation (2)rdf: where f implies the convolution layer computation.

Integration of CBAM into Resnet-18 Architecture within
Blocks and before Classifier. A residual block in ResNet-18 has a depth of two convolutional layers. "Within Blocks" refers to a method that plugged CBAM at every ResNet residual block in the neural network architecture [6]. e middle 16 convolutional layers in ResNet-18 will form 8 residual block structures. is structure inferred that the "Within Blocks" method integrated CBAM eight times between these consequent residual blocks. is CBAM, the residual network, can refine intermediate feature maps to vital information that better represents the input. While the "Before Classifier" technique integrated CBAM at the end part of the whole residual network, right before the average pool layer and fully connected (FC) layer. rough this implementation, CBAM will be used only once for every epoch of training, which has lower network complexity and consumes less computational cost compared to the "Within Blocks" method. After a given input in tensor, the format passes along all the convolutions in a residual block of CBAM-ResNet, transforming the final feature map into the average pool and FC layers. At this stage, only the last feature map will undergo refinement by CBAM. e refined outcome will then be classified to predict the label of input. Figure 3(a) shows a single residual block in CBAM-ResNet, which visualises the exact place of the integrated attention module in residual network architecture using the "Within blocks" method. CBAM is implemented at the end of the residual function, F in its block. Figure 3(b) shows the exact location of CBAM, which is the bottom part of CBAM-ResNet architecture using the "Before Classifier" method.

Experiment Settings and Results
is study implemented the modified CBAM-2DResNet for Malaysian static sign image recognition. e experiment was carried out to compare and evaluate the classification performance of CBAM integration methods into 2DResNet to complete static sign image recognition. A real-time static sign image recognition system using a webcam was built using the best CBAM-2DResNet trained model resulting from the comparison made.

Experiment Settings on Malaysian Static Signs Image
Recognition.
e development phase used Python programming language version 3.6 with Anaconda Spyder integrated development environment and utilised essential Python deep learning libraries such as Pytorch, Torchvision and CUDA Toolkit.
is experiment was conducted in Google COLAB with Tesla K80 GPU for CBAM-2DResNet "Within Blocks" and "Before Classifier." Figure 4 shows a summarised flow diagram of experimental procedures prepared for MSL static signs image recognition. Before starting the classification model training, data preprocessing and augmentation steps were set up on sign image data and several crucial neural network parameters. A collection of 96800 sign images was resized to 112 × 112 resolution images and normalised using z-score normalisation. Normalised images data were further processed with other images transformation operations, such as random image horizontal flip in 50% probability, random image brightness and contrast adjustment in the range between 0.5 and 1.5, random image rotation, and shear alteration within range ±10°. ese data augmentation techniques applied can significantly improve the variation and diversity of the available data for training.
Random data splitting later separated these sign images into training and validation subset with ratio 8 : 2, which training subset take up 77440 images, and remaining 19360 images were in validation subset. Next, signs images in the training subset transformed to 4D tensors and loaded into both CBAM-2DResNet "Within Blocks" and "Before Classifier" to train for  Figure 2: e sequential arrangement of CBAM channel and spatial attention submodules.
Computational Intelligence and Neuroscience 5 15 epochs. Same network parameters were configured for training, such as learning rate � 0.0001, momentum � 0.9, CBAM kernel size with 3 × 3, and batch size of 64. e Stochastic Gradient Descent (SGD) optimiser was implemented, and Cross-Entropy Loss was adopted as a function to compute the training and validation losses over epochs. Validation was continued after training by choosing the best-trained classification model. A small validation batch size of 4 was utilised. e validation results required for the model's efficiency evaluation based on performance metrics were recorded and analysed. In Figure 5, we also show the comparison graph between training and validation accuracy of CBAM-2DResNet "Within block" over 15 epochs. e training accuracy increased rapidly from 6.69% to 86.73% over epoch 1 to epoch 4 and responded to a slower increasing rate from epoch 5 to epoch 15, achieving

Data pre-processing
Images resized to 112 sized image and undergoes z-score normalization, random horizontal flip, random brightness (0.5 to 1.5), random contrast (0.5 to 1.5), random rotate (10º), random shear (10º)   e minimal gap between training and validation accuracies implied that trained CBAM-2DResNet "Within block" was a well-fitted image recognition model.

Comparison of Training and Validation CBAM-2DResNet "Before Classifier".
e training and validation loss comparison in the plotted graph for CBAM-2DResNet "Before classifier" over 15 epochs is shown in Figure 6. Both training and validation loss curves showed similar trends in decreasing, starting with a rapid decline followed by con- In Figure 6, we display the comparison plotted line curves between training and validation accuracy of CBAM-2DResNet "Before classifier" for 15 epochs. A rapid increase was observed for training accuracy from 18.80% to 95.62% over epoch 1 to epoch 5, followed by a slower increment rate on epoch afterwards. Accuracy for validation also increased quickly from 4.02% to 93.65% over epoch 1 to epoch 4 and slowed down later in the remaining epochs. A standard, increasing trend was noticed between both line graphs of training and validation. e accuracy value converged to the stabilisation point with a very low increment rate since epoch 6. e highest accuracy achieved for training and validation is 99.37% at epoch 15 and 99.39% at epoch 14, respectively. e comparable accuracies recorded by the training and validation phase implied that CBAM-2DResNet "Before classifier" after training was a good fit model with high predictive capacity on this dataset.
rough Table 1, we list the precision, recall, and F1-score of CBAM-2DResNet "Within blocks" and "Before classifier" for each alphabet class on validation subset and the classes macro and weighted average generated with classification report function of Scikit-learn, the machine learning library of Python.
Where all values are within a range between 0.97 to 1. e proportion of instances that account for each class were taken into consideration when calculating the weighted average. Meanwhile, it was excluded in the calculation of the macro average. e recall, precision, and F1-score either in macro average or weighted average were reported with a value of 0.99 after round-off.
A multiclass confusion matrix was plotted for the classification result of CBAM-2DResNet "Within Blocks" and "Before Classifier" on the validation subset, as shown in Figure 7. is confusion matrix helped to give a closer look at the incorrect prediction that the classification model made. e false positives for each alphabet class were the green-coloured and grey-colour cells that were diagonally oriented in the confusion matrix. In contrast, other off-diagonal cells were the wrong predictions that were classified on other alphabet classes. e model had its worst prediction on two classes, the alphabet "R" and "V," with 21 misclassified instances for both classes. By taking alphabet "V" for further illustration, it had a correct prediction of 855 instances, where 19 instances were Computational Intelligence and Neuroscience misclassified as "K" and 2 as "W," out of its total of 876 sample images. Meanwhile, this model correctly classified another 2 classes, "H" and 'Out of the 22 classes in the validation subset. Figure 8 illustrates the F1-score bar chart for each alphabet class using CBAM-2DResNet "Within Blocks" and "Before Classifier" on validation subset. e F1-score of classes in "Within Blocks" were relatively high. 11 out of 22 classes reached the best value of 1.0, which included alphabets "B," "C," "D," "F," "H," "L," "O," "P," "Q," "W" and "Y." At the same time, alphabet "V" had the lowest F1-score, valued at 0.97. F1-score among classes in "Before Classifier" were all approximately to the best value 1. e 14 classes, alphabets "B," "C," "D," "E," "F," "H," "I," "L," "O," "P," "Q," "W," "X" and "Y" ranked the highest F1 score at 1. On the other hand, another 4 classes achieved 0.98, which were alphabet "K," "R," "U," and "V." e "Before Classifier" misclassified an alphabet "R," with a true positive of 876. irty-eight (38) instances were misclassified as "U," 3 as "X," out of its total 917 sample images. Among 22 alphabets in the validation subset, 7 classes   Computational Intelligence and Neuroscience contributed 100% correct predictions on all their instances, which are alphabets "C," "E," "H," "Q," "U," "W," and "Y."

Real-Time Malaysian Sign Language Recognition Using
Image Recognition Technique. Figures 9(a)-9(i) presents the correct real-time classifications of certain MSL alphabet signs presented with their class predicted and confidence score, respectively. e best-trained CBAM-2DResNet "Before Classifier" was chosen as a classification model in building the real-time signs alphabet recognition application. is realtime application implemented with the OpenCV library provided a direct platform to evaluate the trained model through images extracted from webcam frames. Real-time signs images were extracted from the blue box region for every four frames captured through a webcam to feed as test inputs and returned the corresponding classification result to the user if the confidence score was higher than 0.5.

Discussion
CBAM-2DResNet implemented had capability in extracting important features such as hand or fingers from sign images. Table 2 shows a comparison table comparing  two different CBAM implementation methods by  extracting the important results presented, including training duration, lowest validation loss, highest validation accuracy, F1-score achieved, and generalisation performance.
From the comparison table, CBAM-2DResNet "Within Bocks" and "Before Classifier" show good performance in signs image classification tasks for this MSL alphabet dataset. Similarly, both models achieved the F1-score of 0.99 computed with the classification report and reflected as good fit models in their generalisation performance. A minor difference of 0.0004 existed between the lowest validation loss values achieved by these two models. eir highest validation accuracy only varied for a 0.1% difference. However, these insignificant differences would not distinguish much on both models in terms of their classification efficiency. e comparison graph of validation loss and accuracy between "Within Blocks" and "Before Classifier" models of the CBAM-2DResNet is given in Figure 10. It shows that the validation loss of the "Before Classifier" model is prone to decrease and converged faster than another model throughout all 15 epochs. Correspondingly, the validation accuracy of the "Before classifier" model also increased faster than model "Within Blocks." Noticed on classification performance for 22 alphabet classes, both CBAM-2DResNet "Within Blocks" and Computational Intelligence and Neuroscience Figure 9: (a) "Null" classification result, (b) "C" correct classification with 0.897 confidence score, (c) "D" correct classification with 0.948 confidence score, (d) "I" correct classification with 0.873 confidence score, (e) "K" correct classification with 0.806 confidence score, (f ) "L" correct classification with 0.976 confidence score, (g) "V" correct classification with 0.941 confidence score, (h) "W" correct classification with 0.957 confidence score, and (i) "Y" correct classification with 0.943 confidence score. Loss (before classifier) Accuracy (before classifier) Figure 10: Comparison between CBAM-2DResNet "within blocks" and "before classifier" models in validation loss and accuracy. "Before Classifier" had the most significant number of wrong classified instances on class "R." In real-time testing, it was observed that "Before Classifier" may sometimes do adjust classification with low confidence or misclassified certain alphabet signs with high similarities, such as hand signs "V" and "K" and hand signs "R" and "U." e same misclassification issues can also be traced from the confusion matrix in Figure 7. Generally, CBAM-2DResNet "Before Classifier" is more efficient than CBAM-2DResNet "Within Blocks" in recognising static signs images.

Conclusion
is study is the pioneer to design and implement the CBAM-ResNet model into Malaysian Sign Language. Two experiments were conducted for static signs and dynamic signs using image recognition and video recognition techniques respectively. A Malaysian Sign Language video dataset consist of 19 dynamic signs was recorded. Two different CBAM integration attempts are applied in this research, which are known as "Within Blocks" and "Before Classifier" methods. e model achieved accuracy more than 90% with some variation. e CBAM-ResNet "Before Classifier" overall excels in recognition tasks on the images dataset. e CBAM-ResNet "Before classifier" is the best because it has a less computational cost and is 2.52 times faster in training than CBAM-ResNet "Within Blocks" in classification performance on video recognition experiments. is new approach in MSL recognition can be applied in real-time systems to help Malaysian signers in their daily communications.
During the dynamic sign34eds videos recognition and classification, an overfitting issue was observed. e overfitting may be because of the small data set; generally a dataset of over 100k samples is required to successfully optimise convolution kernels in CNNs architecture. e concept of transfer learning can be applied in future research in coping with minor overfitting issues of CBAM-3DResNet in signs videos recognition. Another branch of artificial intelligence, Natural language processing (NLP) can extend this research to the next level, by constructing sentences with complete meaning from the recognised signs in video or through real-time. ese interpretable sentences in either written or audio output could enhance the communication effectiveness between others and deaf-mutes. Finally, this model is also suitable to explore human action recognition.
Data Availability e data that support the findings of this study are available upon request from the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest to report regarding the present study.