Voice Recognition and Inverse Kinematics Control for a Redundant Manipulator Based on a Multilayer Artificial Intelligence Network

*is study presents the construction of a Vietnamese voice recognition module and inverse kinematics control of a redundant manipulator by using artificial intelligence algorithms. *e first deep learning model is built to recognize and convert voice information into input signals of the inverse kinematics problem of a 6-degrees-of-freedom robotic manipulator. *e inverse kinematics problem is solved based on the construction and training. *e second deep learning model is built using the data determined from the mathematical model of the system’s geometrical structure, the limits of joint variables, and the workspace. *e deep learning models are built in the PYTHON language. *e efficient operation of the built deep learning networks demonstrates the reliability of the artificial intelligence algorithms and the applicability of the Vietnamese voice recognition module for various tasks.


Introduction
In recent years, control system designs have been developed in the trend of intelligent control systems but still ensure a fast and flexible response in real time to constantly changing control requirements and allow for high-precision human interaction.
In conventional intelligent control systems, research on voice-based control is attracting many scientists thanks to its user-friendly interaction. Among the voice-based control systems of industrial robotics, users can have the robots perform a variety of tasks through simple commands that carry control information related to the motion direction and the characteristics of the object.
In essence, the voice commands are used as the input of the control system to solve the problem of inverse kinematics (IK) and then converted into various operations of the manipulator. Due to the diverse nature of voice commands, the manipulator tasks change constantly, requiring the control system to be processed quickly to respond. e IK solving algorithms such as analytic methods [1] or numerical methods such as AGV [2], CLIK [3], and Jacobi transpose [4] are hardly suitable, especially for redundant manipulator systems.
e results of recent research on artificial intelligence (AI) show that neural networks (NN), deep learning, and reinforcement learning algorithms are extremely useful and effective for dealing with complex nonlinear problems with cost savings in computation time and system resources [5]. e most important point when applying these algorithms is to have a good understanding of the network structure built up and its functioning. e quality of the network and the performance of the network will be used as criteria to evaluate the effectiveness of the algorithms. In terms of programming languages, AI-related networks can be built on different languages like PYTHON, C ++, and Java [6]. However, the PYTHON language has recently become more suitable for building deep learning (DL) network structures with efficient support libraries such as Tensorflow, PyTorch, Numpy, Keras, and Sklearn. More importantly, these libraries support optimization problems in data science, machine learning, and control [7]. Based on the outstanding advantages of AI techniques, many intelligent control systems have been built to solve IK problems for redundant manipulation systems. Furthermore, these AI techniques are well suited for control systems that require constantly changing motion by voice commands that may not be preprogrammed.
Many solutions to apply voice control systems based on AI algorithms for industrial machines are mentioned in [8].
To determine the direction of the emitted sound source, Hwang et al. [9] designed an intelligent ear for the robot arm. To control the fabrication machine and industrial robotic arms, Rogowski [10] designed a VCS solution with good noise resistance. For serving services, multiple manipulators designed to be human friendly interactively through gesture recognition and voice feedback are introduced in [11][12][13]. e manipulator in [14] serving household chores is controlled by VCS to increase the usability and entertainment. An enhanced version of DL algorithm-based speech recognition is proposed in [15]. e medical robot arm in [16] is designed with a VCS that allows the nurses and patients to easily interact with the robot. e manipulator in [17] uses a VCS with visible light communication. An autonomous manipulator is controlled by voice through the Google Assistant application tool on the basis of IoT technology and is shown in [18]. A voice-controlled application that uses IoT technology in combination with an adaptive NN is proposed in [19] to improve the efficiency of solving IK problems for 6-degrees-of-freedom (DOF) robots. Differently, a Bayesian-BP NN is built to create an efficient control system for the root mean square (RMS) with fast and precise learning [20]. e simulation results show that the error of the method is extremely small. e IK problem for a 2DOF manipulator using NN is presented in [21], 3DOF robot in [22], and 4DOF robot with hybrid IK control system NN and genetic algorithm in [23]. e NN has output feedback to solve the IK problem of the 6DOF manipulator proposed in [24]. is is a technique with very high control efficiency. A new algorithm in 5DOF manipulator control in real time on the basis of the NN is proposed in [25].
is study presents setting up of two deep learning networks DL1 and DL2 to process voice signals to take the input of a 6DOF redundant manipulator to solve the IK control problem. Control information in the voice tag includes the direction of movement and the object whose attributes are given in the speech. e robot will then conduct image recognition to determine the object has the appropriate attributes for voice recognition results from the sentence. e image recognition is performed through the computer's built-in vision module and will not be deeply analyzed in this study. e center coordinates of the object will represent the position the end-effector point of the manipulator needs to move to. Training data for model DL2 are taken from the results of the forward kinetics problem based on the kinematics modeling according to Denavit-Hartenberg's (DH) theory. e DL network models are built using the PYTHON language. Successfully solving these two problems has a wide range of potential applications in response to the constantly changing trajectory of the manipulator without preprograming.

e Diagram of Voice-Based Controller.
e manipulator receives voice commands from the operator using the voice recognition module. en, the control system automatically analyzes, calculates, and gives the control signals for the motors at the joints of the manipulator (Figure 1).
Specifically, the voice recognition module converts from human voice containing control information to text in the program. e manipulator control information contained in the voice includes information such as the direction of the movement of the manipulator (turn to the left or right), what action the manipulator needs to perform (the action of grabbing or dropping), identifying the object (wheels, tray, boxes, etc.), and distinguishing features of objects (color, shape, size, etc.). e input voice and the output control signal must be defined to solve the manipulator control target. In essence, the voice recognition module is a natural language processing problem, and a DL model is built in order for the network to learn how to convert information from voice to text. e steps to perform the VCS are depicted in Figure 2.

Preprocessing the Input Voice.
is problem is solved through the following substeps: noise filtering, word separation, converting sound oscillation into sound energy in the frequency domain, and converting this energy into input data for the DL1 model. e noise filter step can be handled through a number of methods such as noise reduction based on the hardware design of the receiver microphone or by electronic elements of the circuit recording or by the program adjust. Voices include the main expected sounds we need to record and noises (unwanted sounds or no control information). ese acoustic noises can come from the sounds of outside environments such as traffic and industrial noise. ey often negatively affect the accuracy of speech recognition results. To significantly reduce audio noise, a noise reduction transceiver is used in this study.
Each human sentence often consists of many words combined. Each word contains one or several syllables. us, the speech recognition program must perform two basic tasks: separating words in sentences and separating syllables in each word.
Interestingly, every Vietnamese word has only one syllable. erefore, this study only needs to focus on the first task, which is separating words in sentences. To better understand this problem, let us consider the following example.
Let consider a Vietnamese voice command to control the manipulator: "Quay bên phải, lấy bánh xe màu vàng" ("turn right, grab the yellow wheel" in English). It is noticed that the Vietnamese sentence has 8 syllables, while the English one has 7 syllables, of which "yellow" has 2 syllables.
Voice is received through the microphone and recorded through the regular application Void Recorder available on Windows Microsoft operating system. e audio file can be read and written with Scipy library in PYTHON programming. Acoustic oscillation amplitude values are standardized so that the input signal does not contain a lot of suboscillations, making the separation process more efficient and easy to set a useful filter threshold. After normalization, decomposition is performed in the DL1 model with network node parameters that can be adjusted through a learning process on the sample to improve accuracy.
Acoustic oscillation amplitude values are normalized so that the input signal does not contain a lot of suboscillations, making the separation process more efficient and easy by setting a threshold filter. After normalization, the word decomposition is performed in the DL1 model with network node parameters that can be adjusted through a learning process on the sample to improve accuracy. e acoustic oscillation amplitudes after normalization are shown in Figure 3. It can be seen that the difference of normalized amplitudes can be clearly distinguished when speaking and not speaking. is difference is used as a key feature to separate words in sentences.
However, it should be noted that areas with exceptionally large amplitude of sound fluctuations relative to other areas while speaking will be considered as acoustic noise in speech. In addition, oscillation regions with small and fairly equal amplitudes are also considered noise signals that can be ignored. erefore, if a user suddenly screams out a word or speaks all words in a sentence at low volume, the system may not understand the voice command. e change in the amplitude of the sound oscillation is determined to separate the words using the gradient method [26]. After separating the words in the spoken sentence, the sound oscillation will be analyzed for the sound energy in the frequency domain through the Fourier transform. is sound energy value will be used to convert to Input Tensor   for the DL model. e sound of the human voice is actually a combination of many signals with different frequencies. e oscillation function can be described through the following Fourier transform [17].
where a 0 is the original sound amplitude, a n and b n are the Fourier constants, n is the frequency ratio coefficient, ω is the angular velocity, and t is a time variable. From equation (1), the sound energy value in the frequency domain can be specified [17]. Figure 4 shows the sound energy that illustrates the two words "Quay" (turn) and "Phải" (right) in the frequency domain.
A fundamental characteristic of sound is the energy value, which is used to convert the input data to the DL model. Considering the energy value of the sound at each frequency spaced by 1 (Hz), the limit frequency is 0/2 (kHz). Tensor Input is a vector of sound energy values in increasing order of frequency ( Figure 5(a)). e values of Tensor Input after being created are usually very large. For the DL model to be better learned, the data level in the Tensor Inputs needs to be normalized by dividing all components by a certain value greater than the maximum value of the energy obtained. e Tensor Input for the DL model after normalization can be described in Figure 5(b).

Building the DL1 Model. After building the Tensor
Inputs, the DL1 model is built with many inputs and many outputs ( Figure 6) similar to the multilayer AI network in [27]. e number of inputs depends on the number of parameters in the Tensor Input vector. e output layer of network DL1 includes different nodes, and each of these nodes represents a certain word. e output words have the probability value of appearing in the range [0, 1]. e word with the highest probability value will be chosen as the result of the voice-to-text transition. e layers hidden within the DL1 model determine the probability value of the words producing the correct output. e elements inside the Tensor Input and Tensor Output are scalar quantities, so the nonlinear activation function is used. According to [28], some nonlinear functions can be used such as Sigmoid, Tanh, and Relu, and the output layer is used with Softmax activation function to calculate the probability distribution across the classes. e DL model simulates how human biological neural work, so this needs to be trained to simulate the outputs with corresponding inputs and predict the results with other inputs. To train the DL model, the limiting criteria need to be defined and how it can be learned need to be outlined to distinguish between right and wrong. According to [29], the Sparse Categorical Crossentropy (SCC) function is used as follows: after each learning, the DL model needs to update these parameters again to create the actual output that converges gradually to the desired value or in other words, to make the error function value decrease with 0.
To update the DL model, [30] the ADAM optimization function is used to combine two momentum methods and RMSprop, whose learning rate changes with respect to time and can find the global minimum optimal value instead of the local minimum optimal value. Model DL1 is built through the Tensorflow library in PYTHON (Figure 7).
Line 47 declares the output layer with 17 nodes with the Softmax activation function. is output number represents 17 common words in the voice command framework. e Softmax activation function will calculate to give a sample with the highest probability to separate words and phrases from each other. A dictionary with words or phrases and the number of words that appear in the sentence is constructed and encoded as a vector. As such, network DL1 can ensure voice recognition, converting recognition data into text containing specific control information.

Extracting Control Information Using the Machine
Learning Model. Technically, the Vietnamese sentence, after being separated into single words, will be classified according to the DL1 model to form a set of words that are necessary to combine into an equivalent complete text, free of noise and other redundant words.
is complete text (meaningful Vietnamese words and phrases) is used as input to the machine learning (ML) model.
Actually, the algorithm TF-IDF is used to extract features of the text. en, Naive Bayes algorithm is used to classify feature words and phrases of the text belonging to control information layer. e ML model is built in PYTHON language in combination with the math libraries Sklearn and Pyvi. e extracted information fields will be encoded numerically and transmitted to the manipulator control circuit via SERIAL communication.
e output of the model is manipulator control information such as motion direction, robot's action, and object color.

Inverse Kinematics Control for the Manipulator Using Deep Learning Network.
e real 6DOF manipulator arm is presented in Figure 8 and its kinematics model is described in Figure 9.
In the kinematics model, the fixed global coordinate system is (OXYZ) 0 . e local coordinate systems (OXYZ) i (i � 1/6) are placed on the joints accordingly. e joint variable i is denoted by q i .
Let us denote q � q 1 q 2 q 3 q 4 q 5 q 6 T as the generalized coordinate vector of the 6 joint variables. e kinematics parameters of the 6DOF manipulator arm are determined according to DH rule [1], as given in Table 1.
Homogeneous transformation matrices H i , (i � 1/6) on the six links are determined in [1] as the following general equation:    Journal of Robotics where R E is a direction matrix (3 × 3) for rotating global coordinate system (OXYZ) 0 to the local coordinate system of the end-effector (OXYZ) 6 and p E � x E y E z E T is a position vector of the end-effector relative to the fixed global coordinate system (OXYZ) 0 .
By applying the DH parameters into equations (2)-(4) and performing mathematical transformations (see the details in [1]), the position coordinates of the end-effector point are given as      Journal of Robotics where cq i stands for cos(q i ) and sq i stands for sin(q i ). e data for the network DL2 model are the spatial coordinate sets of the end-effector point and the corresponding set of joint variable parameters that have been collected and fed into the training DL2 network multiple times until the model can give control signals for the manipulator accurately, meeting the motion requirements. After training and assessing responsiveness well, the DL2 model is used as a model to predict manipulator rotation angle values with object positions in the manipulator workspace. Figure 10 describes the entire process where the DL2 model is built with input as the request signal received after encoding the vector and feasible position data in the workspace. e output of the model is the corresponding joint variable values.   e workspace of the manipulator arm is shown in Figure 11. e drive motors are Servo MG995, Arduino Nano Circuit, Logitech B525-720p camera, Dell Precision M680 laptop, and Razer Seiren Mini microphone ( Figure 12).

Experimental Results
Network parameter DL2 controlling the manipulator is shown in Figure 13 with 5 outputs corresponding to 5 rotation angles of the manipulator joints. e network consists of 9 hidden layers with the Relu activation function. e number of nodes per layer is presented in Figure 13.
Training results and prediction results of motor control signals are shown in Figure 14. Check on the test data with input as the position vector of the end-effector point in the workspace is x � 0 20 0 T (mm), and the output of the test data corresponds to the joint variable value. e q value obtained from the model is q � 90 50 105 90 79 T (deg).
us, the accuracy is 98.67% on the test dataset. e actual experimental system with the circuit reading and writing the joint variable values and the feedback values on the 16 × 2 LCD is shown in Figure 15. e joint variable values to control the manipulator arm to the position on the object (a yellow wheel) are shown in Figure 16.

Discussion
In actual operation, industrial robots in general and redundant manipulators in particular often perform not as perfectly as calculated in ideal conditions due to the influence of many different factors called noise that create the imperfect robot control system. According to [31], although imperfections are unavoidable in real production processes, the real devices still operate well in regimes far from ideality.
For example, mechanical imperfections may occur prior to operation due to mechanical manufacturing defects, assembly errors, or during operation due to mechanical system vibrations. Meanwhile, electrical imperfections can be caused by the electromagnetic interference of the surrounding environment, the instability of the power supply, or high-intensity electric pulses of welding machines. To overcome the imperfections, additional modules related to noise compensation, noise cancellation, or noise suppression will be studied in the next research stages.
is study only considers the problem of kinematics in ideal conditions or the impact of noise can be ignored. In fact, it is not possible to have a general anti-interference problem for all types of noise. erefore, when practically applied, the research team will apply anti-interference solutions suitable for each context.
In the case of group coordination between multiple voice-controlled robots in a narrow space, naming or coding for each robot needs to be done through an independent module with name recognition or decoding capabilities.   Figure 10: e building process for the DL2 model.

Journal of Robotics
When the operator calls the robot's name or activates the code, the related robot is ready to receive the next voice commands. us, when it is necessary to add a new robot to an existing robot network, it is possible to adjust the module of name recognition or decoding without any change in the entire control system. Differently, in a robot network, the audio imperfections may come from the voice interference. e audio imperfections can be solved by the effect of different range connections controlled by a central dispatcher and the voice interference "can be improved by including long-range connections between the robots" [32].

Conclusion
In summary, the PYTHON language has been applied to build AI models for the Vietnamese voice recognition module and IK control for the 6DOF redundant manipulator. DL and ML techniques have been applied successfully with over 98% training accuracy. Data used for training models DL1 and DL2 are independently built according to the Vietnamese language and calculated data from 6DOF manipulator kinematics modeling. AI models are tested on real manipulator models and gave possible results. is study could serve as the foundation for developing applications for various types of manipulators (serial manipulators, parallel manipulators, hybrid manipulators, and mobile manipulators) for industrial production (welding robots, robot 3D printing, and machining robots), medical, service industries, home activities (surgical robots, flexible robots, soft robots, humanoid robots, UAVs, service robots in families, and restaurants).

Data Availability
e datasets generated during the current study are available from the corresponding authors on reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.