An Intention Understanding Algorithm Based on Multimodal Information Fusion

.is paper proposes an intention understanding algorithm (KDI) based on an elderly service robot, which combines Neural Network with a seminaive Bayesian classifier to infer user’s intention. KDI algorithm uses CNN to analyze gesture and action information, and YOLOV3 is used for object detection to provide scene information. .en, we enter them into a seminaive Bayesian classifier and set key properties as super parent to enhance its contribution to an intent, realizing intention understanding based on prior knowledge. In addition, we introduce the actual distance between the users and objects and give each object a different purpose to implement intent understanding based on object-user distance. .e two methods are combined to enhance the intention understanding..emain contributions of this paper are as follows: (1) an intention reasoning model (KDI) is proposed based on prior knowledge and distance, which combines Neural Network with seminaive Bayesian classifier. (2) A set of robot accompanying systems based on the robot is formed, which is applied in the elderly service scene.


Introduction
e current aging is a problem faced by many countries in the world. Because of the busy work of children, it is difficult to give their parents the care they need. At the same time, through the investigation of nursing home and the research on service robot by Joost and others [1], it is found that robot service is more and more recognized by the elderly, and the elderly service robot provides many services. However, the intention to understand the rate of robot service system is relatively low at present, which makes the elderly increase the interaction burden when using the service robot. As a result, based on improving the understanding of the intention of the elderly service robot system and improving the satisfaction of the elderly, we proposed a multimodal intention understanding algorithm named KDI to understand the behavior of the elderly.
In this paper, the results of single-modal identification are obtained by Neural Network. e gesture, action, and scene are obtained by Neural Network. e results of single modal are fused based on a seminaive Bayesian classifier to infer the final intention and by improving the YOLOV3 target detection model to establish a perspective matrix to obtain the actual distance between the user and the objects to enhance the user's intention.
Compared with pure Neural Network Classifier, KDI has better intention understanding because KDI greatly improves the efficiency of high-level task recognition by combining the advantages of single-modal high recognition rate of Neural Network and the advantages that Bayesian classifier is easy to adjust the results and easy to expand the multimodal information, which is more suitable for advanced identification tasks. At present, model fusion is mainly divided into two methods, one is based on Neural Network multimodal fusion and the other is based on probability-based multimodal fusion. e multimodal fusion based on Neural Network mostly processes the multimodal information into low-level features, splices the tensors, forms a new long tensor, and trains the results [3]. During the low-level feature fusion, various features can be extracted to improve the performance of the system [4]. Reference [5] proposed to bridge the emotional gap by using a hybrid deep model, which first produces audio-visual segment features with Convolutional Neural Networks (CNN) and 3DCNN and then fuses audiovisual segment features in a Deep Belief Networks (DBN). e accuracy of recognition results is high, but the user's intention is mostly composed of multiple modal information. If adding multimodal information will introduce a large number of parameters and feature level information fusion is not easy to adjust recognition results frequently and has poor flexibility, it is not suitable for advanced intention understanding tasks. High-level feature fusion is proposed [6,7].
ere have been studies combining Neural Networks and probabilistic models for recognition tasks [8]. e two models combined exert the Neural Network to save the single-modal training cost and bring a high identification rate by fine-tuning the existing network. Probabilistic models have also been developed to easily modify training parameters and to easily augment multimodal channels and arrange and combine results to obtain personalized highlevel (intention) identification results. Reference [9] used the CNN's powerful capacity of learning high-level features directly from raw data and used it to extract effective and robust action features. e HMM is used to model the statistical dependence over adjacent subactions and infer the action sequences. For example, [10] proposed an integrated probability-based decision framework for robots to infer the role of humans in a particular task. It combines Neural Network and probability model. ose methods greatly increase the flexibility of recognition task, can adjust important parameters in recognition task, and is easy to carry out incremental learning.

Scene Perception and Intention
Understanding. Data from multimodal or heterogeneous sensors can provide additional scene information, which enables the system to understand objects more comprehensively and accurately [11]. So if the robot can sense the semantics of the surrounding environment, many recognition and prediction tasks can be effectively completed [6]. e robot sees the environment through sensors. And the collected sensor data are fused into the multilayer representation of spatial knowledge for semantic mapping [12,13]. YOLOV3 not only has a certain accuracy but also maintains a high running speed after several improvements [14,15]. YOLOV3 is favored in many tasks with high edge computing and real-time requirements. It is widely used in environment detection to provide scene information [16,17].
Machine understanding human intention is a key problem in the field of human-computer interaction. And in order to understand the visual world, machines must recognize not only how to identify scene information but also how they interact with and upcoming interactive actions [18]. Reference [19] proposed a system to extract human, verb, and object triples in daily photos. Reference [20] proposed a new function, "active understanding of human intentions" by a robot through monitoring human behavior. Reference [21] presents a framework that allows a robot to automatically recognize and infer the action intention of a human partner based on visualization. During the collaboration, a robot with intention understanding ability can predict the successive actions that a human partner intends to perform, provide necessary assistance and support, and remind for the missing and failure actions from the human to achieve the desired task purpose.

Summary.
To sum up, we greatly improve the efficiency of high-level task recognition by combining the advantages of Neural Network and the advantages of the probabilistic model. We use scene information as one of the pieces of multimodal information. In the intention understanding task, we add the system active response process, which greatly improves the intelligence of the system.

Materials and Methods
e algorithm is mainly divided into two parts: intention reasoning based on prior knowledge and intention reasoning based on distance. Finally, the two reasoning probabilities are combined. e detailed process is shown in Figure 1.
Based on the reasoning of prior knowledge, we used Neural Network to identify the single-modal information to obtain the category label, such as gesture recognition label (h i ) and action recognition label (a i ). At the same time, we obtain the target detection information obtained through YOLOV3 and input the above information into the seminaive Bayesian classifier to obtain the intended result as P K .
In distance-based intention reasoning, we give intention to objects in the scene. At present, we only give each object one intention; that is, intention and object belong to one-toone correspondence. In our understanding, the user is close to an object and is likely to interact with it. erefore, we assume that the closer the user to the object, the greater the probability of interaction as P D .
Finally, we combined the above two intention results to obtain the final intention result.
Next, we will introduce the process of intention reasoning based on Bayesian and distance and the basis of intention classification after intention fusion.

Acquisition of Gestures, Actions, and Target Detection.
e method to get gestures is the gesture recognition model proposed by [22]. eir experimental results show that the method can handle complex gesture interactions with a recognition rate of 97.7%. We used their model and data set for Kinect-based gesture recognition using five gestures (grab, fist, five fingers open, three fingers open, and extend index finger) from its classification results. Using Kinect-based skeleton information detection, we preset five human actions (walk, bend over, reach out, lie down, and sit down) through the relative distance and angular relationship between the three-dimensional skeletal points and set corresponding thresholds [23]. e test accuracy was 84%.
We use the YOLOV3 network to train the object recognition network for real-time object detection. e data set for Neural Network training consists of 330000 pictures.
ere are 80 kinds of pictures, which cover almost all indoor daily necessities and contain many object information to take care of the daily living of the elderly living alone.

Intentional Probability Formula Based on Seminaive
Bayesian. In the actual daily life scene, one of the many pieces of information that affect the understanding of intention always has a feature dependence on intention. For example, the existence of water cups around the user is particularly important for the probability determination of drinking intention. at means if the water cup is near the user, the user's intention to drink water will be significantly greater. e reason why we set super parent is to increase the importance of a key attribute. Property dependencies are shown in Figure 2. erefore, we choose the existence of a special object (μ) which is relatively important as the super parent to estimate independently.
We take h t−1 , a t−1 and h t , a t for t − 1 and t moment gesture and action features, respectively. μe is whether a specific object μ exists at time t and its value range is μe � cup 1 , cup 0 , chair 1 , chair 0 , . . . . For example, cup 1 is the existence of cup and cup 0 is the nonexistence of cup. x j is the property value and its value range is μe . I is the intention and its value range is I ∈ (i 1 , i 2 , . . . , i n ). x' is the super parent and its value range is x ′ ∈ cup 1 , chair 1 , . . . , which means that, under different intentions, we take μe different attribute values of object existence as super parent classes. p(I, x ′ ) is the prior probability of I. p(x j |I, x ′ ) is the conditional probability of I. P t (I|x) is the posterior probability of I at time t as formula (1).
We use the SPODE model of seminaive Bayes to estimate the prior probability and conditional probability as formulas  Scientific Programming (2) and (3). In formulas (2) and (3), D represents the complete data set. Let N denote the number of possible classes in data set D. N i is the number of possible values of the i attribute. D I,x′ is an aggregate whose intention category is I and whose values on the μe attribute is x ′ . D I,x′,x j is an aggregate whose intention category is I and whose values on the μe and j attributes are x ′ and x j . We used data sets in Table 1 to train the prior probability and conditional probability.
Next, the prior probability and conditional probability are obtained according to formulas (2) and (3). en, the test data are input into the classifier (1) in real time. Finally, the posterior probability of classifying the test data at time t into each intention label is calculated as P t (I|x), By formula (4), we normalize the probability of each intention obtained by seminaive Bayesian classifier (1) and obtain the proportion of each intention I in the total intention In the general target detection pictures, the distance between people and objects will be perspectively distorted. erefore, we need to get the actual distance through perspective transformation.
In order to get the actual distance between the user and the object, we use the perspective matrix, calibrate the camera, and place it 2.5 m away from the ground. We define the perspective matrix (5).
x y and x ′ y ′ are the coordinates of the source image and the coordinates of the target image, respectively.
ere are eight unknowns. We take four points to solve the matrix from the coordinates of the source image. By establishing the perspective change matrix and improving the YOLOV3, we get the mapping relationship of the image as follows. For example, the distance between the chairs in the upper left corner and the upper right corner is equal to the distance between the chairs in the lower-left corner and the lower right corner. After perspective transformation, the distance is basically the same as shown in Figure 3.
After the actual test, we find that the best result is to select the midpoint at the bottom of the bounding box to calculate the distance between objects. e error is within 15 cm to ensure the accuracy of the estimated probability.

3.2.2.
Distance-Based Probabilistic Calculations. Generally speaking, when the user is close to an object, a large probability is intended to operate it. For example, when the user is close to a chair, the user has a large probability of interacting with the chair, such as sitting or moving chair, as shown in Figure 4.
In formula (6), D t is the sum of the distance from the detected special objects as μ ∈ (cup, chair, . . .) to the user at time t. And d μ,t is the distance from μ to the user at time t. For example, d cup,t is the actual distance from the cup to the user at the t time.
At present, we only bind one intention to one object, and we can bind multiple intentions in future work. Intention and object belong to one-to-one correspondence as I μ � I ∈ (i 1 , i 2 , . . . , i n ) (e.g., if μ � cup, I � i 3 (pour the water)). P D (I μ ) means the probability of intention I μ between the user and object μ.
After normalization with formula (7), the probability P t D (I μ ) that the target will interact with the user is the difficulty of interaction.

Intentional Reasoning Formula Based on Multimodal
Fusion. By combining the probability of each intention probability obtained from knowledge reasoning and distance reasoning, algorithm (8) complements and corrects the two intention recognition results. Finally, the prediction probability P t (I) of each intention at time t is obtained.

4
Scientific Programming

KDI Intention Understanding Algorithm Flow.
Next, according to several statistical experiments, we determine the thresholds ε 1 and ε 2 . Test with the threshold range of 0.1∼0.9. For example, when we take the threshold ε 1 as 0.9, although the user's intention is obvious, it still fails to meet the system standard; that is, the threshold setting is unreasonable. After 200 statistical experiments, the accuracy of system feedback is 95% when ε 1 � 0.7 and ε 2 � 0.3. We reduce the error of the threshold to about 5%. When max(P t (I))>ε 1 , the level of intention in the current situation is considered to be simple and the interaction can be initiated actively. For example, when all the features of the mobile chair are satisfied at time t (gesture is grab and action is bow, the chair exists, and the distance is closest), the result is simple, and the robot takes the initiative to help move the chair.
When ε 1 <max(P t (I))<ε 2 , the level of intention I occurs in the current situation is medium and the user can be asked tentatively. When max(P t (I))<ε 2 , it is difficult to determine the level of intention in the current situation and then refuse to interact at this t time. Figure 5 is the interactive diagram of the algorithm. In the following figure, we take getting the user's intention as an example to move the table. Firstly, the robot models the scene information and learns that there are people, table, thermos, and cup. Next, it judges the user's gestures and actions and combines with the location information of the objects and the user. e final result is that the judgment intention level is simple, and the robot actively helps to move the table.

KDI Algorithm Analysis.
KDI algorithm can process multichannel information in parallel and alleviate the multimodal conflict problem. e acquired action, gesture, and objects information are processed in real time, and the final probability is calculated based on the fusion algorithm (1). In  Scientific Programming our experiment, there is a probability that a channel recognition is not successful, such as being affected by illumination. If attribute information (x j ) is lost at the current time, we determine that the conditional probability of missing attribute is 1 (p(x j |I, x ′ ) � 1) and continue the intention calculation. It achieves the purpose that the overall probability will not be affected even if the information of a certain channel cannot be obtained. However, when all the information (motion and gesture) of the user is lost, we assume that the Kinect is blocked and fails to recognize and retrieve the user's information.

Results and Discussion
e host processor selected in the experiment is Intel (R) Core (TMi7-9750 CPU). In the 64-bit Windows 10 system, the target detection network model is based on YOLOV3 and calibrates the RGB camera 2.5 m away from the ground to collect the target detection image. We use Kinect as pepper's eyes to take images, use CNN to analyze gesture information, and use bone point information to analyze action information (see 3.1.1). e development language is C++ and python, and the development platform is Microsoft Input: A frame image at time t. Output: e max probability of classification of intentions max(P t (I)).

Start: User's information detection (from Kinect).
(1) If the user is found in the scene, then (2) wake up the system and find the user's information.    a 1 , a 2 , a 3 , a 4 , and a 5 . e existence attributes of objects are cup (binding intention i 3 ), chair (binding intention i 2 ), and thermos (no binding intention). e intention is classified as i 1 , i 2 , i 3 , i 4 , and i 5 . Table 1 is part of the data set of this experiment. e data set of this classifier determines the on-site action demonstration of 10 elderly people (five women and five men). In the demonstration process, we informed them of the intention range and provided different interactive scenes (whether there are special objects), and then the user makes gesture and action feedback under his own selected intention. From the test results, we screened 100 test data for each intention, and a total of 500 test data formed a prior data set. We used data sets to set prior probability and conditional probability by (2) and (3). Table 3 is a predesigned human-computer interaction behavior for us. We consulted our interviewees and got the most desired action for the robot under each intention as our robot feedback. It is based on our actual research results. Figure 6 shows a group of data at a certain time t and we split it into multiple single modals as shown in Figure 7. ey illustrated the operation process and intention recognition process of the algorithm by giving examples of each attribute value. e actual distance between the user with the chair and the water cup at time t measured by the base station camera is d chair,t � 3.1 and d cup,t � 17, respectively. Table 4 shows the values of each attribute captured by the algorithm at a certain time t and input them into the classifier. Based on the prior and conditional probabilities, the probability of each intention is calculated by (1). At the same time, the distance probability between each special object and the user is calculated by (7). Finally, the intention probability after fusion is obtained by (8).

Examples and Analysis of Experiments.
For those intentions that can be inferred without binding objects, we make their distance probability P D as 0.5 by default. at is, i 1 means that shaking hands with robot does not need to bind objects so that P t D (i 1 ) � 0.5. After comparing the fusion probability, the probability of intention i 2 is the largest. Next, the robot will classify the user's intention according to the intention level classification method. Compared with the threshold value, that is, P t (i 2 )>ε 1 , the intention level is easy, and the robot walks to move the chair.

Comparison Test and Analysis.
ree groups of experiments were designed as control experiments. In the identification layer, the acquisition of single-modal information is based on the CNN model proposed in [22] and the LSTM model proposed in [24] as a control experiment. In the intention fusion layer, the seminaive Bayesian classifier without distance reasoning and the deep belief network proposed in [25] are used for control experiments. In the control experiments, the same prior data set was used, and 100 experiments were carried out with each model. We summarized the data in Table 5. We only care about the single-modal accuracy, the understanding rate of the final intention, and the time spent for the classifier. And we evaluate the understanding rate of the intent classifiers by building four confusion matrices in Figure 8.
It can be seen from the above table that the KDI model proposed in this paper is superior to the other three models in intention understanding rate. Moreover, the proposed distance reasoning optimizes the intention understanding rate in the experiment. Compared with the deep learning fusion model, this algorithm is more extensible, and new intention classification can be added by establishing its prior knowledge.

User Experience.
In this section, 30 elderly people (15 men and 15 women) were invited to participate in the user experience. e experience results show that the following four systems can roughly understand and interact with the user's intention. We recorded the user experience scale in detail by using a questionnaire Figure 9. We use system convenience, system helpfulness, system user load, and system accuracy to conduct a satisfaction survey. e score is 1∼10 points (1 is the worst and 10 is the best). Finally, the user satisfaction questionnaire is statistically analyzed to obtain the satisfaction chart, as shown in Figure 10.

Results and Discussion
is paper proposes a multimodal intention understanding algorithm (KDI), which achieves 91% intention understanding rate through experiments. Compared with other intention understanding algorithms, it has the characteristics of high efficiency, high accuracy, and easy to realize incremental learning. And in the practical application scenario, through the user experience interview of the elderly, we get a higher user experience evaluation. e low user experience load of the algorithm model proves the practicability and convenience of the algorithm.
ere are still many short weaknesses in our work, as follows: (1) In persona recognition, only one person can be the protagonist. When multiple users appear in the scene, the system selects a "zero" user by default for system operation. (2) e system has a limitation of camera occlusion.
When there is an obstruction between the user and Kinect, the user information capture is incomplete.
In the future, it is considered to obtain the user's information through wearable devices. Extend index finger a 5 Sit down i 5 Disease alarm Figure 6: Practical scene diagram of experiment. e robot captures the user's gesture as h1 and action as a2, detects the chair, thermos, and cup, and draws their actual distance from the user. Table 3: Human-computer interaction behavior.

Scientific Programming
A pepper robot is used to predict the interaction of intention: Intention i 1 : the robot approaches the user and raises its arm to shake hands. Intention i 2 : the robot moves to help the old man move heavy objects. Intention i 3 : the robot moves to the thermos, picks up the thermos, and moves to the user. Intention i 4 : the robot moves to draw curtains for the user. Intention i 5 : the robot calls its children for sudden anomalies.       (3) At present, the algorithm has less prior knowledge and user's intention and only gives a single intention to a single object. In the future, multiple intentions will be considered for a single object.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.