Using a Multilearner to Fuse Multimodal Features for Human Action Recognition

School of Artificial Intelligence and Big Data, Hefei University, Hefei 230601, China School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK School of Computer and Information Science, Shanxi University, Taiyuan 030006, China School of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China Department of Computer Science and Engineering, Shaoxing University, Shaoxing 312000, China College of Information Science and Engineering, Jishou University, Jishou 416000, China


Introduction
Human action recognition is an interdisciplinary research direction in the field of computer vision, involving image processing, computer vision, pattern recognition, machine learning, and artificial intelligence. With the rapid development of digital image processing technology and intelligent hardware manufacturing technology, human action recognition has wide application prospects in intelligent video monitoring [1][2][3][4], natural human computer interaction [5,6], smart home products [7][8][9], and virtual reality [10]. e popularity of human action recognition has led to several survey articles that have appeared in refs [11][12][13][14][15]. ese articles discuss various features and classifiers that have been used for human action recognition. In recent decades, computer vision research based on RGB image information is more and more abundant. However, RGB images usually provide only the apparent information of objects in the scene. When the foreground and background of an RGB image are similar in texture or color, it is difficult to perform accurate image recognition when relying on the limited RGB information. In addition, the appearance of the object described in the RGB image may not be robust to the common visual changes, such as illumination changes, which seriously hinder the use of the RGB-based visual algorithms in the real-world application environment.
With the continuous progress of science, Microsoft has released the Kinect sensor, which provides RGB information, scene depth information, and also human skeletal information in the scene. e depth image information is only related to the distance between the object and the camera and is not affected by illumination variation, environmental changes, and shadows. e human action sequence, in the form of multimodal sensor data, contains rich temporal patterns that can be used to distinguish between different action categories. is paper makes full use of the multimodal information provided by the Kinect sensor to extract effective human action features and uses a multilearner integration strategy based on the K-nearest neighbor algorithm to construct a classification model. e main contributions of this article are as follows: (1) e RGB modal information, based the histogram of oriented gradient (RGB-HOG), can maintain a good invariance to both geometric and optical deformation. e depth modal information, based on the space-time interest points (D-STIP), can keep dynamic stability of a human action feature, which maintains good local invariance characteristics of human movement. e skeleton modal information based on the joints' relative position feature (S-JRPF) can describe the spatial structure information of human action well. ree different modal features can effectively represent human behavior and provide reliable behavior representation.
(2) is work uses a multilearner ensemble to classify the prediction samples and makes full use of the learning biases of different learners to enhance the generalization ability of the overall model. e rest of this paper is organized as follows. Section 2 presents the related works. Section 3 describes method framework of human action recognition. In Section 4, three different behavioral descriptors are introduced. We introduce the human action recognition algorithm in Section 5. Experimental results are given in Section 6 to verify the feasibility and performance of the proposed method. Finally, a brief conclusion and the future work are given in Section 7.

Related Works
Although there have been many achievements in the research of action recognition, human action recognition in a real environment remains difficult. Video-based human action recognition can be divided into RGB data and RGB-D data-based human action recognition. Compared with RGB-D data, RGB data have more abundant appearance information and can better describe the interaction between human and object. However, RGB data are easily affected by background image, such as weather, light, shooting angle, and clothing, which makes it difficult to extract features from background image. Compared with the traditional RGB data, the RGB-D data are not affected by the change of illumination and the change of color and texture. More importantly, they can estimate the contour and skeleton of human body reliably.
Recently, with the development of RGB-D cameras, especially the Kinect sensor launched by Microsoft, recent research has focused on the use of deep images to solve the problem. Compared with traditional RGB data, the depth information provided by RGB-D images is more robust to changes in lighting conditions. e ever-growing popularity of the Kinect inertial sensors has prompted intensive research efforts on human action recognition. Since human actions are extracted from Kinect and inertial sensors, they can be characterized by multiple feature representations. By encoding the multiview features into a unified space, richer data are available for human action recognition.
In recent years, human action recognition based on video has made great progress. Many scholars have summarized and analyzed human action recognition methods based on RGB-D data [16,17]. According to the different data, the method of human action recognition based on depth sensor can be divided into three parts: depth image sequence-based method, skeleton data-based method, and multimodal feature fusion-based method.

Depth Image Sequence-Based Method.
In RGB-D video, depth data can be regarded as a spatiotemporal structure composed of depth information. e feature representation of action is the process of extracting features from this spatiotemporal structure. e method based on depth sequence mainly uses the action changes in the depth map of human body to describe the action. Sahoo et al. [18] applied depth history image to AlexNet to fine-tune the weights of the pretrained deep leaning architecture. To recognize the closely related actions, DHI alone is not sufficient. e 3D projected planes are extracted and trained separately on AlexNet for this purpose. Two types of projected planes are extracted in this work such as XT plane or side view and YT plane or top view of the action videos. e scores from both the learning techniques are fused to provide the final recognition score. Li et al. [19] proposed a real-time human action recognition system that uses depth map sequence as input. e system contains the segmentation of human, the action modeling based on 3D shape context, and the action graph algorithm. Xu et al. [20] proposed an effective method for human action recognition from depth images. A multilevel frame select sampling (MFSS) method is proposed to generate three levels of temporal samples from the input depth sequences first. en, the proposed motion and static mapping (MSM) method is used to obtain the representation of MFSS sequences. After that, this paper exploits the blockbased LBP feature extraction approach to extract feature information from the MSM. Finally, the fisher kernel representation is applied to aggregate the block features, which is then combined with the kernel-based extreme learning machine classifier. Chen et al. [21] proposed a human action recognition method by using depth motion maps (DMMs). Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, the absolute difference between two consecutive projected maps is accumulated through an entire depth video sequence forming a DMM. An l2-regularized collaborative representation classifier with a distance-weighted Tikhonov matrix is then employed for action recognition. e developed method is shown to be computationally efficient allowing it to run in real time. e above methods identify actions by analyzing and modeling the motion information in the depth sequence. However, because RGB-D video itself has more noise and lacks relevant appearance and texture information, the depth sequence-based method has not achieved ideal results in many datasets.

Skeleton Data-Based
Method. e method of action recognition based on skeleton data is an important direction in the field of depth data research. Based on the skeleton sequence of the human body, this method uses the changes of human joints between video frames to describe the movement, including the changes of joint position and appearance. e skeleton model of the human body can be quickly and accurately estimated from the depth data, so the method of human posture estimation based on RGB-D data is widely used. Wan et al. [22] extracted the orientation vectors from several groups of skeleton joints and used a stacked residual bidirectional long-short term memory (LSTM) network to build modal. Liu et al. [23] proposed a new action recognition LSTM network based on skeleton data, that is, global context aware attention LSTM network. By using the global context memory unit, the network can selectively focus on the information nodes in each frame. In order to further improve the attention ability of the network, a recursive attention mechanism is introduced, through which the attention performance of the network can be gradually improved. Liu et al. [24] proposed a method of human motion recognition based on the skeleton data collected by depth sensor. In order to make full use of the skeleton data of human body, the movement features such as position, speed, and acceleration are extracted from each frame to capture the dynamic and static information of human action. Finally, k-nearest neighbor algorithm based on weighted voting method is used to realize action recognition, and pose specificity is used as voting weight. Phyo et al. [25] used the skeleton motion history image to build a deep learning model to recognize human behavior. e experimental results show that this method can achieve high recognition accuracy with low calculation cost in all kinds of environments. Because the skeleton information is not affected by background light and other factors, it has certain robustness and can be quickly and accurately estimated from the depth data. In recent years, with the development of deep learning, the application of convolutional neural network (CNN), recurrent neural network (RNN), LTSM, and other frameworks has brought progress to the skeleton-based motion recognition, which will make greater progress in the future.

Multimodal Feature Fusion-Based Method.
Each feature extraction method has its own advantages and is independent of each other. If different features can be fused effectively, a more discriminative feature vector can be obtained, and the recognition performance will be improved. erefore, in recent years, the fusion method has attracted the attention of scholars.
ere are two kinds of fusion methods: feature level fusion and decision level fusion.
Feature lever fusion is an early fusion method. Firstly, the feature vectors are extracted by different methods, and then the extracted features are standardized, selected, or transformed, so as to generate a new feature vector with more discrimination. Zhang et al. [26] proposed a method of action recognition which combines gradient information and sparse coding. Firstly, the feature of coarse depth skeleton is extracted by using depth gradient information and skeleton joint distance. en, sparse coding and maximum pool are combined to refine the rough coarse depth skeleton features. Finally, the random decision forests are used to identify the actions. El Din El Madany et al. [27] proposed a human action recognition framework by using global locality that preserves canonical correlation analysis (GLPCCA); their work fuses depth and RGB modalities, which includes the hierarchical pyramid of depth motion map deep convolutional neural network (HP-DMM-CNN) used for the depth images and the optical flow convolutional neural network to model the RGB videos. Guo et al. [28] proposed a new unsupervised feature fusion method for human action recognition, termed the multiview Cauchy estimator feature embedding (MCEFE). By minimizing empirical risk, MCEFE integrates the encoded complementary information in multiple views to find the unified data representation and the projection matrices. To enhance robustness to outliers, the Cauchy estimator is imposed on the reconstruction error. Asteriadis et al. [29] presented a novel, multimodal human action recognition method to handle a sensing device's noise and person-specific characteristics. Each action is represented by a basis vector and spectral analysis is performed on an affinity matrix of new action feature vectors. Using modality-dependent kernel regressors for computing the affinity matrix, the complexity is reduced by forming robust low dimensional representations. Gao et al. [30] proposed pyramid appearance and global structure action descriptors on both RGB and depth motion history images as a way to construct a model-free method for human action recognition. In this algorithm, they first construct a motion history image for both the RGB and depth channels while simultaneously depth information is employed to filter RGB information; next, different action descriptors are extracted from the depth and RGB MHIs to represent these actions, and then a multimodality information collaborative representation and recognition model is built in which multimodality data are put into an objective function naturally. In this method, information fusion and action recognition are done together, with the goal to classify human actions.

Mathematical Problems in Engineering
Decision level fusion is different from feature level fusion. First, the classifier trained by each method outputs the classification results, and then the classification results are fused to get the final classification results. In order to effectively combine the joint, RGB, and depth information of Kinect sensor, Seddik et al. [31] proposed local and global support vector machine model using multilayer fusion scheme to connect different features. Malawski and Kwolek [32] proposed a new motion description called joint motion history context, which is based on depth and bone data. e decision level fusion method based on support vector machine and multilayer perceptron is used to effectively fuse the motion mode information of multiple feature sets. Imran and Raman [33] proposed a multimodal action recognition method based on deep learning paradigm. Firstly, for RGB video, a new imagebased descriptor is proposed, which is called stacked dense flow difference image (SDFDI), which can capture the temporal and spatial information in video sequence. en, they train various kinds of deep twodimensional CNN and compare SDFDI with the latest image-based representation. Secondly, aiming at skeleton flow, a data enhancement technology based on 3D transformation is proposed to train deep neural network on small dataset. A RNN model based on bidirectional gating recursive unit (BiGRU) is proposed.
irdly, for the inertial sensor data, a data enhancement method based on Gaussian white noise jitter is proposed, and the action classification is combined with the deep one-dimensional CNN network. e outputs of these three heterogeneous networks are combined by multiple model fusion methods based on fraction and feature fusion.
Although the existing action recognition method using depth information has made great progress, the reliability of recognition is still unsatisfactory for practical engineering. e primary reason is that human action recognition has great within-class differences but nonobvious between-class differences, and distinguishing the differences of human movement speed requires higher computational complexity.

Method Framework
In order to improve the robustness and practicability of the recognition system and to make full use of the advantages of different features, we use the different modality data provided by the Kinect sensor. ree kinds of features are used as human action descriptors, and then the multilearner ensemble algorithm is used to recognize the action. e system flow is shown in Figure 1. is method preserves the efficiency of performing computation on simple features and also guarantees robustness of the recognition system and the discrimination ability of the action feature.
e system framework includes the following steps: (1) Obtain synchronous RGB image, depth image, and skeleton data from the Kinect sensor.
(2) Transform the RGB image data to gray image data to reduce the scale of data processing, use classical filtering methods to reduce image noise, and then extract a histogram of oriented gradients from the processed image. Space-time interest points are extracted as features from the depth image data, and data describing the relative position of joints from the 3D skeleton data are also extracted as features.

Feature Extraction
e output of Microsoft Kinect camera is a multimodal signal, which can provide RGB video, depth mapping image sequence, and skeleton joint information at the same time.
us, it can effectively overcome the loss of depth information and spatial position relationship between objects due to the traditional RGB camera projecting the 3D physical world onto the 2D image plane. e characteristics of different modes are independent but complementary. In order to obtain better recognition performance, this paper effectively fuses the features under multimodality and designs a description vector with high discriminability, that is, using visual information, the depth, and skeleton to improve the recognition results. In this section, three different behavioral descriptors are introduced.

RGB-HOG.
Histogram of oriented gradients (HOG) is a feature descriptor for object detection in computer vision and image processing [34]. HOG descriptors can effectively extract the local gradient and direction information of the image to describe the key characteristics of human behavior. e traditional HOG feature extraction process is a pyramid structure, which consists of three layers: cell, block, and image. e top and bottom steps are as follows: (1) construct the feature vector of cell; (2) construct the feature vector of block; and (3) construct the feature vector of image. In the process of constructing cell histogram of traditional HOG operator, the influence of neighborhood pixel gradient is not considered, so the "aliasing effect" is easy to appear. To solve this problem, Dalal et al. [34] used the block overlap method, but the calculation is large; Pang et al. [35] used the linear interpolation method to adjust the voting rights of the pixels in the block, but it does not consider the influence of the pixels in the block neighborhood. In fact, based on the cell, only part of the gradient information of its neighborhood is used, which leads to the problem of insufficient information utilization. In this paper, based on the cell, the neighborhood range of cell is planned, and the voting method of neighborhood pixels is further improved. e histogram of the original cell is modified by the gradient amplitude of all pixels in the neighborhood of the cell. e HOG feature extraction algorithm flow is shown in Figure 2.
Step 1: input image and region of interest extraction. In the research of human behavior recognition, the region of interest (ROI) is selected as a smaller region from an image. is region is the most important part of human motion analysis. e region can be cropped from the full-size image to reduce processing time and increase accuracy. In this paper, first an input image G(x, y) is analyzed using a region of interest detection algorithm to predict the approximate position of the target and to select the minimum rectangular boundary around the target as the region of interest. Next, a series of operations is carried out, including feature extraction in the ROI corresponding to the original image.
F(x, y) ⟵ ROI(G(x, y)). (1) Step 2: image graying and gamma correction. Due to the varied factors of image acquisition devices and environments, image of faces may be unclear and prone to either failed detection or false detection. Consequently, it is necessary to preprocess the collected human image, mainly to deal with the situations where the image is either not luminous enough (too dark) or too luminous (too light). ere are two processes used to deal with this issue: image graying and gamma correction.
(a) Image graying: for a color image, the RGB component is converted into a grayscale image. e conversion formula is as follows: (2) (b) Gamma correction: in the case of uneven illumination, gamma correction can be used to improve or reduce the overall brightness of the image. In practice, we can use two different methods to standardize gamma, employing either the square root or logarithm. In this paper, we use the square root method. e formula is as follows (where c � 0.5): Step 3: gradient calculation. For the normalized image, the gradient and gradient direction are obtained via the following equations: Step 4: histogram of oriented gradients. e gradient direction image Φ(x, y) is divided into N cells, with 8 × 8 � 64 pixels as one cell. Adjacent cells do not overlap. e gradient direction of each pixel is counted in each cell. All the gradient directions are divided into 9 bins (i.e., 9-d eigenvectors) as the horizontal axis of histogram, and the cumulative value of gradient value corresponding to the angle range is the vertical axis of histogram.
en, the original histogram vector value is modified. Suppose Cell i is any cell and M 1 is any pixel in its neighborhood. e size of Cell i area is d × d, and the coordinate of the middle point is (x i , y i ). e coordinates of the pixel M 1 are (x, y, θ), where θ is the gradient direction value of M 1 and the gradient amplitude is G(x, y). It is assumed that θ lies between the direction blocks θ l and θ r of Cell i . Let the correction coefficients of M 1 to the histogram of θ l and θ r direction block be w l and w r , respectively. e original . e trilinear interpolation method is used for correction, and the correction coefficients w l and w r are where d θ is the angle difference between adjacent cell blocks.
After correction, the histogram vectors h(x i , y i , θ l ) and h(x i , y i , θ r ) of Cell i histogram are as follows: According to formula (7), the histogram of Cell i is modified by using the gradient information of all pixels in Cell i neighborhood. In the same way, we modify the HOG of other cells of the original image to get the modified HOG vector Step 5: histogram normalization of overlapping blocks. If there is a large variety of illumination and backgrounds in the image, the range of the gradient value will be large, so good feature standardization is very important to improve the detection rate. ere are many ways to standardize, most of which define a cell as a set of blocks and then standardize each block separately. Take the 2 × 2 cells adjacent to each other as a block. e 8 × 8 pixel is a cell, and the red, blue, yellow, pink, and green boxes are all blocks. at is, the 2 × 2 cells in each box form a block. Each block is 16 × 16 pixels. ere are overlaps between adjacent blocks, so the information of adjacent pixels is effectively used, which is very helpful to the detection results.
Next, each block is standardized. ere are four cells in a block. Each cell contains 9-dimensional feature vectors, so each block is represented by 4 × 9 � 36-dimensional feature vectors. In this paper, L2 norm is used for feature standardization. Let ε be a very small normalized constant.
After normalizing the histogram of overlapped blocks, the feature vectors of all blocks are combined to form the HOG feature.

D-STIP.
e action recognition method based on spacetime interest points is one of the more popular action recognition methods at present. It describes the action by detecting the interest points whose pixel values have significant changes in the spatiotemporal neighborhood and extracts the underlying features from them.
Because the space-time interest points are extracted from local features, which are not easily affected by illumination, motion characteristics, or background changes, this method has improved robustness over less localized methods.  In this paper, we implement the representation of spacetime interest points and space-time words based on depth image. is method first extracts the accurate space-time interest points from the samples and then extracts the local neighborhood features of the interest points. Next a spacetime codebook based on the feature of the interest points is established, and a statistical histogram of the interest points based on the space-time codebook is obtained. e D-STIP extraction flowchart is shown in Figure 3.
Step 1: Dollar STIP detection. Laptev extended the 2D Harris corner [36] to the 3D Harris corner [37] and used them as the significant changing points in the spatiotemporal domain. Firstly, the video sequence is represented in the linear space as en, the matrix can be obtained as where g(·, σ 2 l , τ 2 l ) is the Gaussian kernel function, σ 2 l is the spatial factor, τ 2 l is the temporal factor, and D(·) is the depth video image sequence. e three eigenvalues of the matrix N λ 1 , λ 2 , and λ 3 correspond to changes in the depth video sequence D(·) in the two spatial directions (x, y) and on temporal domain t, respectively. When these values are all large, it means that the video changes significantly along all three directions, and therefore this point is a space-time interest point. Laptev defined the response function of interest points as where det(N) and trace(N) are determinants and traces of matrices, respectively, and k is a coefficient and usually takes the value of 0.005. e function value H obtains the local maximum at the point of interest. 3D Harris corner detection is very sensitive to movements that changes the direction of speed, such as walking, running, and waving, but for other movements, such as rotation and periodic movement, there is often no point of interest detected. e interest points detected by the 3D Harris spatiotemporal corner are too sparse. Although we expect sparsity to some extent, if feature points are too sparse, it means that there are too few underlying features. is can negatively affect the recognition results. Dollar et al. [38] proposed a new method for interest point detection, which makes the extracted interest points more dense.
e response function H is calculated by the separable linear filter: where g(x, y; σ) is a 2D Gaussian smoothing kernel function for spatial filtering: h ev and h od are orthogonal components of one-dimensional Gabor function, which are used for filtering in time domain: where ω � τ/4 and the response function H has only two parameters σ and τ, corresponding to space and time scales, respectively. e point whose response function H has the local maximum value is detected as the point of interest if it is greater than a certain threshold value. e number of interest points detected can be controlled by the threshold value. In order to solve the problem of scale changes, the method of multiscale combination can be used to detect interest points. However, the noise points in the depth image will also have a greater response to the kernel function in the space-time domain, so they are mistakenly detected as points of interest. e wrong interest points will introduce a lot of errors to the subsequent feature description, which will seriously reduce the description ability of spatiotemporal interest points. In this paper, a correction filter is applied to the detected interest points to reduce noise interference. e noise of depth image can be roughly divided into three categories: one is generated by depth sensing equipment, which appears randomly in the whole depth image. is kind of random noise generally appears less and has little influence on the detection of interest points. e second kind of noise appears at the edge of the scene object because of the nature of structured light imaging. e depth of noise often jumps between the foreground and background on both sides of the edge. e third is due to the problems of reflective material on the surface of the object, fast movement, and so on. e "holes" appear on the depth image, that is, the loss of the depth value on the image (the pixel value is zero). e second and third kind of noise will produce a lot of interference to the detection of interest points, and they are difficult to be removed by ordinary spatial smoothing filtering. Generally speaking, the disturbance frequency of noise signal is much faster than the motion frequency of human body, and it may appear in consecutive frames of human motion time segment. Based on this, we can calculate the average time of noise disturbance and then filter the obtained interest points. e correction function of interest point is shown in the following formula: Mathematical Problems in Engineering where n fp (x, y) is the number of times the noise signal jumps in the whole movement period and δt i (x, y) is the duration of the ith jump. e interest point correction function is the ratio of noise signal in pixel t to the whole time period. It gets higher values at the real moving pixels and lower values at the noise points. erefore, pixels (noise points) with low ratio can be filtered out by setting a threshold value. After detecting the interest points, it is necessary to select appropriate local feature descriptors to represent the interest points.
Step 2: feature description of interest points. Dollar et al. [38] proposed the concept of the cuboid for the detection of interest points. e cuboid is a cuboid video block centered on interest points, whose edge length generally depends on the detection scale of the interest points. Using cuboid descriptors to represent interest points can represent interest points along with their neighborhood information. Firstly, three kinds of transformations are performed with cuboid detection: (1) pixel value normalization; (2) for each pixel (x, y, t), the gradient in different directions is calculated, and three cuboid matrices (C x , C y , C t ) are obtained; (3) the Lucas-Kanade optical flow [39] is calculated for the adjacent frames, and two cuboid matrices (V x , V y ) are obtained. erefore, for each point of interest p i in the set of extracted points of interest P � p 1 , p 2 , . . . , p n , we can calculate its feature description as F i � C ix , C iy , C it , V ix , V iy .
Step 3: the establishment of the space-time codebook. Because of the difference of the performers' wearing, action mode, and amplitude, the same action will have different interest points in different videos. However, the features of these interest points are similar and provide the essential description of the temporal and spatial features of the action. After the feature representation of interest points, we need to use the feature vector to represent different actions, that is, to model the action. e most common modeling method to model the interest points is the bag of video words (BoVW) method. A k-means clustering algorithm is used to cluster the feature set extracted from the training dataset. e number of clustering centers is selected in the experiment. e generated clustering centers are regarded as the spatiotemporal words w i � f 1 , f 2 , . . . , f m , m is the feature dimension, and f i is the i th feature component of the spatiotemporal words. e set of all spatiotemporal words is v � w 1 , w 2 , . . . , w n , where n is the number of clustering centers. For different action videos, the spatiotemporal codebook corresponding to different action categories is trained according to the above steps in the training set. In the subsequent action recognition process, the interest points are classified by calculating the distance between the feature of interest points and spatiotemporal words. e statistical histogram of interest points H � h 1 , h 2 , . . . , h n based on the spatiotemporal codebook is obtained by counting the categories of all interest points in the video, where n ] is equal to the dimension of spatiotemporal codebook and h i is the frequency of the i th spatiotemporal word in the video. Finally, the histogram is used as the video descriptor:

S-JRPF.
Skeleton joint point is the visual salient point of human body, and its movement in 4D space can reflect the semantic information of action. e research of joint-based motion recognition can be traced back to Johansson's early work [40]. eir experiments show that most of the movements can be identified only according to the position of joint points. is idea has been adopted by a large number of subsequent researchers and has gradually formed an important branch of human motion recognition methods.
With the release of the Microsoft Kinect sensor, it is convenient to get the depth map of the scene and the 3D skeleton of the human body. Compared with the feature of deep image extraction, the 3D skeleton data provided by Kinect has only 20 joint points as information of the human body. After feature extraction, the feature dimension will be lower and thus computations are smaller, which is beneficial to real-time performance of action recognition algorithms. For three-dimensional skeleton motion data, we first need to express the motion through the feature expression before we can correctly identify the motion. We do so by using the Kinect. By using the coordinate information of the 20 joint points from the Kinect, we can find a good representation of the human body.
Based on the joints' modal data, this paper presents the spatial distribution feature of joint projection to represent human motion. Firstly, the 3D skeleton data of each frame are collected and projected in three planes (XOY plane, YOZ plane, and XOZ plane) to obtain the position distribution of projection points of single frame 3D skeleton joint data on different projection planes. e projection of the joint points of the human body is shown in Figure 4.
en, the joint points on the three projection planes are represented in polar coordinates: Finally, the polar coordinates of the projection points on the three projection planes are spliced as the feature vectors of the frame. In order to make the feature data fall in [0, 1], the joints' relative position feature can be obtained by using minimax normalization since the skeleton modal information is invariant under translation transformations, scale transformations, and rotation transformations. erefore, the feature view in joints' mode can be expressed as

Recognition Algorithm
Experiments show that the classification performance of the learning system is better than that of each basic classifier, so the effectiveness of ensemble learning is proved. Dietterich [41] listed ensemble learning as the top four research directions of machine learning. Integrated learning is to build a strong classifier with excellent classification performance and generalization ability. In the traditional classification algorithm, SVM classification algorithm and KNN classification algorithm have better classification effect than other traditional classification algorithms.
However, the classification effect of the base classifier is not stable. Simply using the base classifier to classify the data, it is easy to make the classification result overfit. Combining the base classifier according to the combination strategy produces a strong classifier, and the classification performance of the strong classifier is better than that of each base classifier. In order to generate a better classification method, this paper will build an ensemble KNN multiclassifier model. e KNN method is based on analogical learning, which is a nonparametric classification technology. It is very effective in pattern recognition based on statistics. It can obtain high classification accuracy for unknown and nonnormal distribution and has the advantages of robustness and clear concept.
e basic ideas are as follows: feed in new data without a class label, extract the feature from the new data, and compare the new feature to the feature of each sample in the training set; then select the class labels of the k nearest (most similar) samples and count the number of the label occurrences. e class with the highest occurrence count is determined to be the class of the new data. Now, we expect to use KNN classification rules to complete the correct classification of test data point x. By finding k nearest neighbors near the test sample point x, the test sample point is predicted to be the category with the most k nearest neighbors. Among the N training samples, N 1 training samples belong to category ω 1 , N 2 training samples belong to category ω 1 , . . ., N c training samples belong to category ω c . If k 1 , k 2 , . . . , k c belong to categories ω 1 , ω 2 , . . . , ω c , respectively, then the discriminant function can be defined as e decision rule is if To classify specific actions, we can search the training set for the K actions that are nearest to the new action and determine the class of the new action based on the classes of these K actions. is paper proposes an integrated classification method using multilearners based on a training set Mathematical Problems in Engineering of multimodal features, which is more effective to identify the new action. It fully utilizes the biasing effects from different learners and therefore enhances the generalizing capability of the learning. e implementation sequence of the algorithm is as follows: Step 1: describe the training sets of action features with different modal information separately.
where T RGB−HOG , T D−STIP , and T S−JRPF are training sample sets and the number of samples in each training set is N.
Step 2: determine the vector representation of the action in three kinds of modal description. Step 3: select the Topk 1 , Topk 2 , and Topk 3 actions that are nearest to the action to be predicted from the three training sets using different distance measurement formulas, separately. e equations to compute the similarity for various models are where D 1 (x RGB−HOG * , x j ) is the Euclidean distance metric, D 2 (y D−STIP * , y j ) is the Manhattan distance metric, D 3 (z S−JRPF * , z j ) is the Mahalanobis distance, and V −1 is the covariance function.
Step 4: compute the weight of each class of the Topk 1 + Topk 2 + Topk 3 actions that are nearest to the action to be predicted: where x RGB−HOG * , y D−STIP * , and z S−JRPF * are the feature vectors of action Θ described in various models. (·) is the attribute function of the class. If x RGB−HOG * belongs to class C j , (·) takes 1; otherwise, it takes 0. ω j is the weight coefficient of the nearest neighbor of the sample, D −1 j is the reciprocal of the distance, and ε is the smaller positive number which is not 0.
Step 5: compare the class weights and assign the action to be predicted to the class with the largest weight.

Experiments and Results
is section provides the experimental results and analysis of our algorithm as applied to the G3D dataset and Cornell Activity Dataset 60.

Experiments and Results.
In this section, we validate the feasibility and efficiency of this paper's method in two experiments. In the first, we test the recognition rate and the precision, recall, and F-measure on the G3D and CAD60 datasets based on a single feature and this paper's algorithm. In the second experiment, we compare our method to other algorithms.
We present the result of this paper's method in Experiment 1 with the confusion matrix. e (i, j) element of the matrix is the percentage of action of class i that are classified as the action of class j. erefore, the greater the diagonal elements, the better the classification result.
Figures 7-10 illustrate the recognition rates using the single modal feature on the G3D dataset with confusion matrices. Figure 10 shows the recognition rate resulting from this paper's method using multimodal information. From the above figures, it can be observed that the 20 categories of action recognition rates based on multimodal features are all higher than those using the single modal feature. For the six actions of defend, throw bowling ball, aim and fire gun, wave, flap, and clap, the accuracy is 100%. Figures 11-14 illustrate the recognition rate using the single modal feature on the CAD60 dataset with confusion matrices. Figure 14 shows the recognition rate of this paper's method using multimodal features on the CAD60 dataset.
rough comparison, it can be found that this paper's method achieves a good recognition rate of 94% on the CAD60 dataset, with 100% accuracy for the actions of drinking water, stirring, relaxing on couch, and writing on white board. e results of experiments show that the integrated KNN modal based on multimodal data is better than single KNN model based on single modal data. Single   KNN model is difficult to meet the needs of human behavior prediction.
In addition, we present, in terms of precision, recall, and F-measure, the recognition rates using the single modal feature and multimodal features in Table1. e recognition rates of this paper's method using multimodal features are higher than those of the methods using the single modal feature.
In the second experiment, we compare this paper's method to other classical machine learning methods. Table 2 shows the comparison of this paper's algorithm to boosting, bagging, support vector machine (SVM), and artificial neural networks (ANNs). From the results in Table 2, it can be observed that the integrated multilearner recognition algorithm based on multimodal features achieves the highest recognition rate of 94%. It can be seen from the table that the combined nearest neighbor classifier based on multimodal features has better classification accuracy, mainly because our proposed algorithm is a behavior recognition algorithm based on multimodal feature fusion, which can make full use of the complementarity between different models. In general, the accuracy of the combined nearest neighbor classifier based on multimodal data is higher than that of the original single nearest neighbor classifier. Table 3 compares the average class accuracy of our method with results reported by other researchers. Compared with the existing traditional machine learning approaches, our method shows much better performance, outperforming the state-of-the-art approaches. Note that a precise comparison between the approaches is difficult, since experimental setups, e.g., different strategies in training, slightly differ with each approach. In addition, compared with random dropout-based CNN method only using RGB data, our method also achieves better results. Dropout method is to set the weights of some hidden layer nodes of neural network to 0 during training, which is used to solve the model overfitting problem caused by too few training samples. On the basis of dropout, we further improve it and add a layer of randomization process to realize random dropout, so as to further prevent the overfitting phenomenon of the model. erefore, when the training sample data are small, the multilearner recognition   [38] Sparse spatiotemporal features SVM 78 83 Liu's method [42] PMI spatiotemporal features SVM 82 86 Laptev's method [43] Spatiotemporal corner SVM 87 84 Rapantzikos's method [44] Dense saliency spatiotemporal features KNN 88 89 Rodriguez's method [45] Spatiotemporal

Conclusion
A human action recognition method based on multimodal features is proposed in this paper. rough the Kinect sensor, three modal information is acquired for each image, and the RGB-HOG feature, D-STIP feature, and S-JRPE feature are extracted. An integrated learning strategy with multilearners is adopted, which fully utilizes the biasing effects from different learners. e method achieves good recognition rates on standard public datasets and is robust in real time. Although the method presented herein achieved good experimental results on public datasets, there still remain many issues in action recognition, calling for deeper investigations. Generally, a large amount of tagged video training samples are necessary for the classifier to achieve a good generalizing capability. is requires a lot of manual tagging work, and thus practical modeling can be difficult. It is thus a very valuable direction to investigate how to enhance the learning system's performance utilizing the abundant untagged video samples at hand in the public data.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request. Disclosure e abstract of the manuscript was already presented as conference proceedings in Global Intelligence Industry Conference (GIIC 2018).

Conflicts of Interest
e authors declare that they have no conflicts of interest.