Evaluation of Motion Standard Based on Kinect Human Bone and Joint Data Acquisition

In order to improve human bone and joint data, we propose a method to collect data and judge the standard of motion. Kinect is a 3D somatosensory camera released by Microsoft. It has three cameras in total. The middle is a color camera, which can take color images and obtain 30 images per second; on the left is the infrared projector, which irradiates the object to form speckle. On the right is the depth camera to analyze the infrared spectrum. On both sides are two depth sensors to detect the relative position of people. On both sides of Kinect are a set of quaternion linear microphone arrays for speech recognition and ﬁ ltering background noise, which can locate the sound source. There is also a base with built-in motor below, which can adjust the elevation angle. It can not only complete the collection of color images, but also measure the depth information of objects. The experimental results show that we use MSRAction3D data set and compare the same cross-validation method with other latest research methods in the ﬁ gures. The highest recognition rate of this method (algorithm 10) is the second, and the lowest and average recognition rates are the highest. The improvement in the lowest recognition rate is obvious, which can show that this method has good recognition performance and better stability than other research methods. Kinect plays a relatively important role in the movement of human bone and joint data acquisition.


Introduction
Body motion recognition has always been a hot issue concerned by people. However, there are still many basic problems in the field of computer vision that are not perfect and have not been reasonably solved. We live in the real threedimensional world, and the images obtained by ordinary cameras are two-dimensional, which leads to the problem of lack of information and inaccurate recognition. Therefore, we use Kinect human bone data to identify human actions, which can reduce the lack of information in data collection. According to the knowledge of three views and projection introduced earlier, if we can use two-dimensional plane human body projection in three different directions to express human action, we can more truly show human motion in the three-dimensional world [1]. This chapter uses the two-dimensional plane projection feature of human bone data. The human action recognition method is to project the three-dimensional joint points into the twodimensional space, construct the feature vector composed of three upward joint angles, represent the combination of translation and rotation through the changes of 17 joint angles of human body, and select the method of multiclassification support vector machine to classify 20 human actions in MSRAction3d data set [2]. The Kinect somatosensory device developed by Microsoft can directly capture the body movements of patients without requiring patients to wear and operate any peripheral devices. It has a more natural and convenient human-computer interaction mode and is more suitable for the development and adoption of community and family medical rehabilitation platforms [3]. Therefore, this paper selects Kinect somatosensory equipment as the human motion sensing carrier and designs a somatosensory rehabilitation training platform based on Kinect sensor. As shown in Figure 1, the platform integrates the functions of basic motion acquisition and rehabilitation evaluation, pre-sets typical rehabilitation training movements for the shoulder and elbow joints of the left and right limbs, and collects template motion flow data and training motion flow data through somatosensory sensors. Once the data was processed by the logic of the algorithm, the similarity of the data flows between the two groups of functions was calculated and examined by using the time dynamic control algorithm integration method and Hausdorff distance measurement algorithm. In addition, this paper examines the process of kinematic measurement of rehabilitation treatment to measure the effectiveness of rehabilitation treatment and perform interventions to clarify the algorithm and theoretical in paus [4].

Related Works
Since the 1980s, Internet information technology has been greatly developed. A series of human-computer interaction technologies, such as human-computer interaction and intelligent gesture recognition, have emerged one after another. Human-computer interaction means that human and computer user interface interact in some way to produce information data input and output. In real life, people can use gestures and language to complete the information input to the PC, and the computer can complete the data output through pictures, videos, and other ways, so as to realize the information interaction between people and computers. The human-computer somatosensory interaction is based on the acquisition of depth image. The acquisition of depth image mainly includes the following three ways: structure light, time of flight, and multicamera [5]. The Kinect device selected in this design mainly applies structure light detection technology. Prime sense names this depth measurement technology as light coding. Light coding is a kind of structured light technology, but the depth calculation method of light coding is different from that of other structured lights. Light coding is a depth detection algorithm independently developed by Cristache, C. M. It is a structured light technology using special depth calculation method. Compared with the conventional structured light algorithm, the light source of the light coding algorithm is the diffraction spot randomly generated by the laser passing through the ground glass, which has high randomness and is related to the pattern and distance [6]. Of course, the prior light source calibration process cannot be omitted. Compared with the traditional structured light algorithm, light coding algorithm is a kind of three-dimensional space coding, which is controlled by the chip product ps1080 of Kulczyk, T. when calibrating the light source [7]. Moreover, the measurement accuracy is only affected by the density of the calibrated reference plane and has nothing to do with the spatial geometric position of the reference object. The first generation of Kinect somatosensory devices adopts light coding algorithm technology to collect the depth information of threedimensional space and compare it with the previously saved speckle reference image to obtain the distance between the target object in the Kinect field of view scene and the Kinect camera. However, according to Ma et al.'s paper on Kinect depth data accuracy analysis, with the increase of the distance between the object and the sensor, the random error of depth measurement will also increase, and increase from a few millimeters to 4 meters (when the maximum range of the sensor is reached) [8,9]. For improving this accuracy, Naufal, A., Anam, C., Widodo, C. E., and Dougherty, G. put forward a theoretical error analysis, which can clarify what factors affect the accuracy of data. Vignesh et al. studied and developed a new somatosensory rehabilitation system, which aims to improve the enthusiasm of patients' rehabilitation training and improve the efficiency of patients' rehabilitation training [10]. In this system, the motion data of patients can be recorded in real time, and the obtained motion data can be compared with the corresponding standard motion data in the database, so as to judge the recovery of patients' condition. In the system Kinerehab, patients are reflected on the interface with the image of virtual animation, which has a good and natural human-computer interaction atmosphere, enhances the interest of rehabilitation training, makes patients interested in rehabilitation training, and stimulates patients to train independently. Fajri et al. studied and developed a rehabilitation training system based on human tracking technology, which is mainly aimed at the recovery of the shoulder and elbow [11]. The system takes the position information of the patient's key joint points through the sensor, modifies the effective position information of the key joint points according to the standard action position information, and then imports the modified standard action data information into the system model. The system can effectively reduce the wrong actions of patients in rehabilitation training and avoid the damage of wrong actions to human body. At the same time, the system also has a scoring mechanism to evaluate the recovery of patients. Kim designed and studied a rehabilitation training system aimed at improving patients' balance ability. In this system, sensors are used to obtain depth image information, reshape the 3D model of human body, and conduct humancomputer interaction in the form of simple games through virtual reality scenes, so as to achieve the purpose of sports rehabilitation [12]. As a new rehabilitation training platform, robot is an effective clinical intervention means to assist doctors in the reconstruction of patients' motor   [13]. In the early stage, the motion rehabilitation robot provided resistance and active force through the spring support of free inertia balance. Wang et al. also used the human bone detection ability of Kinect somatosensory equipment to realize the imitation control of 16 joint humanoid robot for human action [14]. Their experimental results show that the time lag of the developed control system is only a short 200 ms. In addition, Afrieda. N. has developed humanoid robots that can recognize human joints in the whole body. They also use Kinect somatosensory equipment to solve the joint angle of the robot by using analytical geometry method, which can solve the kinematics of the robot more quickly [15].

Method
With the support of Kinect for windows SDK, it is possible to monitor the work of one or two human bones moving into the Kinect vision by controlling the bones, receiving data from human bone marrow, and then get a triangle integration of all joints [16]. The human skeleton frame design in this article is the first generation of Kinect. The number of each joint is shown in Figure 2 to support the description of the human skeleton.
According to the sequence of Kinect bone data acquisition, number 20 human bone joint points from a to t, as shown in Figure 2. Each number corresponds to the position coordinates of a human bone joint point, which represents different parts of the human body. Table 1 is the naming of joint points corresponding to the number. 20 joint points represent the complete structure of human bones. Through the analysis of human motion, using the coordinate data of human joint points, every two adjacent joint points are formed into a joint vector, which contains the motion information of the joint [17]. It is also easy to construct the two-dimensional joint vector of the human body. Assuming that the coordinate data of the two-dimensional space of two adjacent joint points of the human body are Uðx 1 , y 1 Þ and Vðx 2 , y 2 Þ, respectively, the joint vector of the two-dimensional space composed of them is According to formula (1), set point Uðx 1 , y 1 Þ and point Vðx 2 , y 2 Þ to represent the joint points of the left ankle and left foot, respectively, and the vector in the formula represents the activity state of the end of the left foot in twodimensional space, covering the details of the movement of the left leg. A total of 19 human two-dimensional joint vectors are composed of 20 human bone joint points.
If you look at different human functions in threedimensional space from an angle, you will not be able to tell the truth of the order, because it will create a local block and covers some of the characteristics of movement. Therefore, we use three images of the human body and the projection method to reflect objects in three-dimensional space in two-dimensional planes, find the objects of motion at differ-ent plane angles, to vary class, and pay for defects. Consider only the features of motion in a triangle [18]. Because changes in the angle of the human joint affect the nature of translation and rotation, angular data can be used to determine the location of various movements. We consider the angle of the joints of the human body in three places inside the two planes, operate the plane from three different angles of view, and create the angles comfort in three different planes: the angle between the joints and the joints in the big picture and the anterior-posterior angular projection of the major plane of the human body. The angle of integration on the left watch is the projection of the joint from left to right of the left plane of the human body. The angle of inclination at the top is the approximate angle of inclination in the lower plane of the human body from the top to the bottom [19]. The angle of the selected human bone marrow is shown in Figure 3. Figure 3 shows 17 human joint angles selected by us. According to the 19 human two-dimensional joint vectors constructed above, the size of human bone joint angle can be calculated by using cosine similarity formula, as shown in the formula θ is the angle of phase t of each bone data, and U ðtÞ and V ðtÞ are both vectors of phase t. Since the collected kinect bone data is in the form of three-dimensional coordinates (x, y, z), first, reduce the dimensionality of the threedimensional bone data to make it become the bone data on the two-dimensional projection plane of the XOY surface, and then use the two-dimensional coordinates to Calculate the size of the joint angle. cos

Experimental Results and Discussion
On this basis, aiming at the problem of low accuracy and stability of human action recognition in complex environment with high noise, a method of human action recognition based on hierarchical feature fusion was proposed, which divided different parts of human body according to the composition of human body structure system. The layered strategy is adopted, which is conducive to the decomposition of complex human movements. Firstly, according to the bone joint coordinates obtained by Kinect, the features of human joint angles in two-dimensional space are extracted, and the actions are roughly classified by the support vector machine (SVM) method. Then, the body vector, angular velocity, and acceleration in 3d space were extracted, and the movements were classified by HMM.
With the knowledge of human models, the human body can be divided into five sections. Part I: physical. The body includes the head, neck, back, and hips. Part II: left hand. The left hand includes the forearm, left wrist, left elbow, and left shoulder. Part III: right hand. The right hand includes the arm, wrist, elbow, and shoulder. Part IV: left foot. The left leg includes the left foot, the ankle, the left knee, and the left thigh. Section V: right. Right foot includes the right foot, right ankle, right knee, and right foot. The body is an essential part of the human body. The motion feature information of some waists of the human body comes from this part of the joint points, and the motion feature information of the hands and feet comes from the joint points of the limbs [20].
By dividing the structure of the human body, it is possible to combine these five components to represent some of the most important functions of the human body. Therefore, in terms of proportions, this paper adopts a hierarchical concept. In step 1, first, divide the above five functions into categories with the same type of combination. For example, the two hands are a combination of the second and third sections. This is a rough proportion. The second layer redivides the functions of the same type of combination and filters the specific functions, which are the details of the process. We take the vector of the joint angle, including the projection of 17 joints on both sides of the three planes included in Chapter 3, according to the characteristics of the first rough decision and distribution of human activities. When distinguishing human actions with the same combination mode, we extract features from kinematics theory. A complete human action can be divided into main action and auxiliary action. The main action reflects the global state of the motion mode, and the auxiliary action reflects the local state of the motion mode. Only by combining the characteristics of the main and auxiliary actions can this action be expressed more accurately [21]. For the five parts of the human body divided above, we construct their limb vectors in three-dimensional space, respectively, shown as {3} represents in the three-dimensional space, t represents a certain time, and the joint points that are prone to drift at the end points of hands and feet are temporarily rounded off. Finally, the limb vectors in the threedimensional space at all times of the five parts are  represented by GT f3g , AJ f3g , BK f3g , EP f3g , and FQ f3g , respectively. According to the different contribution to the expression of human action, two joint angles are selected from each part, which are called the main action joint angle. The torso part selects angle θ 4 and angle θ 9 , the left arm part selects angle θ 3 and angle θ 2 , the right arm part selects angle θ 6 and angle θ 7 , the left leg part selects angle θ 12 and angle θ 13 , the right leg part selects angle θ 15 and angle θ 16 , and the angular velocity of the joint angle is shown as The human action sequence is continuous and changes with time. The change of joint angle before and after forms the value of angular velocity ω. the limb vector and the angular velocity of active joint angle are the characteristics of active action, representing the overall movement of human limbs and trunk [22]. The bending of human limbs and trunk is reflected by the change of the distance between joint points. The human body is projected into the YOZ side plane from the left view direction. The distance between joint points of five bones is shown as where dðy, zÞ represents the Euclidean distance between the two joints in the lateral plane and t represents the distance between the head and end joint points of the five parts of the human body at a certain time, reflecting the bending of the limbs and trunk in motion. The change of the distance before and after the time forms the speed v, and the acceleration is a physical quantity describing the speed of the change of human motion, as shown in The distance between five joints in the lateral plane and the acceleration of motion are regarded as the characteristics of auxiliary action. The accelerations of the five parts are ∂ 1 , ∂ 2 , ∂ 3 , ∂ 4 , and ∂ 5 . The characteristics of main action and auxiliary action together constitute the characteristics of the second level fine classification of human action. Because everyone's height and arm length are different, even if two people make the same action posture, there will be some errors. In order to eliminate individual differences, divide the items in the formula by the shoulder width d AB and the mean value d of the Euclidean distance between joints in the plane of YOZ side, as shown in where d AB represents the width of each person's shoulder and D represents the average distance between the five major joints of the human body at all times in the YOZ side plane. The feature vector of the final rough classification is expressed as ½θ 1 f2g , θ 2 f2g , θ 3 f2g ⋯ θ 17 f2g . The matrix composed of feature vectors of fine classification is expressed as : ð17Þ Each row in the matrix represents a set of eigenvectors.
In this system, we will compare the similarity between the standard action data sequence template and the action data sequence collected in real time, so as to achieve the effect of rehabilitation evaluation. In reality, the time taken by the patient to complete a set of rehabilitation exercises is usually inconsistent with the time taken by the standard action template. Some patients with serious injuries take several times as long as the standard action template. For two sets of motion data sequences with different lengths, the operation of corresponding time points alone cannot meet the accuracy requirements of the current system. DTW, namely dynamic time warping algorithm, is a widely used speech recognition algorithm. At first, it is to solve the error caused by different tones in speech recognition. Compared with other algorithms, dynamic time warping algorithm expresses the relationship between two sequences with inconsistent length through the time warping function under certain conditions. At present, it is widely used in gesture recognition, language recognition, and other fields. The central idea of DTW is to extend or shorten two data sequences with different time lengths to ensure the consistency of the length of the two data sequences to be tested and select an appropriate path from the constructed distance matrix to minimize the sum of the distances obtained by the two sequences. In the recognition of human action sequence, it is actually to find the minimum distortion distance between the current sequence and the standard template.
It is assumed that the two sequences of the standard template and the template to be tested are R and T, and the length is m frames and N frames, respectively [23] 5 Wireless Communications and Mobile Computing shown as In general, for two sequences with different lengths, it is necessary to construct a m × n matrix grid, and the matrix element ði, jÞ represents the distance dðR i , T j Þ between R i and T j . The matrix is shown as : ð19Þ DTW algorithm is to find an optimal path through the grid matrix, and the grid points passed by this path are the points that need to be processed and calculated in the two sequences. Let this path be the regular path W, then the K -th element of W is defined as W k = ðI, jÞ k . That is, the regular path is shown as According to the constraints of continuity and monotonicity, the path of each grid point has only three directions. If the grid point is ði:jÞ at this time, the path of the next grid point is only ði + 1, j + 1Þ, ði, j + 1Þ, ði + 1, jÞ. There are many regular paths that meet this condition, but the formula of minimum distortion distance is shown as The path meeting the minimum distortion distance needs to meet three conditions: (1) The path must meet the requirements from the beginning Wð1, 1Þ of the sequence to the end W k ð m, nÞ of the sequence (2) The path needs to meet the sequence of time, namely m + 1 ≥ m&n + ≥n (3) The path selection needs to meet the monotonicity, and m and n increase by 0 or 1 in turn. That is, if W k−1 = ði, jÞ, W k has three choices ði + 1, j + 1Þ, ði + 1, jÞ, and ði, j + 1Þ The path with the smallest sum of cumulative distances of adjacent elements is the optimal regular path. According to the above formula, the DTW cumulative distance formula can be deduced as A Kionix KXSD9 three-axis accelerometer is included in the Kinect internal structure to prevent the error caused by Kinect being placed on an uneven plane and improve the stability of Kinect's depth image acquisition. Among them, the camera in Kinect can be adjusted according to the needs of users. The moving touch drive motor in Kinect can adjust the elevation of the camera to match the user's position change. Kinect also has a focusing system. If the user exceeds the field of vision, Kinect can automatically drive the base motor to adjust vertically by ±28° [24]. The field of view of Kinect imaging is 43.5°vertical and 57.5°horizontal.
The maximum distance range that Kinect sensor can track and recognize is 0.8 m to 4 m, but in practice, in order to ensure accurate data, the recognition distance is 1.2 m to 3.5 m as shown in Table 2: Kinect is a "pipeline" system functional architecture, as shown in Figure 4. The original sensor data stream includes depth data stream, color image data stream, and audio data stream. Researchers can directly develop applications based on the original data stream information obtained by Kinect SDK.
Kinect uses depth measurement technology in acquiring depth images, which is also known as light coding. That is to mark and code the space to be measured with light source. This technology belongs to structured light technology. Compared with other structured light technologies, the depth calculation method of this technology is different from other technologies. The light source of light coding is called "laser speckle." Laser speckle is the diffraction spot image formed after the light source irradiates the nonsmooth object or passes through the ground glass. These speckle images will constantly change with the distance of the light source, with a high degree of random variation. Generally speaking, the speckle images at any two places in the spatial environment are inconsistent. All speckle images in the whole space are saved and recorded, and then the structured light is irradiated into the space for marking. At this time, put a new object into the space and only view the speckle image generated by the object to obtain the specific position of the object [25].
Today, with the rapid development of science and technology, somatosensory technology has gradually matured. At present, in addition to being mainly used in entertainment games, there are also various human-computer interaction systems developed by many researchers. Therefore, at present, many companies at home and abroad are carrying out research and development in the field of motion sensing, designing, and manufacturing their own motion sensing equipment. Three of the most popular products on the market are the following.
Xtion, a motion sensing device designed and produced by ASUS, mainly uses the Structure Light detection technology to obtain depth images. The Structure Light is used to 6 Wireless Communications and Mobile Computing obtain spatial data and calculate other data, such as depth image and object skeleton. In the working process of the equipment, an infrared laser emitter emits the encoded near infrared light source. After the infrared light is reflected, it is captured by the infrared camera, and then the corresponding feature coding correspondence is calculated to determine the depth information.
Leap-motion, a motion-sensing device developed by Leap (USA), controls a person's hands by detecting their movements. Leap-motion uses multicamera detection technology. It has two infrared cameras and uses powerful data processing chips to quickly process the image data to detect the hand movements of the target. The leap-motion sensor's field of view detects the spatial shape of an inverted quadpyramid with effective detection range of 25 mm to 600 mm. The product also has an open SDK (Software Development Kit), which can meet the needs of developers to develop and research on Windows, Linux, and Mac, the three mainstream operating system platforms.
Kinect for Xbox 360 is Microsoft's external motion sensing device that was officially unveiled on June 2, 2009. Structure Light detection technology is mainly used in obtaining depth images, which is the same as Xtion's principle. In terms of open source development, Microsoft has also designed SDK tools containing rich API interfaces, so developers can combine various languages to develop Kinect programming.
By comparing the above three kinds of motion sensing devices, it can be found that Kinect for Xbox 360 motion sensing devices is more convenient and high-precision for controlling people. Moreover, Microsoft also provides a large number of API interfaces in Kinect SDK, which is more advanced, so that Kinect can not only get the original depth data in the work, but also get the target object bone node data. Therefore, Kinect for Xbox 360 is selected as the somatosensory device in this design.
In order to obtain the coordinate information of human key joint points in real time to judge the similarity algorithm in the next step, Kinect sensor is used to track human joint points to obtain the motion data sequence in real time. The human skeleton is composed of human skeleton joint points, and each joint point has a relative position and direction. Kinect bone tracking module first detects the human body through the depth image information technology in Kinect and then corrects the human body posture. When the correction is successful, the measured human body can be tracked in real time, and the relevant bone information can be obtained in real time. When the correction fails, it will enter, while the cycle processes to calibrate again.
In the example provided by Kinect, the programs of two technologies of depth image data and bone tracking are provided, respectively, but there is no relevant program of data extraction. After learning the relevant application software, the two technologies of depth image and bone tracking are combined to successfully record and extract the data information of key bone joint points of human body. The base note is as follows: firstly, initialize the device environment, create new objects and create new user generators to store relevant data information, and facilitate subsequent calls. Secondly, register relevant callback functions and calibrate the skeleton. The functions to be called include the generation of new users and the detection of bone posture. Finally, bone tracking is performed, and relevant bone information is updated and read in real time. The flow chart of obtaining human bone information is shown in Figure 5.
The hardware setting of the simulation environment for the experiment in this chapter is Intel(R) Core(TM)i5-4210M CPU @2.60GHZ, 8GB RAM. Software settings are Windows 10, 64 bit operating system, and MATLAB R2017B. The MSRAction3D public data set is used. The data set contains 567 samples. Each action category is repeated by 10 different male and female subjects for 2~3 times. We take  Figure 4: Interaction between application and Kinect sensor. ples. In the previous chapter, we directly classified 20 different human actions at one time. This chapter first roughly classifies these 20 actions into seven categories and then subdivides them: class I: single arm action; the second category: the movement of both arms; category III: single leg movements; the fourth category: trunk movement; category 5: the movement of both arms and legs; category 6: trunk plus arms; and category 7: trunk plus legs plus one arm. Because the number of subjects with different height and weight is not included in the study set, only the proportion of male and female subjects is included in the study set. We use MSRAction3D data set and compare the same crossvalidation method with other latest research methods in Figures 6, 7, and 8. The highest recognition rate of this method (algorithm 10) is the second, and the lowest and average recognition rates are the highest. The improvement in the lowest recognition rate is obvious, which can show that this method has good recognition performance and better stability than other research methods.

Conclusion
Based on Kinect human skeleton and joint data collection, the expression of human motion is completed through limbs, and all human bones together constitute human limbs. Rehabilitation exercise training is mainly to help people with movement disorders gradually recover their limb motor function through existing medical technologies and means. The traditional training methods are not suitable for patients to carry out rehabilitation training in family or community environment, regardless of the use cost or the complexity of operation. At the same time, the complex and cumbersome process of sports data acquisition often affects the rehabilitation doctors' judgment of patients' limb recovery. Therefore, this paper studies and designs a set of motion obstacle auxiliary evaluation training system based on Kinect. The system combines Kinect somatosensory sensor and virtual reality technology. By capturing and col-lecting human motion data information in real time and calculating relevant algorithms, patients can carry out rehabilitation training and complete the evaluation of rehabilitation training actions independently, so as to improve the rehabilitation efficiency of patients.

Data Availability
The data sets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest
It is declared by the author that this article is free of conflict of interest.