This paper presents a method to recognize continuous full-body human motion online by using sparse, low-cost sensors. The only input signals needed are linear accelerations without any rotation information, which are provided by four Wiimote sensors attached to the four human limbs. Based on the fused hidden Markov model (FHMM) and autoregressive process, a predictive fusion model (PFM) is put forward, which considers the different influences of the upper and lower limbs, establishes HMM for each part, and fuses them using a probabilistic fusion model. Then an autoregressive process is introduced in HMM to predict the gesture, which enables the model to deal with incomplete signal data. In order to reduce the number of alternatives in the online recognition process, a graph model is built that rejects parts of motion types based on the graph structure and previous recognition results. Finally, an online signal segmentation method based on semantics information and PFM is presented to finish the efficient recognition task. The results indicate that the method is robust with a high recognition rate of sparse and deficient signals and can be used in various interactive applications.
In recent years, sensor-based human motion recognition has received a great deal of attention from researchers. Sensors have been adapted for large-scale movements to avoid shading and lighting problems. This has advantages over vision-based methods for special scenes and has allowed full-body motion recognition and sensor-based motion control to be applied in various fields, such as medical rehabilitation and interactive games.
Currently, motion control tasks are based on accurate and complete accelerations, as well as signals provided by other sensors. Unfortunately, these devices are expensive and not easily portable. In practice, sparse and low-cost sensors are more attractive, but they are usually accompanied by less information, more noise, and frequent signal deletion, making it difficult to acquire or reconstruct accurate position information and accordingly harder to achieve a proper online recognition result. Therefore, reconstructing human motion from signal features based on sparse and deficient signals has recently evoked much interest.
In light of the above problems, an online motion recognition method that adopts sparse, low-cost Wii Remote sensors (Wiimotes) as input devices is proposed. Because sparse, deficient linear accelerations cannot acquire accurate position information of human motion, a predictive fusion model, which combines fused hidden Markov model (HMM) with an autoregressive process, is presented. Considering the independence of each part of the human body, a hierarchical fusion structure of fused HMM is used to deal with human motion signals, which enhances the independent and cooperative expression of the classification model. The predictive capability of the model provided by the autoregressive process ensures robustness when dealing with noisy and deficient signals. Once the online recognition process is underway, a graph model that builds the transition between different motion types filters those motion types and reduces the recognition complexity of the predictive fusion model (PFM). Moreover, a semantic-based automatic signal segmentation method is introduced to ensure the continuity of the online recognition processes.
Thus, based on sparse and deficient input signals, a human motion recognition PFM is presented that effectively supports sparse, low-cost sensors. The presented model is of a high accuracy rate and robust enough to handle insufficient and missing signals. An online motion recognition method is also proposed that does not require any position calibration. The method integrates PFM, action graph structure, and a semantic-based signal segmentation method to support user-driven virtual human motion in virtual scenes with continuous motions.
As pattern recognition technologies develop, pattern recognition methods are increasingly used in the context of motion recognition. Typical methods, including self-organizing maps (SOMs), support vector machines (SVMs), and HMM approaches, can be adapted for motion recognition processes.
Methods for motion recognition vary depending on the input source. It has been shown that vision-based methods and sensor-based methods constitute two of the main research areas and are based on two types of input device, depending on the application. Poppe [
Recent work [
The present research is motivated by the above studies. A probabilistic fusion model and autoregressive process in the hierarchical model of virtual human movement are proposed, which ensures that full-body motion information can be expressed relatively independently and deals with deficient input caused by sparse, inexpensive sensors. The recognition process ensures robustness, accuracy, and efficiency.
In this paper, a recognition model PFM to deal with offline single motion segments is proposed first. Combined with graph constraint and online signal segmentation, the model can then be applied to online motion recognition. The method consists of three main key technologies, the structure of which can be found in Figure
The structure of our method. The three main technologies include predictive fusion model (PFM), graph constraint, and semantics-based segmentation.
To build a robust learning model that can acquire feature information from sparse and deficient sequential input, HMM shows a high capability of dealing with time series. Here a predictive fusion model is presented based on the structure of HMM, which not only considers the sparse and deficient signal but also considers the features of human motion.
Consider two HMMs with observations
Supported by the maximum mutual information criterion in [
The structure of the model. A fusion relationship has been built between the hidden states of HMM1 and observations of HMM2.
The observation
Vary the basic parameters
The methods described above define the model parameters
Then, how to use the model in the process of recognition will be shown. In training process, the input signal sequences
The accumulative results of similarity probability for HMM and PFM in a small scale database. The values on the
Standard model
Fusion model
The model detailed above can properly identify the motion type from dozens of alternative ones. However, when the number of alternative motion types grows, it not only affects the accuracy rate of recognition but also increases the computation time due to the probability calculations required for each model. Therefore, a structured method was used to reduce the scale of alternative motion types in dealing with a large database.
When a user performs continuous and varied actions, it is noticed that certain action types cannot appear when the current action type has been determined, due to the coordination of human motion. This constraint can be used to guide selection of the following motion type based on the current determinate type.
The present graph model is motivated by the methods of Li et al. [
The constraint built by unweighted graphs. (a) The transition probability between two motion types. (b) The visual graph structure constructed from the table on the left. The transition probability
Once the recognition process is underway, the motion type is annotated immediately after recognizing the current motion signal segment
For the online recognition process input signals which are always continuous and long need to be separated into short segments based on different motion types. In recent studies, such as the recursive least squares (RLS) method presented by [
In order to combine the semantic information with the segmentation process, the motion content needs to be parsed by a recognition model in the online signal segmentation. PFM is introduced into the process to acquire semantic information. With specific semantic information, it can be ensured that the segmented sequence is an intact and independent motion type, which can greatly reduce the occurrence of oversegmentation. The method can be described as follows.
Let
The accumulative process of similarity probability for HMM and PFM in a small scale database. The values on the
To deal with transition signals and signals that do not belong to any alternative motion type, an appropriate threshold for each PFM should be set to filter out the redundant segment points during the PFM training process. The threshold is defined as the minimum normalized probability in the training dataset, and it rejects motion signals dissimilar to the training set.
The method presented above considers the semantic information of signal sequence and acquires the recognition result based on the PFMs trained offline. The recognition process is online, and results of which will be discussed in Section
In this section, the functions of PFM, the effect of online recognition, and various applications of this technology will be described. As is presented in the last section, the input devices used in our experiment are sparse and low-cost (see Table
The details of general current portable input devices applied to motion control and recognition.
Sensor | Amount | Per price | Output information |
---|---|---|---|
Wii Remote | 4 | $39.99 | 3D linear accelerations, 2D rotation angle |
Xsens’ MTx [ |
4 or more | $1500 | Orientation, linear accelerations, angular velocity |
MEMS sensors [ |
8 for gait analysis | $250–8000 | 3D angular velocity, Orientation, etc |
HD Hero [ |
16 or more | $250 | Scene videos |
The offsets between actual motion data and position information, calculated by incomplete accelerations as the frame number increases.
Frame number | 10 | 50 | 100 | 150 | 200 |
---|---|---|---|---|---|
RMS value (cm) | 9.55 | 22.13 | 49.68 | 77.86 | 122.46 |
The Wiimotes transmitted signals to a computer via a bluetooth interface that supports an 8–10 meters distance during an experiment. The sampling time in our experiment was 25 fps, which can be adjusted to accommodate a range of precisions. The training motion signal database has been preliminary constructed, which is clustered as 28 nodes in graph structure based on the content of the motion signal segments. Each node consists of 3-4 groups of motion segments with different variants, such as walking in different styles or kicking to different positions. Each type of motion signal is captured 5 times by 4 different actors. These hundreds of motion signals are well-organized for model training. In the experiment, thousands of independent action signals and hundreds of long continuous action signals are performed by testers in real time to get the result on recognition rate, robustness, and so forth.
Before the experiment, we have tested several state-based methods, such as coupled HMM and structural HMM, as Pan et al. [
In our experiment, the recognition effect of different actors was validated by leave-one-out and
The recognition rate shown in the form of confusion matrices. With an increase in the recognition rate, the color of the matrix grid varies gradually from white to black.
The proposed model can handle imperfect signals as well as deletion of input signals. The fewest number of sensors that can retain complete full-body motion information remains to be determined. Further experiments will be conducted to show the robustness and capabilities of dealing with deficient signals and to determine the requisite number of sensors in order to properly function in the motion recognition process. Table
The PFM recognition rate for different actors with an increasing fraction of signal deletion.
Actor | Trained actor | New actor 1 | New actor 2 |
---|---|---|---|
Completed signals | 0.97 | 0.92 | 0.91 |
One intermittent Wii | 0.94 | 0.87 | 0.89 |
Two intermittent Wiis | 0.85 | 0.84 | 0.85 |
One missing Wii | 0.84 | 0.78 | 0.75 |
An analysis of unknown motions not included in the training datasets provides an estimate for the maximal probability of the motions most likely to be in the training datasets. Evaluation methods demonstrate the accuracy of the input signal relative to the recognition results.
In an online recognition system, continuous signal processing is key for completing the task, and the results are essential for influencing and evaluating the recognition process. In our experiments, five actors were required to perform a continuous motion that included 51 motion segments used to test the segmentation accuracy rate. The accuracy rate of the segmentation experiment was evaluated by the number of desirable missing segmentation points and the number of undesirable or redundant segmentation points. Table
The accuracy rate of our segmentation method for different actors.
Actor | Actor 1 | Actor 2 | Actor 3 | Actor 4 | Actor 5 |
---|---|---|---|---|---|
Desired points | 48/50 | 50/50 | 47/50 | 50/50 | 49/50 |
Redundant points | 8 | 10 | 5 | 12 | 7 |
Accuracy rate | 0.86 | 0.83 | 0.9 | 0.8 | 0.87 |
The segmentation accuracy rate of three methods. A long, continuous motion was performed by five actors.
When dealing with large databases of alternative motion types, the difficulty in distinguishing features between different motion types becomes greater. The recognition capability of PFM is reduced substantially (see Table
The recognition rate of PFM and PFM with graph constraint for trained actors with an increasing number of alternative motions.
Total alternative types | 40 | 55 | 70 | 85 | 100 |
---|---|---|---|---|---|
PFM | 0.92 | 0.85 | 0.6 | 0.51 | 0.39 |
PFM with graph constraint | 0.95 | 0.94 | 0.91 | 0.85 | 0.85 |
The methods proposed here are applicable to a wide variety of applications, including behavioral teaching evaluations, interactive games in virtual environments, and activity validation systems in large-scale scenes.
A general application of the proposed recognition method includes driving the virtual human to generate computer animations or to simulate a virtual environment for user interactions. After the user performs the continuous motions the segmentation and recognitions are conducted efficiently, and the recognition results guide the searching process of the corresponding motion data in the database. The blending process in the motion graph technology guarantees continuity of the generated motion. Generative models, such as the Gaussian latent variable model presented by [
In the context of educational applications, the present method can be used to evaluate activities, such as playing tennis, doing martial arts, or dancing. Students can act out motions while following a standard motion sequence that is presented in advance. The system can then evaluate the similarity of the mimicked sequence to the standard sequence. An evaluation system can be constructed by calculating the probability ratio between the input motion signal and the normative training data. The ratio provides an important evaluation criterion. The weights of the fusion model may be adjusted to standardize the motions of each appendage. Figure
The motion evaluation system. Left: an actor performs exercises with accelerometers attached to her four limbs. Right: the recognition result and the actor’s performance grade.
Complex virtual environmental interactions constitute the main application focus of our method. Virtual environment games and special training regimens require environmental immersion and interactions with virtual objects. Our method, based on sparse, low-cost sensors, performed well in the context of these applications and can provide the user with an immersed experience.
This paper presents a full-body motion recognition method based on sparse, low-cost accelerometers. In the online recognition process, a semantics-based signal segmentation method was adopted to acquire short motion segments, and a motion transition graph structure was constructed to reduce the amount of alternative motion types. To recognize the motion type accurately, a predictive fusion model was presented to efficiently distinguish between current motion types and alternative motion types. The models recognition capability is robust and accurate in dealing with unstable and deficient signals that provide little information for reconstructing position information. Results show that the method has a high recognition rate and can be adapted to specific input signals.
During experiments, it is found that the method had difficulty identifying the actors’ orientation, as the input devices we used lack direction information for recovering whole motion information. In addition, a short pause in a continuous motion occasionally led to a redundant motion segment. In the future, in order to overcome these problems low-cost sensors will be integrated that will also provide direction information so that the input device can be more conveniently adapted to a specific interaction. The database of the motion signals and the motion data will also be expanded. Ultimately, the method will be applied to complicated scene interactions between users and the virtual environment.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the Natural Science Foundation of China (Grant no. 61170186) and the Zhejiang Leading Team of Science and Technology Innovation (2011R50019-06). The data used was obtained from HDM05 in [