Multiple built-in cameras and the small size of mobile phones are underexploited assets for creating novel applications that are ideal for pocket size devices, but may not make much sense with laptops. In this paper we present two vision-based methods for the control of mobile user interfaces based on motion tracking and recognition. In the first case the motion is extracted by estimating the movement of the device held in the user's hand. In the second it is produced from tracking the motion of the user's finger in front of the device. In both alternatives sequences of motion are classified using Hidden Markov Models. The results of the classification are filtered using a likelihood ratio and the velocity entropy to reject possibly incorrect sequences. Our hypothesis here is that incorrect measurements are characterised by a higher entropy value for their velocity histogram denoting more random movements by the user. We also show that using the same filtering criteria we can control unsupervised Maximum A Posteriori adaptation. Experiments conducted on a recognition task involving simple control gestures for mobile phones clearly demonstrate the potential usage of our approaches and may provide for ingredients for new user interface designs.
Designing comfortable user interfaces for mobile phones is a challenging problem, given the limited amount of interaction hardware and the small size of the device. Touch sensitive technology has already enabled new ways for users to interact with handheld devices. Recent touch screens provide an intuitive interface for navigating content but this equipment still imposes some limitations: the user's fingertip size can decrease pointing accuracy, the area of interest can be occluded by fingers, and most importantly the operation area is restricted. Moreover, the amount of functionalities in mobile devices is likely to keep increasing due to the forthcoming 3D user interfaces and applications. Going forward we will see also multiple sensors in portable devices that can enrich the mobile user experience by allowing control through gestures and other types of movement. Studies into alternatives to mobile user interaction have, therefore, become a very active research area in recent years.
Much of the work in mobile interaction has been in direct manipulation interfaces, such as screen navigation by scrolling or pointing and clicking. In particular, it has been shown that different sensors provide viable alternatives to conventional user interaction. For example, tilting interfaces can be implemented with gyroscopes [
On the other hand, many of the current mobile phones have also two cameras built-in, one for capturing high-resolution photography and the other for lower-resolution video telephony. Even the most recent devices have not yet utilised these unique input capabilities enabled by cameras for purposes other than just photographing. With appropriate computer vision methods, information provided by images allow us to create new self-intuitive user interface concepts. In our work we have focused on what could be described as indirect interfaces, where an abstract shape is recognised, and this is then interpreted as a command by the mobile device.
In this paper we investigate two specific approaches for creating patterns of motion: firstly the estimation of the egomotion of the device itself using the inbuilt camera now available on most mobile devices and also the use of this camera for tracking the motion of an external object, in our case the user's finger. These motion trajectory sequences are then modelled using
In the following section, Section
Much of the previous work on vision-based user interaction with mobile phones has utilised measured motion information directly for controlling purposes. In these systems the user can operate the phone through a series of hand movements whilst holding the phone to perform actions on the screen of the device such as scrolling or pointing and clicking [
After this work, we have seen many other image motion based approaches. Möhring et al. [
An alternative to markers is to estimate motion between successive image frames with similar methods to those commonly used in video coding. Rohs [
Some recent and generally interesting directions for mobile interaction are to combine information from several different sensors. In their feasibility study, Hwang et al. [
Recently, the motion input was also applied for more advanced indirect interaction such as recognising signs. This increases the flexibility of the control system as the abstract signs can be used to represent any command, such as controls for a music player. A number of authors have examined the possibility of using phone motion to draw alphanumeric characters. Liu et al. [
The other solution studied in this paper, vision-based finger tracking, is well studied problem on desktop computers with numerous applications [
In our contribution, we propose two alternative solutions to extract motion information from successive images which can be used as a feature for classification. In the first approach, the ego-motion of the device is estimated while the user operates the phone through a series of hand movements. The second technique is to move an object such as a finger in front of the camera and simultaneously track the object during gestures. Both these approaches utilise the feature-based motion analysis as a subtask where a sparse set of image features are first selected from one image and then their displacements are determined. In order to improve accuracy of the motion information, an uncertainty of these features is also analysed.
Feature motion analysis begins with the selection of image features from the first frame. The goal is to ensure that the features are distributed over the image so that the probability of sufficient presentation of overall image motion is high. We use a computationally straightforward way where the image area is split to nonoverlapping regions and one feature is selected from each region [
Another goal is to select some distinctive features which guarantee high precision in the estimation of the displacement vectors. Various criteria for selecting such features typically analyse the richness of texture within an image area [
To estimate the displacement of the features
Uncertainty of the obtained estimate is analysed by detecting those displacements that may be close to the true displacement according to the matching measure value. Selection of the set of those displacements is based on gradient-based thresholding of the motion profile. The result of this analysis is summarized as a covariance matrix
As a result of these computational steps, we obtain a set of
A mobile user interface system controlled through a series of hand movements requires a method for estimating the ego-motion of the device's camera [
The ego-motion estimation generally refers to the computation of 6-DOF motion. However, the choice of a model and the number of parameters for the computation are application dependent. For simplicity we use a four-parameter similarity motion model which represents the displacement
The global motion describing the device motion is estimated using those motion features which pass an outlier analysis stage. Such analysis is necessary as feature displacement estimates can be erroneous due to image noise, or there may be several independent motions in the scene. It is assumed that the majority of motion features are associated with the global motion we want to estimate. To select those inlier features, we use an RANSAC-based scheme where pairs of motion features are used to instantiate motion model hypotheses, which are then voted for by other features.
A feature votes for a hypothesis if the displacement instantiated from the hypothesis is close to the estimated displacement. The covariance matrix
The goal of object tracking is to estimate the motion of an object such as a finger which can then be used as a feature for recognising gestures [
One way to track multiple object motions and cope with multimodal distribution is combinatorial data association methods [
In our model, we assume that the background and foreground motions are constant but subject to random perturbations. Translational models are considered as sufficient approximations, and then the state-space model of the camera (
Object tracking uses motion features described in Section
(a) Motion features. Estimates of feature displacements (lines) and associated error covariances (ellipses). (b) Assignment of motion measurements to two components. Weightings are illustrated using colors (red : background (
To estimate the motions we use a technique where the Kalman filter [
Soft assignments are then used in the computation of the Kalman gains which are needed to get the filtered estimates of
To describe the algorithm in more detail, we denote the estimate of the state Predict estimate Compute the weights Use the weights Compute filtered estimates of the state
Update
Figure
Sample frames from the sequence 1. (a) Frame 40, (b) frame 50, (c) frame 60, (d) frame 70, (e) frame 80, and (f) frame 110.
In order to perform classification we must select an appropriate method of modelling the motion sequences produced by the feature extraction methods described in Section
Due to the difficultly in tracking and the noisy nature of the measurements, in some applications, it may be difficult to create general models for each class that will perform well for many different users. In order to improve the models performance we propose using unsupervised
When using statistical models for pattern recognition we must train the models based on a training set. If this training set is labelled, then the
In a GMM the set of parameters
A key point in unsupervised learning is controling of either the learning process or the data used for adaptation. One way to control unsupervised adaptation is to filter out any incorrectly classified sequences before they are used for adapting the model. In this case we propose the use of entropy and the log likelihood ratio as criteria for selecting sequences for adaptation. This is based on our previous work [
To adapt the general model to a user specific model, we first classify the sequences produced by the user using the general model. We then use the entropy and log-likelihood ratio as the criteria for filter incorrect sequences from these results. Finally the sequences that pass the selection criteria are used to update the model for that class using MAP adaptation.
If we have a set of classes
Information Entropy is a measure of the randomness of a probability distribution of a random variable
So a sequence with a higher entropy will have a more random velocity, while a sequence with lower entropy would have a more constant velocity. Our hypothesis is that well-formed signs will have a more constant velocity, and so a lower entropy, as opposed to more random or poorly formed signs. In this paper we demonstrate that sequences with higher entropy are more likely to be incorrectly classified and that by setting a threshold on the entropy we can filter these potentially incorrect sequences from the data used for adaptation.
The system we propose here uses HMMs, described in Section
An overview of the proposed system for the recognition of device ego-motion, with result filtering based on log likelihood ratio and entropy.
We use a two-level filtering of the result. The first level of filtering is based on the likelihood ratio between the most likely command and the second most likely command. This ratio can be seen as a confidence measure of the classification result. If this ratio is below a certain predefined threshold, then the confidence in the result is low and the sequence is rejected. A second level of filtering is employed to reject unintentional or accidental sequences, such as when the input system is activated without the user's knowledge or the user loses control of the phone for some reason. It is important that these unintended commands are not recognised and executed as real commands.
In our experiments we use two threshold values for
Entropy is used for filtering the final classification result. In our experiments we have included a number of sequences where the user has either deliberately made a bad sign or has just moved the phone at random. These sequences are used to test the case where the system may be unintentionally turned on by the user or the user loses control of the phone whilst making a sign. The mean of the entropy of “bad” sequences in the validation was found to be significantly higher than the mean of the “good” sequences. This initial result indicates the potential of using the velocity entropy as a measure of the quality of the sign.
In order to validate the technique described here a hypothetical control system of mobile phone functions was devised. In this system a series of control commands was proposed. These commands are composed of seven simple elements based on seven different motions. These seven elements are shown in Table
Seven basic motion elements.
Element | Type of motion |
---|---|
Left horizontal | |
Right horizontal | |
Up vertical | |
Down vertical | |
Left diagonal | |
Right diagonal | |
Anticlockwise circle |
Eleven complex commands constructed from the seven basic motions.
Name | Command |
---|---|
Com1 | |
Com2 | |
Com3 | |
Com4 | ↑→ |
Com5 | ↑ |
Com6 | |
Com7 | |
Com8 | ↑→ |
Com9 | ↑ |
Com10 | |
Com11 |
The experimental data was collected from 35 subjects. Each subject was asked to draw each of the commands in Table
The sequence
The subjects were randomly divided into training, validation, and test sets. There were 20 subjects in the training set, 5 subjects in the validation set, and 10 subjects in the test set. Additionally 10 “bad” sequences were added to the validation set and 20 to the test set, giving a total of 1100, 285, and 570 sequences in the training, validation, and testing sets, respectively.
In addition to these subjects 30 random sequences were collected. These sequences were produced by moving the camera in a random way. These random or “bad” sequences were included in the data to test the system's performance with input caused by accidental activation of the camera or the user losing control of the phone whilst making a sign. The mean of the entropy of “bad” sequences in the validation set is 0.88 with standard deviation of 0.11, while the mean and standard deviation of the “good” sequences is 0.58 and 0.13, respectively.
It must be emphasised that there was no overlap of subjects between these three sets. The training set was used to train the parameters of the HMMs. The validation set was used to set the hyperparameters of the individual models, such as the number of Gaussians in the GMMs, that model the state distributions of the HMMs, and the number of states in the HMMs. The validation set was also used to set the hard threshold,
The results of running the system on the 570 test sequences are shown in Table
Results of testing on 570 sequence of which 20 were intentionally bad. It should be noted that of 570 sequences only 5 were finally incorrectly classified.
Command | Correct | Correct rejected | Incorrect rejected | Incorrect |
---|---|---|---|---|
Com1 | 50 | 0 | 0 | 0 |
Com2 | 47 | 2 | 0 | 1 |
Com3 | 46 | 4 | 0 | 0 |
Com4 | 49 | 0 | 1 | 0 |
Com5 | 50 | 0 | 0 | 0 |
Com6 | 48 | 0 | 2 | 0 |
Com7 | 50 | 0 | 0 | 0 |
Com8 | 47 | 0 | 1 | 2 |
Com9 | 46 | 1 | 1 | 2 |
Com10 | 48 | 0 | 2 | 0 |
Com11 | 50 | 0 | 0 | 0 |
Bad seq | 0 | 20 | 0 | 0 |
Total | 531 | 27 | 7 | 5 |
In this section we propose a system for control of a mobile device by again recognising motion sequences. In this instance the sequences are generated from tracking the users finger motion in front of the mobile device camera, as described in Section
The eight signs chosen to represent mobile phone commands. In the experiments each user was asked to draw the sign in the air in front of the mobile phone camera.
In order to recognise the motion trajectories produced by finger tracking we are again using HMMs. However, due to the diversity in how people make the gestures it may be difficult to create general models for each class that will perform well for many different users. In order to improve the model performance we propose using unsupervised MAP adaptation to tailor the general models for a specific user. We address the problem of controlling unsupervised learning by proposing a method of selecting adaptation data using a combination of entropy and likelihood ratio. We demonstrate how this approach can significantly improve the performance in the task of finger gesture recognition.
The experimental data was collected from 10 subjects. Each subject was asked to draw each of the commands in Figure
We first ran the baseline experiments using the training set of four subjects and the test set of four different subjects. This produced a sequence recognition rate of 82% on the test set. It can be seen from the confusion matrix shown in Table
Confusion matrix for the baseline recognition experiment using unadapted HMMs. The rows are the recognition result, and the columns are the labelling.
S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | |
---|---|---|---|---|---|---|---|---|
S1 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
S2 | 0 | 8 | 0 | 0 | 1 | 0 | 0 | 0 |
S3 | 2 | 2 | 15 | 0 | 0 | 1 | 1 | 0 |
S4 | 2 | 0 | 0 | 16 | 0 | 0 | 0 | 0 |
S5 | 0 | 0 | 0 | 0 | 15 | 0 | 0 | 0 |
S6 | 0 | 0 | 0 | 0 | 0 | 15 | 1 | 0 |
S7 | 0 | 6 | 0 | 0 | 0 | 0 | 14 | 0 |
S8 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 16 |
The signs S1 (a) and S8 (b) performed by a single user.
In the next set of experiments we tailor the general model to an individual user using unsupervised MAP adaptation. The results of these experiments can be seen in Table
Confusion matrix for recognition experiment using HMMs adapted to the two subjects. The rows are the recognition result, and the columns are the labelling.
S1 | S2 | S3 | S4 | S5 | S6 | S7 | S8 | |
---|---|---|---|---|---|---|---|---|
S1 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
S2 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 |
S3 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 |
S4 | 1 | 0 | 0 | 8 | 0 | 2 | 0 | 0 |
S5 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 0 |
S6 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
S7 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 0 |
S8 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 8 |
Results for the adaptation experiments. This shows the baseline percentage recognition rate, the recognition rate when adapting with no constraints, and the recognition rate after adapting with entropy and likelihood ratio constraints.
Baseline | Adapting with no constraints | Adapting with constraints | |
---|---|---|---|
Subject 1 | 81.2 | 84.4 | 90.6 |
Subject 2 | 81.2 | 87.5 | 93.7 |
We have presented here two camera-based user interaction techniques for mobile devices that combine motion features and statistical sequence modelling to classify the hand movements of a user: the first by recognising the motion of the device held in the user's hand and second by recognising the motion of the user's finger. In order to improve the results produced by these systems we have introduced two methods of filtering the result of this classification, likelihood ratio and entropy. In the first application these criteria were used to filter incorrect or random sequences from the final result, while in the second the criteria are used for selecting data for unsupervised adaptation. It is clear from the results shown in Section
We conclude that the computer vision-based motion estimation and recognition techniques presented in this paper have clear potential to become practical means for interacting with mobile devices. They can possibly also augment the information provided by other sensors, such as accelerometers and touch screens, in a complementary manner. In fact, the cameras in future mobile devices may, for most of time, be used as sensors for self-intuitive user interfaces rather than using them for digital photography.