Recently, social touch gesture recognition has been considered an important topic for touch modality, which can lead to highly efficient and realistic human-robot interaction. In this paper, a deep convolutional neural network is selected to implement a social touch recognition system for raw input samples (sensor data) only. The touch gesture recognition is performed using a dataset previously measured with numerous subjects that perform varying social gestures. This dataset is dubbed as the corpus of social touch, where touch was performed on a mannequin arm. A leave-one-subject-out cross-validation method is used to evaluate system performance. The proposed method can recognize gestures in nearly real time after acquiring a minimum number of frames (the average range of frame length was from 0.2% to 4.19% from the original frame lengths) with a classification accuracy of 63.7%. The achieved classification accuracy is competitive in terms of the performance of existing algorithms. Furthermore, the proposed system outperforms other classification algorithms in terms of classification ratio and touch recognition time without data preprocessing for the same dataset.
Social touch is one of the basic interpersonal methods used to communicate emotions. Social touch classification is a leading research area which has great potential for further improvement and development [
Previous studies have aimed to identify the touch classes using 14 predefined classes [
Another issue is the avoidance of preprocessing, which develops case dependency, and, as previously discussed, prevents real-time performance (e.g., using an average or any measurement which performs temporal abstraction) [
To handle this huge amount of data, we use a robust tool, which has become popular in the literature [ High performance and accuracy, which outperforms other recognition algorithms applied to the same dataset A CNN is used to recognize social gestures in an end-to-end architecture No preprocessing operations are required (except rescaling pressure data between 0 and 1 by dividing them to 1,023, which is the maximum measureable pressure) Classification operation starts after receiving a minimum number of frames (frame length = 85) Social gesture class is predicted in nearly real time, after 629 ms of the raw input samples (sensor data) Gestures are classified even if the data sample is given in the middle of the gesture
The remainder of this paper is organized as follows: Section
This section introduces the CoST, describes the CNN, and sets up the parameters used to build the network.
The CoST dataset provides recorded social touch gestures from various subjects. The data frame was collected using a pressure sensor installed in the mannequin arm. The pressure sensor grid detectable pressure ranges from 1.8 × 10−3 to >0.1 megaPascal (MPa) at an ambient temperature of 25°C. In an 8 × 8 grid, which covers the artificial skin, the sensor data were sampled at 135 Hz (frame per second). A single experiment collected from a subject consists of an 8 × 8 ×
Gesture instances for each class for time (
Several studies have classified the CoST dataset using various classification methods depending on distinct number of features extracted from raw input data. The first method was introduced by Jung et al. [
They used 28 features extracted from the dataset based on mean and maximum pressure, pressure variability, mean pressure per column and row, contact area, peak count, displacement, and duration. Classification results were evaluated using leave-one-subject-out cross validation. Results of touch gesture recognition ranged from 24% to 75% (
Gaus et al. [
To achieve high accuracy for gesture recognition, Hughes et al. [
Ta et al. [
Hughes et al. [
Zhou and Du [
Lastly, Jung et al. [
CNN is a type of artificial neural network that requires a convolutional layer but can have other types of layers, such as nonlinear, pooling, and fully connected layers, to create a deep convolutional neural network [
In the convolutional layer, multiple filters slide over the layer for the given input data. A summation of an element-by-element multiplication of the filters and receptive field of the input is then calculated as the output of this layer. The weighted summation is placed as an element of the next layer. Figure
Convolution layer slides the filter over a given input. Output is the summation of an element by the element matrix multiplication of the filter and receptive field (image from [
Each of the convolutional operation is specified by stride, filter size, and zero padding. Stride, which is a positive integer number, determines the sliding step. For example, stride 1 means that we slide the filter one place to the right each time and then calculate the output. Filter size (receptive field) must be fixed across all filters used in the same convolutional operation. Zero padding adds zero rows and columns to the original input matrix to control the size of the output feature map [
Zero padding mainly aims to include the data at the edge of the input matrix. Without zero padding, the convolution output is smaller in size than the input. Therefore, the network size shrinks by having multiple layers of convolutions, which limits the number of convolutional layers in a network. However, zero padding prevents the shrinking of networks and provides unlimited deep layers in our network architecture.
The main task of using nonlinearity is to adjust or cut off the generated output. Several nonlinear functions can be utilized in the CNN. However, the rectified linear unit (ReLU) is one of the most common nonlinearities applied in various fields, such as image processing [
The pooling layer roughly reduces the dimension of the inputs. The most popular pooling method, max pooling, represents the maximum value inside the pooling filter (2 × 2) as the output [
Softmax layer is considered an excellent method to demonstrate categorical distribution. The softmax function, which is mostly used in the output layer, is a normalized exponent of the output values [
Our approach uses CNN and raw sensor data to classify social gestures. The main challenge is finding an optimal architecture for the CNN. Therefore, we first defined the input and output structures of the network. We then presented the optimal architecture based on the results of various experiments. Each recorded sample is an 8 × 8 × The number of samples used to train the neural network increases, which, however, depends on the frame length. Short frame lengths denote additional subsamples with less information, and vice versa. We obtained subsamples derived from another part of the main sample (i.e., from the middle or toward the end of social gestures). Thus, our method can recognize the social gesture class although it is unspecified in the beginning. The proposed methods in previous studies were not designed for real-time classification. Rather, these methods recognize class after the gesture is completed. By contrast, our approach recognizes the gesture after receiving a fixed length of data.
The softmax function with 14 classes is the output shape of our method. Although we employed the peak value in the output node as the calculated class, we relied on the softmax values to consider other highly probable hypotheses.
Our approach is similar to that used in CNN for video classification or image processing. Image classification uses color images, in which the input shapes are, for example, 128 × 128 × 3. The output shape of our social gesture recognition is 8 × 8 ×
We cascaded the convolutional layers together to build the classifier. Each convolutional layer consists of convolution, nonlinearity, and pooling. We proposed three convolutional layers with one fully connected layer and, lastly, softmax for our gesture recognition system. The meta parameters of our CNN architecture are presented in Table
Meta parameters of our CNN architecture.
Layer number | Element | Parameter | Value |
---|---|---|---|
1 | Convolutional filter | Input channels | 8 × 8 × 85 |
Size | 3 × 3 | ||
Stride | 1 | ||
Pad | 1 | ||
Max pooling | Size | 2 × 2 | |
Pad | 2 | ||
|
|||
2 | Convolutional filter | Input channels | 64 |
Size | 2 × 2 | ||
Stride | 1 | ||
Pad | 1 | ||
Max pooling | Size | 2 × 2 | |
Pad | 2 | ||
|
|||
3 | Convolutional filter | Input channels | 128 |
Size | 3 × 3 | ||
Stride | 1 | ||
Pad | 1 | ||
Max pooling | Size | 2 × 2 | |
Pad | 2 | ||
|
|||
4 | Fully connected | Input to layer | 256 × 2 × 2 |
Output from layer | 512 | ||
|
|||
5 | Softmax | Output units | 14 |
For touch gesture recognition, we used MATLAB (Release 2016a) and LightNet Toolbox as a versatile and purely Matlab-based environment for the deep-learning framework [
The loss/objective as a function of training epoch.
We ran the experiments for frame lengths of 5, 10, 15, 20, …, 100 to compute the optimum frame length. Five random subjects were selected for the hold-out validation test to find the hyperparameters (due to the computationally expensive experiments). These subjects’ IDs are 5, 10, 18, 23, and 31. The criterion for the optimal frame length is the average cross-validation accuracy. The results for the subjects and their average are shown in Figure
Evaluation of the performance of CNN with three convolutional layers using five randomly selected subjects from the CoST dataset. The figure shows improved performance with the increase in the frame length of the data.
However, our system performance did not importantly increase after 40 frames. The proposed system achieved the maximum classification rate at 85 frames, which are equivalent to 629 ms. Thus, this value is selected as the input dimension of our CNN.
Results of the leave-one-subject-out cross validation for all subjects ranged from 39.1% to 73% (
The average leave-one-subject-out cross-validation result using our proposed CNN for the gesture recognition.
The validation method | Correct classification rate (CCR) | Standard deviation |
---|---|---|
Leave-one-subject-out | 63.7% | 11.852% |
To further understand the results, Table
Results of our proposed CNN for gesture recognition presented as the accumulated confusion matrix of the leave-one-subject-out cross validation for all subjects.
Gesture | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Grab (1) | 552 | 2 | 50 | 1 | 12 | 3 | 12 | 4 | 9 | 1 | 146 | 8 | 1 | 7 | 808 |
Hit (2) | 1 | 198 | 5 | 7 | 7 | 7 | 4 | 2 | 5 | 27 | 4 | 1 | 15 | 6 | 289 |
Massage (3) | 35 | 1 | 1090 | 3 | 37 | 8 | 38 | 140 | 57 | 1 | 103 | 49 | 2 | 102 | 1666 |
Pat (4) | 5 | 7 | 8 | 220 | 4 | 3 | 6 | 8 | 21 | 14 | 9 | 9 | 37 | 37 | 388 |
Pinch (5) | 9 | 1 | 42 | 1 | 392 | 25 | 14 | 9 | 15 | 2 | 48 | 16 | 1 | 21 | 596 |
Poke (6) | 7 | 2 | 8 | 3 | 24 | 296 | 21 | 3 | 11 | 1 | 9 | 6 | 11 | 22 | 424 |
Press (7) | 32 | 1 | 36 | 2 | 18 | 17 | 461 | 32 | 10 | 2 | 21 | 14 | 5 | 14 | 665 |
Rub (8) | 13 | 0 | 125 | 7 | 15 | 3 | 72 | 580 | 170 | 3 | 15 | 125 | 1 | 79 | 1208 |
Scratch (9) | 9 | 2 | 70 | 13 | 11 | 9 | 17 | 73 | 523 | 4 | 8 | 35 | 7 | 188 | 969 |
Slap (10) | 1 | 31 | 4 | 15 | 3 | 3 | 5 | 4 | 8 | 230 | 9 | 8 | 16 | 12 | 349 |
Stroke (11) | 126 | 1 | 116 | 2 | 57 | 2 | 17 | 10 | 7 | 3 | 345 | 7 | 1 | 21 | 715 |
Squeeze (12) | 4 | 1 | 75 | 7 | 23 | 2 | 8 | 94 | 41 | 2 | 4 | 591 | 3 | 76 | 931 |
Tap (13) | 2 | 10 | 8 | 48 | 5 | 11 | 4 | 7 | 13 | 14 | 0 | 7 | 186 | 47 | 362 |
Tickle (14) | 10 | 4 | 69 | 13 | 29 | 20 | 8 | 28 | 155 | 6 | 9 | 19 | 16 | 986 | 1372 |
Total | 806 | 261 | 1706 | 342 | 637 | 409 | 687 | 994 | 1045 | 310 | 730 | 895 | 302 | 1618 | 10 742 |
CCR% | 68.5 | 75.86 | 63.89 | 64.33 | 61.54 | 72.37 | 67.1 | 58.35 | 50.05 | 74.19 | 47.26 | 66.3 | 61.6 | 60.93 | 63.71 |
Indeed, the confusion makes sense because these gestures are similarly performed by humans. An interesting outcome is that important confusion is lacking between massage and grab despite both being confused with stroke. The same is true for tickle and rub, which has a mutual confusion with scratch. The performance of the proposed system is slightly different based on the gesture class as explained in Figure
Accuracy of the proposed model in predicting each gesture class.
In comparison with the previous work, we have addressed more challenging tasks which is classifying a gesture using its subsamples. Although we used less number of samples, we did not use the whole length of the sample for prediction. Moreover, in our approach with subsampling, we have generated more data for training and more tests for evaluation. A subsampling of 85 frames (with 10 frames sliding samples) generates about 15 times more test sets which of course is more challenging.
The leave-one-subject-out cross validation is used to evaluate the classification accuracy of CNN for CoST dataset. However, other previous approaches utilized 21 subjects for training and 10 subjects for testing. Thus, training on the additional nine subjects would likely result in improved performance. Table
Comparison of features from other existing classification methods applied on same dataset.
No. | Reference | Features extracted | # Subject | # Touch | # Features | Classify method | Accuracy (%) | S.D. (%) |
---|---|---|---|---|---|---|---|---|
1 | [ |
Yes | 31 | 14 | 28 | Bayesian classifier | 53 | 11 |
SVM | 46 | 9 | ||||||
2 | [ |
Yes | 31 | 14 | 28 | Bayesian classifier | 54 | 12 |
SVM | 53 | 11 | ||||||
3 | [ |
Yes | 31 | 14 | 45 | Neural network | 54 | 15 |
4 | [ |
Yes | 31 | 14 | 42 | Random forests (RF) | 55.6 | 13 |
5 | [ |
Yes | 31 | 14 | 5 set | Random forests (RF) | 59 | |
Boosting | 58 | |||||||
6 | [ |
Yes | 31 | 14 | 7 | Deep autoencoders | 56 | |
7 | [ |
Yes | 31 | 14 | 273 | SVM | 60.5 | |
Random forests (RF) | 60.8 | |||||||
8 | [ |
Yes | 31 | 14 | 54 | Bayesian classifier | 57 | 11 |
Decision tree algorithm | 48 | 10 | ||||||
SVM | 60 | 11 | ||||||
Neural network | 59 | 12 | ||||||
9 | [ |
No | 31 | 14 | Raw data 8×8×45 | CNN | 42.34 | |
Raw data 8×8×45 | CNN-RNN | 52.86 | ||||||
7 | Deep autoencoders | 33.52 | ||||||
10 | Our proposed method | No | 31 | 14 | Input data (raw data) 8×8×85 | Convolutional neural network | 63.7 | 11.85 |
In this paper, a system that classifies touch gesture in nearly real time using a deep neural network is proposed. The CNN is presented, which is considered a good feature extractor algorithm. The CoST dataset was used to train our CNN for various classes. Results showed that our method performed better compared with the previous work based on leave-one-subject-out cross validation for the CoST dataset.
The proposed approach poses two benefits compared with those in the existing literature. First, the proposed method does not need data preprocessing or manual feature extraction and can be applied end-to-end. Second, this method can recognize a class after receiving a minimum number of frames. This minimum number of frames can be provided by the CoST dataset using grid search. Meanwhile, the proposed approach also has certain limitations. First, CNN performance is affected by the size of the input frame. The smaller the size of the frame (8 × 8 pixels), the more negative the effect on CNN performance because CNN behavior reduces the size of input data in the subsequent layers. Thus, zero padding for rows and columns of a frame is utilized after convolutional operation to repair the lost frame size before pooling operation. Second, increasing the number of filters used in convolutional operation improves CNN performance. However, the time consumed to train the network will be increased.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.