An Indoor Scene Classification Method for Service Robot Based on CNN Feature

Indoor scene classification plays a vital part in environment cognition of service robot. With the development of deep learning, fine-tuning CNN (Convolutional Neural Network) on target datasets has become a popular way to solve classification problems. However, thismethod cannot obtain satisfying indoor scene classification results because of overfittingwhen scene training datasets are insufficient. To solve this problem, an indoor scene classification method is proposed in this paper, which utilizes CNN feature of scene images to generate scene category features to classify scenes by a novel feature matching algorithm. The novel feature matching algorithm can further improve the speed of scene classification. In addition, overfitting is eliminated by our method even though the training data is limited. The presented method was evaluated on two benchmark scene datasets, Scene 15 dataset and MIT 67 dataset, acquiring 96.49% and 81.69% accuracy, respectively. The experiment results showed that our method was superior to other scene classification methods in terms of accuracy, speed, and robustness. To further evaluate our method, test experiments on unknown scene images from SUN 397 dataset had been done, and the models based on different training datasets obtained 94.34% and 79.80% test accuracy severally, which proved that the proposed method owned good performance in indoor scene classification.


Introduction
Scene cognition is a key point of service robot cognition. Scene information can help improve robot service level. Indoor scene classification is one of the most import missions of service robot, which can enable the robot to provide different services according to different scenes.
A great deal of researches on indoor scene classification had been done. Traditional scene classification method was usually based on manual vision features such as SIFT (scaleinvariant feature transform) [1] and SURF (speeded up robust features) [2]. Reference [3] utilized an improved SIFT feature named RootSIFT to build a BoW (Bag of Word) model and combined selective attention for scene classification. In [4], SPM (spatial pyramid matching) model was proposed based on BoW to classify scenes. Reference [5] structured CLM (codebookless model) model by extracting SURF feature of scene image, which obtained better accuracy on indoor scene classification. However, the mentioned vision features are low-level features of images without rich semantic information. It is hard to get satisfying results on complicated scene classification.
In recent years, deep learning has already made huge progress on image classification [6,7], object detection [8][9][10], and so on. Deep learning methods especially CNN have become popular solutions for scene classification. A scene classification model was presented in [11] based on deep CNN feature. In [12] accuracy of scene classification was improved by transferring learning. Transferring learning is to fine-tune a pretrained model on a new dataset and the pretrained model has been well trained on large scale dataset. By this way, better accuracy can be obtained in comparison with training from scratch. In [13] CNN feature transferring was employed to classify scenes and got good results. From the cited references we can find that the deep learning methods own more excellent performances in scene classification than manual vision features. However, there are still some problems as follows. ( Figure 1: Overall framework of the proposed indoor scene classification method based on CNN feature. There are two parts in the framework. The first part is to generate scene category features by CNN feature extraction and process. The second step is to match CNN feature vector of the test scene image with the scene category features by a new feature matching algorithm to generate scores of different scenes. The largest score indicates the result of scene classification. For example, the category of the test scene image is bedroom with the highest score in Figure 1. GPUs to speed up, which is expensive. (3) If the training dataset is insufficient, overfitting is around corner. Therefore, it is difficult to get satisfying results based on very limited indoor scene datasets by fine-tuning a pretrained CNN.
Aiming at the above problems, this paper proposes an indoor scene classification method for service robot based on CNN feature. Different from the general method of finetuning CNN, our method utilizes CNN feature of scene images to generate scene category features to classify indoor scenes by a new feature matching algorithm. The novel feature matching algorithm can further speed up the scene classification. Meanwhile overfitting can be eliminated by this method when training data is insufficient. The presented method was adequately estimated on two benchmark scene datasets, Scene 15 [4] dataset and MIT 67 [14] dataset, and tested on completely new scene images that are different from the training datasets.

Overall Framework
Essentially CNN is a kind of input to output mapping, which can learn a lot of mapping relationships between input and output and does not require any precise mathematical expression. CNN usually adopts alternating settings of convolution layer and sampling layer. The convolution layers are used to extract image features named CNN feature. CNN feature of network pretrained on large scale datasets includes abundant representation information. Therefore an indoor scene classification method for service robot based on CNN feature is proposed in this paper. The method contains two parts and overall framework is illustrated in Figure 1.
The first part is CNN feature extraction and process. A CNN feature extraction model is built by reconstructing a pretrained CNN model. The output of the CNN feature extraction model is one-dimensional feature vector with discriminative representation information. Then a category of scene images is processed by the model to create scene category feature that can generally represent this kind of scene. By this way, other scene category features can be obtained. This part is a learning process and the main purpose is to get the category features with high discrimination of various scenes for scene classification in the next part.
The second part is about scene classification. A test scene image is put into the same CNN feature extraction model to generate a CNN feature vector. Then the CNN feature vector is matched with the scene category features by a proposed feature matching algorithm, which calculates diverse scores

Obtaining Pretrained CNN Models.
There are a lot of open source deep learning frameworks such as Theano [21], Caffe [22], and MXNet [23], which promote the development of deep learning. This paper is based on MXNet.
MXNet supports many kinds of programming languages and deep learning algorithms and provides diverse pretrained deep CNN models based on a variety of large scale datasets. Three types of pretrained model (shown in Table 1) are selected from MXNet in this paper. The selected deep CNN models are all ResNets [7] with different network layers. Model 1 and Model 2 were trained on a combined dataset including ImageNet11K [24] dataset and Places365 [25] dataset, achieving 0.3113 and 0.2255 top-1 accuracy, respectively. ImageNet11K dataset includes 11,221 category objects and 11,797,630 images totally. Places365 is dataset about scenes with 365 category scenes and 8,000,000 images. Model 3 was trained on ImageNet11K dataset. After training on these large scale datasets, the three models own powerful capacity to extract CNN feature.

Scene Category Feature.
Firstly the CNN feature extraction model needs to be built based on the pretrained models by using flatten layer instead of softmax layer. Architectures of the feature extraction models are shown in Table 2 and ReLU activation functions are used in the models. Output of the CNN feature extraction model is a vector rich in semantic information. The dimensionality and length of the vector are 1 and 2048, respectively. The processes of generating scene category feature are listed as follows.
Suppose that the category number of a scene dataset is and each category has scene images.

Input. Scene images in training dataset.
Output. All scene category features.
Step 1. Put image from scene category into CNN feature extraction model to create a feature vector V = , v ] of the image. The k = 2048 is the length of the feature vector V .
, v ] of all image in scene category can be generated according to Step 1. Then the mean value of is figured out by and is the scene category feature vector of scene category .
Step 3. Each scene category feature vector is generated by Step 2 and all scene category feature vectors are Step 4. Scene category feature vectors V are normalized into where is the mean value of each scene category feature and is the standard deviation. If the original scene category feature vector is directly used for analysis, it will highlight the role of the vector with higher value in the comprehensive analysis and weaken the role of the vector with lower value. Therefore, in order to ensure the reliability of the results, the original vector needs to be standardized. And the improvement will be proved in subsequent experiments. Figure 2 shows element value changes of scene category feature vector after normalization.
After the aforesaid steps we can get all scene category feature vectors for scene classification. After that, put a test scene image into CNN feature extraction model and the CNN feature vector of the test scene image is created, which also needs to be normalized into by Z-Score.

Scene Category Feature
Matching. Scene classification results are figured out by measuring the similarity between scene category feature and CNN feature of test scene image . Therefore feature matching is a key point. Service robot should possess the capacity of real-time scene classification, which requires the feature matching algorithm to be fast enough. Some common feature vector matching algorithms are compared and analysed as follows, and a new feature vector matching algorithm is proposed in this paper.
Larger value of Euclidean distance means larger difference between the two vectors. The frequently used similarity measures are Pearson correlation coefficient ( , ) and cosine similarity cos( , ).
The value close to 1 means that the two vectors are more similar. Suppose that vectors and are normalized by Z-Score.
Equation (4) can be simplified in (8) since the mean value of and is 1 and the standard deviation is also 1.
It can be found that Pearson correlation coefficient is equal to cosine similarity when the input vectors are normalized. From variance formula (9) we can get (10).
When is big enough, − 1 can be seen as n, so ∑ =1 2 = . In the same way, ∑ =1 2 = . There is a direct linear relationship between the square of Euclidean distance and Pearson correlation coefficient [15]. With these conditions, the square of Euclidean distance can be unfolded as follows.
In the same manner, Euclidean distance is equal to Pearson correlation coefficient as well as cosine similarity. Although the three feature matching algorithms have the same property, they own different computation speeds. They both need to extract a root which is time-consuming. Inspired by formula (12), a new feature matching algorithm ( , ) is presented without extracting a root. The new algorithm can improve calculated speed which will be proved by experiments.
The CNN feature vector V of a test scene image will be matched with scene category feature vectors = ( 1 , 2 , 3 , . . . . . . , ) by the proposed feature matching algorithm ( , ). The output score matrix is as follows.
The largest element in the score matrix is the index of scene category. By this way we can get the category of the test scene image.  Figure 4. Experiment designs were specified as 5-fold cross-validation in order to make the experiment results convincible and repeatable. Each category scene images were divided into 5 parts; 1 part was used to created scene category features and the rest were tested. All images were resized to 224 × 224 pixels.

Experiment Results and Analysis.
The proposed method was written in Python using MXNet deep learning framework and run on a PC. The PC operating system was Ubuntu 16.04.4 with Intel i5-6500 CPU, 32G memory, and 1 NVIDIA GTX 1080 graphics card. In order to fully test the performance of our method, some experiments were carried out as follows.
(1) The Impact of Normalization on Classification Results. This experiment was done to test if Z-Score normalization could improve scene classification accuracy. Two groups were set to make a comparison; one used Z-Score to process the CNN feature vector and the other did not. We utilized cosine similarity to match features since the three matching algorithms were not equivalent without Z-Score. The experiment results   are listed in Table 2 and a histogram is provided in Figure 5 for comparison. From Table 3 and Figure 5 we can see that using Z-Score normalization can obtain better accuracy on both Scene 15 dataset and MIT 67 dataset with 3 different pretrained models.

(2) Computational Speed Comparison of Feature Matching
Algorithms. Service robot should be able to recognize scenes in real time, so the speed of indoor scene classification is significant. In order to verify the advantage of the proposed feature matching algorithm on computing time, the following experiments were carried out in contrast with other algorithms. Euclidean distance (ED), Cosine similarity (CS), Pearson correlation coefficient (PCC), and the proposed feature matching algorithm were, respectively, tested on the two datasets based on the three different pretraining models. The scene classification results are shown in Table 4. From the results in Table 4, we can see that CS and PCC have similar processing speeds, which are much slower than ED and our algorithm. Compared with ED, the proposed algorithm is faster. In addition, the scale of datasets and the layers of the pretrained models affect the scene classification speed. By contrast our feature matching algorithm is able to meet the service robot demand of real-time scene classification.
(3) Contrast with Transferring Learning on CNNs. A transfer learning strategy was used; fully connected layers of the pretrained CNN models were changed according to the number of scene categories and fine-tuned on the datasets. The training parameter sets were learning rate 0.0001, batch size 32, and epoch 100 (MIT 67) and 200 (Scene 15). Figure 6 shows the training curves of fine-tuning CNNs on Scene 15 dataset and MIT 67 dataset. Results of the comparison are listed in Table 5.
(4) Contrast with Other Advanced Scene Classification Methods. In order to verify the performance of our method, other advanced indoor scene classification methods were used as references to carry out the following comparison tests. Table 6 demonstrates indoor scene classification performance of different methods. Confusion matrixes created by CLM+SVM [5] and our method are shown in Figure 7. Performance contrasts on MIT 67 dataset are listed in Table 7, and Figure 8 shows scene classification confusion matrixes of MIT 67 dataset on three pretrained models based on our method.
In Table 6 our method gets higher scene classification accuracy and better efficiency than other methods. Scene classification confusion matrixes in Figure 7 indicate that our method is more robust than CLM+SVM in [5]. For classification of a large number of scenes in MIT 67 dataset, it can be observed from Table 7 that our method with model 2 obtains better results than other methods. The classification confusion matrixes of MIT 67 dataset in Figure 8 demonstrate the robustness of our method.    Table 8. To further prove the performance of our method, scene classification test confusion matrixes are shown in Figure 9 (the models trained on Scene 15) and Figure 10 (the models trained on MIT 67). From the test results in Table 8 a conclusion can be drawn that our method owns good ability to classify indoor scenes on a completely different source scene data, which proves that the proposed feature matching algorithm refers to the content distance between different indoor scenes. Although the models were trained on grayscale images from Scene 15 dataset, they can obtain more than 92% accuracy with different pretrained models, which demonstrates more evidence that our method is based on the content and semantics  of image to make scene classification instead of lower-end abstract features such as pixel, colour, and edge descriptor. In addition, Figures 9 and 10 show the test confusion matrixes of scene classification by various models trained on Scene 15 and MIT 67 datasets severally, which depicts the robustness of our method on test scene dataset.

Conclusions
In this paper an indoor scene classification method for service robot based on CNN feature is proposed. We utilize CNN feature of scene images to generate scene category features to classify indoor scenes by an improved feature matching algorithm. The novel feature matching algorithm can further speed up the scene classification. The presented method is adequately evaluated on two benchmark scene datasets, Scene 15 dataset and MIT 67 dataset. Compared with general method fine-tuning CNN on training dataset, this method can obtain satisfying accuracy without overfitting on a small amount of training dataset and does not need to be trained repeatedly. In contrast to other indoor scene classification methods, the scene classification results have been greatly improved in terms of accuracy, classification speed, and robustness by our method. The experiment results show that this method has good performance in indoor scene classification and can meet the task requirements of service robot indoor scene classification. Nowadays, with the continuous development of computer hardware and cloud robots, the computing capacity of service robots has been greatly improved. Our next step is to further improve scene cognition ability of service robots based on deep learning methods.

Data Availability
The data used to support the findings of this study are available from the public scene datasets.

Conflicts of Interest
The authors declare that they have no conflicts of interest.