Enhanced Mask R-CNN for Chinese Food Image Detection

Food image detection plays an essential role in visual object detection, considering its applicability in solutions that improve people’s nutritional status and thus their health-care. At present, most food detection technologies are aimed at Western food and Japanese food, but few at Chinese foods. In this work, we exert effort to establish a Chinese food image dataset called CF-108 that can be used as an essential data basis for Chinese food image detection. 'e CF-108 dataset contains most Chinese dishes and covers large variations in presentations of the same category. In addition, we introduce a training architecture that replaces the traditional convolution in mask region convolutional neural network (Mask R-CNN) with depthwise separable convolution, namely, Mask R-DSCNN, to reduce the expensive computation cost. Experiments demonstrate that Mask R-DSCNN can significantly reduce resource consumption and improve Chinese food images’ detection efficiency without hurting too much accuracy.


Introduction
Diet is essential for people's dietary health and quality of life [1]. By effectively recognizing and segmenting food images of daily meals, people can obtain practical information and effectively analyze and summarize their needs [2]. For example, ordinary people can balance their nutrition intake [3,4]; people with diabetes can avoid all high-sugar dishes [5]; doctors can also analyze the patient's previous diet structure and give reasonable dietary recommendations [6].
Food image recognition has always been a topic of concern. In [7], statistical methods were proposed to calculate the characteristics of dish images for food image classification. e author in [8] used the random forest to extract local features to classify dishes. In [9], the classification of dish images was discussed based on texture Anti-Textons features. In these cases, the performance of model accuracy and generalization ability is poor. With the development of deep learning, using a convolutional neural network to classify images occupies a dominant position and many other computer vision tasks [10][11][12]. Kagaya et al. [13] introduced the deep learning model Alex Net to detect and classify food images. Hassannejad et al. [14] showed a deeper model inception to classify food images. Singla et al. [15] have applied the GoogLeNet model to classify food images and non-food images. e development of object detection technology [16,17] has put forward higher requirements for image recognition. e main popular algorithms can be divided into two categories. One is based on Region Proposal's convolution neural network, called R-CNN [18], including R-CNN, Fast R-CNN [19], Faster R-CNN [20], and Mask R-CNN [21]. ese R-CNN algorithms are twostage algorithms that first generate the target candidate boxes and then predict the detection results. e other is a single-stage algorithm, such as YoLo [22][23][24] and SSD [25], which only uses convolutional neural network CNN to directly predict the category and location of different targets.
Generally, the detection accuracy of the two-stage algorithm is better than that of the single-stage algorithm; especially, the performance of Mask-RCNN outperformed existing single-model entries in each task in the 2016 COCO Challenge [26]. However, Mask-RCNN is computationally expensive and time-consuming, due to its relatively complex model structure. In this paper, we present the work of establishing a Chinese food image dataset as an essential data basis for Chinese food image detection. We further propose using depthwise separable convolution [27] instead of traditional convolution to reduce the number of model parameters and operation costs. e main contributions of this paper are as follows: ( e rest of the paper is organized as follows. Section 2 briefly reviews the target detection algorithm Mask R-CNN as background knowledge. Section 3 provides the procedures for building a dataset of Chinese food images. In Section 4, the framework of Mask R-DSCNN is formally set out with experiments of Chinese food detection in Section 5.
is work is concluded in Section 6.

Background
In recent years, breakthroughs have been made in target detection algorithms. Among all these algorithms, Mask R-CNN outperformed existing single-model entries in each task in the 2016 COCO Challenge [26]. e framework of Mask R-CNN mainly includes three parts: first, the backbone convolutional neural network (CNN) for feature extraction from the input image; second, the Region Proposal Network (RPN) [28] using anchors with different scales and aspect ratios sliding on the feature map to generate region proposals; third, the three branches in a parallel prediction network with two fully connection (FC) layers for boundingbox classification and regression, and a fully convolutional network (FCN) [29] for predicting object mask. In principle, the backbone network could be any main models of deep neural networks, such as AlexNet [30], VGG [31], GoogLeNet [32,33], and ResNet [34]. In the Mask-RCNN model, ResNet (remove the last fully connected layer) is used as the backbone network to extract features, which can effectively reduce the difficulty of gradient disappearance and training degradation without increasing model parameters. ResNet contains five sets of convolutions. e underlying network can extract low-level features such as edges, while the upper network can extract the top-level features representing the target category. In order to make better use of the features of each level, Mask-RCNN extends the backbone network to a feature pyramid network (FPN), which uses the inherent layering and multiscale properties of convolutional neural networks to derive useful features for object detection. e goal of RPNs is to predict a set of region proposals efficiently. To this end, a small network will slide over the feature map and generate multiple region proposals with multiple scalars and aspect ratios based on anchors. Two FC layers then follow this feature for box regression (reg) and box classification (cls). In RPNs training, the anchors with the largest intersection-over-Union (IoU) overlapping with the ground truth box are used as positive labels, and the anchors with IoU ratio below 0.3 are used as negative labels. e calculation of IoU is shown in where Detection Result indicates the predicted box, and the Ground Truth indicates the ground truth box. RPNs will fine-tune the region proposals based on the obtained regression information and delete those region proposals that coincide with the image boundary. Finally, according to Non-Maximum Suppression (NMS) [35], about 2000 proposal regions per image will be left. e region proposals generated from RPN require RoIAlign to adjust their dimension to meet the multibranch prediction networks. RoIAlign uses bilinear interpolation, instead of the rounding operation in RoIPool in Faster R-CNN, to extract the corresponding features of each region proposal on the feature map. e multibranch prediction network consists of FC layers for object detection, and FCN for masking. During the model training process, the loss function of the Mask R-CNN model for each proposal is shown infd2 where L cls and L box , respectively, represent classification loss and regression loss, and L mask represents segmentation loss; the specific calculation formula of classification and regression loss is shown infd3 where i represents the index of the anchor, p i indicates the predicted probability of anchor i, t i represents the four coordinate parameters of the box, and t * i represents the coordinate parameters of the ground-truth box corresponding to the positive anchor. If the anchor is positive, p * i is 1; otherwise, p * i is 0. By minimizing the loss function, the model is gradually optimized.

Data Collection.
e goal to build a dataset of Chinese food images needs to meet the following three aspects. First, the dataset needs to contain as many Chinese food images as possible, and each item needs to be represented with as many images as possible. Besides, in practical scenarios, the resolution of the dish images is varied by the camera taken, meaning that a dataset containing pictures of multiple resolutions can provide a more accurate representation of food.
erefore, the sources of Chinese food images are wanted to contain dishes with different resolutions to yield better robustness of the model. Finally, both the image and the target object must be correctly labeled.
To meet these goals, we first gather the labels and images of Chinese dishes using publicly available images from the relevant Chinese food websites (http://www.meishichina. com; http://www.douguo.com) where most of the users post their Chinese dishes with tags. e ten most common food items from these websites are shown in Table 1. Web crawler technology [36] is used to obtain the labels and images of Chinese dishes since it can effectively obtain data on a topic within a specific time and a specific range on the website, such as fried shrimp, braised pork, and pickled fish. As a result, the images crawled achieve over 100,000 of 108 categories. Each dish has at least more than 100 images and a maximum of more than 1,000 images.

Data Preprocessing.
Typically, the collected data is complex and may contain inappropriate images, unclear images, or complex noise images. erefore, the next step is to clean [37], smooth, and label the data to improve the quality of the dataset. In this step, we first remove the images that are unclear or irrelevant. en we use median filtering [38] to smooth image noise caused by other unrelated objects on the target object (such as background debris or image watermarks). In addition, we label both the image and the target object to achieve the complete segmentation of the target object from the background.
Image histogram is a common method for data cleaning. e specific operation is to convert the image into a histogram and then use the correlation coefficient method to find the similarity of the image. Images with similarity below the threshold will be cleared. e calculation formula of the correlation coefficient is defined as where x, y are the histogram results of the two images, Var[x] is the covariance of x, Var[y] is the covariance of y, and Cov(x, y) is the covariance between x and y. e value of this formula ranged from −1 to 1. e larger the calculated result, the more similar the two images. In this paper, the method of histogram is used to clean the collected Chinese food images. Before using histogram, we manually remove some images that are irregularly large or small, which usually are irrelevant images. en, we select a correct image for each dish category and then calculate the correlation coefficient between this correct image and the remaining images separately. If the correlation coefficient value of any image is less than 0.3, we consider this image irrelevant to the correct image and remove it. e biggest difference between image and noise is the change of gray level. e visual obstacle of the image is formed by the huge change between the gray level of noise and the surrounding gray level. erefore, an image smoothing method is generally used to eliminate noise by utilizing the nature of gray scale differences. Figure 1 is a comparison of spicy crayfish images before and after smoothing by median filtering (the left side is before processing and the right side is after processing). In the image of the spicy crayfish on the right below, we can clearly see that the background debris, such as green onions, peppers, tea cups, and chopsticks on the table, lose many obvious bright spots, and the image becomes smoother.
After data cleaning and data smoothing, the next step is data labeling. Labeling Chinese dishes in dataset building is an expensive process because, even in the same category, the food images appear considerably different in various ingredients and cooking styles. In this experiment, we adopt the same labeling method used in [39], which designs a semisupervised method to accelerate the labeling process. Specifically, it pretrains a CNN model for the food recognition task based on some labeled samples and then classifies the collected images into candidate labels according to this CNN model. Finally, the label images are completed by manually performing label verification to finalize the dataset. Note that both image and target object need to be labeled to achieve the complete segmentation of the target object from the background.

Dataset Description.
After work of data collection and data preprocessing, finally, the new Chinese food image dataset CF-108 contains 100,800 images of 108 categories, each of which covers significant variations in presentations of the same category. We divide the dataset into training and testing sets approximately at a ratio of 8 : 2. Specifically, there are 81,543 and 19,257 images for training and testing sets, respectively. Figure 2 shows some example images with their original size in the CF-108 dataset.

Depthwise Separable Convolution.
Depthwise separable convolution, proposed by Laurent Sifre in 2013 [27], has the characteristics of lower parameter quantity and operation cost compared with the standard convolution operation [40]. e main idea of depthwise separable convolution is to decompose the standard convolution integral into depthwise convolution and pointwise convolution. e comparison between depthwise separation convolution and standard convolution is shown in Figure 3.
Consider there is an input volume with width and height D f , and the number of input channels M. If a color image   was an input, then M would be equal to three for the RGB channels. In standard convolution, the application of filters across all input channels and the combination of these values are done in a single step. As for N convolution kernels of shape D k * D k * M that are applied on the input in standard convolution neural network, the output volume would be D g * D g * N. e cost of this convolution operation would be N * D 2 k * D 2 g * M. Taking the same input volume for comparison, depthwise separable convolution breaks the convolution down into two parts-depthwise convolution and pointwise convolution. Depthwise convolution applies convolution to a single input channel at a time. erefore, each convolution kernel of shape D k * D k * 1 is applied to a single input channel in depthwise convolution stage with M such convolution kernels required over the entire input volume. Stacking the M outputs from each of these M convolutions together, an output volume with shape of D g * D g * M is taken. Ending depthwise convolution, it will be succeeded by pointwise convolution, which involves performing the linear combination of each layer. e filter is basically 1 * 1 convolution operation over all M layers. Assuming N such filters, the output volume will thus have the same shape as the standard convolution D g * D g * N. e total cost of these two phases would be g . e effect of depthwise separable convolution can be shown as follows: For instance, considering the output feature volume N of 1024 and a kernel of size 3, the ratio is 0.112. In other words, standard convolution is nine times more than the number of multiplications. erefore, we conclude that the computational resources required for depthwise separable convolution are much lower than the standard convolution.

Training
Infrastructure. Mask R-CNN has high detection accuracy in image recognition and segmentation, but suffers from excessive computing resources and storage space. In this section, we use depthwise separable convolution instead of traditional convolution to reduce model consumption. Specifically, we replaced all convolutional blocks of ResNet-50 with depthwise separation convolution to complete feature extraction. Figure 4 illustrates the training procedure of Mask R-DSCNN for Chinese food image detection. e training procedure of Mask R-DSCNN consists of three modules. e backbone is typically built by a depthwise separable convolution network with FPN architecture for feature maps extraction from input images. e feature map is shared for subsequent RPN layers and the RoIAlign layer. e RPN network is used to generate region proposals. is layer uses softmax to determine whether anchors are positive or negative and then uses bounding box regression to modify anchors to obtain accurate proposals. RoIAlign layer collects the input feature map and proposals, extracts the proposal feature maps after synthesizing the information, and then sends them to the subsequent multibranch prediction network. e FC layers use proposal feature maps to calculate the category of the proposal and use bounding box regression to obtain the final precise position of the detection box and the FCN segmented instance for masking.

Experiments
In this section, we conducted experiments for Chinese food image detection based on Mask R-DSCNN. First, the evaluation metrics and experimental settings are described, and then we assess the effectiveness of Mask R-DSCNN on the COCO dataset with Mask-RCNN for comparison. Finally, we provide the experimental results and analysis of Chinese food image detection.  e evaluation metrics of detection refer to the COCO target detection and evaluation indicators Average Precision (AP), which can effectively detect the similarity between the real target and the predicted target. As for the consumption of the model, it mainly refers to the model size and the time consumption for training. e experiments were conducted on NVIDIA Tesla M60 with Tensorflow2.0 [41] and Python 3.6.

e Effectiveness of Mask R-DSCNN.
We trained Mask R-DSCNN on the COCO dataset and standard Mask R-CNN for comparison. e anchor size is set to (128, 256, 512), and the aspect ratio is set to (0.5, 1, 2). Stochastic gradient descent (SGD) is selected as the training optimizer. e learning rate was set at 0.001, momentum to 0.9, batch size to 128, and a total of 200,000 epochs. After training, we randomly selected 5,000 images from the testing set of the COCO dataset to evaluate the model performance (see Table 2).
Experiments show that the values of APs on Mask R-DSCNN are slightly lower but still on par with Mask R-CNN. is is because the Mask R-DSCNN replaces the standard convolutional layer with a deep separation convolution to extract features, which may cause some feature information loss. e model size and running time are recorded on the same configuration (see Table 3).
It can be seen that Mask R-DSCNN is more cost-efficient with smaller model size and thus benefits running speed. Explicitly, the model size of Mask R-CNN is 245 M, while the model size of Mask R-DSCNN is only 93 M, which is much lower than that of Mask R-CNN. In addition, the total running time of Mask R-DSCNN is 3716 s, which is more than 1400 s shorter than the running time of Mask R-CNN. erefore, we can conclude that the Mask R-DSCNN can significantly reduce resource consumption and improve the detection efficiency without hurting too much accuracy.

Chinese Food Image Detection.
To save training resources and ensure the model's performance on the Chinese food image, we fine-tuned both the Mask R-CNN and Mask R-DSCNN that are pretrained with the COCO dataset. Fine-tuning is a method to apply previously learned knowledge to new knowledge. In terms of deep learning, this means that the weight of each layer of the node is no longer randomly initialized, but is initialized using the trained model parameter layer. Since the target features extracted by deep learning are hierarchical, the high-level network will extract random combinations of features extracted by the low-level network. erefore, the primary information extracted by deep learning is common in different datasets. If the training results obtained by the model on a large dataset are right, then the primary features obtained by the model can also be used on another dataset. In this experiment, the parameters of the models successfully trained on the COCO dataset are used as initialization parameters, and then we fine-tune the models with the new Chinese food image dataset. e size of the anchor is set to (8,16,32,64,128), and the aspect ratio is set to (0.5, 1, 2). e optimizer is SGD with a learning rate of 0.001 and a momentum of 0.9. e batch size is 64 with a total of 200,000 epochs. After training, we also randomly selected 5,000 images from the testing set of CF-108 dataset to evaluate the model performance (see Table 4).
Same as trained on COCO dataset, the Mask R-DSCNN model in CF-108 dataset training is slightly weaker but tolerable, and when the threshold of IOU is set at 0.5, the AP value of the two models is closest. Figure 5 shows the detection of Chinese food images by the Mask R-CNN model and the Mask R-DSCNN model when the IOU threshold is fixed at 0.5. e results show that Mask R-DSCNN can successfully identify and segment braised pork, and the two models are almost the same in terms of regression box and mask survival rate. e comparison on model size and running time between the two models is shown in Table 5.
It can be seen that Mask R-DSCNN still leads to a competitive result with smaller model size and shorter running time. erefore, we can claim that Mask R-DSCNN has practical significance for Chinese food image detection.

Conclusions
In this paper, a method for Chinese food image detection based on an improved structure of Mask R-CNN was proposed. To achieve that goal, we first built a dataset of Chinese food images, called CF-108, which contains 100,800 images of 108 categories covering most Chinese food. In addition, we proposed a new model framework, namely, Mask R-DSCNN, with deep separable convolution instead of traditional convolution for reducing model consumption.
e experiment results on the CF-108 dataset demonstrate that Mask R-DSCNN can greatly reduce the resource consumption and improve the detection efficiency of the Chinese food images without hurting too much accuracy.  Mathematical Problems in Engineering Further work will be carried out with multispectral or hyperspectral images as in [42].
Data Availability e dataset and software code used to support this study's findings have not been made available because the data also form part of an ongoing study. Requests for data, after publication of the ongoing study, will be considered by the corresponding author. Disclosure e authors contributed equally to this work and should be considered co-first authors.

Conflicts of Interest
e authors declare that they have no conflicts of interest.