In the field of object detection, recently, tremendous success is achieved, but still it is a very challenging task to detect and identify objects accurately with fast speed. Human beings can detect and recognize multiple objects in images or videos with ease regardless of the object’s appearance, but for computers it is challenging to identify and distinguish between things. In this paper, a modified YOLOv1 based neural network is proposed for object detection. The new neural network model has been improved in the following ways. Firstly, modification is made to the loss function of the YOLOv1 network. The improved model replaces the margin style with proportion style. Compared to the old loss function, the new is more flexible and more reasonable in optimizing the network error. Secondly, a spatial pyramid pooling layer is added; thirdly, an inception model with a convolution kernel of 1
Human beings can easily detect and identify objects in their surroundings, without consideration of their circumstances, no matter what position they are in and whether they are upside down, different in color or texture, partly occluded, etc. Therefore, humans make object detection look trivial. The same object detection and recognition with a computer require a lot of processing to extract some information on the shapes and objects in a picture.
In computer vision, object detection refers to finding and identifying an object in an image or video. The main steps involved in object detection include feature extraction [
The object detection is critical in different applications, such as surveillance, cancer detection, vehicle detection, and underwater object detection. Various techniques have been used to detect the object accurately and efficiently for different applications. However, these proposed methods still have problems with a lack of accuracy and efficiency. To tackle these problems of the object detection, machine learning and deep neural network methods are more effective in correcting object detection.
Thus, in this study, a modified new network is proposed based on the YOLOv1 [ The loss function of the YOLOv1 network is optimized. The inception model structure is added. A spatial pyramid pooling layer is used. The proposed model effectively extracts features from images, performing much better in object detection.
The remaining of this paper is organized as follows. Section
Detecting and identifying multiple objects in an image is hard for machines to recognize and classify. However, a noteworthy effort has been carried out in the past years in the detection of objects using convolutional neural networks (CNNs). In the object detection and recognition field, neural networks are in use for a decade but became prominent due to the improvement of hardware new techniques for training these networks on large datasets [
In this section, the proposed model is described in detail. Firstly, the improvement based on loss function is presented. Secondly, the improvement based on inception structure model is described. And lastly, the improvement based on the spatial pyramid pooling layer is portrayed. The symbolic representations are described in Table
The mathematical symbols.
Symbol/notation | Description |
---|---|
Hyperparameter is set to ensure a “fair” contribution of the bounding box location | |
Hyperparameter is set for bounding box score prediction | |
Categories | |
Probability of detected class categories | |
Training samples | |
Training input | |
Output label | |
Input label |
The following improvements to the YOLO network model are made while maintaining the original model dominant idea.
The loss function of the original YOLOv1 network takes the same error for the large and small objects, which makes the model’s prediction for neighboring objects unsatisfactory. If two objects appear in the same grid, only one object can be detected, and there will be a problem in detecting small objects. Compared with the old loss function, the new loss function is more flexible and optimized. In the new loss function, the original difference is replaced by the proportionality. Equation (
In convolutional neural networks, variance function is often used as the loss function [
Here,
In the YOLOv1 network loss function design, the variance function is used as part of the entire loss function, the normalization idea of contrast is used to improve it, and the improved model replaces margin style with proportion style, so here the size of the object in the picture is considered. The specific modified loss function is shown in
Here,
The third and fourth layers of the original network are replaced with new inception models. The inception model itself has the ability to deepen and widen the network and enhance the network; a 64 × 1 × 1 layer is added between the first and second layers of the original network, which reduces the network parameters. Figure
Partial structure of the new network after adding inception.
The inception model can deepen and widen the network, and the convolutional kernel of different scales is connected in parallel. Thus, the multiscale feature can be more effective, and the hidden information in the image can be used more efficiently.
Figure It can output a fixed-size image for any size input or any ratio of the input image. It can extract pool features at varying scales.
Partial structure of the new network after joining the SPP layer.
A classifier (SVM/Softmax), as well as fully connected layers, requires a fixed-length vector, which can be generated through Bag-of-Words (BoW) [
By using the SPP layer, more feature-rich image information is obtained, and also great improvements in the network’s time efficiency are observed. Hence, this technique shows remarkable detection accuracy.
Following is the comprehensive analysis of our proposed network and improved YOLO model based on the results of the experimental tests. By the analysis of the confusion matrix, we observed what kind of sample detection performance is better for the new network, what kind of sample detection performance is not good, and how to distinguish the easily confused categories and understand the advantages and disadvantages of the network. We examined the network architecture of the new network model, such as the comparison of the number of network parameters, and assessed its performance.
Through the confusion matrix, the test results are analyzed. A confusion matrix is a list of data classes; in each class, the actual data is classified so that we can observe which categories of samples are easily confused in the modified network. In the confusion matrix, the rows represent the true categories of the test images. The columns show the classes of the test images divided by the network in the actual test.
In the original Pascal VOC dataset, there are 20 categories of objects; here some representative categories, which easily cause misidentification, are selected.
Table
Confusion matrix for the new network.
From Table
Here, the proposed network architecture is described. Before going into detail, please note that the first and second layers are the same: both are convolutional layers plus the downsampling layer structure; the third and fourth layers are the same: both are inception + pool structures; the fifth and sixth layers are the same: both are convolutional cascade structures; the seventh layer is spatial pyramid pooling layer; and the eighth and ninth layers are the fully connected layers.
For the first layer, it is assumed that the input is an image,
Computing area is the size of the convolution kernel area, so the result of (
and the size of the feature map after convolution will become
Next is the maximum downsampling layer; since the downsampling layer does not change the number of feature maps, the number
The calculation of the total number of
The following is the convolution second layer, assuming that the number of features
The calculation with the upper layer of the feature map for convolution operation will be as follows.
Assuming that the output of the maximum downsampling layer in the second layer is characterized by the size of the downsampling window and with the step size
From the above, it can be seen that the output feature size of MaxPool2 is
Inception model architecture.
Thus, the whole calculation of inception four layers can be done in the above way. Next is the fifth layer of the convolution, and the total calculation is
Since the sixth layer and the fifth layer have the same structure, the calculation is the same as (
The seventh layer is the pyramid layer, denoted by
The eighth layer is fully connected. Assume that the number of input features is
Because the full-connection layer is derived from the original neural network, the calculation method is the same as that of the neural network, so the computational cost of the layer is
From the above description of network architecture analysis, it is observed that the network’s overall calculation, input layer image size, convolution kernel size, and the number of convolutional layers, shows that network depth and width are having big impact.
Pascal VOC is divided into two datasets: Pascal VOC 2007 and Pascal VOC 2012 dataset. The newly designed network was tested on both datasets [
Pascal VOC 2007 dataset images. (a) 000006. (b) 000008. (c) 000014. (d) 000015. (e) 000017. (f) 000018.
Pascal VOC 2012 dataset images. (a) 000028. (b) 000031. (c) 000033. (d) 000036. (e) 000039. (f) 000040.
The whole experiment process is conducted on NVIDIA GeForce GTX 1060 GPU using the Ubuntu operating system. The number of iterations was 40000.
The results are discussed and the network performance is checked using t-SNE visualization tool, showing the extent to which the new network is able to extract rich features from images.
Next, the visualization of a large number of sample features in 2D is observed by using the t-SNE visualization tool, which maps high-dimensional to low-dimensional data [
Figure
Two-dimensional visualization of ten samples.
There are about seven categories which are not compatible with each other, indicating that the characteristics of these seven types of differences are relatively large and relatively easy to identify; in addition to several types of partial integration, the characteristics of several types have a certain degree of similarity, which is easy to cause misidentification. However, overall, the use of the new network to extract the characteristics is very effective and robust, but it is also inadequate and needs to be further improved. The improved network was tested on Pascal VOC 2007 and Pascal VOC 2012, respectively. The results are shown in Tables
Pascal VOC 2007 test results.
VOC 2007 | The modified YOLOv1 | VOC 2007 | The modified YOLOv1 |
---|---|---|---|
Aero | 77.9 | Table | 51.2 |
Bike | 77.6 | Dog | 81.9 |
Bird | 63.7 | Horse | 77.5 |
Boat | 47.6 | M-bike | 78.7 |
Bottle | 44.8 | Person | 68.6 |
Bus | 70.7 | Plant | 37.1 |
Car | 68.9 | Sheep | 71.8 |
Cat | 85.3 | Sofa | 58.4 |
Chair | 42.2 | Train | 71.0 |
Cow | 71.9 | Tv | 64.6 |
Average recognition rate | 65.6 |
Pascal VOC 2012 test results.
VOC 2012 | The modified YOLOv1 | VOC 2012 | The modified YOLOv1 |
---|---|---|---|
Aero | 76.1 | Table | 49.1 |
Bike | 67.8 | Dog | 80.3 |
Bird | 58.0 | Horse | 72.7 |
Boat | 39.9 | M-bike | 71.9 |
Bottle | 24.2 | Person | 64.2 |
Bus | 68.9 | Plant | 29.0 |
Car | 57.6 | Sheep | 54.5 |
Cat | 82.5 | Sofa | 55.2 |
Chair | 36.3 | Train | 73.9 |
Cow | 61.1 | Tv | 51.7 |
Average recognition rate | 58.7 |
The data in Tables
Pascal VOC 2007 comparison test results.
VOC 2007 | R-CNN | YOLOv1 | The modified YOLOv1 |
---|---|---|---|
Aero | 63.5 | 78 | 77.9 |
Bike | 66 | 74.2 | 77.6 |
Bird | 47.9 | 61.3 | 63.7 |
Boat | 37.7 | 45.7 | 47.6 |
Bottle | 29.9 | 42.7 | 44.8 |
Bus | 62.5 | 68.2 | 70.7 |
Car | 70.2 | 66.8 | 68.9 |
Cat | 60.2 | 80.2 | 85.3 |
Chair | 32 | 40.6 | 42.2 |
Cow | 57.9 | 70 | 71.9 |
Table | 47 | 49.8 | 51.2 |
Dog | 53.5 | 79 | 81.9 |
Horse | 60.1 | 74.5 | 77.5 |
M-bike | 64.2 | 77.9 | 78.7 |
Person | 52.2 | 64 | 68.6 |
Plant | 31.3 | 35.3 | 37.1 |
Sheep | 55 | 67.9 | 71.8 |
Sofa | 50 | 55.7 | 58.4 |
Train | 57.7 | 68.7 | 71 |
TV | 63 | 62.6 | 64.6 |
Average recognition rate | 53.1 | 63.4 | 65.6 |
Pascal VOC 2012 comparative test results.
VOC 2012 | R-CNN | YOLOv1 | The modified YOLOv1 |
---|---|---|---|
Aero | 68.1 | 77 | 76.1 |
Bike | 63.8 | 64.2 | 67.8 |
Bird | 46.1 | 57.7 | 58 |
Boat | 29.4 | 38.3 | 39.9 |
Bottle | 27.9 | 22.7 | 24.2 |
Bus | 56.6 | 68.3 | 68.9 |
Car | 57 | 55.9 | 57.6 |
Cat | 65.9 | 81.4 | 82.5 |
Chair | 26.5 | 36.2 | 36.3 |
Cow | 48.7 | 60.8 | 61.1 |
Table | 39.5 | 48.5 | 49.1 |
Dog | 66.2 | 77.2 | 80.3 |
Horse | 57.3 | 72.3 | 72.7 |
M-bike | 65.4 | 71.3 | 71.9 |
Person | 53.2 | 63.5 | 64.2 |
Plant | 26.2 | 28.9 | 29 |
Sheep | 54.5 | 52.2 | 54.5 |
Sofa | 38.1 | 54.8 | 55.2 |
Train | 50.6 | 73.9 | 73.9 |
TV | 51.6 | 50.8 | 51.7 |
Average recognition rate | 49.6 | 57.9 | 58.7 |
It can be seen from the tables that our modified model has improved recognition over the YOLOv1 and R-CNN model in almost every type. Table
Comparison of test results for time performance.
Device | R-CNN (s) | YOLOv1 (s) | The modified YOLOv1 (s) |
---|---|---|---|
GPU time/image | 6.9 | 0.14 | 0.11 |
Testing results of our model on Pascal VOC 2007.
Testing results of our model on Pascal VOC 2012.
From the testing results, the robustness of the improved network is noticed; it classifies each class accurately and detects the desired class.
In this paper, we proposed YOLOv1 neural network based object detection by modifying loss function and adding spatial pyramid pooling layer and inception module with convolution kernels of 1
In the future, we expect to extend our work further to make our own benchmark dataset and a hybrid detector for small object detection.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that there are no conflicts of interest regarding this paper.
This work was supported in part by the National Key R&D Program of China under Grant 2018YFC0831404 and the State Grid Corp of China Science and Technology Project “Research on Key Technologies of Knowledge Discovery Based ICT System Fault Analysis and Assisted Decision”.