Vehicle Type Recognition Algorithm Based on Improved Network in Network

Vehicle type recognition algorithms are broadly used in intelligent transportation, but the accuracy of the algorithms cannot meet the requirements of production application. For the high eﬃciency of the multilayer perceptive layer of Network in Network (NIN), the nonlinear features of local receptive ﬁeld images can be extracted. Global average pooling (GAP) can avoid the network from overﬁtting, and small convolution kernel can decrease the dimensionality of the feature map, as well as downregulate the number of model training parameters. On that basis, the residual error is adopted to build a novel NIN model by altering the size and layout of the original convolution kernel of NIN. The feasibility of the algorithm is veriﬁed based on the Stanford Cars dataset. By properly setting weights and learning rates, the accuracy of the NIN model for vehicle type recognition reaches 97.2%.


Introduction
Intelligent transportation [1] refers to a research hotspot in existing society, and vehicle type recognition [2] underpins and critically impacts intelligent transportation studies. e existing algorithms of vehicle type recognition are primarily classified as manual feature descriptions, 3D model, and artificial intelligence algorithms. At the early phase, the manual feature descriptions (e.g., SIFT [3] and HOG [4]) are adopted to extract vehicle features; subsequently, the algorithms (e.g., SVM and decision tree) are combined for classification. Since feature extraction and data reconstruction are difficult to achieve, Hsieh et al. [5] employed HOG and symmetric SURF descriptor to extract the vehicle features of mesh generation. Besides, Liao et al. [6] conducted the appearance and semantic segmentation of vehicle parts to recognize vehicle types. Moreover, Biglari et al. [7] exploited the overall appearance of the vehicles and the feature differences of various components to train the SVM classifier. e mentioned algorithms are easy to affect by environmental factors (e.g., light and background), so their recognition accuracy is relatively low. As impacted by the random variation in the shooting angle of vehicle images, the 3D model-based vehicle type recognition method was developed at the right moment. e 3D model can reflect spatial relationships between local features and the whole vehicle. Existing studies [8,9] effectively performed the 3D modeling and feature extraction of vehicles. Artificial intelligence introduced a novel impetus into vehicle type recognition, and the features of the vehicle can be automatically extracted. Dong et al. [10] adopted the sparse Laplace filter and a semisupervised convolution neural network to extract vehicle features and classify vehicles. Studies [11][12][13][14] employed different methods or optimized the existing neural network to conduct the vehicle type recognition, and its effect was significantly improved; however, for the similar vehicle recognition exhibiting a remarkably small feature gap (e.g., Volkswagen's front face is nearly identical), the room for improvement of classification accuracy is limited.
In view of the low accuracy of vehicle type recognition, we propose an improved NIN for vehicle type recognition and get high recognition accuracy. In fact, the breakthrough point of vehicle type recognition refers to the efficient extraction of nonlinear features of vehicles. NIN [15] exhibits a complex multilayer perceptron (MLPConv) with a micronetwork structure and is capable of efficiently and automatically extracting local nonlinear features of images. e present study fully exploits the following features of the NIN model and uses its 1 × 1 convolution kernel to conduct the dimensionality reduction of the feature map and downregulate the number of network parameters. e global average pooling layer (GAP) is adopted to effectively combine the features and prevent the whole network from falling into the overfitting state. e improvement measures are as follows: the original large convolution kernel of NIN is changed into a small convolution kernel, which increases the depth of convolution neural network and improves the performance of the network. In order to avoid the gradient loss problem caused by the increase of depth, residual measures are arranged on the structure to solve the network degradation problem.
e improved NIN has high classification effects, and its classification accuracy is better than VGG and GoogLeNet in vehicle type recognition. By the verification based on the Stanford Cars dataset and the reasonable weight and learning rate setting, the vehicle type recognition accuracy of the improved NIN reaches over 97.2%.

Related Works
e 1 × 1 small convolution kernel, GAP, micronetwork structure, and other measures proposed by NIN underpin the follow-up deep convolutional neural network (CNN). CNN [16] automatically extracts image features; thus, the complex feature extraction and data reconstruction process of conventional recognition algorithms can be avoided. AlexNet [17], VGGNet [18][19][20][21], GoogLeNet [22,23], ResNet [24][25][26][27], and other networks can be adopted for vehicle type recognition, whereas for the limitations of sample quality and quantity as well as the defects of network feature extraction and classification performance, vehicle recognition exhibits relatively low accuracy.
Most networks are only capable of extracting linear features on the images, landing the classification algorithm in confusion since the linear features are basically consistent (Figure 1(a) and (b)). For classification, only the overall information built by linear features can be classified (Figure 1(c)).
In Figure 1, the linear features denoted by (a) and (b) are consistent, which are both a line segment and a part of an object without any difference. However, given the overall information, the information represented by (c) is completely inconsistent.
us, a question is raised of how to extract this nonlinear feature effectively. is question is determined by the micronetwork [28,29] structure embedded in NIN, i.e., a full connection layer consisting of two layers of convolution. In the neural network, two-layer fully connected hidden neurons are capable of approximating arbitrary curves.
2.1. "Micronetwork" Structure. In 2013, the proposal of NIN modified the original idea of network structure, and the multilayer perceptron was built by replacing the conventional linear perceptron with the embedded "micronetwork"; as a result, the efficiency of nonlinear feature extraction of local sensing field of images was significantly enhanced.
In NIN, "micronetwork" refers to a general nonlinear function approximator. e difference between MLPConv of NIN and linear perceptron of CNN is the method of image feature extraction. MLPConv consists of several fully connected nonlinear activation functions, shared by all local receptive fields. Moreover, by sliding on the input, the feature map is generated and then outputted to the next layer. MLPConv can combine different feature maps, so the network can extract complex and useful nonlinear image features. Furthermore, the overall structure of NIN can be superposed by multiple MLPConv.
ere are two reasons why NIN selects multilayer perceptron: (1) MLPConv fits the structure of the convolutional neural network and (2) MLPConv can act as a deep model, complying with the spirit of feature reuse [22]. e feature map of MLPConv is calculated: where n denotes the number of layers of the multilayer perceptron; (i, j) represents the pixel index in the feature map; x i,j indicates the input block centred on the position (i, j); k is the channel index of the feature map; and b k 1 is the bias. ReLU acts as the activation function in MLPConv.

Global Average Pooling Layer.
In the classification, GAP [30,31] remedies the defect of the fully connected layer. At the early phase, the feature map of the final convolutional layer is vectorized and passed into the fully connected layer; subsequently, it is inputted to the Soft-Max layer [32][33][34]. Since the fully connected layer is easy to overfit, the whole network exhibits a reduced generalization ability, and the subsequent network conducts a dropout [24] operation on the fully connected layer, thereby preventing overfitting significantly. However, GAP is adopted by NIN to set the last MLPConv feature map to pertain to the corresponding classification category, which can more effectively fit the convolution structure. ere are no parameters to be optimized in the operation, thereby avoiding overfitting. e regularization effect of GAP is more significant than dropout. e 1 × 1 convolution was initially proposed by NIN to make the network exhibit significantly high network performance. By 1 × 1 convolution computation, MLPConv reduces the dimension of the channel parameter pool of convolutional kernel, as well as downregulating the number of parameters. e main functions of 1 × 1 convolution are as follows: (1) Dimensionality reduction: for instance, if an 500 × 500 image with a depth of 100 is generated with 1 × 1 convolution on 20 filters, the size of the result is 500 × 500 × 20. (2) e nonlinear expression ability is enhanced. After the convolutional layer passes through the excitation layer, the 1 × 1 convolution introduces nonlinear excitation to the learning representation of the previous layer to enhance the expression ability of the network. (3) e model depth is increased. Accordingly, the number of the network model parameters can be reduced, the depth of the network layer can increase, and the representational capacity of the model can be enhanced to some extent. Figure 2 illustrates the NIN structure of 4 MLPConv and 1 GAP. Subsampling layers can be added between MLPConv layers, and the number of layers of the "micronetwork" can be altered for specific tasks. First, taking the first MLPConv as an example, the input image is 224 × 224 × 3, 224 represents the pixel of the input image, and 3 denotes the channel of the image. Later, the convolution filter is adopted to slide on the input image and calculate the inner product. e size of the convolution filter adopts 11 × 11 × 3, i.e., the length and width are both 11, and the depth is 3. In the first layer of MLPConv, 96 convolution filters are adopted. e embedded "micronetwork" refers to a fully connected neural network with a two-layer convolutional kernel, performing nonlinear feature extraction. e number of neurons in each layer reaches 96. Besides, Figure 2 presents one of the models compared in subsequent experiments, and the specific setting of parameters is presented in the figure.
In the present study, the nonlinear feature extraction capacity of NIN is exploited to extract the features of vehicles in the image (e.g., texture and topology structure) to enhance the efficiency of the vehicle type recognition. On that basis, by increasing the size, quantity, and layout of the convolutional kernel in NIN, as well as the network performance and convergence speed, the training of NIN for vehicle sample data is conducted efficiently, and the vehicle recognition accuracy is enhanced. Subsequently, the residual thought is adopted to solve gradient dissipation that is attributed to the rising number of network layers.

Optimized NIN
At present, network performance can be enhanced primarily by two measures. One is to increase the width or depth of the network. For instance, VGG enhances network performance by increasing network depth. e other refers to optimizing the network input sample data (e.g., increasing the sample number, strengthening the texture of the sample, or transforming the shape of the sample image (inversion and distortion) to enhance the network performance). For the deepened or widened network, its defects gradually appear, the gradient disappears, the number of parameters is huge, and the extracted features tend to be invalid in the network transmission. In the present study, NIN is optimized by the following two means.

Use of Small Convolution Kernel.
e small convolution kernel increases the network depth and improves the network performance, as well as significantly downregulates the number of network parameters. In numerous networks, the convolution kernel with a size of 3 × 3 and 5 × 5 has been extensively used, and 3 × 3 refers to the smallest size that can capture 8 neighbourhood information of pixels. e small convolution kernels are stacked to replace the large convolution kernels, and the size of the receptive field remains unchanged. Multiple 3 × 3 convolution kernels exhibit more nonlinearities (more layers of nonlinear functions) than the convolution layer of a large convolution kernel. Moreover, multiple 3 × 3 convolutional layers have fewer parameters than a large convolution kernel. If the input and output feature maps of the convolutional layer are assumed to have an identical size to C, the number of parameters of the three convolutional layers is 3 × (3 × 3 × C × C) � 27C 2 . e parameter of one 7 × 7 convolutional layer is 49C 2 . us, the small convolution kernel significantly reduces the number of network parameters.
At the beginning of AlexNet and NIN training, a large convolution kernel is employed for calculation, and the classification accuracy is not significantly enhanced. Even though NIN employs a micronetwork as a local nonlinear feature collector, it only increases the convergence speed of the model. On the whole, the convolution kernel of VGG uses 3 × 3 convolution kernel, and GoogLeNet contains 3 × 3, 5 × 5, and 1 × 1; the classification effect of VGG and GoogLeNet models is larger than that of the former two. Indeed, this is also attributed to the deepening of the number of network layers. e function of 1 × 1 convolution kernel suggested that it exhibits the function of raising and reducing dimension and can downregulate the number of network parameters in Section 2.
An experiment is performed to verify the influence of small convolution on the model classification. MINST dataset is employed in the experiment, and the network structure is adopted (Figure 3). e experiment is split into two groups to verify the effect of 7 × 7, 5 × 5, and 3 × 3 convolution kernels on the network performance, respectively. e statistics is summarized to the iteration times under the accuracy of the four models reaching over 0.6, 0.7, 0.8, and 0.9 initially, as well as the iteration times in the presence of maximum accuracy as well as the maximum accuracy and time consumed initially. Each model experiment is repeated 50 times, and the average number of statistical iterations is listed in Table 1. Table 2 presents that the small convolution kernel enhances the extraction performance of local receptive field features of the network and increases the classification accuracy of the model. ree 3 × 3 convolution kernels are equivalent to a 7 × 7 convolution kernel, and two 3 × 3 convolution kernels are equated with a 5 × 5 convolution kernel. Under the receptive field of the identical convolution kernel, it is easy to find by comparison that the recognition efficiency of the convolution kernel falls to the maximum. In all effective intervals, the average number of experimental iterations of 5 × 5 convolution kernel is smaller than that of 3 × 3 and 7 × 7 convolution kernels. 3 × 3 convolution kernel exhibits the highest accuracy, whereas the accuracy of 5 × 5 convolution kernel is relatively low; however, the convolution kernel exhibits significantly low accuracy. Accordingly, in general, 3 × 3 convolution kernel has the maximum recognition efficiency and the fastest rise in accuracy; that is, 3 × 3 convolution kernel exhibits a better performance to extract local features of images.

Complexity
To obtain the vehicle type recognition accuracy, the NIN structure is optimized. e size, quantity, and layout of the convolution kernel of the NIN structure in Section 2 are tuned in accordance with the advantages of the small convolution kernel to extract local features of the image and downregulate  4 Complexity the number of computational parameters of the network. Figure 4 suggests that the 11 × 11 convolution kernel of the first layer is converted into 4 3×3 convolution kernels.

Use of Residual Blocks.
Since AlexNet, the depth of the most advanced CNN architecture has been increasing, whereas the depth of the network cannot increase by simply stacking layers. e mentioned finding is because the gradient backpropagates to the previous layer, and repeated multiplication may make the gradient infinitesimal and the gradient disappear; the deep network is difficult to train, and the network performance tends to be saturated, or even drops rapidly. To address this problem, He Kaiming et al.
proposed the residual network ResNet; in 2015, the proposed network won the first prize in the challenge competition of ImageNet image recognition and has deeply inspired the design of the later deep neural network. He Kaiming considered that the training errors produced by stacking identity maps on the deep network should not be higher than those attributed to shallow networks. According to Figure 5, the residual block can achieve the mentioned condition, and the input can be spread by crosslayer data line forward faster. In fact, ResNet is not the first model exploiting fast connection. Highway networks [35] and long and short-term memory network [36] units employ different gate structures to conduct fast connection.
ResNet ( Figure 6) continues to use the design of all 3 × 3 convolution layer of VGG. First, there are two 3 × 3 convolutional layers with an identical number of output channels in the residual block. Each convolutional layer is followed by a batch normalization layer and ReLU activation function. Subsequently, the input is directly introduced to the front of the final ReLU activation function by skipping the two convolutional operations. In the mentioned design, the output and input of the two convolutional layers should exhibit the identical shape, and then they should be added. To alter the number of channels, an additional 1 × 1 convolutional layer should be introduced to transform the input into the required shape, and then an addition operation is required.
As impacted by small convolution kernel and residual concept, the NIN is further optimized, and the convolution kernel in NIN is replaced by 3 × 3 convolution kernel to conduct the rapid convergence and training of the network. e residual measurement is performed to build data lines between the front and back layers of the network, so the feature map can be efficiently transmitted to the front convolutional layer, thereby eliminating the effect of gradient accumulation and decreasing and avoiding gradient disappearance. Given the setting requirements of ResNet, the optimized NIN structure is illustrated in Figure 6.

Implementation of Optimized NIN
e optimized NIN uses 3 × 3 convolution kernel and 1 × 1 convolution kernel [37,38]. 3 × 3 convolution kernel is used to increase network depth and improve network performance. 1 × 1 convolution kernel is used to enhance the extraction ability of nonlinear features of the network. In the optimized NIN structure, GAP is used as a classifier instead of full connection layer and to improve the generalization ability of the network and avoid overfitting of the network. In order to avoid the loss of gradient caused by the increase of network depth, residual measures are arranged between consecutive multiple 3 × 3 convolution layers on the optimized NIN to avoid network degradation. e partial source code of optimized NIN is as follows: (Algorithm 1)

Results and Discussion
e results and discussion may be presented separately, or in one combined section, and may optionally be divided into headed sections. e representative Stanford Cars dataset is adopted in the experiment. e scene with the images located varies with different postures [39] and unfixed resolutions. Accordingly, the vehicle type recognition of this dataset is more challenging. e Stanford Cars dataset     Complexity optimized NIN (layer 20) act as the comparison network models. GAP + SoftMax is employed for all the mentioned network model classifiers, and all network training employs data input dimensions. e preprocessing of the dataset, the splitting of the training set, and the verification set comply with literature [40]: the image size of the dataset is normalized to 256 × 256, 4 corners and the centre part are cut to generate 5 images with a size of 224 × 224, and the mirror operation is performed to generate 10 training images on the whole, from which the mean value of the training set image is subtracted to obtain the training input data. In the present study, appropriate weights and learning rates are manually set to achieve initialization. e training process starts from the initial weight and learning rate and continues till the accuracy of the training set stops enhancing, and then the learning rate reduces to one-tenth of the original. is process is repeated five times. e weight of the model is updated with the stochastic gradient descent method, and the initial learning rate is 0.01.

Vehicle Type Recognition Performance.
After repeated training of several models, the classification accuracy rate and the number of iterations reached initially are determined from the Stanford Cars sample data, as listed in Table 2.
e optimized NIN has the original MLPConv of NIN. e nonlinear features of the image can be approximated through "micronetwork" structure, so the optimized NIN has fast convergence. By replacing the large convolution kernel of the original NIN with the small convolution kernel, the optimized NIN has deeper layers than the original NIN. e computational effect of multiple 3 × 3 convolution kernels is equivalent to that of a 5 × 5 convolution kernel. Using this conversion, all the large convolution kernels of the original NIN are replaced by 3 × 3 small convolution kernels, which increases the convolution layers of the NIN and enhances the network performance. e residuals are deployed on the NIN structure to avoid the loss of gradient and restrain the degradation of network performance. It can be found from Table 2 that the number of iterations of NIN in each iteration process is less than that of VGG and GoogLeNet, which indicates that the convergence speed of NIN is quicker than that of VGG and GoogLeNet. However, at the end of the experiment, the recognition accuracy of NIN did not exceed that of VGG and GoogLeNet, even if the NIN trained too many iterations. However, the optimized NIN keeps good convergence because the "micronetwork" structure can extract the nonlinear features of the automobile image. In addition, the optimized NIN solves the problem of gradient weakening in the calculation process by the residual layout and strengthens the feature map for subsequent calculation. erefore, the optimized NIN model outperforms VGG and GoogLeNet in accuracy and convergence speed, and the final vehicle type recognition accuracy reaches 97.2%.

Convergence Effect of Optimized NIN.
e experimental data of NIN, VGG19, GoogLeNet, and the optimized NIN in the first 3000 iterations of the third experiment are intercepted, and the training error curves of the sample data of the four networks are plotted (Figure 7). Figure 7 suggests that the recognition training error of the optimized NIN in the training process is significantly lower than that of the other three networks. In the vicinity of 1300 iterations, the training error of the optimized NIN model did not continue to decrease. We reduce the learning rate of the models participating in the comparison to one-tenth of the original. Each model continued to learn according to the new learning rate, and the training error had a cliff drop in this case, which improves the training speed. In the 3000th iteration, it drops to 19.6%, while the error rate of NIN, VGG19, and GoogLeNet reduces to 31.2%, 28.9%, and 24.6%, respectively.
is also indicates that the optimized NIN exhibits good convergence and accelerates the training speed of vehicle license plate recognition.

Conclusions
In the present study, the structure and vital components of NIN are analysed, and it is verified that the NIN embedded micronetwork can efficiently extract the nonlinear features of vehicle images, and GAP avoids the overfitting of models and can regularize operation; besides, 1 × 1 small convolution conducts the dimensionality reduction of feature maps, downregulating the number of model parameters. Based on the NIN, a novel vehicle type recognition algorithm is built by changing the size and layout of the convolution kernel and using residual thought of NIN. Subsequently, it is verified in the Stanford Cars dataset, and the result reveals that the algorithm exhibits a better vehicle type recognition performance and higher recognition accuracy that reaches 97.2%. However, the optimized NIN also has shortcomings. First, in the same local receptive field, the large convolution kernel can be replaced by the small convolution kernel. Although the small convolution kernel operation reduces the number of variables compared with the large convolution kernel operation, the training time is greatly improved, and the efficiency is reduced. Second, the strategy of optimizing NIN is to deepen the network level. To some extent, the application of residual can solve the problem of gradient vanishing and restrain the degradation of network performance. Whether this network performance improvement method can support the further increase of network depth remains to be studied, which also points out the direction for our future work.
Data Availability e authors used the vehicle dataset provided by Stanford University to verify the improved model. e Cars dataset contains 16,185 images of 196 classes of cars. e data are split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of make, model, and year, for example, 2012 Tesla Model S or 2012 BMW M3 coupe; visit http://ai.stanford.edu/∼jkrause/cars/car_dataset.html.

Conflicts of Interest
e authors declare no conflicts of interest.