Convolutional Neural Network Based Vehicle Classification in Adverse Illuminous Conditions for Intelligent Transportation Systems

School of Electrical Engineering and Computer Science (SEECS), National University of Science and Technology (NUST), H-12, Islamabad 44000, Pakistan College of Technological Innovation, Zayed University, Abu Dhabi, UAE Control, Automotive and Robotics Lab Affiliated Lab of National Center of Robotics and Automation (NCRA HEC Pakistan), Department of Computer Science and Information Technology, Mirpur University of Science and Technology (MUST), Mirpur-10250, Pakistan Institute of Management Sciences, Peshawar, Pakistan Department of Computer Science and Engineering, Chungnam National University, Daejeon, Republic of Korea Department of Engineering, Lancaster University, Lancaster LA1 4YW, UK Department of Electrical Engineering, COMSATS University, Lahore 54000, Pakistan Department of Computer Science, Faculty of ICT, Balochistan University of Information Technology Engineering and Management Sciences, Quetta, Baluchistan, Pakistan


Introduction
With an exponential production of vehicles around the world, vehicle classification systems can play a significant role in the development of intelligent transportation systems, i.e., an automated highway toll collection, perception in self-driving vehicles, and traffic flow control systems. In the earlier times, laser and loop induction sensors-based methods have been proposed for the vehicle type classification [1][2][3][4]. ese sensors have been installed under the pavement of the roads to collect and analyse the data to extract the relevant information regarding vehicles. However, the precision and stability of these methodologies are significantly influenced due to undesired weather conditions and impairment in the road pavement [5]. In step with the advancement in computer vision, image processing and pattern recognition-based vehicle classification systems have been proposed [6,7]. Basically, computer vision-based classification system is a two-step procedure; in the first step, handcrafted extraction methods are utilized to obtain visual features from input visual frame. In the second step, machine learning classifiers are trained on the extracted features to perform classification on group-based data. Hand-crafted features are categorized into (i) global and (ii) local features to describe and represent the image data simultaneously [8].
ese features are combined in the training of traditional machine learning classifiers to perform object recognition.
ough these systems perform well in the specific controlled environment and are more convenient in terms of installation and maintenance than the existing laser and inductive-based schemes, these methods are trained on the limited handcrafted features extracted from the small datasets, whereas extensive prior knowledge is required to maintain accuracy time environment [9].
Recently, deep learning-based feature extraction and classification methods have been introduced, which demonstrated better applicability and adaptability than the traditional classification systems. Convolutional neural network (CNN) based classification systems have achieved significant accuracy on the large-scale image datasets due to their sophisticated architecture [10][11][12].
ough, the development of the graphical processing unit (GPU) has significantly increased the image processing capabilities of the computing machines. But the matter of fact is that CNN based classification system requires piles of data to sustain accuracy and ensure generalization. Until recently, to the best of our knowledge, no generalized benchmark dataset is available for the development and evaluation of vehicle classification systems. Consequently, available vehicle classification datasets are relatively small, based on limited classes of the specific regions, i.e., CompCars [13] and Stanford cars dataset [14]. Intelligent transportation systems of these regions can achieve significant results with these datasets; however, their performance is prejudiced in the occurrence of nonregional classes. To address the abovementioned limitations in vehicle-classification systems, we have made the following contributions.
(i) Convolutional Neural Network (CNN) based generalized vehicle classification architecture is presented to improve robustness of vehicle classification systems for Intelligent Transportation Systems (ITS) in adverse illuminous conditions. (ii) A local dataset comprising of 10,000 images based on six classes (i.e., Car, Van, Truck, Motorbike, rickshaw, and Mini-Van) has been collected from traffic surveillance and driving videos. It is important to mention that these classes are unique in design and shape, which are not covered in the existing vehicle datasets. (iii) Modified CNN has been employed and trained on the VeRi dataset, containing 50,000 images over six vehicle classes, to ensure generalization of the network. (iv) Finally, an extensive comparison study has been carried out between the proposed and existing vehicle classification methods to demonstrate the effectiveness of the proposed classification network.
e rest of paper is organized as follows. In Section 2, the existing handcrafted and deep learning feature-extraction and vehicle-classification methods are discussed briefly. In Section 3, network architecture along with the preprocessing and dataset collection has been elaborated. e results and the comparison study are carried out in Section 4. Finally, the article is concluded in Section 5.

Related Work
In step with the rapid advancement in artificial intelligence, vision-based vehicle classification is considered as an important element in perception module of self-driving vehicles. In the existing research work [5], vision-based vehicle classification is categorized into two major categories: (i) handcrafted features-based and (ii) deep features-based methodologies.
In the early era of computer vision, handcrafted featuresbased vehicle classification methods have been proposed for intelligent transportation systems. In this regard, Ng et al. [15] have proposed HOG-SVM based handcrafted features method to train SVM classifier using HOG features with Gaussian kernel function. e proposed classifier has been evaluated on 2800-image dataset of surveillance videos, which classified the motorcycle, car, and lorries with 92.3% accuracy. In another research work, Chen et al. [16] have presented a classification method that extracts the texture and HOG features and classifies the vehicles using a fuzzy inspired SVM classifier. e presented classifier has been evaluated on dataset, comprising of 2000 images in which the proposed systems classified the cars, vans, and buses with 92.6% accuracy. Matos et al. [17] have proposed two-neural networks based combined method embedding the features, i.e., height, width, and bounding borders of the vehicles. Resultantly, the proposed classifier achieved 69% on the dataset of 100 images. Furthermore, Cui et al. [ [19] have proposed an AdaBoost based fast learning vehicle classifier to distinguish the data into vehicle and nonvehicle classes. Moreover, the authors have proposed an algorithm to extract Haar-like features for the rapid learning of classifiers. e presented classifier has been evaluated on the public Caltech dataset, in which the system achieved 92.89% accuracy.
To overcome the issues of the handcrafted features-based classifiers, deep features-based systems have been proposed. Dong et al. [20] have presented CNN based semisupervised classification method for real-time vehicle classification. A sparse-Laplacian filter-based method has been devised to extract relative vehicle information, and the softmax layer has been trained to calculate the class probability of the belonging vehicle. e presented method has been evaluated on the Bit-Vehicle dataset and achieved 96.1% and 89.6% accuracy in day and night images, respectively. In another research work, Wang et al. [21] have presented a Fast R-CNN based vehicle classification method for traffic surveillance in a real-time environment. A crossroad dataset consisting of 60,000 images has been collected and divided into training and tested data, on which the proposed method attained 80.051% accuracy. Cao et al. [22] have proposed CNN and an end-to-end combined architecture for the vehicle classification in the incontinent road environment. e proposed framework has been evaluated on the CompCars view-aware dataset, in which the proposed classifier achieved a 0.953 accuracy rate. Chauhan et al. [23] have proposed CNN based vehicle classification framework for vehicle classification and counting on highway roads. Authors have claimed that the proposed framework achieved 75% MAP on the collected dataset of 5562 CCTV camera videos of highway traffic. Jo et al. [24] have proposed a transfer learning-based GoogLeNet framework for vehicle classification of road traffic. e authors have shown that the presented classifier has achieved a 0.983 accuracy rate while experimenting on the ILSVRC-2012 dataset. Kim et al. [25] have proposed the PCANeT-HOG-HU based combined feature extraction method, which is provided to SVM as input data to train the classification model. Moreover, the authors have collected the dataset consisting of 13700 images of vehicles considering six-categories of vehicles (i.e., motorcycle, van, car, truck, mini-bus, and large-bus), extracted from the surveillance videos for the training and testing of the proposed classification network. Results demonstrated that the proposed light-weight classifier achieved 98.34% average accuracy on the provided dataset.
ough the deep feature-based approaches can enhance the accuracy of vehicle classification effectively, these methodologies need a huge amount of data to achieve significant accuracy in real-time ITS applications [26][27][28][29]. In the recent era, extensive research has been carried out in this field; however, the available public datasets for self-driving vehicles/intelligent transportation systems comprise modern vehicle types, which are common in well developed countries. Consequently, these classification systems are not feasible for the intelligent transportation systems in Asian countries, i.e., Pakistan, India, Bangladesh, and China. e above-mentioned issues are indication towards the need of a novel vehicle classification system along with the dataset that covers the common vehicles, i.e., traditional trucks, buses, cars, rickshaws, and motorbikes of Asian countries.

The Proposed Method
To address the above-mentioned issues, we present a new vehicle dataset comprising of 10,000 images having six classes based on the common road traffic vehicles, as elucidated in Figure 1. To enhance the performance of the proposed classification in real-time ITS applications, initially, the existing pretrained AlexNet [30], VGG [31], GoogleNet [32], Inception-v3 [33], and ResNet [34] are finetuned on self-constructed dataset to obtain the final network. Based on the performance of these models, the best performing model is selected for the fine-tuning to increase the classification accuracy of the network. To ensure generalization, the proposed network is further fine-tuned on public VeRi dataset for robust performance in the intelligent transportation system of different regions. e whole process is briefly discussed below in Figure 1.

Dataset.
In deep learning-based classification systems, dataset is a key input that helps the algorithms learn the features to perform predictions based on the learned information. Currently, to the best of our knowledge, there is no generalized public vehicle dataset available that contains the images of the common vehicles to cater with the classification problems. For example, CompCars and Standford car datasets only contain the classes of modern cars of certain regions, which cannot be employed in the real-time classification systems of the other regions. Moreover, the proposed dataset is different from the existing datasets in terms of features and representations. Additionally, the existing vehicle classification systems are trained on relatively small datasets containing limited classes, which does not perform well in real-time intelligent transportation systems applications [35]. To address these issues, road surveillance and driving videos are collected from different regions to extract the images of the vehicles. Based on analyses, six common road vehicle classes are identified, and the dataset is formed through manual labelling using windows editing tool, as shown in Figure 2. e dataset comprises 10,000 images that have been categorized into six classes (i.e., car, bus, van, truck, motorbike, and rickshaw), and each class consists of 1670 images.

Data Augmentation.
Equations data augmentation is the easiest and most common technique to mitigate overfitting from network by artificially expanding the dataset through label-preserving transformation methods [36]. To increase the diversity of our dataset, we employed four Complexity distinct types of data augmentation: (i) Gaussian blur, (ii) rotation, (iii) horizontal flip, and (iv) Gaussian noise, as shown in Figure 3.
We utilized Gaussian blur with the 5 × 5 kernel to reduce the high-frequency noisy pixels while preserving the low spatial frequency through convolving the Gaussian kernel over the 224 × 224 size image. In the second type of data augmentation, i.e., rotation, we employed 10-degree rotation on the original dataset images to generate a diverse view of the original dataset. e third type of data augmentation includes the dataset generation through the horizontal flipping of the original dataset, whereas we used Gaussian noise as a fourth type of data augmentation to add some random luminous factor in the dataset. It is important to mention here that the horizontal flip, Gaussian blur, and rotation are applied to the training dataset, whereas Gaussian noise is applied to the test dataset, as shown in Figure 3. e main purpose of applying Gaussian noisebased data augmentation to the test dataset is to validate the efficacy of the proposed classification network on the noisy data.

Convolutional Neural Network (CNN) Model.
CNNs are supervised feed-forward networks that proved considerably significant performance on the large-scale object classification applications. e basic structure of the CNNs is stimulated by the key visual cortex of the human brain, which oversees the processing of visual information [37]. In the image classification, compared with the traditional  handcrafted features extraction methods, the CNNs can automatically extract the learnable visual features from the large-scale dataset input images from the classes to perform the classification. One of the main superiorities of the CNNs over traditional classification methods is that, in CNNs, representation of the features and the classifier are employed in the same network to eliminate their dependencies. e architecture of the CNNs principally comprises three types of layers, (i) convolution layers, (ii) pooling layers, and (iii) the connecting layers, briefly discussed below.

Convolutional Layers.
Convolutional layers are considered as one of the most important layers in the CNNs, which consist of the defined set of learnable filters. e filters are spatially smaller than the input-size, which slides over the input image data during the forward pass to produce the two-dimensional activation map. e activation map indicates the location along with the strength of the detected visual features in an input image. e calculation of the features of the convolutional layers is obtained using where y l n is the n th feature map of l-layer, m ⟶ l n is the C-kernel, while feature extraction from layer-l, and y l−1 m is the Characteristic patterns linked to layer-l.

Pooling Layers.
Pooling layer is commonly used between consecutive convolution layers of the CNN structure to gradually minimize the spatial representation size to reduce computations while retaining useful information, which helps in controlling overfitting during the learning process. It is important to mention that there are two types of pooling layers being used in the existing state-of-the-art CNNs, i.e., a pooling layer having filter size � 2 along with stride � 3, which is called overlapping pooling; the other pooling layer with filter size � 2 is having stride of 2. Besides, some other types of pooling, i.e., L2-norm pooling and average-pooling functions, have also been used in the existing CNNs. e pooling function can be performed through where z l−1 n is the value extracted from l -1 convolution features, w l n is the map weight, and b l n is the offset value.
3.6. Drop-Out Layer. In CNNs, regularization is a common way to avoid the effects of overfitting by adding a significant amount of penalty to the utilized loss function. In this regard, drop-out layer is added in the bottom of proposed network, so that the system does not learn interdependent weights of features.

Fully Connected Layer.
In the final section of structure of CNNs, neurons of fully connected layer are linked with all the activations of the previous layer to minimalize the feature dimensions. e final pooling layer of the CNNs flattens the convolutional layer, which is forwarded to fully connected nodes of the network. In the next step, the matrix multiplication is applied to compute these activations followed by a bias factor offset. Fully connected neurons can be computed using where N l is the No. of neurons of output-layer, y l−1 m is the m characteristic pattern of layer l-1, and w l m,n is the connected weights.  performance on large-scale datasets [38][39][40][41]. To choose the suitable CNN model, initially, we fine-tuned the existing stateof-the-art AlexNet, Inception-v3, GoogleNet, VGG, and ResNet models according to the classes of the collected dataset.

Selection of CNN
In the next step, transfer learning is applied to these models to evaluate the self-constructed vehicle dataset. Resultantly, ResNet demonstrated better applicability in terms of convergence, response-time, and accuracy than the competitive networks (briefly discussed in Section 4). Consequently, the network architecture of the ResNet with 152 layers is improved and employed in the proposed vehicle classification system.

The Architecture
In the proposed system, we have employed ResNet architecture to perform vehicle classification, one of the most groundbreaking CNN architectures proposed by He et al. [34], which demonstrated outstanding performance in object recognition and classification by securing first place in ILSVRC-15 with 3.57% Top 5 error-rate [34]. In the previous deep learning networks, increasing network layers can cause a vanishing gradient problem, due to which the model was unable to converge at its best. In ResNet network architecture, a novel skip connection-based technique was introduced, where each input from the previous layer is accumulated to output of next layer. Since network goes deeper, a bottleneck design was also adopted to mitigate time-complexity of this CNN architecture. We have employed a transfer learning approach where a model trained for some specific task can be tuned to perform another task by simply learning new weights. is approach can be effective if we have a lower amount of data, which is insufficient for training from scratch. In this work, we have deployed a pretrained ResNet-152 network for vehicle classification, as shown in Table 1. is network has a depth of 152 layers, which was achieved by replacing each 2-layer block in the original ResNet with the 3-layer bottleneck block [34]. e input layer of this network takes an RGB colour image with a size of 224 × 224 pixels. In Table 1, it can be observed that the structure of the presented method uses 64 convolution kernels of 7 × 7 with the stride of 2 in the first layer, and max-pooling layer of 3 × 3 with the stride of 2 is used to the first convention layer. Further, convolutional blocks, i.e., 2-5, are organized in the form of three-layer bottleneck blocks having several filters to 128, 256, 1024, and 2048 followed by an adaptive-average pooling layer, respectively. To perform transfer learning, last fully connected layer was removed from the network, which was pretrained to perform the classification of 1000 natural categories. Besides, we append a new classification block consisting of a fully connected layer having a feature vector of 1024 neurons superseded by the average-pooling layer and the ReLU layer to learn new visual features from the training dataset. In the bottom of network, drop-out layer is in place to overcome vanishing gradient problem. Based on the classification block, a new fully connected layer is inserted to perform six types of vehicle classification, where each unit in the last layer is linked to six-class output probability by utilizing softmax function. To ensure that these new layers learn higher-level visual features from the dataset, we have increased the learning rate of these layers as compared to the previous layers whose learn rate remains unchanged. We have set batch-size and total epochs to 64 and 100, respectively. Network training was performed on a heavy computing machine equipped with the RTX 2080TI, 11 GB DDR5 GPU, core i9 -9900k CPU along with 32 GB RAM, which took 8 hours to complete training.

Experiments and Results
e proposed vehicle classification method is assessed on the dataset-based platform setup. e experiments are performed on the heavy computing machine equipped with the RTX 2080TI, 11 GB DDR5 GPU, core i9 -9900k CPU along with 32 GB RAM having a 64 bit windows 10 operating system.

Training of the Proposed Classification System.
e whole training process is categorized into three steps: (i) data preprocessing, (ii) training, and (iii) evaluation. In the first step, the dataset images are distributed into training, validation, and testing data, normalized to the size of 224 × 224 according to standard input size of CNN architectures. e training and testing images are randomly split by an 80-20% ratio of the total dataset images, and the validation set is formed by a random selection of 20% images from the training set. Pytorch 1.4.0 library and MATLAB 2019a are utilized for the implementation (i.e., data preprocessing and organization, training, evaluation, and modification of the network) of the proposed classification system. e experiments have been categorized into three types, (i) evaluation of the networks without finetuning, (ii) evaluation of fine-tuned model on self-constructed vehicle dataset, and (iii) evaluation of fine-tuned model on public VeRi dataset, which are briefly discussed below.

Evaluation of the State-of-the-Art CNNs without Fine-
Tuning. To evaluate the CNNs, AlexNet, Inception-v3, GoogleNet, VGG, and ResNet are loaded from Pytorch resources. e training of these networks is performed using the Pytorch framework; a stochastic gradient descent (SGD) optimizer is employed for the parameter learning with momentum, learning rate, and batch size of 0.9, 0.001, and 128, respectively. Cross-entropy, a commonly used loss function, is utilized to accumulate loss during the whole process, and validation is performed after every epoch to evaluate the learning while training the network. e comparative accuracy of these networks is shown in Figure 4.
Discussion: It can be observed from Figure 4 that ResNet with 152 layers demonstrated better accuracy than the 19layer VGG, 22-layer GoogLeNet, AlexNet having 25 layers, and Inception-v3 with an average difference of 1.7%. Consequently, it can be assumed that the ResNet can achieve better accuracy after fine-tuning the architecture.

Evaluation of the Modified Network on Self-Constructed
Vehicle Dataset. Based on the performance of the networks, discussed in the above section, the architecture of the ResNet is improved by adding a new classification block in the base of the network.
e new classification block comprises fully connected layers, followed by an average-pooling and Relu layers, respectively. However, to find the best fitting feature vectors of fully connected layers, ResNet with 152 layers has been evaluated on a self-constructed dataset with the multiple combinations of the feature vectors of fully connected layers in classification block to improve robustness of network.
To apply transfer learning, feature extraction is set to the newly added classification block to learn the optimal weights and biases from the input dataset. SGD optimizer along with the same parameters, i.e., learning rate and momentum, is utilized in the training and evaluation of the proposed classification system. It can be observed from Table 2 that the proposed network with two fully connected containing higher feature vectors achieved significantly higher accuracy among other FC layers with the low feature vectors. Fully connected layer-1 is passing 2048 features down the classification block. Concurrently, the other FC layer with the higher feature vectors has been added to the classification block of the network.
In the next step, the proposed network with the different depth layers, i.e., 18, 34, 50, 101, and 152, has been evaluated, and the performance of these networks in terms of accuracy has been shown in Table 3.
Discussion: Table 3 demonstrates the influence of the network depth on the performance in terms of the accuracy of the self-constructed vehicle dataset. It can be seen that the performance of the ResNet is increased with an increase in the depth of the network. Consequently, ResNet with the 152 layers achieved better accuracy in overall classes of the dataset as compared with the ResNet with lower depth layers. e detailed performance matrices of the ResNet with 152 layers are shown in Table 4.

Evaluation of the Modified CNN on VeRi Dataset.
Based on the performance of the fine-tuned networks demonstrated in Table 4, the proposed classification network has been fine-tuned on the public VeRi dataset [42,43] to ensure generalization. e dataset involving 50,000 images has been categorized into six classes, i.e., bus, MPV, pickup, sedan, truck, and van, distributed into test and train set over 80 : 20 ratio. It is important to mention that these classes are selected based on the variation in data. e performance matrices shown in Table 5 elaborate the effectiveness of the presented classification system.

Comparison with Existing State-of-the-Art Vehicle Classification Methods.
e performance of the presented classification method is compared with the traditional vehicle classification methods to prove the applicability of the proposed system in terms of class-wise and average accuracy, as shown in Table 6. e existing classification systems [11,[44][45][46] have been implemented in the MATLAB 2019a, which has been trained and evaluated on the self-constructed vehicle dataset.
Discussion. e proposed classification system has been compared with the existing vehicle classification systems [11,[44][45][46] to validate the efficacy of the proposed network. e existing networks have been reproduced on the proposed dataset. Zhuo et al. [44] have presented the GoogleNet Complexity architecture-based vehicle classification method with having 22-layer depth network. On the other side, Gao et al. [45] have introduced AlexNet based vehicle classification system containing 5 convolutional and 3 fully connected layers in the network. Shivai et al. [46] introduced a self-proposed CNN based vehicle classification system having 13 convolutional layers and one fully connected layer, which is followed by max-pooling and dropout layers, whereas Zakria et al. [11] have presented the inception architecture-based classification system. ough these systems demonstrated good performance on their datasets, one of the main reasons for the difference in the accuracy is that these existing systems [11,[44][45][46] are comprised of the small depth networks which do not converge on the large-scale datasets. Besides, the existing systems [11,[44][45][46] are trained on the limited classes, which do not cover the common road traffic vehicles. Resultantly, these systems do not perform well in real-time classification applications. Moreover, it is important to mention here that these methods were trained on unbalanced datasets, which is also an influential factor in real-time performance of vehicle classification systems. Consequently, the performance of these existing systems is Accuracy (in percentage) Figure 4: Results of the state-of-the-art CNNs without fine-tuning on a self-constructed dataset. 8 Complexity prejudiced while evaluating our proposed self-constructed balanced dataset, whereas, on the other side, the proposed vehicle classification system is trained on the self-constructed vehicle dataset comprising of 10,000 images, which cover the common road traffic classes, and further finetuned on the public VeRi dataset, which contains 50,000 images, to ensure the generalization of the proposed classification system. Consequently, our proposed classification system achieved higher accuracy than the existing vehicle classification systems.

Conclusion
In this paper, a CNN-based vehicle classification system is proposed to improve the effectiveness of intelligent transportation systems. A new dataset containing 10,000 images of six classes is constructed to train classification system. Initially, five state-of-the-art CNNs, i.e., AlexNet, Inception-v3, GoogleNet, VGG, and ResNet, are trained on the collected dataset to validate the performance. Based on the effectiveness, ResNet with the 152 layers is improved by adding a new classification block in the original network through transfer learning. To ensure generalization, the proposed classification system is fine-tuned on public VeRi dataset. Results demonstrate that the proposed classification system achieved higher accuracy, i.e., 99.68% and 97.66%, on self-constructed and VeRi dataset, respectively, which is significantly higher than that of the existing state-of-the-art classification systems. In the future, we are aiming to extend our work to develop fine-grained classification system to improve the effectiveness of proposed method in intelligent transportation systems.

Data Availability
All the data used to support the findings of the study are available in the manuscript.

Disclosure
Muhammad Atif Butt and Asad Masood Khattak are the joint first authors to this work.

Conflicts of Interest
e authors declare no conflicts of interest.

Authors' Contributions
e research conceptualization and methodology were done by Muhammad Atif Butt, Asad Masood Khattak, and Sarmad Shafique. e technical and theoretical framework was prepared by Sarmad Shafique and Saima Abid. e technical review and improvement were performed by Ahthasham Sajid, Muhammad Waqas Ayub, Bashir Hayat, and Awais Adnan. e overall technical support, guidance, and project administration were done by Muhammad Atif Butt, Asad Masood Khattak, Saima Abid, and Ki-Il Kim.