Auditory Speech Based Alerting System for Detecting Dummy Number Plate via Video Processing Data sets

Spectrum of applications in computer vision use object detection algorithms driven by the power of AI and ML algorithms. State of art detection models like faster Region based convolutional Neural Network (RCNN), Single Shot Multibox Detector (SSD), and You Only Look Once (YOLO) demonstrated a good performance for object detection, but many failed in detecting small objects. In view of this an improved network structure of YOLOv4 is proposed in this paper. This work presents an algorithm for small object detection trained using real-time high-resolution data for porting it on embedded platforms. License plate recognition, which is a small object in a car image, is considered for detection and an auditory speech signal is generated for detecting fake license plates. The proposed network is improved in the following aspects: Training the classifier by using positive data set formed from the core patterns of an image. Training YOLOv4 by the features obtained by decomposing the image into low frequency and high frequency. The resultant values are processed and demonstrated via a speech alerting signals and messages. This contributes to reducing the computation load and increasing the accuracy. Algorithm was tested on eight real-time video data sets. The results show that our proposed method greatly reduces computing effort while maintaining comparable accuracy. It takes 45 fps to detect one image when the input size is 1280 × 960, which could keep a real-time speed. Proposed algorithm works well in case of tilted, blurred, and occluded license plates. Also, an auditory traffic monitoring system can reduce criminal attacks by detecting suspicious license plates. The proposed algorithm is highly applicable for autonomous driving applications.


Introduction
Real-time object detection finds its application in autonomous vehicles, manufacturing industries, etc. Limited memory and computation power have made real-time object detection very challenging. Small Objects of interest imply that objects are really small in appearance like mouse, plate jar, bottle or objects are physically large in appearance but still they take up a smaller patch on an image like train, airplane, bicycle, and car [1]. Diversity in input images and its representation makes the job of small object detection complicated. An application for autonomous driving requires end to end detection with less computation time. Even after the success of Deep Neural Network (DNN), there exists a huge difference in performance evaluation metrics (accuracy) of large, small, and medium objects. Two primary approaches of object detection are one-stage approach and two-stage approach. e difference in these approaches is that detection efficiency of one-stage approach is better compared to other approaches. Two-stage approaches which are region-based detectors are low in speed for real-time applications even though they exhibit higher accuracy. Deeper architecture results in higher computation leading to low speed. In practical applications, the major issue is to strike a compromise between speed and accuracy. e evaluation shows that deeper the architecture is the number of layers, which requires higher training parameters. is needs higher resource consumption and larger data to fine tune training parameters. Finally, resulting accuracy is high. But it makes practical application of deeper architecture difficult. For example, faster RCNN can be considered as a baseline model to detect multiscale objects with high performance, but the speed needs to be compromised. Choice of the model is done considering various factors to enhance Mean Average Precision (mAP) like super resolution for obtaining scaling information of small objects, multiscale training [1,2]. In practical applications like vehicle detection, license plate detection, YOLO is a good choice to consider since it exhibits a good tradeoff between speed and accuracy. ere is a significant impact of amount of data on the backbone model. Shallow models work well for scarce data. YOLOv2 offers a simple balance between speed and accuracy. YOLOv2 is ideal for small GPUs, high frame rate videos as it runs at 90 Frames per second (FPS) with mAP comparable to fast R-CNN. But the drawback of YOLOv2 is detecting small objects. Considering all this fact we have chosen YOLOv4 as the backbone model and to reduce the computational load Haar-based training is done. Comparatively low recall and more localization error compared to Faster R_CNN, Struggles to detect close objects because each grid can propose only 2 bounding boxes, and Struggles to detect small objects are the limitations of YOLO.

Context of the Problem with reference to Case Study.
In India, the size of the number plate is variable (which may be considered as small objects) and the surveillance purpose CCTV cameras are of low resolution. Hence, automatic number plate recognition remains a challenging problem.
is problem of detecting variable size number plate with good accuracy as well as with good speed is addressed in this paper. Organization of the paper is as follows: Section 2 describes the challenges in small object detection; Section 3 briefs related work; Section 4 explains the basics and preliminaries associated with the proposed work; Section 5 introduces the proposed work; Section 6 describes the experimental setup and lists the results followed by conclusion, future work, and references.
Key contributions of this work are shown in Figure 1: (1) YOLOv4 along with Haar cascade training is proposed. In accordance with this low-level and highlevel feature information is extracted using Haar cascading. is makes our method practical for realtime object detection (Vehicle and in particular small objects like number plate of the vehicle). e proposed work helps in achieving multiple tasks like small object detection, multiple object detection, reducing computation time, and apparently memory requirement. (2) As a case study for small object detection, license plate detection is studied in this work. License plate forms a small object in the complete image area and detection is achieved using YOLOv4 which is specially designed to overcome the drawback of YOLO and YOLOv2 of localization and small objects detection, respectively, see Figure 1.

Regarding Small Objects and Detection Challenges
Zhu et al. [1,3] stated that small objects occupy 20% of an image size. e object is said to be small if the dimensions of bounding box are 20% lower than that of image height and width.
Challenges associated with small object Detection: Small object detection faces several challenges apart from normal object detection like small appearances. Detector gets confused to spot small objects which are of similar appearance and located around.
Also, locating small objects is difficult in the clutter background. Pixels representing small objects are less informative. Furthermore, numbers of available pixels representing the data of small objects are less compared to normal objects. Key features of small objects are eventually lost while going through subsequent DNN. Example, object occupying size of 32 × 32 is represented by one pixel after pooling in VGG16.
is makes sliding window, selective search impractical to attain good outputs. Useful data pertaining to small objects with respect to training may get ignored due to widespread use of small objects in the image.
is demands to exploit the contents of an image as well as the amount of data significantly impacts the model.

Related Work
is section reviews work that comprises a deep learning approach to detect small objects and recognize number plates. Expanding Receptive Field YOLO (ERF YOLO) was introduced [4] to optimize YOLOv2 in locating small objects. ERF block was used to expand the receptive field. Lowlevel information was down sampled by ERF block for obtaining location information and deconvolution was used to up sample the high-level information to obtain feature information.
e detection result was obtained by combining these two results. Even though ERF YOLO shows improvement in accuracy the inference time required was high. Two-stage like fast and faster RCNN, and one-stage approaches like YOLOv3 are evaluated for parameters like resource utilization, processing speed along with the  Computational Intelligence and Neuroscience backbones like Feature Pyramid Network (FPN), (residual Networks) ResNet, or ResNeXT in [1]. e work clearly states the advantages and disadvantages of the model pertaining to above parameters and changes in these parameters when the size of the object is scaled. is paper provides a detailed comparison of two-stage and one-stage methods. Work concludes that faster RCNN can be considered as a baseline to develop a model from it, but faster RCNN falls short in real-time applications and training is also complex. Hence, YOLO can be good in case of real-time application as even though training time is high, the tradeoff between speed and accuracy is worth applying. Also, the drawback of YOLO is removed in YOLOv2 and YOLOv3. YOLOv2 is not able to detect small objects. Even though YOLOv3 has good accuracy in detecting small objects, speed of YOLOv3 is slow compared to YOLOv2 due to use of DarkNet-53 combined with techniques like skip connections, residual blocks, and up sampling. To overcome this drawback of YOLOv2and YOLOv3, we have used YOLOv4. YOLO-V4 system weight files are small and do not require high hardware requirements. It can also be implemented in PyTorch so that it can be deployed on mobile devices, enabling edge devices to run these models as well, relieving the space constraint of immovable signal capture devices and providing the advantages of high accuracy and high detection rate [5]. Multiple object detection is achieved using YOLOv3 and openCV on KITTI and COCO data sets. e evaluation shows that accuracy obtained for car and heavy vehicles are 95.5% and 96%, respectively, for day images [6]. Work does not show the performance evaluation metrics for real-time data. Also, it is suggested that design can be modified to make the model more robust and suitable for real-time applications. Multiple vehicle tracking is proposed in [7] proposed EYOLO vehicle detection algorithm processing at 35 FPS. e methods involve use of a kalman filter for multiple vehicle tracking. Resource consumption can be reduced by using convolutional neural networks based on the Haar filter. G-Haarbased methods proposed in [8] outperform BNN as G-Haar weight can keep higher computing precision in comparison to binary networks. Local regression tasks using sparse window generation strategy can detect multiscale small objects [9].
Automobile industry related applications like crime tracking, traffic violation tracking gaining fame hence also License plate recognition (LPR). e LPR model is stable and robust due to the use of edge information, texture features, and mathematical morphology. A review on number plate detection and recognition is considered as license plate is a small object in vehicle images. License plate recognition of a Chinese vehicle is proposed in [10] by use of a kernel-based learning machine with deep convolutional features. Main aim was recognition of the Chinese number plate. Abedin et al. [10,11] proposed license plate recognition but it performed well on only high-quality images. Cloud-based number plate recognition for smart cities was proposed by Polishetty et al. [12] which involves binarization and edge detection, but these approaches are susceptible to complex backgrounds. e merging of CNN and RNN is proposed by Redmon and Farhadi [13] for reading car number plates. e proposed method investigates the blending of RNN and CNN for car number plate detection and recognition. e operational speed of these methods is not clearly mentioned. Also, all these works do not consider real-time scenarios. [14] has addressed the problem of identifying moving vehicles with their number plates. e work involves a database system for identifying the culprit. Most of the works involved used openCV with python. e accuracy of openCV is less than that of YOLOv3 and the speed of detection is less. A real-time vehicle detection and LPR recognition system are presented to address the issue of fake number plates, accuracy, poor quality images, speed, etc. Number plate is detected and recognized from the moving vehicles. [15] proposes YOLOv2 DarkNet based on Alexey's implementation. [16] uses YOLOv2 for fast and accurate license plate detection. Even though the recognition rate achieved was 78.33%, but the results were unsatisfactory for some real-world Automatic License Plate Recognition (ALPR) applications. Author also proposes to explore new CNN architecture to optimize speed.

Basics of YOLO4.
YOLO is a real-time object detection model, having three versions, with progressively substantial improvement. YOLOv1 [17] widely known as YOLO is a one-stage network which looks at object detection as a regression problem, hence giving class probabilities and prediction simultaneously for bounding boxes coordinates. Input image is fixed to size by resizing, and later a single convolutional network works on the image. A threshold is put on the resulting detection with the model confidence score On GPU, YOLO detects at a speed of 45 fps. Smaller Fast YOLO achieves the speed of 150 fps. e result is displayed by dividing the input image into a S x S grid of equal width and height of the tensor. Grid cell takes responsibility of detecting objects if the center of the object is inside grid cell. Also, every grid cell is concurrently responsible for predicting confidence scores and bounding boxes presenting the confidence of the model as well as the accuracy of predicting bounding boxes. Background errors in YOLO are less than half as compared to faster RCNN. YOLO struggles in precisely locating small objects. Further, localization error present in YOLO is fixed in YOLOv2 by introducing many new training methods like batch normalization, multi-scale training with input images of higher resolutions, use of default bounding boxes in place of fully connected layers. YOLOv2 focuses on improving localization and recall. is offers a tradeoff between accuracy and speed. Improvements in YOLOv2 allow it to train multiclass data sets like COCO/ImageNet. YOLOv2 fails in detecting small objects because of the resulting low dimensions of the feature map due to input down sampling used for final prediction.
ese issues of YOLOv2 are addressed in YOLOv3 with remarkable improvements in detecting small objects. YOLOv3 [13,18] approach developed deeper network consisting of 53 layers termed Dar-knet53 and it also combines the network with methods like skip connections as in ResNet, residual blocks, and up Computational Intelligence and Neuroscience sampling in order to improve recall precision and Intersection over union (IoU) metrics. Since YOLOv3 consists of 106 layer fully convolutional architecture, it is slow in speed compared to YOLOv2. Low-resolution image performance is improved as YOLOv3 predicts objects at three distinct scales instead of single prediction at last layer. Final detection is done by applying a 1 × 1 kernel on a feature map of three different sizes and three different positions in the network like FPNs. YOLOv3 creates nine anchor boxes and divides them into three areas. Bounding boxes per image are more as each location administers three anchor boxes. Number of boxes predicted by YOLOv3 are 10 times the number predicted by YOLOv2. [19] cost function calculation in YOLOv3 is different from YOLOv4. YOLOv3 uses logistic regression for the bounding box prediction, that is, binary cross entropy loss for each label instead of mean square error for calculating classification loss. Softmax function is not used for class prediction. YOLOv3 obtained a mAP of 37 on the COCO-2017 validation set with input image resolution of 608 × 608, whereas the competing MobileNet-SSD architecture received a mAP of 30. YOLOv4 architecture is used in the current work [20]. Table 1 illustrates the different versions of YOLO [5].

Basics of Haar
Cascading. e wavelet transform decomposes the input image into of wavelet images with frequencies namely low-frequency LL containing the source image's vital intelligence, and the high-frequency LH, HL, and HH, which preserve the source image's horizontal, vertical, and diagonal edge details, respectively. Wavelet possesses the ability to localize time-frequency for limited duration. ey are adjacent rectangles at distinct positions in an image. Haar-like feature's basis depends on detection of features and encoding the information about the class to be detected. Haar-like features are of three types. e first being edge feature, second is the line feature, and third type is the center-surround feature. Haar feature selection algorithm is based on the foundation of calculating the difference between the sum of black pixels and sum of white pixels. Haarlike features demonstrate fast sum computation by use of integral images. Integral image depends upon the number of pixels in the rectangle. Since it does not depend on individual pixel computation speed is high. As it is based on the fundamental of Haar wavelet, it is called Haar-like. Integral image consists of small units, representing a given image. e integral image is described as where ii(x, y) is the integral image and ii(x ′ , y ′ ) is the original image.
Haar Cascade Classifier is a method utilized for detecting objects [21][22][23]. It has four points for object detection, such as Haar-like features, integral image, AdaBoost learning, and Cascade Classifier. Advantage of integral image is its capability of performing basic operations in a very less processing time. Use of integral image made cascade classifier to run in real time.

Speech Alerting Unit.
e speech alerting system is to assure the redundancy mapping and signal strength orientation of overall data under the processing channel. As the authors discuss a detailed signal analysis via a speech alerting in [24] the orientation assures the process is dependent on mapping and extraction of results via a spoken grammar library or lookup database systems. e similar pattern of signal compression is discussed in [25] with an order of ration inter-dimensional signal processing. e fall detection based alerting system discussed in [26] for feature-based discussion support. In the proposed system architecture, the behavioral model assures the processing and detection of dummy name plates via a video data sets through an approach of interdependent decision support of alert management.

Proposed Work
e proposed work helps in achieving multiple tasks like small object detection, multiple object detection, reducing computation time, and apparently memory requirement. As a case study for small object detection, license plate detection is studied in this work. Proposed approach consists of diverse layers like pre-processing, feature extraction, training, testing, vehicle license plate detection layer, and TTS conversion. A training set is generated containing a huge amount of positive and negative data for automatic detection and recognition of license plates in a real-time video during the pre-processing stage. Positive images are obtained by cropping core sub-image with the aspect ratio same as that of license plate by use of positive training data set computation time is further reduced. Pre-processing is followed by feature extraction. Feature extraction includes extracting features like length, width-height of the object. Haar features were extracted and trained using Haar cascading [27]. e Haar cascade classifier was trained on a batch of positive and negative samples which were sewed up later together to form a vector file to generate an xml file. In addition, use of GPU (Graphics Processing Unit) increased the computation speed and processes at 45 frames per second. Purpose of using Haar is twofold. Haar cascading reduces the computation and memory requirement as well. Testing phase is executed after training phases which checks whether the detected object is correct or incorrect. YOLOv4 predicts moving object detection using CSPDarknet53 frameworks. YOLOv4 is used to predict the bounding box and to calculate the confidence score of predicted vehicles using a single convolutional neural network. Input image is split in to S × S grid. m bounding boxes are produced inside each grid. e bounding box with probability of class greater than threshold represents detected object.
Speech synthesis (or Text to Speech) is the computergenerated simulation of human speech. It converts human language text into human-like speech audio. In this tutorial, you will learn how you can convert text to speech in Python [28]. Text-to-speech reads words on webpages, smartphones, etc., and converts written text to a phonemic representation (Sounds of a word), further it converts the phonemic representation to waveforms using WaveNet that can be output as sound. e block diagram below represents the working of the same. TTS is built using python [29]. Python facilitates with different APIs to convert text to speech. We have used gTTS API which enables us to read text on-screen supporting different languages like Kannada, Tamil, English, and Hindi [30]. e Algorithm 1 operates on a folder which contains image files.

Hardware Requirements and Annotation.
Proposed method is verified by carrying out experiments on the data collected by real-time cameras. Camera used is SC-IS42BP-I (Z) (S) (W). Key features of the camera is 1/2.7 MP progressive scan CMOS. Max resolution of the camera is 2688 × 1520 @1-25fps. It supports low bit-rate, low delay, and ROI enhance coding Quad Stream support. Images were collected by using high-definition camera with 2688 × 1520 resolutions. Cameras used reflected real-time scenario of the traffic for 24 hours in variable traffic scene. Data set is split as 40% for training, 40% for testing, and 20% for validation, in accordance with the division protocol proposed by Tanwir [31] in the SSIG data set. We have used LabelImg annotation tool, a graphical annotation tool to graphically label images. It is coded in Python, and the graphical interface is provided by QT. Annotation is saved as xml file in YOLO format. is study focuses on various parameters such as the x-coordinate of the bottom-left corner of the rectangle (x), the y-coordinate of the bottom-left corner of the rectangle (y), the width of the rectangle (w), and the height of the rectangle (h). According to research, low-level layers carry location information and high-level layers consist of information pertaining to features [32]. To gauge the performance of the proposed algorithm, 2424 images were used. Some types of License plates appeared more than other, since vehicle images are collected from realtraffic scenes. Real-time data collection made data set challenging due to variation in illumination, background, plate color, font, etc. For creating positive training set images were cropped to the core sub-image with same aspect ratio as that of license plate [33]. is increases the detection rate of license plates. Class IDs like 0, 1, 2, 3, 4 are provided for detection of yellow, white, black, green, red colored number plates. is is demonstrated in Table 2. Example: <Class ID> <x><y><w><h> Example of annotation. [4] have been of great aid in providing cross-platform support for the execution of numerous vision algorithms, making (i) Initially load the video converted to frames in the system memory (ii) Locate the folder path P provided for the process.

Software Requirements. Tensorflow and OpenCV
For every image file F belonging to the folder path P Construct file list L s.t F ∈ folder at path P Set the csv report file to empty R Plot the bounding box for finding the License plate. (11) Detect the character and number by segmentation using OpenCV. (12) If (confidence <40%) (13) Nested if (number of characters are not between 8 & 10) (14) Enter the number plate into alert table (15) Else enter the number plate in non-alert table (16) Convert the alert number plates from text to speech (17) Intimation to traffic in charge for enquiry.       Figure 2 represents the vehicle detection of real-time data sets with LPR number, color, and the efficiency of detection. Figure 3 proposed block diagram of auditory Speech based alert system for detecting fake number plates. Figure 4 illustrates Text to speech conversion using python [28].
is work can detect the number plates which are blurring and tilted. Figure 5 illustrates architecture of YOLOv4. Figure 6 shows Annotation of Car license plate. Figure 7 shows the proposed method of real-time video indicating the detected vehicle with license plate. Figure 8 illustrates license plates at different angles. Table 3 shows the result of LPR detection with confidence score. It can also be observed that multiple vehicles in an image are also detected. e algorithm is subjected to real-time videos containing vehicles of variable sizes for speed analysis. In comparison to the YOLOv3 approach, there is 3 percent gain in speed [35]. Average accuracy, precision recall values obtained for six different real-time videos captured at real time are tabulated in Table 4. Accuracy is 97% improved by 3.2% greater accuracy than existing YOLOv3.  Computational Intelligence and Neuroscience ree metrics are considered for evaluating object detection algorithm performance, namely accuracy, precision, and recall. Table 4 gives the performance metrics for the proposed algorithm.
Proposed algorithm is able to detect LPR at different angles, different fonts, and colors.
Alert messages generated from LPR data in GUI are conveyed to traffic in charge from dashboard and accordingly an enquiry is done for false number plate. Table 5 demonstrates the performance estimation and speech alert management of incoming patterns. Table 5 also computes the inter-dependency of information signal processing and response time computation for particular signal and speech cum alerting message. e process assure the computational statistics of computation with reference to signal strength and processing delay.

Conclusion and Future Work
In this paper, we proposed modified YOLOv4 with Haar cascading based feature extraction and training for detecting small objects like License plates. OpenCV along with python was used for character segmentation and recognition. With the proposed algorithm, we were able to reduce the memory consumption and computation time. Haar CNN was used to train the model. is approach processes images with a frame rate of 45 (fps) which is greater than YOLOv3. Results on the real-time data captured by camera shows the   operational compression of the model still achieving mAP of 87%. Proposed algorithm can significantly improve the camera-based License plate detection system for autonomous driving, and can contribute remarkably in the applications of autonomous driving. In future work, different wavelets like contourlets can be used for extracting features. Also, the network structure can be improved further. Researchers can also focus on a combination of feature extractors like GLCM and Haar to improve real-time performance as well as accuracy. e proposed system computes with a speech based alerting system and estimated performance of 97.23% with reference to selection of alerting message and information selection. e framework assures the detection and processing of dummy plates via video processing data sets. Recently, objects on dark, low-resolution, blurry images and tough angles, all vehicle types are implemented. Future work can be improving the detection of Decodes license plate, vehicle type (e.g., SUV, van, pickup truck), vehicle make model (e.g., Honda Accord), color, and orientation which Ignores bumper stickers, car signs, etc.
Data Availability e data sets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.