Utilizing Image Processing and the YOLOv3 Network for Real-Time Traffic Light Control

,


Introduction
In [1], a real-time trafc control system using various imageprocessing techniques collected using webcams was developed. Te study used MATLAB software with timers to control the lights and a seven-segment display that shows the vehicle count. In addition, they used morphological image operations, but at the simulation level on fat surfaces with toy vehicles that are clearly diferent from those in the background. Furthermore, their study did not present results on real images or the time required by their algorithm. Te study in [2] considered trafc congestion as a basic problem in urbanized areas. In their study, they used webcam images and MATLAB software to propose the control of trafc lights at a four-way intersection using RGB conversion techniques and background subtraction with the objective of determining vehicle density, and 90% accuracy in density identifcation was achieved.
An investigation carried out on a toll road in [3] established that the trafc density can be controlled if the volume data of vehicles on the road are acquired and processed. Hence, they proposed a background subtraction method using a Raspberry Pi with OpenCV, which achieved a precision of 92.3% in the morning but dropped77.3% in the afternoon, thus showing the dependency of the technique on lighting conditions, noise, vehicle speed, and camera viewing angle.
In [4], it was confrmed that trafc congestion is a signifcant problem occurring in all urban cities. Te study proposed a vehicle detection model using image processing by converting to RGB and HSV to determine whether the images are day or night. Furthermore, they applied the corresponding methodology to extract the vehicles and apply the object count, thereby achieving an average precision of 95% in the datasets used.
Furthermore, the study in [5] established that the inefcient control of trafc lights causes numerous problems, such as long delays and waste of energy. Terefore, using data collected from diferent sensors and vehicular networks, a reinforcement learning model was proposed to control the duration of trafc signals as actions that are modeled by Markov decision processes. To validate their model, they used the simulation of urban mobility (SUMO) environment, showing the efciency of their model in controlling trafc lights.
In [6],we developed a new vehicle counting technique using a synergism attention network (SAN) to improve overlapping phenomena and sophisticated large-scale variations that occur in high-density images. Tey showed that the new SAN model obtains better performance indicators than MCNN, P2PNet, and DMCount.
All previous investigations consider congestion as a problem in cities, which can be confrmed by the data shown by the INRIX. Based on data collected from 300 million cars and connected devices at diferent times of the day along diferent paths of the road network, it was established that the cost of congestion in the United Kingdom was 37.7 billion pounds sterling [7], with an average of £1,168 per driver. Globally, trade increases as transport and logistics activities fulfl their functions, allowing the mobility of people and goods in an efcient, timely, and assertive manner. Furthermore, in the main cities of South America, there is up to 86 h lost in congestion, similar to that in São Paulo, thus representing an average of 30% of driving time lost in congestion. Hence, detailed knowledge of vehicular movement in transportation and logistics is necessary.
In addition, none of the previous studies provided information on the execution time of the proposed models or trafc light control algorithms. Hence, in this study, we propose the use of a YOLOv3 network for counting vehicles and people based on images obtained from cameras placed at an intersection. Furthermore, the method entails the use of a strategy that allows counting to be achieved in less than one second, thereby enabling its use in the control of trafc lights and improving trafc control in real-time.
As observed in previous research studies, none of them provide information regarding the execution times of the proposed models or trafc light control algorithms. Additionally, it is worth noting that several of these studies developed algorithms using MATLAB, which requires the use of a laptop or desktop computer and makes it unsuitable for real-world trafc light applications due to installation difculties. Furthermore, some studies utilized OpenCV and single-board platforms but tested them in simulated scenarios with toy vehicles or virtual environments.
In this study, we propose the use of YOLOv3 for realtime counting of vehicles and people obtained from cameras placed at an intersection using a Jetson Nano board, making it easier to implement in real-world situations. Tus, our research provides a more suitable and efcient strategy for real-time trafc light control compared to previous investigations.

Materials and Methods
Our interest is in digital images acquired by some digital cameras, which are discrete representations of processed data with spatial (arrangement) and intensity (color) information. In addition, we consider it as a multidimensional signal. Te two discrete (2-D) dimensions represent the digital image, where I (m, n) represents the intensity response of a sensor to a series of fxed positions (m � 1, 2, . . ., M); (n � 1, 2, . . ., N) in 2D Cartesian coordinates derived from a continuous 2D signal space I (x, y), through a sampling process often referred to as discretization. Te value of this position in 2-D is known as a pixel. An image contains one or more colors (channels) that defne the color intensity at a particular pixel value at location (m, n), which we denote as I (m, n).
Te image processing was performed automatically, following the steps of image acquisition and storage, preprocessing, segmentation, representation, description, and recognition interpretation [8]. We defned the actions of representation description with the use of convolutional networks [9]. In addition for recognition interpretation, we apply neural networks with deep learning [10].
A neural network (or neuron) functions (see Figure 1) by multiplying its inputs by their weights, adding with its polarization b, the net output s passed through a transfer function, and we show the sigmoid function, but other functions can be used, depending on the applications and learning algorithms [11].
Neural networks usually have more than one layer, and the connections between diferent neurons are the weights that are modifed during training. In Figure 2, we have a set of weights in the frst hidden and output layers and two layers of neural networks with weights that can be updated during training. Neural networks with more than one hidden layer are examples of nonlinear relationships that learn complex relationships with the ability to approximate continuous functions.
Convolutional neural networks (CNNs) are neural networks specially designed to work with data with spatially characteristics, such as images, and they are composed of convolutional layers that flter the input layers to fnd valuable features within the inputs [12]. In Figure 3, we apply a 7 × 7 × 3 flter to a 38 × 38 × 3 image to obtain a 32 × 32 × 1 image. Hence, the dimensions have been reduced, but the image has more marked characteristics, which depend on the properties of the applied flter [13,14].
Te input images were split into S × S meshes, while the algorithm checks the bounding mesh frame. In addition, the deep-learning algorithm extracts the bounding boxes of each possible detected object. Furthermore, using an available set of inputs, deep learning was used to determine the object type in each box. Finally, this process was repeated for each box, with the objects being counted, and consequently returning a list containing the number of objects found [15].
Te solution is built into an embedded device. Considering the processing speed and resolution of each frame, we reviewed the available embedded devices and selected the most suitable. Te Raspberry Pi 3 B+ has 1 GB RAM, a 4core processor with a 1.4 GHz clock frequency, four USB ports, an Ethernet port, and WiFi and Bluetooth connections [16]. Te important feature of this Raspberry model is the availability of an HDMI port, a DPI camera port, and a display port. In addition, it was noticed that when AlexNet is used, a frame or video frame can be analyzed every 2.5 s, and in the worst case (BVLC reference models), an estimated time of 20 s fps. Furthermore, the Jetson Nano card is more powerful tool and can be used as a minicomputer alternative to the Raspberry Pi. It has 4 GB of RAM and 128 CUDA cores. Nvidia [17] showed the performance of this card with image recognition models, with speeds up to 25 fps, which is much higher than those of Raspberry Pi 3 B+ [17]. Conclusively, we compared the characteristics of both minicomputers and performed simple detection algorithms on both platforms, hence we decided to work the solution based on Nvidia's Jetson Nano minicomputer.
YOLOv3 is pretrained network used for detect objects, and it consists of 106 layers and uses 80 categories (objects). For this purpose, the diagram was divided into S × S meshes. In each mesh box, the limits of the bounding boxes and a deep learning algorithm extract the bounding boxes for each detected object. In addition, the probability that each box contains an object of each category in the training set was calculated [18]. Furthermore, the obtained output consists of the location of the object given by the coordinates (x, y), the size shown by (W-width and H-height), and the probability of belonging to one of the 80 categories (objects).

Analysis and Proposal.
We implemented of vehicle detection in videos and photos of the crossing of two oneway streets; by highlighting the precision of the YOLOv3 network (see Figure 4). First, we extracted relevant information about the detected objects (boxes of diferent colors in the image) and their quantities.
We analyzed the object detection performance on a set of images extracted from videos obtained from the crossing of two one-way streets, and made the following annotations (see Figure 5): (1) Vehicles before every intersection. Possibly stopped or what we see happening could be stopped by at color change (2) Vehicles are passing or in the transit zone (3) People are crossing the respective cross-walk (4) People are waiting on the corners for a possible change in the lights to be able to cross.

Detection with Masks.
Te frst solution consists of fltering using a mask (see Figure 6) to obtain the corresponding counts. In total, we made four detections, two in each photo, to obtain, the number of stopped vehicles directly, people crossing, vehicles crossing, and people stopping. Terefore, the algorithm consists of applying Filter 01 to one of the streets, detecting and counting the stopped  Journal of Engineering vehicles and crossing people in the resulting image, and subsequently using Filter 02 to see and count the passing vehicles and stopped people. Tis is repeated for the image from the other street, and then the trafc light logic is applied. Furthermore, using a function from the time library and measuring the time with nanosecond precision, we determined the execution time of each detection process. In addition, we measured the total time and included the trafc light logic. It was found that, on average, each detection took about 0.7 s, with the duration of a complete cycle being 2.8 s.

Detection Using Polygon to Delimit Objects.
Based on an aforementioned observation, the primary source of delay is the detection and counting processes in the diferent resulting images after applying the corresponding flters (four detections and counts). Hence, we opted to perform only one detection in each image, thereby reducing the detections and counts to two (one for each image). In addition, we determined what we have detected (person or vehicle) and checking if they are inside or outside a polygon using a function {point_in_polygon}, with the ray-casting method [19,20]. Hence, this allowed us to establish whether or not a given point is in an area delimited by a polygon. In addition, it enabled counting vehicles that stop and cross people in one case and vehicles that pass and people stop in the other. Tus, allowing the application of this information in the smart trafc light control logic.
Furthermore, we determined that the execution time of each detection process, in addition to each cycle of counting objects inside and outside the polygons in each image (on average 0.55 s, and the time counting objects and   determining whether they are inside or outside the polygon, is less than 10 ms. Hence, completing the complete cycle per image in 1.14 s considerably improves the performance mentioned previously.

Using Polygon to Clip Objects and a Single Detection.
Te detection and counting times exceeded one second. To reduce this total time, we chose to combine the images of both streets in a single shot (see Figure 7). In addition, we conducted a single detection process and used polygons that strengthen whether a given point is in an area to determine whether there are vehicles and people stopping or crossing each street.

Results and Discussion
In Figure 8, we compared the values obtained for the total number of vehicles and people in the images of crossing two one-way streets. Both statistics followed the same trend, and we confrmed the appropriate capability of the YOLOv3 network in detection and counting. In addition, we measured the performance using the mean error as follows: where V d values detected either vehicles or people; V c values counted either vehicles or people; L number of evaluated images. Applying this relationship, we obtained mean errors of 2.11 and 1.92 for vehicle count and people count, respectively. Tus, implying that on average, the results returned by the detector are in a range of ± of the values found.

Discrete Counts.
We carried out detection and counting using the zones of vehicles before each intersection and people crossing with described above (see Figure 5), with the results shown in Figure 9. Te average errors in the vehicle and people counts are 1.85 and 1.03, respectively, thus,  Journal of Engineering implying that on average, the detector results can be within the range of ±2 and ±1 for vehicles and people, respectively.
We performed detection and counting in the zones shown in Figure 5, and the results are presented in Figure 10. It was found that the mean errors for the cases of vehicles and people are 2.07 and 1.11, respectively.
In addition, we determined the execution times of the unique detection process and the process of counting objects inside and outside each polygon for each street. Te average detection time of the image was found to be 0.72 s. Te time for measuring objects and determining if they are inside or outside the polygon is less than 10 ms, while the time for completing the entire cycle per image is 0.73 s, thereby considerably improving the performance obtained. Furthermore, the use of other strategies was considered. Te study of [2] proposes a trafc light control, using MATLAB, which is not appropriate for embedded devices. In addition to not indicating the execution time of the proposed control similar to [21], which uses CNNs to count objects in dense images, and does not present execution time results that allow its strategies to be applied to trafc light control.     Figure 12: Test platform-GUI.

Real-Time Trafc Light Control.
Based on the obtained times, which are less than one second in detection and counting, we developed a real-time trafc light control logic, which changes its state depending on the number of vehicles stopped at the intersection. Consequently, if there are a more signifcant number of vehicles stopped at one of the streets, this trafc light will turn green. To achieve this, it determines its current status. If it has been green, it simply does not make any change but decreases its counter by one. In addition, if it has not been green, it verifes the previous status, and if it has been amber and the amber counter is already at zero, it changes color to green and also updates the color of the trafc light that controls the other street by changing it to red, and further updates its new status and counters. Furthermore, if amber's counter was not zero, it decreased by one; if it is red, it makes a change again, passing through amber (see Figure 11). We verifed the functionality of the detection algorithms by testing a platform consisting of a graphic application, using simulated trafc-light control based on the counts obtained from photos or videos (See Figure 12).

Conclusions
Tis study aimed to analyze vehicular and pedestrian trafc at the crossings of two one-way streets and develop a realtime trafc light control system that considers the number of stopped vehicles, crossing vehicles, people crossing, and people stopping, with a minimum number of stopped cars and stopped people. Te detector and counter algorithms used in the study showed promising results, with an average detection time of 0.72 s and a time for completing the entire cycle per image of 0.73 s, which is suitable for intelligent trafc light control.
Te developed real-time trafc light control system can improve trafc fow and reduce congestion in urban areas, leading to more efcient and efective transportation and potentially reducing environmental impacts. Te system can manage trafc efciently and efectively by considering the number of stopped vehicles, crossing vehicles, people crossing, and people stopping, with a minimum number of stopped cars and stopped people. Tis study used a strategy of combining images from both cameras into a single shot, which improved the performance obtained.
However, the study used only two one-way streets, which may limit the applicability of the developed trafc light control system to other types of intersections or road confgurations. Additionally, the performance of the developed system under real-world conditions is not tested, and this needs to be explored further.
Terefore, some future research directions based on the fndings presented in the study could be exploring the performance of the developed real-time trafc light control system under real-world conditions to evaluate its efectiveness in managing trafc fow and reducing congestion in urban areas. Additionally, it could be extended to work with multiple intersections or road confgurations to enhance its applicability and usefulness. Investigating the use of alternative object detection and counting methods or algorithms to improve the performance and accuracy of the system and considering the integration of other sensors or data sources, such as GPS or weather data, could enhance the developed system's capabilities and efectiveness.

Data Availability
Te corresponding author can provide the data used to support the fndings of this study upon request.