Research on Object Detection of PCB Assembly Scene Based on Effective Receptive Field Anchor Allocation

Vision-based object detection of PCB (printed circuit board) assembly scenes is essential in accelerating the intelligent production of electronic products. In particular, it is necessary to improve the detection accuracy as much as possible to ensure the quality of assembly products. However, the lack of object detection datasets in PCB assembly scenes is the key to restricting intellectual PCB assembly research development. As an excellent representative of the one-stage object detection model, YOLOv3 (you only look once version 3) mainly relies on placing predefined anchors on the three feature pyramid layers and realizes recognition and positioning using regression. However, the number of anchors distributed in each grid cell of different scale feature layers is usually the same. The ERF (effective receptive field) corresponding to the grid cell at different locations varies. The contradiction between the uniform distribution of fixed-size anchors and the ERF size range in different feature layers will reduce the effectiveness of object detection. Few people use ERF as a standard for assigning anchors to improve detection accuracy. To address this issue, firstly, we constructed a PCB assembly scene object detection dataset, which includes 21 classes of detection objects in three scenes before assembly, during assembly, and after assembly. Secondly, we performed a refined ERF analysis on each grid of the three output layers of YOLOv3, determined the ERF range of each layer, and proposed an anchor allocation rule based on the ERF. Finally, for the small and difficult-to-detect TH (through-holes), we increased the context information and designed improved-ASPP (Atrous spatial pyramid pooling) and channel attention joint module. Through a series of experiments on the object detection dataset of the PCB assembly scene, we found that under the framework of YOLOv3, anchor allocation based on ERF can increase mAP (mean average precision) from 79.32% to 89.86%. At the same time, our proposed method is superior to Faster R-CNN (region convolution neural network), SSD (single shot multibox detector), and YOLOv4 (you only look once version 4) in the balance of high detection accuracy and low computational complexity.


Introduction
With the increasing influence of electronic products in social changes, major countries regard electronic product manufacturing as a strategic development industry. Completing the intelligent production transformation and upgrading the electronic product manufacturing industry is also an inevitable choice for the manufacturing industries of all countries. In particular, the realization of visual object detection in the entire manufacturing field of electronic products can help manufacturers eliminate labor shortages and improve product competitiveness.
At present, visual object detection has been widely used in different stages of electronic product manufacturing, such as the manufacture of electronic components, PCB surface mounting, and reliability testing. However, the THT (through-hole technology) in PCB assembly requires highly skilled operators trained in the corresponding standards to complete it. ese manual placement costs are high and become the bottlenecks of the intelligent manufacturing of electronic products. e most challenging problem in the THT process of PCB assembly is the messy placement of electronic components and the extremely small vias. Even a well-trained operator uses vision to recognize these electronic components with different shapes and directions and similar-looking vias with a high error rate. erefore, studying the problem of intelligent visual object detection involving THT in PCB assembly scenarios can help electronic product manufacturers increase output, improve quality, and significantly reduce costs. e task of object detection [1][2][3][4] is to find all the objects of interest in the image and determine their position and class. Deep convolutional neural networks (DCNN) [5][6][7][8] is a biologically inspired structure for hierarchical computing features. After the concept of DCNN was presented, object detection has achieved remarkable progress based on deep learning techniques [9]. For a long time, improving the detection speed and accuracy has been the mainstream problem in the research field of visionbased object detection algorithms. Anchor box is the key to improving the quality and speed of object detection using DCNN. Since the introduction of the RPN (region proposal network) proposed by Faster R-CNN [10], the idea of selecting a bounding box based on an anchor has been widely used. An anchor is a set of fixed reference frames of different scales and positions according to the size of varying label samples in the training data set. ese reference frames cover almost all positions in the selected feature pyramid layer. Each reference frame is responsible for comparing with ground truth to obtain regional proposals larger than the IoU (intersection over union) threshold. ese proposals are retained as bounding boxes using NMS (nonmaximum suppression). e bounding box contains the location and class information of the detection object. At present, most of the mainstream object detection methods use the anchor, and the experiment proves that the effect is significantly improved compared with the previous sliding window.
As a typical representative of the one-stage object detection algorithm, the YOLO series of algorithms have attracted much attention from the day of its birth because it can realize multiobject detection in real-time [11]. In the development process from YOLOv1 to YOLOv5 [12], we always use multiple anchors as dense detectors to improve the accuracy and speed of object detection. As we know, the YOLOv1 network structure is simple, with only one output layer and two anchors; however, each grid of the output layer can only predict one class. For YOLOv2 [13], after the input image was preprocessed and resized to 416 × 416, the backbone network of YOLOv2 made the image pass through the deep convolution became 13 × 13. At this layer, five anchors of different sizes are predetermined from the training data and are fed to the model as a preselection box before training and prediction. e increase of anchors makes training faster by determining the shape and size of possible bounding boxes. e main problem with YOLOv2 is that it cannot adapt to more pictures and only detects one layer. Unlike YOLOv1 and YOLOv2, which predict the output at the last layer, YOLOv3 [14], YOLOv4 [15], and YOLOv5 are all nine anchors generated by clustering training data, and these nine anchors are equally distributed to three output layers of different scales according to the size to achieve object detection. e 3-3-3 of the fixed allocation of the anchors method of evenly distributing anchors on three output layers of different scales makes full use of the idea of feature pyramid and dense box search and achieves excellent accuracy. However, this anchor allocation mechanism does not consider that the anchor size generated by clustering is random. e ERF size of each grid of the three anchor allocation layers is fixed. In DCNN, the ERF reflects the size of the effective area when each pixel of each feature layer is mapped to the input image during the convolution process [16]. We can imagine that if the anchor size assigned to the grid is much larger than the ERF size of the grid, then the extracted feature maps are only a tiny part of the object to be detected. e entire network detection object seems like a blind person touching the elephant. e anchor size to the grid is much smaller than the ERF size, and hence, the extracted feature map contains too much context information that it is easy to cause interference. e entire network detection object is like finding a needle in a haystack. However, few people associate ERF with the effectiveness of the anchor to improve the accuracy of target detection.
To solve the above problems, we choose YOLOv3 as the object detection framework, take the THT electronic components and through-holes in the PCB assembly scene as the detection objects, take the ERF size range of the anchor distribution layer as the entry point, and design the ERFbased anchor allocation rules to improve detection accuracy. e main contributions of this paper are summarized as follows: (1) We proposed a new dataset for object detection in PCB assembly scenes. is dataset involves three THT scenarios: before assembly, during assembly, after assembly, and 21 classes of detection objects.
(2) We have realized the refined analysis of the three output layers of YOLOv3 from the perspective of ERF and proposed an ERF-based anchor allocation rule. (3) We designed an improved-ASPP and channel-attention joint module for TH with a small size and similar appearance. ese modules increase the context information, adjust the weights of different channels in the feature layer, and improve detection accuracy. e following section reviews object detection methods in manufacturing scenes and various improvement methods based on anchors. In section 3, we describe the proposed method. In section 4, we present experimental results and show the effectiveness of the proposed method. Finally, section 5 concludes this paper.

Related Work
As the ERF-based anchor allocation method for PCB assembly scene object detection involves object detection in manufacturing scenes and anchor-based improvements, the following will introduce the related work in these fields.

2
Computational Intelligence and Neuroscience

Object Detection in Manufacturing Scenes.
With the development of the manufacturing industry, intelligent manufacturing technology has gradually become the key technology to realize the knowledge, automation, and flexibility of manufacturing to achieve rapid response to the market [17][18][19]. In particular, vision-based object detection to detect objects in manufacturing scenes can effectively improve intelligent production. ere have been some research results on object detection in manufacturing production scenarios in the early stage. Ghadai et al. proposed using 3D-CNN to identify difficult-to-manufacture drilled holes in CAD geometry. is detection can assist in the realization of intelligent manufacturability decisions [20]. Lin et al. proposed a YOLO-based capacitance detection method for PCB assembly, however, the detection object only contains nine types of capacitance [21]. Tao Xian et al. proposed a CNN detection model for five defects likely to occur in the spring wire sockets during manufacturing. is model performs defect detection and classification tasks simultaneously to improve detection efficiency [22]. Lemos et al. proposed using transfer learning, SSD300, SSD512, and Faster R-CNN to achieve efficient detection based on the task of additive manufacturing target recognition [23]. Li et al. studied the role of CAD synthetic data in the semantic segmentation algorithm of industrial manufacturing [24]. For the four small parts of the watch assembly process, Qian et al. proposed IoMA-NMS and a new loss function based on NMS based on the excessive number of preselected frames during the segmentation process, improving the detection effect of Mask R-CNN [25]. At the industrial manufacturing site, Park et al. used wearable augmented reality smart glasses to render the scene through the three-dimensional spatial information of the scene target extracted by Mask R-CNN to help the operator better recognize and understand the operating object in the actual physical scene [26]. Because of the rapid update of clothing styles in clothing manufacturing companies, it is a complex problem to use deep learning to detect defects on untrained objects on clothing. San-Payo et al. proposed an incremental learning algorithm to solve new target feature learning [27]. Tsai and Chou proposed four CNN-based models for precise positioning in PCB manufacturing and production to achieve exact position and angle detection [28]. Zhang et al. proposed a deep multimodel cascade method that combines single-frame image and multiframe image processing to detect and identify foreign particles for the qualitative detection of liquid pharmaceutical products [29]. Zhang et al. proposed a method that combines Faster R-CNN and R-FCN multichannel features to detect surface defects in the production process of solar panels [30]. Li et al. proposed a lightweight model design method based on ERF and anchor matching to solve the detection problem of electronic components on the assembled PCB [31]. Wang et al. took the worker's action recognition and parts detection at the toy assembly site as the research content. ey proposed using Faster R-CNN as the detection method to realize the skill transfer in intelligent manufacturing [32]. Faced with the problem of semantic segmentation of safety vents welded on the battery cover during the production process of power batteries, Zhu et al. proposed a lightweight multiscale attention model to effectively improve product quality detection [33].

Anchor-Based Object Detection.
Using anchors as preselected boxes to search and match intensively on images to improve the accuracy and speed of object detection is currently the mainstream method for object detection. At present, many scholars are studying the design and improvement of the anchor box. Zhong et al. proposed a method of adaptively learning the anchor box size according to network capabilities and data distribution. is method solves the problem of fixed anchor box size during training and can improve detection accuracy [34]. When Zhang et al. used Faster-RCNN to detect surgical tools in laparoscopic videos, they followed the anchor point center and feature alignment principle. ey proposed a subnetwork for predicting anchor size and reducing the number of dense anchors [35]. Yu et al. designed orientation-guided anchors when detecting rotating objects in remote sensing images. One of the two subnetworks is responsible for training and generating anchors of appropriate size in the anchor generation stage. e other is responsible for counting the occurrence probability of the object, avoiding invalid anchors appearing in the background [36]. Tian et al. used the attention mechanism to produce adaptive anchors to improve object detection in remote sensing images [37]. In the task of license plate detection, Nguyen et al. designed the anchors of fixed length and proportion at different layers of the feature pyramid according to the aspect ratio specified by the license plate. ey used a multiscale regional proposal network with predicted anchor position information to finally improve the detection effect [38]. Ma et al. considered the spatial relationship between anchor, ground truth, and bounding box for a one-stage anchor-based object detection algorithm. ey proposed a location-aware framework to improve the detection effect and robustness [39]. Li et al. proposed posture anchors when using a one-stage object detection framework for hand keypoint detection. rough the clustering analysis of multiple gestures, angles, and scales of representative posture anchors, the occlusion problem of hand key point detection can be solved [40]. In the object detection task, Jin et al. proposed to use the ground truth area in the train set to adaptively generate anchors for different feature layers in the backbone to reduce the number of anchors [41]. Hosoya et al. found that in video object detection, there will be instantaneous object loss in consecutive frames. ey proposed to use a soft threshold to solve the problem of fixed-value anchor and threshold for the same object scale and angle change in consecutive frames [42]. Given the multiscale and denseness of remote sensing image targets, Guo et al. adopted a nonfixed k-value clustering algorithm, used the principle of minimum clustering loss, and generated the number of adaptive anchors according to the characteristics of the object data in the training set [43]. Grabel et al. used circular anchors to replace the rectangular anchors commonly used in object detection when detecting the blood cell [44]. Chen et al.

Computational Intelligence and Neuroscience
proposed generating K-means anchor boxes based on similar shape distances aiming at the problem of small size and a large number of ship detection in synthetic aperture radar images [45]. Zhu et al. used the size ratio relationship between the feature layers to allocate the anchors obtained by the K-means according to the same ratio, which improved the object detection effect of multiscale remote sensing images [46]. When detecting ships in complex backgrounds, Xiao et al. proposed using pairwise semantics in the direction and horizontal minimum bounding boxes, resulting in fewer and more accurate rotation anchors, reducing the complexity of calculations [47]. Wang et al. proposed a receptive field generation method that matches the anchor size. ey used a nonsquare convolution kernel to generate the receptive field closest to the anchor size and redistribute the size of the anchor using the relationship between the vehicle space position [48]. When performing object detection, Wang et al. considered invalid anchors to affect the detection speed and proposed an anchor position prediction network and an anchor shape prediction network, which greatly reduced the number of anchors and computation time [49]. In the multiclass object detection problem of remote sensing images, Mo et al. proposed a class-specific anchor generation algorithm to improve the recall rate and designed the most suitable anchor for each class [50]. Gao et al. found that the same object has the same aspect ratio even though the obtained field of view is different and proposed an anchor that only predicts the aspect ratio, which improves object detection accuracy [51]. In the task of pedestrian detection, Fang et al. found that geometric constraints of pedestrians can be combined with anchor frame information, and anchors containing geometric constraints are used to reduce reasoning time and reduce the error rate of reasoning in pedestrian detection [52]. Deng et al. used the learned proposals in the two-stage R-CNN to propose a learnable anchor and replaced the learnable anchor with the fixed anchor in the one-stage to achieve realtime scene text detection [53]. In the object detection task of optical remote sensing images, BAO et al. proposed two regressions of anchors that adapt to different IOU thresholds according to the different needs of recognition and positioning [54]. Zhang et al. proposed an optimized sampling anchor constructed using feature maps extracted from deep learning. is method is suitable for two-stage object detection to improve performance [55]. Yang et al. proposed a meta-anchor and designed an anchor function to learn parameters from preselected boxes. e setting of this anchor and the distribution of bounding boxes are more robust in object detection tasks [56].

Knowledge Gaps.
Although significant progress has been made in the two fields mentioned above, some gaps still need to be fulfilled from the review.
For object detection tasks in manufacturing scenarios, many studies have involved different processes of manufacturing, such as drawing geometry detection in production design, tool material detection in the early stage of production, action recognition in the production process, and defect detection in production products. Especially defect detection in the manufacturing process is a research hotspot. However, these object detection tasks involve a single production scene, each scene has fewer object classes, and the object distribution is relatively scattered. e PCB assembly scenarios proposed in this paper include three different situations, the scattered electronic components before assembly-some electronic components are inserted into the TH on the PCB during the assembly-and all the inserted TH on the PCB after assembly. ere are 21 classes. e object to be detected has a significant difference in scale and a high similarity in appearance. As far as we know, this is the first full-scene detection dataset for electronic component assembly.
As far as anchor-based object detection methods are concerned, the main research points are adaptive anchors that can be trained and learned, anchor alignment, arbitraryoriented anchor design, and anchor effectiveness research. Most of the experts only pay attention to the anchor itself, ignoring the relationship between the corresponding size of the anchor and the ERF size corresponding to each grid of the anchor placement layer when the anchor performs the function of the preselection box.
is paper proposes an anchor allocation algorithm that only needs to use the basic K-means algorithm to generate anchors and allocates based on the principle that the anchor size placed on each grid on the output channel matches the ERF size corresponding to each grid. As far as we know, this is the first time that the ERF size of each grid in the output feature layer of the object detection model has been quantified, which provides a basis for the effectiveness of the allocation of anchors and improves the performance of object detection.

Methodologies
is study uses anchor-based one-stage object detection network YOLOv3 as the primary network. e electronic components, PCB, NITH (no-inserted through-holes), and ITH (Inserted through-holes) are detection objects in the electronic component assembly scene. Analyzing and calculating the ERF of each grid is the starting point of this research. is research focuses on determining the anchor distribution principle according to the ERF size range of each anchor distribution layer. Figure 1 shows the entire anchor allocation method based on ERF. As we know, an object detection network structure based on CNN consists of four parts: input, backbone, neck, and head. For YOLOv3, the input is a 3-channel color image resized to 416 × 416. e backbone is a feature extractor in the form of a feature pyramid. Here, we choose Darknet-53 as the backbone. e neck is a feature fusion model that helps the detector to locate and recognize objects better. e head has three convolution outlets, and each outlet completes the object detection of different scales using the assigned anchor. e method we propose includes ERF-based anchor assignment and a joint feature fusion enhancement module adapted to difficult-to-detect targets, which occur in the head and neck of YOLOv3, respectively, which will be described in detail below.

Anchor K-Means.
K-means is used in the anchor generation problem of object detection as a simple and commonly unsupervised learning algorithm. K groups of anchors with high similarity in width and height are automatically generated by clustering the bounding boxes of the training set.
We choose 416 × 416 as the size of the YOLOv3 input image. All training set and test set images will be resized to 416 × 416 first, and the resized bounding box size is used for clustering to generate anchors. An anchor is used as a preselection box for object detection. We can obtain the empirical size of the object from a large amount of label data in the training set and quickly find the detected object using regression training CNN. Why should it be emphasized that the anchor we used was generated after the original image size was resized to 416 × 416? ere are two reasons. One is that for YOLOv3, whose input image size is fixed at 416 × 416, it is more accurate to normalize the anchor to 416 for object detection. e other is the anchor size can directly compare with the ERF size to determine the anchor distribution rules.
Here, we compare the anchor generated by clustering the PCB assembly scene training dataset before and after resizing to 416. As a professional camera is used to obtain pictures of the assembly scene, the photo size is large, and it is 4092 × 3000. e objects to be detected include PCB, capacitors, inductors, chips, ITH, NITH, etc. e size difference between these objects is large. As we all know, an object can only have one anchor responsible for detection.

Refined Analysis of the ERF Size.
In the deep neural network, a concept called the receptive field represents the size of the range of perception of the original image by neurons in different positions within the network. e deeper the number of convolutional layers, the larger the field of view that the end pixel will be fed back to the original image.
e ERF refers to the area of the original image information that can be effectively received, and this area has a Gaussian distribution. When the convolutional neural network uses the convolution kernel to perform traversal feature extraction on the image, pixels at different positions  Figure 1: Description of anchor allocation method based on ERF. Object detection based on YOLOv3 is achieved by placing dense anchors on the three output channels of the head through regression to achieve object recognition and positioning. e red rectangular boxes in the head part represent anchor boxes placed in different output layers. N1, N2, and N3 represent the number of anchors allocated to the three output layers. e allocation of the original anchor is 3-3-3. Blue, green, and red represent the anchors placed on each grid on the 13 × 13, 26 × 26, and 52 × 52 layers. e yellow rectangle represents the ERF size corresponding to a grid. Grids with different positions correspond to different ERF sizes. According to the ERF size range of the anchor placement layer, redistributing the number of anchors will improve the detection effect of the original anchor even distribution.
Computational Intelligence and Neuroscience in the previous layer contribute differently to the feature value of a point in the current layer [57]. Other researchers have researched the calculation and analysis of the ERF of convolutional neural networks [31,[57][58][59]. Different researchers have studied the accounting and analysis of the ERF of convolutional neural networks. However, current pieces of research default to the fact that ERF is square, and there is no detailed analysis of their size and shape. We use the gradient backpropagation method shown in Figure 3 to perform ERF analysis and calculation on the three output layers of YOLOv3, i.e., each grid corresponding to the allocation anchor. e entire method of refined analysis of ERFs is divided into seven steps.
Step 1. Load the object detection model. As the object detection framework used in this study is YOLOv3, it is necessary, firstly, to construct and load this model.
Step 2. Import model weights. Use the weights of the pretrained YOLOv3 object detection model to assign values to the grid of the output layer, perform gradient backpropagation, and determine the activation area.
Step 3. Determine the number of feature layers to be analyzed. e ERF analysis method proposed in this paper can be used to analyze any pixel in any feature map corresponding to the original image. erefore, we must first inform the layer number of the feature map to be studied in the entire object detection model.
Step 4. Assign the initial value of 1 to each pixel of the feature layer to be analyzed. e essence of refined analysis lies in precision and refinement. Determining the size of the original image area that the feature map point can feel by point will play a particular role in the visual task of the convolutional neural network. Step 5. Use gradient backpropagation to calculate the activation degree of the three channels of the input image. Firstly, construct three 416 × 416 black channels, representing the original input color image's R, G, and B channels. In turn, the feature map with the brightness of each pixel point of 1 is backpropagated back to the R, G, and B channels of the original image to obtain a three-channel activation map.
Step 6. Use the Gaussian function to determine the ERF. As far as we know, not all the activation points obtained in step 5 are valid, and the effective area is Gaussian. Using the two-sigma rule of Gaussian distribution, it can determine the ERF area where each pixel on the feature map to be detected is mapped back to the original image.
Step 7. Determine the ERF size. After clarifying the ERF range, determine the ERF size according to the leftmost, rightmost, bottom, and top positions of the activation points distributed in the range of 416 × 416.
Among the above steps, the sixth step is the most important. Using Gaussian distribution to determine the ERF range is the basis of this research.

Anchor Allocation Rules.
With a refined analysis method for the ERF of each pixel of the feature map, the ERF range of the entire feature map is determined. e YOLOv3 object detection network relies on placing anchors on each grid of the three output layers of the head to finally achieve object detection using regression. Once the object detection task's training set is selected, the corresponding anchor's size is determined. Compared with the ERF size of the detection layer where the anchor is placed, the size of the anchor is too small, like finding a needle in a haystack, and too big, like a blind person touching an elephant.
How to determine the rules for assigning anchors based on the ERF? Here, we use Figure 4 to illustrate what kind of anchor allocation is appropriate and what kind of anchor is inappropriate. From Figure 4(a), we can see that although the three anchors are all enclosed in the ERF because the sizes of the anchors are different, we find that the anchor [34,47] is too small compared to the ERF. An anchor in this output layer is like finding a needle in a haystack for object detection, which is inappropriate . From figure (b), we can see that only one anchor is enclosed in the ERF, and the other two anchors [186, 120] and [233, 151] are larger than the ERF. e boundaries of these two anchors are both outsides of the ERF. When such two anchors are placed in this layer for object detection, it is like a blind person touching an elephant. Each time, only a part of the object can be detected, which is inappropriate. e above analysis shows that the first condition for assigning a suitable anchor is that the anchor should be enclosed in the ERF, and the difference between the anchor and the ERF should not be too large. erefore, we design an  e three rectangular boxes, respectively, represent the size and position of the ERF corresponding to a grid in the three output layers. e three grids on the right represent the size of the three output layers when the input image size is 416 × 416. ese three output layers are also anchor distribution placement layers.
anchor allocation rule based on ERF. According to the ERF of each grid of the anchor placement layer, the ERF range of the entire output layer can be obtained. e anchors can be arranged from small to large. e larger the feature map of the output layer, the smaller the ERF size corresponding to each grid. e smaller the feature map of the output layer, the larger the ERF size corresponding to each grid. We start with the smallest anchor and compare it with the ERF corresponding to all grids in the largest feature map of the output layer. Suppose this anchor can be enclosed in all ERF. In that case, this anchor should be allocated in this layer, as long as one ERF cannot contain it, and this anchor will enter the second largest feature map of the output layer for comparison and allocation. Repeat afterward until all anchors are allocated to different layers. Algorithm 1 shows the pseudocode corresponding to the anchor enclosed algorithm.
rough the above anchor allocation algorithm based on ERF, we changed the original YOLOv3 anchor average distribution algorithm. We reallocated the anchors in the three output layers, respectively, and the anchors of each layer can be enclosed corresponding to all grids in this layer. At the same time, follow the principle that the anchors of different layers are no longer reused because the ERF size corresponding to the 52 × 52 output layer is the smallest. e anchors allocated to 52 × 52 are those that the ERF of this layer completely encloses. For the 26 × 26 output layer, the anchors that can be assigned are those that are completely enclosed by this layer after removing the anchors that have been allocated by 52 × 52. For the 13 × 13 output layer, firstly, remove the anchors from all anchors where 26 × 26 and 52 × 52 have been used, and the remaining anchors are allocated to this layer. Finally, an anchor allocation algorithm based on the ERF is realized.

Some Other Improvement.
In addition to assigning suitable preselected boxes to object detection, to improve the detection effect further, we also combine context and channel attention to improve multiscale feature fusion. For the neck part of YOLOv3, the feature maps extracted by deep learning, we let them pass through an improved-ASPP [60] and channel attention joint module and send them to the final head part. Figure 5 shows the entire joint module.
Spatial pyramid pooling with atrous/dilated convolution (ASPP) can simultaneously process objects of different sizes. It has been applied in the field of semantic segmentation and object detection. e improved-ASPP proposed in this paper concatenates contextual multiscale feature information on the original feature map. en, the concatenated result and the original feature map are added together. Whether concatenation or add, it integrates feature information of different scales. e improved-ASPP strengthens the reuse of features and solves the problem of gradient disappearance in the deep network.
As everyone knows, the attention mechanism can ignore irrelevant information through the model and focus on the critical information we want it to focus on. e channel attention is to improve the network's representation ability by modeling the dependence of each channel, especially to learn the weight of different channel features in object detection, and to adjust the features channel by channel, so that the network can learn to use global information to selectively enhance features that contain helpful information and

Experiments
We validated the proposed method on the PCB assembly dataset. erefore, in the experimental part of this section, we will, firstly, introduce the constructed PCB assembly scene dataset. Secondly, analyze the ERF of the three different output layer grids based on YOLOv3. en, use the anchor allocation rules based on the ERF proposed in this paper. e anchors are allocated in different output layers. Finally, the experimental results are analyzed and discussed.

PCB Assembly Dataset.
Based on the background of intelligent manufacturing, we designed a PCB assembly scene dataset. e so-called PCB assembly places electronic components in a designated position on the bare PCB. ere are currently two main types of PCB assembly technology: through-hole technology (THT) and surface mount technology (SMT). e vast majority of PCB assembly uses a mixture of the two technologies. SMT has achieved automation at present, mainly because THT still needs to rely on manual completion. We are concerned about detecting the electronic components to be inserted and the corresponding TH on PCB after PCB completes SMT. e dataset contains three scenes before assembly, during, and after assembly and 21 class objects. Among these detection objects, there are four types of capacitors with small variance between classes, preinsertion, postinsertion through-hole with slight differences in appearance, and large-size PCB. ere are 9636 pictures in the whole dataset. e size of all pictures is 4092 × 3000. e picture background is white. e electronic components outside the PCB Algorithm anchor allocated based on ERF Input: 9 anchors arranged in ascending order of size. Anchors, AnchorsW represents the width of each anchor, and anchorsH represents the height of each anchor, anchors  (4) isPass � 1 (5) for i in range (52): (6) for j in range (52) (13) for anchor in anchors: (14) isPass � 1 (15) for i in range (26): (16) for j in range (26) (23) for anchor in anchors: (24) isPass � 1 (25) for i in range (13): (26) for j in range (13): (27) if anchor[0] > w_13.iloc [i, j] or anchor [1] > h_13.iloc [i, j]: (28) isPass � 0 (29) if isPass � � 1 and anchor not in data1 and anchor not in data2: (30) data3.append (anchor) (31) print (13 × 13:, data3) (32) end procedure allocate ALGORITHM 1: Anchor allocation algorithm.
Computational Intelligence and Neuroscience 9 are messy and disordered. e electronic components and TH on the PCB are fixed and sequence relative to the PCB. We randomly divide the training data and the test data according to the ratio of 8 : 2 to the entire dataset. Figure 6 shows the object number and category statistics of the training set and test set of the whole PCB assembly scene.  Tables 1-3, respectively. Using these three tables, we will find that the larger the output layer, the smaller the ERF corresponding to each grid, and the smaller the output layer, the larger the ERF corresponding to each grid. e ERF size corresponding to each grid is different in the same layer.

Application of ERF-Based Anchor Allocation in PCB
Assembly Scenarios. According to the previous analysis of the YOLOv3 model, we learned that there are 169 grids in the 13 × 13 output layer. In terms of area, the smallest ERF size is (222, 162), and the largest ERF size is (390, 399). What we are more concerned about is that the anchors allocated in this layer are entirely enclosed by the effective receptive fields of all grids. erefore, we find the ERF size with the smallest width (189, 227) and the ERF size with the smallest height (226, 160). In the same way, we can obtain that the ERF size with the smallest area of the 26 × 26 output layer is (84, 111), the ERF size with the largest area is (182,193), the ERF size with the smallest width is (81, 122), and the ERF size with the smallest height is (112, 86). For the 52 × 52 output layer, the ERF size with the smallest area is (41,39), the ERF size with the largest area is (118, 87), and the ERF size with the smallest width is (38,54). e ERF size with the smallest height is (47,38). e anchor allocation rules proposed above are applied for the PCB assembly scene training dataset for which the anchor size has been determined. For the 52 × 52 output layer, there are 4 anchors allocated, which are [9,14,17,21,22,22,32,35]. For the 26 × 26 output layer, there are three allocated anchors, namely [27,30,34,41,47,47]. For the 13 × 13 output layer, there are two allocated anchors, [186,120] and [233,151]. We have also proposed the improved-ASPP and attention mechanism joint module for YOLOv3. e biggest function of the improved-ASPP is to combine more contextual information and enlarge the receptive field. erefore, for the joint module improved YOLOv3, we use the same anchor allocation method based on ERF. ere are 5 anchors allocated to the 52 × 52 output layer, which are, respectively, [9,14,17,21,22,22,32,35], and [27,41]. e 26 × 26 output layer assigns a total of 3 anchors, which are, respectively, [30,47], [34,47], and [186, 120]. e 13 × 13 output layer assigns only 1 anchor, i.e., [233,151]. Figures 7(a) and 7(b), respectively, show the two anchor allocation schemes obtained by applying anchor allocation rules based on ERF when YOLOv3 and the joint module improved YOLOv3 perform object detection in PCB assembly scenes. It is different from the original YOLOv3, which evenly distributes the nine anchors in the three output layers according to the size. concat Improved ASPP (Improved Atrous Spatial Pyramid Pooling) relu Figure 5: A schematic diagram of the improved-ASPP and channel attention combined module. e rate stands for dilation rate. is joint module will be used to connect the neck and head of the original YOLOv3. It is used three times for the three output channels, however, the input and output layers are adjusted according to the number of feature channels extracted by the neck part. e output of the entire joint module will be input to the corresponding head layer. 17 [62]. Table 4 shows the parameters used for the experimental algorithms.

e Experimental
To better illustrate the effectiveness of the method proposed in this research, we carried out eight experiments by adding an improved-ASPP, channel attention mechanism, joint module to the neck part, and three anchor allocation schemes. Table 5 shows the naming and specific improvement steps of these eight sets of algorithms.

Analysis of Subjective Test Results.
e above eight algorithms can achieve multiobject detection in a PCB assembly scene image, and we tested 120 images with them. Here, we take the detection results of 4 images to illustrate the effect of the eight algorithms. ese four images are one before assembly, two during assembly, and one after assembly. In the algorithm test results image, we use rectangular boxes of different colors to indicate the detected object location. e upper left corner of the rectangular box indicates the class and confidence of the object. In Figures 8-11, we, respectively, show the object detection results of four images in three PCB scenarios. Each image contains nine subimages representing the original image's ground truth and the results of the eight algorithms in Table 5.
Upon comprehensively analyzing these four sets of images, we will find that upon only using the average distribution anchors algorithm of YOLOv3 to detect the PCB assembly scene, the error rate is higher, especially on the PCB and large object some chips. With the addition of improved-ASPP and channel attention mechanisms in YOLOv3, we will see the improvement of the detection effect. For example, the detection of PCB gradually approaches the ground truth, and the accuracy of detection of inductance is improved. Especially, when we use the 4-3-2 anchor allocation algorithm based on ERF to detect objects, we will observe that some small objects, such as NITH and ITH, still have low detection accuracy until the improved joint module in YOLOv3 and the anchor allocation of 5-3-1 was performed, which achieved the best detection effect among all algorithms.

Analysis of Objective Test
Results. For detecting PCB assembly scene containing 21 classes, we use AP (average precision) to represent the detection accuracy of a single class in the algorithm and mAP (mean of average precision) to represent the average accuracy of all classes in the algorithm. e higher the AP and mAP, the better the detection performance of the algorithm. In Table 6, we use bold to indicate that this algorithm's result is better than or equal to other algorithms. From Table 6, we can learn from the specific data that YOLOv3-IASPP-CATT-531 (ours) improves the detection accuracy in 15 classes, and the mAP obtains the highest value.

Ablation Analysis.
For the ERF-based anchor allocation rules, improved ASPP, and channel attention proposed in this paper, we will use detailed ablation analysis in Table 7 to understand the influence of different optimizer functions in the object detection of PCB assembly scenes. We take mAP when the IoU threshold is 0.5 as the measurement standard and use the object detection result of YOLOv3-333 as the benchmark.   A first observation is that the use of ERF-based anchor allocation alone (row 5) brings the most significant increase to mAP (refer to the original YOLOv3-333) in the use of improved-ASPP (row 2), channel attention (row 3), and joint improvement modules (row 4). is result also brings us confidence that the allocation of anchors according to the size range of the ERF of the anchor placement layer can indeed improve the accuracy of detection. As the improved-ASPP can combine more contextual information to expand the ERF size, we considered the joint improvement module of IASPP and CATT in the experiment. We reallocated anchors on the three output layers YOLOv3-IASPP-CATT. A second observation is the joint application of the three modules of improved ASPP, channel attention, and ERFbased anchor allocation (row 8) to obtain the highest value of mAP. Compared with the original YOLOv3-333 (row 1), the mAP of YOLOv3-IASPP-CATT-531 (row 8) has increased by 10.54%.

Analysis of a Series of Curves.
To comprehensively compare and analyze the advantages and disadvantages of the eight algorithms, we have drawn four curves in Figure 12 for comparative analysis. We use eight different colors to represent eight algorithms among the four curves. e color of the specific algorithm is shown in the four subpictures. Figure 12(a) shows the relationship between the accuracy and threshold of object detection. It is known from (a) that as the threshold level increases, object recognition accuracy also increases. During detection, the accuracy and threshold provided by the red line model are significantly higher than the other three models, reflecting the superiority of the YOLOv3-IASPP-CATT-531 (ours) algorithm. e mAP (mean average precision) provides a singlefigure measure of quality across recall levels. Among evaluation measures of different object detection algorithms, mAP has been excellent at discrimination and stability. In the experiments of this paper, the training epochs have been iterated a total of 100. We have the mAP value as the y-axis, the range is from 0 to 100%, and the number of iterations as the x-axis, which ranges from 0 to 100. We can see from Figure 12(b) that when the number of training epochs reaches 100, the best performing red curve mAP got the maximum of 93.07%.
Figures 12(c) and 12(d) show the train-loss and test-loss value curve change with the iterations of the eight algorithms. It can be observed from the decreased degree of the loss of these two figures that although the eight algorithms can reach a stable value in the end regardless of training or testing, the original YOLOv3 algorithm continuously decreases slowly at first. e YOLOv3-IASPP-CATT-531 proposed in this paper always drops the fastest.
rough the above experimental results, the joint application of the anchor assignment rules based on the effective receptive field, the improved-ASPP, and the channel attention proposed in this paper has indeed achieved good results in the object detection of PCB assembly scenes.

Comparisons of Detection Performance between Our
Algorithms and Other Methods. To further illustrate the performance of the algorithm, the above eight algorithms of YOLOv3 are selected to compare with the current advanced anchor-based target detection methods, including Faster R-CNN [63], SSD [64], and YOLOv4 [15].
Here, we are concerned about the detection accuracy (mAP) and the computational complexity. Here, we use the        Computational Intelligence and Neuroscience 23 number of parameters (use Mb as a unit) and FLOPs (floating point operations) to measure the complexity of the model. e former describes how many parameters are needed to define this complex network, i.e., the storage space required to store the model. e latter describes how much computation is required for data to pass through such a complex network, i.e., the calculation force needed to use the model. A lowcomplexity CNN has a small number of parameters and few FLOPs. Table 8 shows the accuracy and complexity of eleven algorithms. SSD has the smallest parameters, YOLOV3-CATT-333 has the fewest computational complexity, and YOLOV3-IASPP-CATT-531 (OURS) has the highest detection accuracy. In summary, YOLOV3-IASPP-CATT-531(OURS) is superior to other algorithms in the balance of high detection accuracy and low computational complexity. 4.6. Discussion. We have shown that by calculating the ERF size corresponding to each grid of the anchor allocation layer and using the ERF ranges of the three output layers to determine the allocation of anchors, the recognition accuracy and positioning accuracy of object detection in PCB assembly scenes can be improved. For small-size objects, we designed the improved-ASPP and application channel attention to expand the ERF. When these two modules perform feature extraction, more context information is added, and weight information is automatically assigned to different channels. ey are sensitive and reliable. By visualizing the ERF of the grid at different positions of the three output layers, we have achieved the purpose of refined analysis inside the DCNN. By comparing the subjective detection images and objective statistical data of the eight algorithms in the test dataset, we found that the proposed method has shown superiority in object detection.
Although YOLOv3-IASPP-CATT-531 has obtained the advantage of high detection accuracy, there are still some potential limitations and challenges to improve its effectiveness further. First of all, within each ERF, there is a difference in the degree of activation. A strong degree of activation indicates that a strong possibility exists for the        detection speed while ensuring the detection effect is still challenging. Finally, after a refined analysis of the ERF, whether the smallest bounding rectangle of the active area of the ERF can be used in object orientation detection is still a mystery. Fortunately, according to recent experience, the current research status of deep feature alignment [65], dynamic learnable anchors [66], and object orientation detection [67] can all bring inspiration to solve the above problems. erefore, future research related to anchors and ERF will continue to be indepth in adapting to different object detection tasks, improving detection results, and increasing detection speed.

Conclusion
Electronic product manufacturing based on the background of intelligent manufacturing requires PCB assembly to be automated and intelligent. e visual object detection of the PCB assembly scene is the basis of the realization of intelligence. We try to solve the problem of object detection in the PCB assembly scene. Firstly, we proposed the PCB assembly scene object detection dataset, including three scenes before assembly, during, and after assembly. ere are 21 classes of objects to be detected, mainly through-hole components and through-hole. Secondly, we deeply analyzed the ERF size corresponding to each grid on the three output layers of YOLOv3 and determined the ERF range of the three output layers. Again, we proposed an ERF-based anchor allocation rule using the anchor reallocated to optimize the classification and positioning of the bounding box. Finally, we propose an improved-ASPP and channel attention joint module for the small-size TH before and after the insertion, adding more context information to improve detection. Our proposed method has achieved the best results on object detection in PCB assembly scenes by experimental comparison.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.