Vision-Based Multiscale Construction Object Detection under Limited Supervision

Contemporary multiscale construction object detection algorithms rely predominantly on fully-supervised deep learning, requiring arduous and time-consuming labeling process. Tis paper presents a novel semisupervised multiscale construction objects detection (SS-MCOD) by harnessing nearly infnite unlabeled images along with limited labels, achieving more accurate and robust detection results. SS-MCOD uses a deformable convolutional network (DCN)-based teacher-student joint learning framework. DCN uses deformable advantages to extract and fuse multiscale construction object features. Te teacher module generates pseudolabels for construction objects in unlabeled images, while the student module learns the location and classi-fcation of construction objects in both labeled images and unlabeled images with pseudolabels. Experimental validation using commonly used construction datasets demonstrates the accuracy and generalization performance of SS-MCOD. Tis research can provide insights for other detection tasks with limited labels in the construction domain.


Introduction
Construction site monitoring methods are very important in construction site safety management and productivity analysis.As one of the important basic tasks of intelligent construction, vision-based multiscale construction object detection (MCOD), which aims to accurately localize and recognize various objects with diferent sizes, can provide important data support for subsequent collision risk warning and construction process optimization [1,2].Te earlier MCOD algorithms used traditional image processing technology to manually design features and make simple judgments.With the development of machine learning, designing automatic classifcation algorithms using handcrafted features to conduct MCOD has gradually become the mainstream.
Machine learning-based methods have improved MCOD accuracy to a certain extent, but they are not efective in the case of complex background interference.Deep learning methods have achieved convincing detection accuracy in general object detection feld, which greatly improves the detection accuracy of construction objects.Most of these methods have used fully-supervised deep learning approaches and used large numbers of labeled construction image datasets to train detection models [3,4].Te construction object detection accuracy depends largely on the labeled dataset's quality and quantity.
Building high-quality large-scale MCOD datasets is very challenging and requires lots of time and labor costs.Meanwhile, there are diferences in the data distribution between diferent datasets, and the performance of a MCOD model that performs well on a certain dataset would decrease when tested directly on other datasets [5][6][7].Terefore, developing a MCOD model that does not rely heavily on large-scale labeled datasets is of great signifcance for reducing training costs, expanding data utilization, and improving model generalization capabilities.To achieve this goal, this paper develops a novel semisupervised deep learning-based MCOD approach.As shown in Figure 1, only limited number of labeled images (i.e., small amount) and nearly infnite number of unlabeled images (i.e., large amount) are needed to achieve better detection accuracy than fully-supervised deep learning (for intradataset and across-dataset).
Te research of this paper mainly includes the following two contributions.Firstly, this paper proposes a novel semisupervised MCOD framework, which achieves more accurate and robust detection of construction objects.Secondly, aiming at the multiscale detection problem caused by the large size diference, the presented SS-MCOD uses DCN instead of conventional convolutional network to further improve detection precision.Tis paper's remaining sections are organized as follows: Section 2 examines the research advancement in vision-based construction object detection; Section 3 introduces the detailed architecture of the proposed SS-MCOD method; Section 4 illustrates the specifcs of the implementation; Section 5 exhibits the outcomes of the training and evaluation, as well as the impact of key factors; and Section 6 provides a summary of this paper.

Related Studies
Before the widespread adoption of deep learning, most of the vision-based construction object automatic detection methods have used sliding windows to extract image regions of interest and then use hand-crafted features to identify the construction objects contained in the regions.Chi and Caldas [8] used surveillance camera videos to develop a construction object detection algorithm.Te background subtraction and morphological operation were used to obtain the area of the construction object, two classifers were trained to identify the object using shape and texture features.Park et al. [9] conducted a comparative analysis of various manual feature extraction techniques to assess their impact on object recognition accuracy.Azar et al. [10] developed Haar-HOG-based and Blob-HOG-based construction truck detectors, and a part-based excavator recognition method in videos using HOG features.Yuan et al. [11] uses hybrid kinematics shape and key node features to develop an excavator detection algorithm.Tese approaches, which relied on manually designed features, achieved a partial abstraction of construction objects and led to enhanced processing efciency and precision in construction object detection.
In the deep learning period, convolutional neural network-based object detection methods have gained wide popularity in construction object detection domain.Tese deep learning-based methods can be categorized into twostage anchor-based, single-stage anchor-based, and anchorfree techniques.In the case of two-stage anchor-based techniques, researchers predominantly used Faster R-CNN and its variants for construction object detection.Fang et al. [12] introduced an innovative approach called IFaster R-CNN, specifcally designed to detect construction workers and heavy construction equipment.Tey demonstrated the superiority of their proposed method over the hand-crafted feature-based detection approach.Kim et al. [13] presented the use of transfer learning to address the scarcity problem of training data within the R-CNN series for the construction domain.To detect construction workers in complex backgrounds and changing postures more accurately, Son et al. [14] integrated the deep residual network (152 layers) into Faster R-CNN.Lu et al. [15] implemented fll factor estimation and bucket detection using Faster R-CNN with the feature integration of region proposal network.In the case of single-stage anchor-based techniques, Guo et al. [3] devised an enhanced version of SSD model to ensure accurate detection of dense construction vehicles.Roberts and Golparvar-Fard [16] used RetinaNet with ResNeXt-101 backbone as earthmoving equipment detection module for activity analysis.To accelerate the detection speed, Arabi et al. [17] developed lightweight construction object detection networks using SSD network with MobileNet as backbone.You only look once (YOLO)-v3 is an excellent general object detection framework.Xiao et al. [18] used YOLO-v3 for construction machinery detection in nighttime environment, construction personnel safety device detection, and construction equipment detection in large-scale scenes.In the case of anchor-free techniques, Guo et al. [19] developed an anchor-free method for detecting construction vehicles with arbitrary orientations to realize precise localization of construction vehicles in any orientations.
Te construction object detection methods based on deep CNN mentioned above are mostly fully-supervised deep learning, that is, it requires labeled construction datasets for training.In theory, more labeled data with more-parameter models will produce better detection results.To address the challenge of limited labeled data, researchers have dedicated signifcant eforts to curating benchmark datasets in the feld of construction [20], such as, ACID (10,000 images) [5], MOCS (41,668 images) [6], and SODA (19,846 images) [7].In addition to consuming lots of time and manpower to obtain and annotate datasets, researchers also have tried to use new techniques to automatically generate construction images and corresponding annotations.Soltani et al. [21] developed an automated construction image generation and annotation approach using 3D equipment models.Bang et al. [22] used generative adversarial networks to generate more construction images with various transforms.Hwang et al. [23] investigated to use web crawling technique to acquire construction images and use a segmentation model to automatically label construction objects.Te methods using generative approaches to synthesize simulated construction site images have to some extent improved the accuracy of construction object detection.However, these methods still rely on source data from previous limited construction image dataset.Te generative models then generate new images that closely match the distribution of this existing construction image set, but they are unable to simulate the diversity of real construction site image distributions.Terefore, the precision and robustness of detecting construction objects using such methods are clearly limited.
Te annotation of construction image datasets is time-consuming and laborious, and a large-scale and high-quality annotation is more difcult.Widely-used datasets in general object detection are usually annotated 2 Structural Control and Health Monitoring at million-image level.However, it undoubtedly takes a long time to accumulate such numbers of annotations in construction domain.But it is relatively easy to obtain only construction images.Indeed, it is crucial to research more accurate and robust construction object detection algorithms by using limited labeled images and combining with almost infnite unlabeled images.Kim et al. [24] introduced few-shot learning into the construction domain, and successfully realized new construction object category detection using limited labeled samples, and became the pioneer work of construction object detection research under limited labeled samples.Fewshot learning focuses on discovering new categories that are not in the training set, while semisupervised learning is able to efectively use unlabeled data to further enhance detection models [25].Te current application of semisupervised learning in the construction domain focuses on structural damage identifcation and segmentation.Guo et al. [26] developed a façade defect classifcation approach with semisupervised learning using a modifed mean teacher technique which could train labeled and unlabeled images simultaneously.Wang and Su [27] proposed a surface crack semantic segmentation model and used the semisupervised teacher and student framework with EfcientUNet as the backbone.Zhang et al. [28] presented automatic defect segmentation frameworks integrated with GANs and semisupervised learning to achieve better precision.Unlike object classifcation and segmentation, semisupervised object detection needs to consider more factors and is more difcult to implement.To fll this gap, this paper proposes the SS-MCOD framework using semisupervised learning technique to realize more precise and robust MCOD.

Methodology
As illustrated in Figure 2, the SS-MCOD method proposed in this paper is a teacher-student joint learning framework based on the deformable convolutional network (DCN).Te teacher-student joint learning structure is designed to enable semisupervised learning, while the DCN component is leveraged to resolve multiscale issues, thereby enhancing accuracy in construction object detection tasks.During training, labeled construction images are directly input into the student module, and its output is compared with manually labeled data to calculate loss.Unlabeled construction images undergo strong augmentation and are input into the student module to produce pseudolabels (for classifcation and localization).Additionally, unlabeled construction images undergo weak augmentation and are input into the teacher module, with its output compared with the pseudolabels to calculate loss.Te weights of the teacher model are transferred from the weights of the student model using exponential moving average technique.

Teacher-Student Joint Learning
Framework.Te proposed SS-MCOD approach introduces a teacher-student joint learning framework for efective semisupervised learning.Specifcally, SS-MCOD undergoes training utilizing a combination of labeled and unlabeled data.Both the teacher module and the student module use the identical fully-supervised object detection architecture, which serves as the base construction object detection model.However, the parameters are diferent between the two models.Te teacher module's parameters are transferred through the application of the exponential moving average technique from the student module [29].Tis technique enables Unlabeled construction images go through two diferent data augmentation approaches to generate strongly and weakly augmented data.Strongly augmented data is sent into the student module as input, and the predicted construction object detection bounding boxes of unlabeled construction images is output.Weakly augmented data is sent into the teacher module as input with the output of pseudolabels of unlabeled data.Te diference between the predicted construction object detection bounding boxes and the pseudolabel is computed, that is, the training loss of unlabeled data L u : unlabeled classifcation loss L cls u and unlabeled location loss L loc u , as shown in equation (2).N u is the number of unlabeled construction images, P k u is the corresponding predicted construction object detection bounding boxes of the student module,  G k cls is the pseudocategory label, and  G k loc is the pseudolocation label.
Te quality of construction object pseudolabels is crucial to the training and inference accuracy of SS-MCOD.After the weakly augmented construction images are input into the teacher module, multiple construction object detection bounding boxes will be generated.To eliminate the results of high repetition rate, nonmaximum suppression is used to perform preliminary postprocessing on multiple construction object detection bounding boxes.Referencing to Xu et al. [30], a high threshold is used to flter these bounding boxes after preliminary postprocessing.Tese construction object detection bounding boxes can be divided into foreground boxes (r fg k ) and background boxes (r bg j ).Te foreground boxes are used as the pseudolabel of the classifcation, and the reliability measure (ω j ) is used to weight the loss of each background box to calculate the unlabeled classifcation loss L cls u .As shown in equations ( 3) and ( 4), Due to the inconsistency of construction object classifcation and locating tasks, high-quality classifcation pseudolabels are usually inconsistent with high-quality positioning pseudolabels.Tis paper uses box jitter method to select the reliable bounding box location coordinates, that is, the variance is calculated after multiple jitters of the foreground box as the reliability measure, and fnally the box with high enough reliability (r fg k ) is used as the location pseudolabel, where l cls represents the L1 loss.
Te total loss L of SS-MCOD is composed of labeled loss and unlabeled loss, where λ is the adjustment coefcient.

DCN-Based Student Module.
Conventional convolution layers use a consistent convolution operation across various feature maps, with fxed pixel sampling positions (shown in Figure 4(a)).Tis approach results in the inclusion of numerous background features within the extracted information.Consequently, conventional convolution-based construction object detection networks possess uniform receptive felds for multiscale objects.Tis limitation hinders the accurate detection of multiscale construction objects.Te deformable convolution is proposed to replace the conventional convolution [31], that is, by adding an ofset at the position of the original convolution sampling, as shown in equation (7).X and Y represent input and output convolutional feature maps, ω is the weight function, R is the convolution kernel, u 0 is the location in Y, u k is the location in R, and ∆u k is the ofset.
Te receptive feld can be rotated and scaled, which can efectively cover large construction objects and accurately concentrate near small construction objects (as shown in Figure 4(b)).Trough the application of deformable convolution operations, disparate features are extracted according to the size and shape of the construction object.Tis approach minimizes the extraction of extraneous background information, contributing to more accurate object detection.
In the SS-MCOD framework, the base construction object detection model for both the teacher and student modules is derived from the Faster R-CNN-DCN (i.e., FRCD).Te backbone of the FRCD architecture is ResNet-50, featuring four convolution stages.Notably, the frst stage retains conventional convolutions, while the subsequent stages, from the second to the fourth, incorporate deformable convolutions.

Implementation Details
4.1.Dataset.Te training dataset used in this paper was sampled from the MOCS dataset [6], which were acquired from 174 diferent construction sites considering various weather environments using a variety of equipment.Te training dataset includes 12 common types of objects in construction sites.
To implement the training of SS-MCOD, the training dataset was segregated into labeled data and unlabeled data.Te overall count of the training dataset image is 3000.To explore the infuence of diferent proportions of labeled data and unlabeled data on SS-MCOD, this paper presents to conduct four training cases, and the proportions of labeled data are 2%, 5%, 10%, and 50%, respectively.Te numbers of images and objects of diferent cases are shown in Figure 5.In this paper, objects with a bounding box area (width multiplied by height) smaller than 1024 pixels (32 × 32) are classifed as small objects, while those larger than 9216 pixels (96 × 96) are categorized as large objects, and the remaining fall under the medium object category.

Structural Control and Health Monitoring
To evaluate the accuracy improvement of SS-MCOD, the MOCS validation dataset (including 4000 images) was used as the validation dataset I (for intradataset evaluation).To evaluate the robustness improvement across diferent datasets, another dataset called ACID basic (2850 images) [5] was used as the validation dataset II (for across-dataset evaluation), including excavator, truck, and concrete truck.

Parameter Setting.
Batch size, being one of the crucial parameters of deep learning, wields a signifcant impact on the training procedure.To keep consistent of diferent experimental cases, the batch size of all cases was both set to 5. In each batch, the ratio of labeled data to unlabeled data was 1 : 4. For each experimental case, the training epochs were set at 45000, while the initial learning rate was confgured to 0.005.Te learning rate alteration strategy used a multistep approach, involving reductions of 0.3 at the 15000th and 30000th epochs.λ in equation ( 6) is set as 2.0.Te parameter settings for the student model or the teacher model are consistent with those in the original Faster R-CNN [32] and DCN [31].

Results and Discussions
Training and testing results of the proposed SS-MCOD under four training cases are introduced in this section.Moreover, the infuence of pretraining on SS-MCOD is discussed.shows that L loc l increases frst and then decreases, which is due to the characteristics of the twostage detection method; the coordinates of bounding boxes can be regressed only after the candidate regions are screened.Similarly, as the ratio of labeled data grows, the loss also increases.Training unlabeled classifcation loss and location loss represent L cls u and L loc u , respectively.Figure 6(c) shows that L cls u increases signifcantly frst and then decreases.Figure 6(d) shows that L loc u also increases signifcantly and then decreases slowly.Tis is because the quality of pseudolabels in the unsupervised learning branch is poor at the beginning of training.As the number of training steps increases, the accuracy of pseudolabel classifcation and regression improves, leading to a reduction in both two losses.Additionally, with the increase in the ratio of labeled data, L cls u and L loc u show a slight decrease, because the reduction of the number of unlabeled data reduces the difculty of ftting.
Figure 7 is the curve of SS-MCOD total loss L. Tis loss initially experiences a sharp decrease followed by a gradual increase, and eventually transitions into a slow decrease with advancing training steps.Te initial descent corresponds to the rapid data ftting by the supervised training branch of SS-MCOD.Te subsequent ascent represents the evolving quality of pseudolabels in the unsupervised training branch.Finally, the subsequent descent refects the improved data ftting capabilities of both the supervised and unsupervised training branches.

Intradataset Evaluation Results.
To demonstrate the efectiveness of the proposed SS-MCOD, the fullysupervised detection method FRCD and the well-known semisupervised object detection method Soft teacher (Faster R-CNN as the student module) were used for evaluation and comparison.During training, Soft teacher and the proposed SS-MCOD used the same training data, while FRCD used only labeled data for training.
Table 1 presents the testing results of three diferent methods on validation dataset I, using varying training cases.In Case 1, the mAP achieved by FRCD trained solely with 60 labeled images reached 5.3.Demonstrating the benefts of semisupervised learning, the other two methods exhibit signifcant enhancements in mAP, highlighting the considerable advantage of the semisupervised detection framework for construction objects when labeled data is scarce.Notably, the proposed SS-MCOD outperforms Soft teacher in terms of evaluation accuracy.Similar trends are observed in Cases 2, 3, and 4. Relative to the fully-supervised FRCD, SS-MCOD yields substantial improvements in evaluation accuracy, with mAP increases of 10.8, 11.4, 10.4, and 13.8 in the respective cases.Tese improvements represent percentage increases of 204%, 107%, 58%, and 63%, respectively.Tis underscores the notion that SS-MCOD achieves more pronounced accuracy enhancements as the proportion of labeled data decreases, as well as improvements in recall.In contrast, this relationship is inverted when comparing Soft teacher to SS-MCOD.Tis phenomenon can be interpreted as follows: Leveraging unlabeled data, the semisupervised COD framework can efectively yield a more precise COD model compared to its fully-supervised counterpart.
Figure 8 qualitatively shows the example detection results of SS-MCOD (solid line) and Soft teacher (dotted line) in Case 4. Soft teacher failed to detect the two tower cranes positioned in the middle of the upper left image, the pump truck situated on the left side of the upper middle image, the construction worker located in the lower right of the middle left image, the three construction workers positioned on the right side of the lower left image, as well as the construction vehicle situated in the middle of the lower middle image.In contrast, SS-MCOD successfully detected all of these objects.Tis indicates that the proposed SS-MCOD can achieve more accurate detection when there are construction objects with large-scale diferences in the same image.
Figure 8 provides a qualitative depiction of example detection results for SS-MCOD (represented by a solid line) and Soft teacher (represented by a dotted line) within Case 4. Notably, in the upper left image, two tower cranes Structural Control and Health Monitoring positioned centrally were successfully detected by SS-MCOD, while eluding detection by Soft Teacher.Similarly, the pump truck situated on the left side of the upper middle image, the construction worker located in the lower right of the middle left image, the three construction workers positioned on the right side of the lower left image, and the construction vehicle at the center of the lower middle image, all went unnoticed by Soft teacher, yet were accurately identifed by SS-MCOD.Tese results underscore the capability of the proposed SS-MCOD to achieve heightened detection accuracy, particularly in scenarios involving construction objects with signifcant scale variations within a single image.

Across-Dataset Evaluation Results.
To ascertain the generalization capabilities across-datasets, a novel dataset (validation dataset II) was used to evaluate the proposed SS-MCOD.Te evaluation results on Validation Dataset II, with various training cases, are illustrated in Table 2.
In terms of MCOD accuracy, SS-MCOD displays increased performance compared to Soft teacher.Specifcally, SS-MCOD demonstrates growth in AP l , AP m , and AP s by 51%/11%/9%, 18%/3%/2%, 10%/−1%/−12%, and 19%/4%/ respectively.Tis analysis highlights SS-MCOD's  Structural Control and Health Monitoring   Structural Control and Health Monitoring primary enhancement in detecting large-scale construction objects and accentuates the noticeable disparity in accuracy among objects of varied scales.Tis discrepancy can be attributed to the unique proportion of scales in validation dataset II, where it stands at 86%:13%:1%.It is worth noting that this proportion varies to 39% : 36% : 25% in validation dataset I.

Infuence of Pretraining on SS-MCOD.
Te efectiveness of pretraining in object detection models, which allows for the extraction of more generalized features, has been widely recognized and established.To ensure both representativeness and accessibility of object detection datasets, this study used the training set from the COCO dataset [35] to train the student module of SS-MCOD.Te training process spanned 180,000 epochs, after which the trained weight parameters were adopted as the initial weights for SS-MCOD's continued training or fne-tuning.Illustrated in Figures 9 and 10 are the training loss curves with pretraining.A notable reduction in loss is observed for SS-MCOD with pretraining in comparison to SS-MCOD without pretraining, evident in both partial loss and total loss.Tis reduction signifes an enhanced capacity of the model to conform to the dataset.Specifcally, the trends and patterns of partial losses, L cls l and L loc l , within the labeled branches mirror those of SS-MCOD without pretraining, albeit with over 30% reduction in loss values.However, signifcant changes are noted in the unlabeled training partial losses when compared to SS-MCOD without pretraining.Te initial surge in L u cls is considerably mitigated, and the ascending phase of L loc u is notably abbreviated, followed by a pronounced reduction.Tese observations underscore the impact of the unlabeled training branch in expediting pseudolabel generation, thus substantially augmenting the model's ftting capability.
Table 3 presents the evaluation results of SS-MCOD on two validation datasets with pretraining.In the context of intradataset evaluation, SS-MCOD's mAP with pretraining exhibited substantial improvements, registering increments of 37%, 51%, 41%, and 30%, respectively, when compared to SS-MCOD without pretraining.In terms of across-dataset evaluation, SS-MCOD's mAP with pretraining saw noticeable enhancements, with increases of 33%, 31%, 31%, and 19%, respectively, relative to SS-MCOD without pretraining.Tese fndings underscore the signifcant efcacy of the pretraining strategy in augmenting the performance of SS-MCOD trained on datasets characterized by varying proportions of labeled data.
In addition to pretraining, the choice of backbone for the student or teacher modules in this paper's Faster R-CNN is also a signifcant factor afecting detection accuracy.Te use of a more powerful feature extractor can further improve detection accuracy, but at the same time, the algorithm's processing speed will decrease.Furthermore, by statistically analyzing the size characteristics of construction objects and then determining the size and quantity of anchors in Faster R-CNN, detection accuracy can be further enhanced.

Conclusions
In this paper, a novel semisupervised multiscale construction object detection method, SS-MCOD, is introduced.Tis approach takes advantage of a limited number of labeled samples along with a vast amount of unlabeled construction images for training.As a result, SS-MCOD achieves improved accuracy and robustness in object detection.Te following conclusions can be drawn: (1) Superior performance over fully-supervised methods: When contrasted with fully-supervised methods, SS-MCOD achieves substantial improvements in both intradataset and acrossdataset evaluations.Notably, for the four cases, the improvements of 204%, 107%, 58%, and 63% in intradataset evaluation and 357%, 168%, 156%, and 158% in acrossdataset evaluation have been achieved.Tese outcomes underscore SS-MCOD's elevated accuracy and its adeptness in generalizing across diverse datasets.(2) Multiscale capability: By harnessing the potent multiscale feature (3) Impact of pretraining: Te incorporation of a pretraining strategy yields a signifcant enhancement in the accuracy and generalization capabilities of SS-MCOD.Model pretraining using COCO dataset results in an average mAP increase of 40% for intradataset evaluation and 28% for across-dataset evaluation.
Te proposed semisupervised framework efectively enhances detection accuracy and robustness, making it applicable for the efcient and cost-efective detection of various valuable objects in civil engineering contexts.However, this study still has the following limitations: the SS-MCOD framework proposed uses Faster R-CNN as the detector, which is a classic two-stage anchor-based detection framework with relatively high detection accuracy but slow running speed.Future research eforts can focus on adopting single-stage anchor-based or anchor-free detectors with higher detection efciency, ensuring a signifcant reduction in algorithm runtime while improving detection accuracy.

Figure 1 :
Figure 1: Comparison of MCOD methods with diferent supervision approaches.

Figure 8 :
Figure 8: Example detection results of two semisupervised methods in Case 4.

Table 1 :
Evaluation results on validation dataset I with diferent training cases.Best evaluation results among three methods.

Table 2 :
Evaluation results on validation dataset II with diferent training cases.Best evaluation results among three methods.

Table 3 :
Evaluation results of SS-MCOD on two validation datasets with pretraining.Structural Control and Health Monitoring extraction capabilities inherent in the DCN architecture, SS-MCOD demonstrates pronounced advancements in multiscale COD accuracy compared to the widely recognized semisupervised object detection method, Soft teacher.