Underwater Incomplete Target Recognition Network via Generating Feature Module

. A complex and changeable underwater archaeological environment leads to the lack of target features in the collected images, a ﬀ ecting the accuracy of target detection. Meanwhile, the di ﬃ culty in obtaining underwater archaeological images leads to less training data, resulting in poor generalization performance of the recognition algorithm. For these practical issues, we propose an underwater incomplete target recognition network via generating feature module (UITRNet). Speci ﬁ cally, for targets that lack features, features are generated by dual discriminators and generators to improve target detection accuracy. Then, multilayer features are fused to extract regions of interest. Finally, supervised contrastive learning is introduced into few-shot learning to improve the intraclass similarity and interclass distance of the target and enhance the generalization of the algorithm. The UIFI dataset is produced to verify the e ﬀ ectiveness of the algorithm in this paper. The experimental results show that the mean average precision (mAP) of our algorithm was improved by 0.86% and 1.29% under insu ﬃ cient light and semiburied interference, respectively. The mAP for ship identi ﬁ cation reached the highest level under all four sets of experiments.


Introduction
Underwater cultural heritage is a kind of nonrenewable cultural resource. In recent years, the substantial and highintensity development in coastal areas has seriously threatened the safety of underwater cultural heritage, making the situation of underwater cultural heritage protection increasingly serious. The current stage of underwater archaeology requires high physical and professional skills of the staff, and the underwater scenarios are complex and changeable, posing significant safety risks. Therefore, the use of the autonomous underwater vehicle (AUV) for underwater archaeology can effectively reduce the risk of underwater archaeology.
However, in underwater archaeological operations, the target images collected by AUV have the problem of missing features due to the harsh underwater environment, such as insufficient underwater light, targets are mostly buried in mud and sand, and relics are corroded into wreckage for a long time, which leads to low recognition accuracy. Genera-tive Adversarial Networks (GAN) can generate features through discriminators and generators, which can effectively improve the accuracy of underwater target recognition when all features of the image cannot be extracted. However, since underwater images are generally blurred, classical GAN generates a lot of noise while generating features and requires a large number of iterations resulting in slow convergence. In addition, a large number of labeled samples are required in the target recognition algorithm to effectively improve the accuracy, and most algorithms learn from labeled training sets, focusing on the recognition of labeled samples that have already appeared in the training. However, in practical applications, the difficulty in obtaining underwater target samples results in the lack of a large number of samples to train the network, and the large differences between individual relics make the algorithm's generalization performance poor. Few-shot learning does not rely on large-scale training samples and can achieve low-cost and fast target recognition for an emerging task with few collectible samples. Its application effectively improves the generalization performance of target recognition algorithms, but the algorithms usually cause severe overfitting.
In response to the above specific questions, we propose an underwater incomplete target recognition network via generating features module in this paper. The overview of our algorithm is shown in Figure 1.
The main contributions of this paper are as follows: (1) Dual discriminators and generators are introduced to generate missing features in two submodules.
The generated features retain semantic information while reducing noise generated by the generator and reduce the number of iterations, thus improving the accuracy of the algorithm (2) Supervised contrastive learning is applied to fewshot learning for target detection using contrastive proposal encoding. The intraclass similarity and interclass variance of targets are improved by cpe loss. This module improves the generalization performance of the algorithm recognition (3) The proposed algorithm was evaluated on a dataset UIFI with disturbances such as insufficient lighting, partially buried targets, and wreckage. The superior performance of the algorithm in this paper is verified by comparing it with state-of-the-art algorithms The rest of this article is arranged as follows. The related work is discussed in detail in Section 2. Section 3 presents the feature generation model, the few-shot learning network model, and the training process of the proposed algorithm in three subsections. In Section 4, simulation experiments are conducted to verify the effectiveness of the proposed algorithm. Section 5 concludes the paper.

Related Work
In recent years, object recognition has been widely used in many fields. Nevertheless, incomplete features of the target images collected by AUVs in underwater archaeological target detection lead to difficulties and low accuracy in recognition. In terms of feature-missing image reconstruction, some scholars have conducted in-depth research. Wang et al. [1] proposed a DPNet dual-pyramid reconstruction framework to learn more different scale features and further proposed a pyramid attention mechanism (PAM) in the decoder to obtain finer patches directly from the learning layer. Some scholars also perform a phased reconstruction for the missing features' objectives [2,3], achieving global rough results first, followed by local refinement. Attention is also paid to the texture information of the image [4,5], which guides the reconstruction of the image by generating the texture of the image. Cai et al. [6] innovatively proposed a framework for transfer reinforcement learning for the reconstruction of multiview optical fields. Niu et al. [7] proposed a defect image generation method with controllable defect area and intensity. The generated defect area was controlled by using a defect mask.
The small number of underwater image samples leads to poor generalization performance of the recognition algorithm. Great progress has also been made in the field of few-shot object detection. Meta-learning [8][9][10] can learn classes that have never been trained, and introducing meta-learning into the framework can effectively improve the performance of small-sample recognition algorithms. Lu et al. [11] designed Decouple Representation Transformation (DRT) and image-level distance metric learning to eliminate the adverse effects of manually annotated prior knowledge by predicting the object and anchoring shape. Kim et al. [12] inferred the geometric correlation between the new category and the basic region of interest. Zhang et al. [13] proposed a Joint Adaptive Detection Framework (JADF), which matches marginal and conditional distributions between domains without introducing any additional hyperparameters. Kaul et al. [14] obtained high-quality pseudoannotations for each new category from the training set and removed candidate detections with incorrect class labels by introducing a validation technique. Hu et al. [15] stated that context-aware polymerization (DCNet) intensive relation extraction is features to capture the object using the annotation of new characteristics of fine-grained.
For underwater image target recognition, there are still huge challenges. The performance of the identification algorithm is low due to a variety of disturbances caused by an underwater complex environment. Ref. [16] can effectively improve underwater target identification performance by extracting salient features and spatial semantic information of targets and then fusing them. Cai et al. [17] improved the accuracy of target detection under glass interference by minimizing the abstract feature distance between the source and target domains. Cai et al. [18] proposed a framework based on transfer reinforcement learning that can improve the accuracy of cooperative multi-AUV target recognition. Wang et al. [19] treat deep CNN as different views to extract semantic representations of images, and visual and semantic representations of images are used to predict the categories of images. Ref. [20] innovatively proposed a fusion framework (SSFNet), which effectively mitigates the gap between features by means of a semantic modulation model and a resolutionaware model. There are still many deep learning-based models that are applied for different tasks. Ref. [21] comprehensively investigates microorganism biovolume algorithms and classifies them according to digital image analysis methods. For the problem of parameter explosion, Ref. [22] proposed low-cost U-Net, which significantly reduces the high memory cost of U-Net. The proposed algorithm in Ref. [23] is divided into two stages, which significantly improves the performance of the algorithm in colorectal histopathology image classification.
In summary, there are some studies in target detection and few-shot learning. However, there are few studies on target detection in the case of images with missing features and few training samples at the same time. The proposed method in this paper effectively solves this practical problem.

2
International Journal of Distributed Sensor Networks

Proposed Method
In this section, for targets with missing features, we generate the missing features through the feature generation module (FGM), which consists of two submodules containing dual generators and discriminators, and the features of underwater archaeological target images are extracted by using RepVGG network. After FGM, the accuracy of target recognition can be significantly improved. The generalization ability of the algorithm for target recognition is improved by introducing contrastive learning into our algorithm framework. By applying the algorithm of this paper, the issue of low target identification accuracy under the interference of insufficient underwater light, target buried in mud and sand, and target wreckage has been effectively solved.

Underwater Image Missing Feature Generation Module.
In the process of target recognition in underwater archaeology, the AUV usually fails to extract all the features from the collected underwater target images. This section utilizes dual discriminators and generators to generate the unextracted features by two generation submodules. This reduces the impact of missing features on underwater recognition algorithms. The feature generation module is shown in Figure 2. However, classical GANs generate a lot of noise and require a large number of iterations. The feature generation model proposed in this section is divided into two submodules, submodule 1 for generating features while preserving the semantic information of the image and submodule 2 for noise reduction.
In real underwater archaeological scenes, the images taken by AUVs usually have some interference factors, resulting in low algorithm recognition efficiency, such as insufficient light, partial burial of target objects, and antique wreckage. In submodule 1, x s represents the real image, and x t represents the image with missing features. The generator G 1 is a deep network, and G 1 ðxÞ represents the generation function of the input x. Take x t as input, generate an intermediate image x g with complete features by G 1 and then input x g to the discriminator D 1 . D 1 denotes a network of discriminators, and D 1 ðx, tÞ denotes the discriminator function with input x and target label t . For discriminator D 1 , we set x s to 1 and x t to 0 and use the complete real image x r and the generated intermediate image x g as inputs to discriminator D 1 . D 1 loss function is based on binary crossentropy loss, and the adversarial loss function of D 1 can be expressed as follows: The adversarial loss function corresponding to G 1 can be written as follows:  where l b is the binary crossentropy loss function and l st is the structural loss. The Nice Gener module acts as a link between the two submodules, it takes x g as input, and set the loss function threshold to 0.01. It generates highquality images x n as input to submodule 2. In submodule 2, the image generated by submodule 1 is further denoised, and the generator G 2 is used as a denoising autoencoder. The adversarial loss function of D 2 is similar to D 1 and can be expressed as follows: The main role of adversarial loss at this stage involves narrowing the gap between the distribution of the generation and true features. Similar to submodule 1, G 2 corresponding adversarial loss function is given by the following: where l a = △ðl st ðx s , G 1 ðx t ÞÞ, l st ðx s , G 2 ðx n ÞÞÞ, and △ is the differential operator. Submodules 1 and 2 can be trained independently, resulting in shorter training durations. Minimizing l G 2 adv means reducing the gap between the generated samples and the real samples and also reducing the noise in the generated samples. In the feature extraction process for images, the RepVGG-B2 network is used as the backbone.
Therefore, the final trained loss function of the feature generation module can be expressed by Equation (5), where λ adv denotes the weight of adversarial loss in the loss function.: 3.2. Few-Shot Learning Module. In the process of underwater target recognition, a large amount of data is usually needed to train the network. However, the small number of underwater image samples leads to the poor generalization performance of the algorithm. To address this problem, this section introduces contrastive proposal coding, where we perform few-shot target detection by supervised contrastive learning. The intraclass similarity and interclass distinction are increased to reduce the problem of low recognition accuracy of unknown images underwater due to small training samples, thus improving the generalization performance of the network. The flow of this module is shown in Figure 3.
In this module, we take the feature map of the backbone as input to the region proposal network (RPN) and generate region proposals. Then, each region proposal is classified by RoI head. In the classification results, the bounding box is returned through the loss function if the target is included. In the RoI head, the region of interest is first pooled to a fixed size by a feature extractor. Immediately afterwards, the features are encoded as RoI feature s i . To obtain more significant target feature representations from a very small number of samples, this paper applies to batch contrastive learning to improve the intraclass similarity and interclass differentiation in the target suggested region.
This article introduces batch contrastive learning into our framework, adding a contrast branch to the RoI head. Then, the similarity between the targets is calculated on the RoI features and increase the intraclass similarity and interclass distinction. We use a bounding box classifier based on cosine similarity which denoted as Equation (6), where the sim is the scaled cosine similarity between RoI features s i and category weights τ j . By calculating sim, we can predict the i-th instance to be the class j, further improve the similarity of the same category, and expand the distinction between different categories.
where δ is the scaling factor of the amplification gradient, which is usually set to δ = 20. The contrastive learning head simplifies the distinction between different categories by learning the objects of contrast perception through RoI head. The embedding of contrastive learning makes the same class similarity higher in the classification task and the greater distance between the extended different classes. Therefore, the generalization performance of the algorithm is strengthened. By introducing supervised comparative learning into the detect task, we can take advantage of the following cpe loss (Equation (7)). Specifically, in a small batch of N RoI head features ff i , u i , y i g N i , we define f i to be the i-th RoI feature, u i represents the IoU of the bounding box to the ground truth, and y i represents the ground truth.
where N y i represents the number of samples with the same label as y i , σ is the hyperparameter, gðu i Þ controls the consistency of the proposals, gðu i Þ = Ifu i ≥ ωg•kðu i Þ, and kð•Þ is the weight of the corresponding IoU score, setting the threshold ω to 0.7. Embedding the contrast learning head into the network, the loss function of the few-shot learning module can be expressed as where λ cpe is the corresponding weight (usually set to 0.5), l cls denotes the loss function of the bounding box classifier, l rpn is set to the binary crossentropy loss, and the loss function l reg is used for bounding box regression.

UITRNet and Training
Process. UITRNet can effectively solve the problem of missing target features and few training samples in underwater archaeological scenes. First, the input image is generated through the feature generation module for missing features. In the feature generation module, the missing features are generated by two submodules while reducing the generation noise. Then, the features of different layers are fused through feature pyramid network (FPN), and the region of interest (RoI) is extracted and input into the few-shot learning module. Finally, by introducing contrast learning, RoI features are added to the detector by contrasting branches to increase intraclass similarity and interclass gaps. The training process of the algorithm is as follows.
For the adversarial loss L adv of the two submodules of the feature generation module, in submodule 1, by minimizing Equation (1) and Equation (2), generator 1 performs better than discriminator 1 in reaching Nash equilibrium, where discriminator 1 considers that the feature images generated by generator 1 obey the true distribution. In submodule 2, the structure and pixel values of the generated features and the real samples are made more similar by Equation (4), thus reducing the noise in the generated samples. For the few-shot learning module, the loss function is given by Equation (9).
In summary, the loss function of the proposed framework in this paper can be expressed as follows:  where the hyperparameter λ is used to balance the loss function, L gen is the loss of the generation module mentioned above, and L cont is the loss of the few-shot learning module.

Experiment
4.1. Dataset. This paper uses the homemade dataset UIFI (underwater incomplete feature images), the dataset includes 1112 images with targets, and the corresponding labels are given. The dataset is divided into 0.7 : 0.3 training and test sets. The dataset includes a variety of interfering factors, such as insufficient light, semiburial, and wreckage, which can be used to evaluate the performance of underwater missing features object detection algorithms.

Implementation
Details. This article sets the batch size to 16 to train the model. Optimization is performed using stochastic gradient descent (SGD) with momentum set to 0.9, weight decay to 0.0005, initial learning rate set to 0.01, and 500 epochs throughout the training process. The number of trainable parameters for the proposed algorithm is 145,439,611, and the memory cost is 563 MB. We provide the loss curves of training and validation in Figure 4. In the algorithm operating environment, CPU is Intel(R) CORE i5 7200U, and GPU is RTX 3090 VENTUS 3X 24G.

Analysis of Results.
In this section, simulation experiments are performed on the UIFI dataset. For the three different disturbances mentioned above, four sets of experiments containing different disturbances are designed  7 International Journal of Distributed Sensor Networks to verify the effectiveness of our algorithm. The evaluation indicators of the algorithm are mAP and time. First, we compare the methods proposed in this paper with advanced methods, such as SA-FPN [24], YOLOv4 [25], ViTDet [26], and UPDETR [27]. In each subsection, we analyze the experimental results, and the conclusions of the experiments provide a clear picture of the efficiency of the algorithm in this paper.

Results of Conventional Underwater Image
Recognition. For the problem of target recognition in conventional underwater environment, this section compares the advanced algorithms SA-FPN, YOLOv4, ViTDet, and UPDETR with the algorithms in this paper. In Figure 5, we show some of the effect figure of different recognition algorithms. The mAP and recognition time of the algorithm in this paper and the advanced algorithm can be obtained from Table 1, where the black bold font is the best data. The comparison results can be seen visually from Table 1, and the mAP of the ViTDet in a conventional water environment is up to 0.7236, but its recognition speed is 0.326. The mAP of our algorithm is 2.7% lower than that of ViTDet, but the recognition speed in this article is 0.229, which is higher than ViTDet. The fastest recognition speed of UPDETR is 0.113, but compared with our algorithm, the recognition accuracy of this paper is improved by 4.13%. This paper has an excellent performance in the mAP of ship and stone statue.   compares advanced algorithms such as SA-FPN, YOLOv4, ViTDet, and UPDETR with the algorithms in this paper. The effect of identification is shown in Figure 6. The mAP and recognition speed of each algorithm can be obtained from Table 2, and we bolded the best data. From Table 2, it can be seen that in a poorly lit underwater environment, the mAP of the proposed algorithm is up to 0.5684, and the recognition speed is 0.227. Compared with the excellent algorithm UPDETR, its recognition speed is 0.112, but compared with the method of this paper, this paper has improved the mAP by 0.86% over the UPDETR algorithm. This article achieves the highest level of accuracy for the identification of ship.

Results of Underwater Partially Buried Image
Recognition. For the recognition problem of partially buried underwater targets, this section compares advanced algorithms such as SA-FPN, YOLOv4, ViTDet, and UPDETR with the algorithms we proposed. The identification results of different algorithms are presented in Figure 7. In Table 3, we summarize the mAP and identification time of our algorithm and advanced algorithms, where the black bolded font is the best data. From Table 3, we can see that the proposed algorithm has the highest mAP of 0.5902 and recognition speed of 0.229 under the partially buried condition of underwater targets. Compared to the advanced recognition algorithm ViTDet, the mAP is

Results of Underwater Wreckage Image Recognition.
For the target recognition problem of underwater wreckage, advanced algorithms such as SA-FPN, YOLOv4, ViTDet, and UPDETR are compared with the algorithms in this paper in this section. Figure 8 shows the recognition results. From Table 4, we can see the mAP and identifying time of the advanced algorithms, where the black bold font is the best data. Using Table 4, we can conclude that the ViTDet algorithm has a maximum mAP of 0.5559 for the wreckage of underwater targets, but its recognition speed is 0.331. The mAP of the proposed algorithm is 0.78% lower than that of ViTDet, but the speed of our algorithm is 0.229 higher than that of the ViTDet. The fastest identification speed of the UPDETR is 0.112, but compared with the method in this paper, the mAP of this paper is improved by 3.83%. This article has excellent performance in the mAP of the ship.

Conclusions
In a real underwater archaeological scene, AUV works under various disturbances causing difficulty in extracting the full features of the target. This paper proposes UITRNet, which can compensate for missing features in underwater images by generating features. In this paper, the algorithm is simulated on the UIFI of the self-made dataset, considering the detection under the conditions of conventional underwater images, insufficient light, buried targets, and wreckage. The mAP of the proposed algorithm in this paper is 56.84% under the interference of insufficient light, which is 0.86% better than the advanced algorithm UPDETR, and 59.02% under the interference of buried targets, which is 1.29% better than the advanced algorithm ViTDet. For target (ship) recognition with insufficient training data, the mAP is higher than the advanced algorithms SA-FPN, YOLOv4, ViTDet, and UPDETR under four different disturbances. The above experimental data show that our algorithm has excellent detection ability and position labeling ability in target recognition of missing feature images.
However, the performance of the algorithm needs to be improved in the situation of wreckage images. In addition, artifacts for the extracted features also affect the accuracy of the algorithm, and there is a need to continue to improve the algorithm to achieve better performance in the case of different image types.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
On behalf of my co-authors, the authors declare that there is no conflict of interests regarding the publication of this article.