High-Accuracy Real-Time Fish Detection Based on Self-Build Dataset and RIRD-YOLOv3

To better detect fish in an aquaculture environment, a high-accuracy real-time detection model is proposed. An experimental dataset was collected for fish detection in laboratory aquaculture environments using remotely operated vehicles. To overcome the inaccuracy of the You Only Look Once v3 (YOLOv3) algorithm in underwater farming environment, a suitable set of hyperparameters was obtained through multiple sets of experiments..en, a real-time image recovery algorithm is applied before YOLOv3 to reduce the effects of both noise and light on images whilst keeping the real-time capability, leading to a mean average precision of 0.85 and frame rate of 17.6 fps, respectively. Finally, compared with the base detection model using only the YOLOv3 algorithm, the enhanced detection model presented results in a reduction of miss detection rate from 23% to only 9% across different environments and with the detection accuracy of the target in different environments being improved from 8% to 37%.


Introduction
Recently, ocean engineering and research have increasingly relied on underwater images captured by autonomous underwater vehicles (AUVs) and remotely operated vehicles (ROVs) [1]. However, since the collection of underwater datasets is more difficult than that for onshore datasets, there are few generally accessible datasets for underwater creatures, and public datasets for freshwater creatures are even rarer. In addition, underwater images usually suffer from various types of degeneration, such as low contrast, color casts, and noise, due to wavelength-dependent light absorption and scattering as well as the effects of low-end optical imaging devices [2]. To obtain much higher quality underwater images, a number of advanced methods have been designed and used. For example, Gray World [3] and White Patch [4] are used in color correction. Fang et al. proposed a single image enhancement approach based on image fusion strategy to enhance the underwater image [5]. Li et al. presented a systematic underwater image enhancement method including underwater image dehazing algorithms and a contrast enhancement algorithm for highquality underwater images [6], and Hitam et al. utilized the contrast limit adaptive histogram equalization (CLAHE) to enhance the contrast [7]. Recently, Peng and Cosman proposed a depth and background light estimation method for underwater scenes based on image blurriness and light absorption, which can be used to restore and enhance underwater images [8]. Besides, many studies try to address the issue from the physical level. Typically, Schechner and Karpel employed a polarizer in front of their camera [9]. ese methods work well for underwater image processing, but few of them took the degeneration model into account or the proposed models are too complex to work in real-time. Moreover, most existing algorithms are lacking in the capability of self-adaption and self-adjustment, which are important for a robot working in a changing and complex underwater environment.
Instead of traditional target detection, artificial neural networks (ANNs) can be used to detect fish in images, and some methods have shown promise for real-time performance, such as Faster R-CNN [10], R-FCN [11], SSD [12], and YOLO series [13][14][15], amongst others. Among them, YOLOv3 performs well in both real-time and in terms of mean average precision (mAP). However, YOLOv3 just performs well in clear waters. When in dim and turbid waters, YOLOv3 loses almost all its original land-based advantages.
In this paper, an experimental dataset was collected, which solves the problem of the lack of datasets for fish in aquaculture environments. Furthermore, a set of suitable hyperparameters were obtained for the dataset through multiple sets of experiments, reducing training time and improving the detection accuracy. To improve the performance of YOLOv3, a high-accuracy real-time fish detection algorithm was proposed, named RIRD-YOLOv3.
is paper is organized as follows: firstly, this paper introduces the laboratory acquisition of the dataset, the RIRD-YOLOv3 algorithm, and the matching of hyperparameters; secondly, we discuss the analytical results of the experiments; finally, we present the conclusion. In this paper, the proposed algorithm is tested, and it performs well.

Dataset Collection and RIRD-YOLOv3 Algorithm
2.1. Acquiring the Dataset. Deep learning [16] requires a large amount of training samples, and the amount of data used will directly affect the detection accuracy of fish for this application. However, the problem faced by the fish dataset is that its open source dataset is very scarce and does not meet the training needs of grass carp detection models.
To solve the problem of the lack of grass carp dataset in the breeding environment, in this paper, through a field investigation, a simulated grass carp breeding environment is established in the laboratory. Based on the growth environment of grass carp, the length, width, and height of the pool are set to 600 cm, 450 cm, and 250 cm. e pond can simulate the real grass carp breeding environment. e experiment site is shown in Figure 1.
e content of the sample of the dataset includes adult grass carp and robot fish. Among them, the robotic fish is a bionic robot purchased in the laboratory. It matches the shape and characteristics of the real fish. e purpose of placing it in the dataset is to verify whether the classification performance of the model obtained by the algorithm will be affected. Each of the sample image contains zero or more instances of fish. As a result, each image could contain from zero to multiple annotations. is method enriches the types of dataset samples and can be used to verify the classification characteristics of the detection model. e sample content of the dataset is shown in Figure 2.
To fully simulate the impact of light on grass carp in the breeding environment and to fully collect images of grass carp under different light environments, this paper accomplishes this by changing the lighting conditions at different time periods. e specific implementation is shown in Table 1.
According to Table 1, after using ROV to collect grass carp images, through sorting out, it is found that three types of images are useless, as shown in Figure 3. e three types of images shown in Figure 3 do not contain much in the dataset, but they still affect the accuracy of the subsequent detection model. erefore, in order to achieve the purpose of improving the accuracy of the detection model, this paper removes these three types of images manually.
A standard dataset should include a training set and a testing set. e training set and the testing set are mutually exclusive.
e training set is used to obtain an excellent detection model, and the testing set is used to test the performance of the model. After removing the above three types of useless samples, the dataset contains 3069 images. In order to ensure the performance of the detection model, this paper uses the 'reserve method' to randomly divide the images of the dataset into a training set and a testing set according to 7 : 3 ratio. After the division of "reserve method," the training set contains 2148 images and the testing set 921 images. Besides, in this paper, the training set is classified according to the three types of lighting conditions shown in Table 1, and the number of the three types of samples is NL 443, NOL 1228, and 477 NORL. e dataset named the Grass Carp Dataset before Restoration (GCDBR) is composed of original images. e examples of GCDBR are shown in Figure 4.
Besides, to improve the detection accuracy, it is necessary to separate out the fish from the environment. Usually fish have a similar color to the environment to protect themselves. erefore, a large number of optical images of the underwater environment were collected and labelled as 'negative sample.'

Dataset Labelling.
e dataset needs to be annotated to accomplish and validate the goal of classification and detection on the images. In this paper, labelling software is used to create annotations of respective classes for the images in the dataset based on the PASCAL VOC [17] standard labelling format. Each annotation is created by drawing a bounding box around the object of interest belonging to one of the classes and assigning the bounding box and the class label associated with it. For simplicity, in this paper, axis-aligned bounding boxes are used as described in the PASCAL VOC dataset paper [17]. e examples of annotated images are shown in Figure 5.

e RIRD-YOLOv3 Algorithm.
e water quality of the grass carp farming environment is turbid, and due to the absorption of light by the water, the scattering effect, and   Complexity the uneven illumination of the ROV, the quality of the image will deteriorate and the grass carp cannot be distinguished by the naked eyes. e examples of images of grass carp sample of low quality are shown in Figure 6.
To overcome this problem, Chen et al. provided three parameters, related to underwater image degradation and color correction, by presearching in the first frame of image sequences using an artificial fish school algorithm [18]. e core of image restoration is a Wiener Filter in frequency domain as follows: where V orig,C represents one channel of the original image; V deg,C represents one channel of the degraded image due to underwater scattering and abortion; R is the reciprocal of signal-to-noise ratio and was implemented to restrict scattering; H(u, v) is originated as a general image degradation model in turbulent media [18] expressed by where k is a crucial parameter related to the depth of water and the distance from the camera. After Wiener Filter is applied, color correction is implemented on the image by gamma factor as follows: At this point, R, K, and c have been introduced. To obtain a reliable combination of these three parameters, we employ a quality index of the restored image expressed as follows: where α is a haze indicator, describing the level of haze by gradient computed by the modified Tenegrad evaluation, given as follows: where M × N is the size of an input image; V g is a grayscale map, and orientations of gradient are regulated as k × 45 ∘ . is indicator takes the textural feature and edge feature into consideration. Generally, a higher value of α reflects a clearer restored image.
β is a contrast indicator, which is calculated by histogram distribution in RGB channels, representing the image contrast as defined in the following equation: where h C (i) stands for the data of histogram curves at gray level i for channel C and μ C shows the average of histogram curves of channel C. eoretically, objects can be distinguished more easily with a higher value of β.
η is an imbalance indicator, which denotes the level of color correction as follows: Clearly, η diminishes along with a better color correction.
A test result of a deep sea image is shown in Figure 7. Clearly, the method is effective in contrast and color correction, and it takes only 17.5 milliseconds for each frame on average. In addition, the amount of relevant information in the restored image has been retained to a large degree, such as color information, texture and edge information, and illumination information.

Matching of Hyperparameters.
To train the ANN efficiently and well to predict the desired outcome, the hyperparameters of the network should be properly determined. For the various values of number of epochs, momentum, learning rate, and batch size, a grid search was performed to optimize the hyperparameters. All possible sets of values described in Table 2 were tested to train the network. en, after training the network using each set of values, the sensitivity was assessed using 100 test images. en, the values that maximize the quality of the network were adopted for the hyperparameters.

Experimental Platform and GCDAR.
First of all, the experiment in this paper starts with image restoration. e specific implementation method is to restore the sample image of GCDBR to obtain the grass carp dataset after restoration (GCDAR). e comparison of sample images of grass carp dataset before and after restoration is shown in Figure 8. e appearance of the target in the image is different due to the influence of the light. In Figure 8(a), since the color of the target is similar to the background environment, it becomes difficult for the human eye to detect the target. In Figure 8(b), under the action of the outdoor auxiliary light, the underwater image halo is enhanced, and the target is very blurred due to the presence of water mist. In Figure 8(c), when natural light and ambient light do not work, the self-contained light source of the ROV is used, but the observation of the underwater target is still difficult due to the limited light strength. After the image is restored, the target in the restored image in Figure 8

Training Result.
e best hyperparameters for training are shown in Table 3. Figure 9 is a graph showing the relationship between batches and average loss. For each batch, 64 images are randomly selected and used to train the ANN. Since the number of samples is limited, each image is used multiple times. e graph shows that the average loss is almost reduced to 0 as batches progress. A total of 30200 epochs were run, and it took 48 hours to complete the training. Compared with the initial training parameters, the best hyperparameters reduce the training time by at least 48 hours.
To meet real-time requirements, through testing, the frame rate of GCDAR's model is shown in Table 4.
In addition, evaluation of the trained network is performed by taking our validation dataset consisting of 300 images and executing detection on it using the trained model. e metrics used to evaluate the object detection are as follows: (a) mAP. is is the mean of the interpolated average precision across all the classes in the dataset used for object detection. (b) IOU. is is the ratio of the area of intersection to the area of union of the predicted bounding box and the corresponding maximally matched ground truth box as defined in the following equation:

Comparison of Unrecovered and Recovered Images.
To verify the validity of RIRD-YOLOv3 in different environments, two new evaluation parameters were proposed, namely, missed detection rate (MDR) and target detection accuracy (TDA) in a single environment. In addition, to demonstrate the advantages of the method, a comparative experiment of image detection capability before and after restoration is proposed and conducted. e MDR and the TDA were also tested (as defined in equations (9) and (10)). e results are shown in Table 5. As seen from Table 5, the restored image's MDR is reduced from 8.9% to 21.7% compared with the image before restoration. e restored image's TDA is increased from 7.8% to 36.8% compared with the image before restoration. e large reduction in the rate of missed detection indicates the effectiveness of the RIRD-YOLOv3 algorithm. e improvement of detection accuracy in different environments shows that the model has generally excellent performance. Figure 10 shows the IOU contrast between the prerecovery image and the restored image. In the original image, the target detection showed missed detection and false detection. However, in the restored image, both missed detection where M is the missing detection image and Nis the no missing detection image.
TDA � t 1 + t 2 + t 3 +, · · · , +t n n × 100%, where tis the detection accuracy of a single target in test images and n is the number of targets.

Conclusions
In this paper, we presented a high-accuracy real-time fish detection algorithm, called RIRD-YOLOv3. It is able to solve the problem of image blur and noise caused by processing in an underwater environment. In addition, a set of suitable hyperparameters is provided for laboratory freshwater aquaculture environmental dataset. When using the hyperparameters is discovered, the experimental results show that the training time for the dataset is reduced by 48 hours. During testing, the frame rate of RIRD-YOLOv3 was 17.6 FPS and the model's mAP is 0.85. Prerecovery and postrecovery images were contrasted in three environments, and the miss detection rate and detection accuracy are reduced from 23% to 9% and increased from 13% to 37%, respectively. erefore, overall RIRD-YOLOv3 has demonstrated excellent performance for this type of environment.
e RIRD-YOLOv3 algorithm is of important significance for underwater target detection applications. It can be applied to underwater submersibles such as ROV and AUV. It has potential for further contribution to the exploration of underwater resources.

Data Availability
is paper proposes the laboratory acquisition of the dataset, which is published on CSDN, and data can be obtained from the following links

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.