Low-Shot Wall Defect Detection for Autonomous Decoration Robots Using Deep Reinforcement Learning

Wall defect detection is an important function for autonomous decoration robots. Object detection methods based on deep neural networks require a large number of images with the handcrafted bounding box for training. Nonetheless, building large datasets manually is impractical, which is time-consuming and labor-intensive. In this work, we solve this issue to propose the low-shot wall defect detection algorithm using deep reinforcement learning (DRL) for autonomous decoration robots. Our algorithm ﬁrst utilizes the attention proposal network (APN) to generate attention regions and applies AlexNet to extract the features of attention patches to further reduce computation. Finally, we train our method with deep reinforcement learning to learn the optimal detection policy. The experiments are implemented on a low-shot dataset in which images are collected from real decoration environments, and the experimental results show the proposed method can achieve fast convergence and learn the optimal detection policy for wall defect images.


Introduction
Autonomous decoration robots are increasingly applied in the field of house decoration. Figure 1 shows our robot platform, autonomous decoration robot, which is used to decorate the walls of rough houses. e first thing of wall decoration for an autonomous decoration robot is wall defect detection.
Wall defect detection is an important research problem in automatic housing decoration. In recent years, deep learning (DL) is widely used in computer vision [1][2][3], and the current mainstream object detection methods [4] based on deep learning can be divided into two-stage detection and one-stage detection. Two-stage detection decomposes the object detection algorithm into two stages: it first generates region proposals and then classifies the region proposals. Many methods belong to two-stage detection, such as R-CNN [5], fast R-CNN [6], and faster R-CNN [7]. In general, two-stage detection methods have an advantage in accuracy, but they cannot meet real-time requirements in practical use. To address this issue, some researchers proposed one-stage detection methods which have advantages in speed. One-stage object detection cannot generate region proposal but directly outputs the location and classification of the bounding box in the output layer. Classical one-stage detection methods include YOLO [8], YOLO-v2 [9], YOLO-v3 [10], and SSD [11]. However, both methods need a large amount of wall defect images and require to annotate handcraft bounding box for region proposal. In addition, collecting larger number of wall defect images from real building decoration environments is very difficult, and annotating bounding boxes for each image increases the difficulty of making a ground-truth dataset. erefore, the process of wall defect image collection is not only timeconsuming but also labor-intensive.
To address the issues mentioned, we propose wall defect detection with low-shot data based on deep reinforcement learning in which the image dataset used in our work is lowshot, and handcrafted bounding boxes are not required. We first utilize the APN [12] to acquire attention regions and use AlexNet [13] to generate a compressed feature vector. Furthermore, we feed the feature vector into LSTM that outputs a center location of the wall defect image. In the subsequent iterations, our method continuously improves the accuracy of detection location based on the previous center location. Figure 2 shows some image samples of wall defects. e contributions in our work can be summarized as follows: (1) Defect detection via DL requires a ground-truth dataset with the handcrafted bounding box. However, our method does not require manual annotations for wall defect images. (2) Our method extends deep reinforcement learning from classification tasks to detection tasks. e remainder of this paper is organized as follows: in Section 2, we present the background. en, in Section 3, the proposed method is described in detail. In Section 4, the experimental results are presented and discussed, which demonstrate the advantage of our method. Finally, the conclusions are presented in Section 5.

Deep Reinforcement Learning.
e aim of reinforcement learning is to maximize a discounted sum of rewards when an agent interacts with an environment over a number of discrete time steps [14,15]. At each time step, the agent receives a state s t from the environment and produces an action a t according to its learned policy π. In return, the environment gives the agent a next state s t+1 and a reward r t+1 . Reinforcement learning can be divided into three categories [14]: value-based methods, policy-based methods, and actor-critic methods. Among these categories, policy gradient methods [16] are used to compute an estimator of agent's policy gradient by a stochastic gradient ascent algorithm and are suitable to incorporate with the deep neural network. Works on policy gradient methods have been developed, such as actor-critic (AC) [14,17], in which the actor is a policy, and the critic is a baseline. Lillicrap et al. [18] extended DQN [15] and DPG [19] to propose deep deterministic policy gradient (DDPG), and DDPG based on AC includes 4 neural networks: current critic network, current actor network, target critic network, and target actor network. Mnih et al. [20] proposed asynchronous advantage actor-critic (A3C) that uses asynchronous training of multiple agents in parallel. In recent years, DRL algorithms have been applied to robotics [21][22][23].

Recurrent Attention
Model. Recurrent attention model (RAM) [24] is a novel visual attention model formulated as a single recurrent neural network. is visual attention model takes a glimpse window as the input and uses an internal state of the neural network to select the next detection location and to generate control signals in a dynamic environment. Although RAM is not differentiable, the unified architecture is end-to-end from pixel inputs to actions using a policy gradient method. [12] receives full image and iteratively generates attention regions from coarse to fine by taking the previous prediction as a reference, while a finer scale network takes as input an amplified attention region from the previous scales in a recurrent way. e learning process of the APN is trained in a weakly supervised fashion because a part-level annotation is hard to obtain. Figure 3 shows the APN architecture that consists of a pretrained VGG-19 [25] model and two-stacked fully connected layers. In addition, pretrained VGG-19 is trained on ImageNet.

Method
In this work, we regard the wall defect detection problem as a sequential decision process in which a goal-directed agent interacts with the environment. Our architecture, as shown in Figure 4, can be decomposed into two modules: initialization module and refinement module. e initialization module is responsible for obtaining an initial detection location that is a preliminary input for the second module. To further prove the effective detection of our method, our proposed method also classifies the wall defect images into convex ones and concave ones based on the detection results. With several recurrent iterations, the refinement module gradually refines the results of the wall defect detection.
In the initialization module, we feed an input image into the pretrained APN model to generate the initial attention region. is initial attention region becomes small and reduces computation significantly. en, AlexNet compresses the high-dimension attention region into low-dimension feature vectors. Finally, the agent, via the policy gradient, receives the feature vectors and outputs an initial detection and classification policy for the wall defect image. e purpose of the initialization module is to calculate rough center coordinates of detection, and the refinement module receives the initial detection coordinates to implement recurrent iterations for detection improvement.

Detection Initialization.
e initialization module outputs initial detection. We utilize the pretrained APN to predict a box coordinate of an attention region for a finer scale. At each step t, an original image I t is fed into the APN, and the APN outputs an attention region x t . e representation of the attention region x t can be expressed as follows: where t x , t y are the square's center coordinates with respect to xand y-axis and t l is the distance between the detection center location and its border, respectively. For ease of notation, we rewrite the location [t x , t y , t l ] as l t−1 . f p is the neural network architecture of the APN.

Journal of Robotics
To further reduce the network computation of our method, we compress image x t to extract features by convolutional neural networks. Compared with other frequently used CNNs, such as VGG and ResNet, AlexNet has much less training parameters. erefore, we use AlexNet f n to extract features from x t , and f n outputs feature vectors n t � f n (x t ) � f n (f p (I t )) with much lower dimensionality.
Feature vectors n t are fed into LSTM, and the input parameters of inner LSTM are shown in Figure 5. In the inner of LSTM, each hidden unit h t has an internal state which summarizes information from environment states.
During the interaction period, the update of each hidden unit is h t � f h (h t−1 , n t ; θ h ).
Similar to [24], as shown in Figure 6, the agent outputs a location action l t and an environment action a t using the internal state h t when it interacts with the environment. In this paper, the location network f l (h t ; θ l ) outputs l t for the next time step, and the environment network outputs the environment action a t after a fixed number of time steps.
After performing an action a t , the agent receives a new visual observation I t+1 and a reward signal r t+1 from the environment. Wall defect images would be categorized into (a) (b) Figure 1: Autonomous decoration robot. e arm of the autonomous decoration robot is used to decorate the walls of rough houses, and a camera sensor on the arm captures environmental image information for the robot. Journal of Robotics convex and concave types, and the reward is 1 if the agent classifies the wall defect image correctly; otherwise, the reward is set to 0. In addition, the aim of our agent is to maximize the accumulated rewards R � T t�1 r t to learn an optimal policy π((l t , a t ) | s 1: t ; θ).

Detection Refinement.
e refinement module outputs the final detection and classification results. When the initialization module generates an initial detection location l t , we can get a new attention region x t ′ around l t , and the region is fed into AlexNet for a compressed feature vector n t ′ � f n (x t ′ ). e LSTM agent receives n t ′ and outputs classification policy a t+1 and detection policy l t+1 which generate an initial detection coordinate for its next iteration. After K recurrent iterations, the refinement module can learn optimal classification and detection policy. Furthermore, the classification policy is used to categorize wall defect images into a convex defect or concave defect, and the detection policy is used to generate a center coordinate for the final detection region. [24], our architecture has three small neural networks: a glimpse network, a location network, and an environment network. e training goal is to learn a policy that maximizes the total rewards.  Figure 4: e detection and classification architecture of the wall defect. e initialization module is used to calculate rough center coordinates of detection, and the refinement module is used to refine detection and classification results with K recurrence iterations.

Training Method. Similar with RAM
h t Figure 5: e input parameters of inner LSTM. Each hidden unit h t has an internal state to summarize the environment information n t , and its update is h t � f h (h t−1 , n t ; θ h ).

Environment
Agent I t+1 r t+1 (l t , a t ) Figure 6: Reinforcement learning used for the detection and classification policy of wall defect detection. e agent outputs a location action l t and an environment action a t when it interacts with the environment. In return, the environment gives the agent the next state I t+1 and a reward r t+1 .

Journal of Robotics
where p(s 1:T ; θ) depends on the policy. It is not easy to maximize J because it involves the expectation of high-dimensional relation sequence. We can regard this problem as a partially observable Markov decision process (POMDP). However, it allows us to solve this problem from the technical perspective of RL, and a sample approximation method is used to approximate the gradient: 3.4. Neural Network Architecture. As shown in Figure 4, the initialization module is a pretrained APN which includes pretrained VGG-19 layers on ImageNet and two fully connected layers. e APN parameters are frozen, and the APN outputs a 200 × 200 × 3 attention region from a 640 × 480 × 3 RGB defect image. In addition, we feed the attention region into AlexNet to generate a 4000-d feature vector, and the vector is further passed through 512-LSTM to produce an initial classification policy and a center coordinate of initial detection. e refinement module has a similar neural network architecture as the initialization module except the APN.

Experiments
Experiments are implemented in PyTorch and trained on Nvidia GeForce GTX 1080Ti GPU. We evaluate our method on the low-shot dataset of the wall defect images collected from real decoration environments.

Experimental Configuration.
We collected 317 RGB images for a low-shot dataset from real decoration environments by our autonomous decoration robot and split the images into the training dataset with 254 images and test dataset with 63 images, respectively. For ease of training, all the images are resized into the same size 640 × 480, and the distance between the detection center and its border t l is set to 100. erefore, the size of the attention region from the APN is 200 × 200. In addition, proper K recurrence is set to 2 for high accuracy and short training time. We trained this neural network based on a shared RMSProp optimizer with learning rate 7 × 10 −4 .

Experimental Results and Analysis.
We took five wall defect images as examples and gave qualitative and quantitative analysis to demonstrate the effectiveness of our method. As shown in Figure 7, the first column is original images, the second column is initial wall defect detections, and the third column is refined detections. Figure 7 represents the qualitative experimental results. Each original image is fed into the proposed neural network, and the first column is the output of the APN which generates the initial detection center. ough several images (Img-000, Img-082, Img-167, and Img-288) output the rough detection image, the refinement module still generates more accurate detections after recurrent iterations. In addition, APN even gives wrong initialization, such as Img-237, whose initial detection does not include the wall defect, and our method also maintains correct detection. e qualitative experimental results show the agent can learn optimal detection policy via the low-shot dataset.
To further prove the effective detection of our method, we classified images of wall defect after detection into convex ones and concave ones. Figure 8 shows the quantitative experimental results. e classification loss decreases rapidly, as shown in Figure 8(a), and it fast reaches stability with low value at 200 episodes. In Figure 8(b), the classification accuracy curve reaches 90% when the agent is trained at about 100 episodes. e training process begins to achieve convergence at 150 episodes, and the average classification accuracy is 98.25% that is calculated between 150 episodes and the end episode. e quantitative experimental results show that our agent can learn the optimal wall defect classification policy via the low-shot dataset, which further proves the effectiveness of our detection method.
According to the qualitative and quantitative analysis, training of our method can achieve fast convergence using deep reinforcement learning via the low-shot dataset. Moreover, the agent can learn optimal detection, while the training dataset is low-shot. In addition, the trained parameter model is 452M which is small, and our method is real-time in practice.

Conclusion
Wall defect detection is an important function for autonomous decoration robots. However, object detection methods via deep learning require a large number of image datasets annotated with the ground-truth bounding box. It is impractical to collect enough images and to handcraft bounding boxes for wall defect detection from real decoration environments. To address this issue, we proposed low-shot wall defect detection using deep reinforcement learning in this paper. We first utilized the attention proposal network to generate attention regions and applied AlexNet to extract features of attention patches to reduce the computation. en, we used deep reinforcement learning to train the proposed method successfully. e proposed method can reach fast convergence via the low-shot dataset and learn optimal detection policy for wall defect images for autonomous decoration robots.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.  Journal of Robotics