An End-to-End Learning-Based Row-Following System for an Agricultural Robot in Structured Apple Orchards

A row-following system based on end-to-end learning for an agricultural robot in an apple orchard was developed in this study. Instead of dividing the navigation into multiple traditional subtasks, the designed end-to-end learning method maps images from the camera directly to driving commands, which reduces the complexity of the navigation system. A sample collection method for network training was also proposed, by which the robot could automatically drive and collect data without an operator or remote control. No hand labeling of training samples is required. To improve the network generalization, methods such as batch normalization, dropout, data augmentation, and 10-fold cross-validation were adopted. In addition, internal representations of the network were analyzed, and row-following tests were carried out. Test results showed that the visual navigation system based on end-to-end learning could guide the robot by adjusting its posture according to different scenarios and successfully passing through the tree rows.


Introduction
With the development of technology, many intensive tasks worked by the human in orchards can be replaced by using agricultural robots [1][2][3][4][5][6][7][8]. However, developing such robots in a real orchard, safely and reliably, is still a challenge. One major challenge for an autonomous agricultural robot in an orchard is row following. Recently, vision sensors have been widely used in agricultural robot navigation since their low cost, high efficiency, and capability to provide huge information [9][10][11][12][13][14][15][16][17]. In our previous study, a row-following system based on traditional machine vision for an apple orchard was designed, of which navigation was divided into multiple subtasks, such as image binarization, boundaries detection, guidance path generation, coordinate transformation, and low-level motor control [18]. Each subtask was processed independently, and their outputs were integrated as the final control decision. Based on our previous test, traditional development is easy to adjust, optimize, and troubleshoot each module. However, the system's complexity increased at the same time when more and more modules were added to improve the navigation. In addition, it was found that once the environment changed, such as the light intensity and occurrence of deep shadows, the traditional navigation system had to be readjusted to pick out the important features consistently. By contrast, deep learning has the potential to learn for performing many of the complex perception tasks of mobile robot navigation [19,20]. is study attempted to develop a deep learning network that was directly mapped from pixels to actuation based on the end-to-end learning scheme. As Figure 1 shows, traditional subtasks of the navigation were replaced by a specially designed deep network, which reduces manual programming and simplifies the system. e ALVINN (autonomous land vehicle in a neural network) study was the first time to prove that end-to-end learning was feasible for unmanned driving. However, limited by the computing power of the time, the ALVINN system used a single hidden layer backpropagation network and a small number of samples for learning, which prevented it from being applied in more complex environments [21]. Recently, with the development of deep learning-based hardware and theories, many breakthroughs have been made in autonomous driving based on end-to-end learning. NVIDIA developed a self-driving car system based on the DAVE-2 network. e network was trained via human driving; samples were captured simultaneously by using three different camera angles. Test results showed that, with less than a hundred hours of training, the car could execute autonomous driving on highways, local roads, and residential neighborhoods in sunny, cloudy, and rainy conditions [22,23]. Vastly different from the NVIDIA self-driving system, Mobileye divided the self-driving system into three parts: perception, high-precision map, and driving decision; each part was designed on an independent network. Supervised learning techniques and direct optimization were implemented in the recurrent neural network to solve the long-term planning problem. Results showed that, by incorporating adversarial elements into the environment, robust policies could be learned by the designed self-drive system [24]. Large and diverse datasets of driving information, such as steering, braking, and speed, were also considered in some studies and implemented in the end-toend learning component to achieve the optimal driving strategy. Such a self-drive system was developed by Comma.ai, which expected both speed and steering direction under different driving conditions that could be calculated intelligently [25].
Most of the current research on self-driving (autonomous) systems using deep learning mainly focuses on landto-road navigation [26][27][28][29][30][31][32][33], while outdoor robot navigation studies remain relatively low. Muller et al. developed a visual-based obstacle avoidance system using deep networks for an outdoor robot. During training, the robot was executed to drive under different terrains, obstacles, and lighting conditions by remote control. Test results showed that the robot exhibited an excellent ability to detect obstacles and navigate around them at speeds of 2 m/s [34].
Hwu et al. applied the IBM NS1e board, which contained the IBM Neurosynaptic System (IBM TrueNorth chip), on the Robotics platform to speed up the CNN (convolutional neural network) training process. Results showed that the designed self-drive system based on CNN enabled the robot to drive along a mountain path with low power processing [35]. Orchard environments are much more complicated than land road environments. Orchard environments have uneven ground surfaces, diversification of trees, and so on, which makes developing an autonomous navigation system based on deep learning challenging. However, studies in this field have not adequately addressed the issue. Bell et al. developed a monocular vision-based row-following system for pergola structured orchards. A fully convolutional network was used to perform semantic segmentation of color images for an abstract class called "traversable space." Test results indicated that the designed self-drive system executed row following better than the existing 3D lidar navigation system. However, the designed system still needs the traditional subtasks such as boundary detection and centerline fitting to generate the final steering decision [36]. e objective of this study is mainly focused on row following and the detection of row ends. A CNN network for the row-following system was developed, which consisted of five convolutional layers and one fully connected layer. Experiments were carried out and results were analyzed. e main novelties of this study are as follows: (1) A tree row-following system that was directly mapped from pixels to actuation based on the endto-end learning scheme was developed, which saves much hand programming and simplifies the system compared to traditional methods. In addition, the deep learning-based system has the potential to improve problems such as the fluctuation of light intensity and shadows in complex environments.
(2) A sample collection method for network training was also proposed, by which the robot could automatically drive and collect data without an operator or  remote control. No hand labeling of training samples is required. (3) Methods such as batch normalization, dropout, data augmentation, and 10-fold cross-validation were adopted to improve the network's generalization ability and visualization analysis has been executed to clearly understand the useful features learned by the network. e remainder of this paper is divided as follows. Section 2 contains an overview of the vision-based navigation system, the design of the training data collection method, and the CNN network architecture. Section 3 details the results of the simulation and row-following test and discussion. Finally, Section 4 states the conclusion of this study.

System Overview.
A crawler-type robot platform was used in this research. To keep the navigation system simple and relatively low-cost, a monocular camera (Imaging Source DFK 21AU04) with a frame rate of 30 Hz and an image resolution of 640 × 480 pixels was used for image acquisition. An industrial computer was used to execute high-level algorithms and a microcontroller was used for low-level control operations. A dual antenna GNSS (Global Navigation Satellite System) and a Trimble BD982 receiver were applied for measuring driving trajectories. Figure 1 demonstrates the schematic diagram of the tree row-following platform.

Sample Collection. Given a series of N images
. , x N sampled at time instances 1, 2,. . ., N, respectively, the goal of the designed deep network is to classify the images as performing one of the discrete commands move forward, turn left, turn right, stop . e commands are defined by the following rules: (1) robot moves forward if both sides of tree rows could be perceived in the camera field of view; (2) robot turns left or right if and only if one side of a row could be perceived; and (3) robot stops driving and gets ready for headland turning when the last row is detected.
For navigation based on deep learning, collecting a large number of training samples is a great challenge because a mobile robot sometimes needs to be steered or controlled by an operator for several days or even weeks. To reduce the workload, a collection method was developed based on robot path tracking using GNSS in this study. During the initial stage, to focus more on the performance of the designed method and reduce losses if the robot runs over a barrier of trees, an artificial apple orchard environment was set up. A reference path was defined as the center of the tree rows. A path tracking controller was designed, and the details were presented in [37]. e sample collection process was carried out as follows: First, the robot executed a straight-line path tracking at a speed of 0.3 m/s. Image sequences were then saved and categorized as "move forward." Second, the yaw angle of the camera was adjusted until only the left row could be seen in the camera field of view. Straight-line path tracking was executed again, and image sequences were categorized as "turn right." "Turn left" commands were obtained using a similar process. Finally, images of "move forward," "turn right," and "turn left" actions, which showed the last row, were extracted and categorized as "stop." Some samples are shown in Figure 2.

Network Configuration and Training
e designed CNN consists of five convolution layers and one fully connected layer. To speed up the training and real-time navigation, an input image is first resized to 48 × 36. e output of the network is expected steering commands. As shown in Figure 3, the network mainly consists of convolution layers, activation functions, pooling layers, a fully connected layer, and a SoftMax layer. Five convolutional layers are designed to perform feature extraction and each kernel of the convolutional layer corresponds to a feature map, which will be further compressed and extract key features using pooling layers. Finally, all the features are connected using the fully connected layer and results are output to the classifier for steering command prediction.

Network Training.
Optimizing a deep neural network is not trivial due to the gradient vanishing/exploding problem. In addition, optimization may get stuck in a saddle point, resulting in premature termination and inferior lowlevel features [38]. To improve the training, the following methods were adopted in this study: (1) Batch normalization was applied right after each convolution layer, which forces the network's activation to generate larger variances across different training samples, accelerating the optimization in the training phase and achieving a better classification performance [39]. Formally, denoting by x ∈ B an input to batch normalization that is from a minibatch B, batch normalization transforms x according to the following expression: where μ B is the sample mean and σ B is the sample standard deviation of the minibatch B. e notation ⊙ represents Hadamard (elementwise) product. c and β are the scale parameter and shift parameter, respectively; they need to be learned jointly with the other model parameters. μ B and σ B in equation (1) can be calculated as follows: where ε, ε > 0, is a small constant which is added to the variance estimate to avoid dividing by zero, even Mathematical Problems in Engineering 3 in cases where the empirical variance estimate might vanish. (2) Dropout is also adopted during the training to prevent network overfitting [40]. With dropout probability p, each intermediate activation h is replaced by a random variable h ′ as follows: (3) To enrich the training samples, data augmentation was executed. e details, described as "move forward" and "stop," are horizontally flipped. All samples are randomly shifted horizontally or vertically within the range −5 5 pixels and randomly rotated counterclockwise or clockwise within the range −10 10 degrees. e sample brightness (of each sample) is randomly adjusted to simulate lighting change at different times of the day. Some of the samples after augmentation are displayed in Figure 4. (4) To minimize the sampling bias when using machine learning models, 10-fold cross-validation is used. e original training data are split into ten nonoverlapping subsets. en model training and validation are executed ten times, each time training on nine subsets and validating on a different subset (the one not used for training in that round). Finally, the training and validation errors are estimated by averaging the results.

Training Results.
A total of 21280 samples were classified into four categories for training. To test the effectiveness in preventing network overfitting when using batch normalization (short as BN) and dropout (short as DP), different types of network structures were executed and compared, e.g., BN plus DP, only BN, and only DP. Figure 5 shows the results of the comparison during network training. e performance was poor when using only DP, with an accuracy of only 37.04%. By contrast, when using BN plus DP or only BN, the accuracy could reach 96.31% and 96.33% on average, respectively, which was much higher than using only DP. Figure 6 further demonstrates the comparison of accuracy and loss rate. By using BN plus DP or only BN, at the early 30th iteration, accuracy had already achieved 97% and the loss rate had dropped down to 0.07. On the contrary, when using only DP, training had to be executed for 210 iterations to achieve the same performance. Overall, BN plus DP and only BN had a similar performance among the test networks; the two methods also had better results than using DP alone.
For the final training run, BN plus DP was chosen to construct the final network since BN could improve the accuracy and speed up the training; DP reduced the number of features in intermediate layers, which could significantly reduce the network's dependence on certain features. 80% of the total samples were randomly selected in each category (17024 images) as the final training set and the remaining samples (4256 images) were selected as the test set. During the final training, the loss function error was reduced to below 0.0055 after executing 330 iterations.

Results and Discussion
3.1. Simulation. A video simulation was carried out to evaluate the performance of the network before applying it to the actual test. e robot drove through the tree row for video recording by remote control; a total of 7 videos were recorded. During simulation, the network with BN plus DP and the network using only BN had similar performance, which was consistent with the training results found earlier in the study. e error rate of the network with BN plus DP was 3.8%, while the network with only BN was 4% on average.
To further understand the features learned by the network, visualization was performed. Figure 7 displays activation of the first convolutional layer with eight convolution kernels under different image inputs. e first and most important task of machine visionbased navigation is perception, which means recognition of tree rows in our case. As Figure 7 shows, white pixels represent strong positive activation, while black pixels represent strong negative activation. A channel that is mostly grey does not activate as strongly on the input image. After training, the regions of the tree row were strongly activated. At the same time, the contours and texture of the tree row were clear, which suggested that the designed CNN network had correctly learned the key features of tree rows. Deep networks would be less powerful and would not be able to learn the complex patterns from the data without activation functions. Traditionally, two widely used nonlinear activation functions are the sigmoid and hyperbolic tangent. A general problem with both the sigmoid and tanh functions is the possibility of saturation. Once saturated, it becomes challenging for the learning algorithm to continue to adapt the weights to improve the performance of the model. erefore, the ReLU (rectified linear unit) function was adopted, which not only solves the saturation problem but also has the following advantages: computational simplicity, representational sparsity, and linear behavior. e activation of the ReLU function in the first layer is demonstrated in Figure 8.

Tree Row-Following Test.
Two types of row following were carried out. Straight-line row following was first executed, where the length of the row was set to 20 m. To further test the designed network generalization, a curved path section was added between two straight-line row sections. e height and tilt angle of the camera were set to 1.65 m and 15°, respectively. e speed of the robot was set to 0.5 m/s and deviation was measured using GNSS. e experiments were conducted at noon on a sunny day with an occasional breeze. During the experimental test, the robot moved on flat land and surface soils that are naturally hard and dense. To Mathematical Problems in Engineering reduce the sliding effects, a tracked mobile robot has been adopted in this study since it has better maneuverability in rough terrain and higher friction in turns due to its tracks and multiple points of contact with the surface. e robot platform and the test environment are shown in Figure 9.
During the test, the robot could automatically move forward when both sides of tree rows appeared in the captured image; if only one side of the row appeared, the robot would execute the corresponding turning; once the last row was detected, the robot would stop and get ready for    the headland turning. Traditional navigation had also been conducted and compared with the end-to-end method. During the test, images were collected for retraining the endto-end network after each run. Figure 10 shows a comparison of the straight-line row-following performance and Figure 11 shows the row-following trajectories.
As shown in Figure 10, the lateral errors were 0.14 m and heading errors were 0.8°on average for traditional navigation, while the lateral errors were 0.29 m and heading errors were 1.8°on average for end-to-end navigation. In general, traditional navigation had better performance. However, when the robot drove along trees with large gaps or when trees happened to swing due to wind, traditional navigation became unstable since tree rows could not be fully extracted or were overextracted due to the fluctuation of light intensity. By contrast, the end-to-end row-following system has greatly improved these issues since different light intensities had been simulated using image augmentation during the network training process. Both lateral and heading errors of navigation based on the end-to-end method kept decreasing after each run, which means that performance improved after each network's retraining. Figure 12 demonstrates the poor performance of the end-toend navigation system. Once a significant difference between  Mathematical Problems in Engineering the training sample and the real-time captured image was found, the accuracy of the prediction would decrease. In future work, this problem can be improved by appropriately adding more steering types such as a large turn or a series of small turns. Increasing the amount of training data would also likely increase the accuracy of prediction. Moreover, since the ideal steering control was a continuous control problem, using a regression model to describe the robot movement would be more accurate. Applying other methods such as regularization, early stopping, and network pretraining would improve the network's generalization and should be tested in the future. Overall, even with a certain initial deviation, the robot could still return to the center of the row after a while.

Conclusions
A visual tree row-following system based on end-to-end learning for an agricultural robot in an apple orchard environment was developed in this paper. e input image was directly mapped to steering commands by the designed CNN. A data collection method without human driving or remote control was also proposed. e CNN network consisted of five convolutional layers and one fully connected layer. To improve the network generalization ability, techniques such as batch normalization, dropout, data augmentation, and 10-fold cross-validation were adopted in the study. Two types of row-following tests were carried out. Test results showed that the robot could adjust its posture according to different situations and drive through the tree row.
A tree row-following system was carried out using a simple landscape with an obvious color contrast and shape structure as a preliminary test. With implementers installed on the mobile robot, this research could be expanded to different agricultural tasks such as planting, spraying, fertilizing, cultivating, harvesting, thinning, weeding, and inspection. In future work, more realistic elements of the orchard navigation will be added. (1) For example, leaves and weeds in a real apple orchard appear as noise in the captured images; this affects the accuracy of visual navigation. In this study, an input image was resized to ensure low resolution, after which noise was reduced while the main regions of tree rows were maintained. In a future study, additional complications will be introduced; e.g., samples under different weather conditions, different tree trunk sizes and colors, canopy shadows, and tree trunk with branches may be collected and added to the training process to enhance the generalization of the designed model. Moreover, image preprocessing such as noise reduction should also be considered and developed to increase the robustness of the navigation system. (2) Keeping the camera steady during data collection is a great challenge for visual navigation. To simulate this effect, samples were randomly shifted and rotated to simulate camera vibration during training. In future studies, camera position, such as yaw, pitch, and roll angles measured by the IMU (Inertial Measurement Unit), could be used to calibrate the captured images and reduce the camera vibration further. (3) A tracked mobile robot was adopted in this study since it has better maneuverability in rough terrain and higher friction in turns due to its tracks and multiple points of contact with the surface. Furthermore, the sliding effects can be incorporated into an extended kinematics model to make the robot adapt to different terrain conditions. (4) In a real orchard, some apple trees or even an entire tree row may be missed due to the planting plan. is situation should also be considered when executing the sample collection process in future research. (5) Adopting high-performance GPUs and using an embedded development platform such as NVIDIA Jetson TX2, which can realize image and video information processing efficiently, will also be considered. With further development, the system can be combined with longer-term strategies, such as headland turning planning and control.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this study.