Research on the Agricultural Machinery Path Tracking Method Based on Deep Reinforcement Learning

With the rapid development of information technology, industry and service industries have achieved rapid development in recent years. ,en, looking at the development of agriculture, the popularity of informatization lags far behind industry and service industries, directly hindering the digital development of agriculture. Starting from the current agricultural machinery driving operation scene, this paper carried out a simplified research on the traditional agricultural machinery driving operation method through the agricultural machinery kinematics model, and based on the related theory of deep reinforcement learning to study the agricultural machinery path tracking in the agricultural operation scene, it carried out the controller design, built the agricultural machinery autonomous path tracking framework operating mechanism under deep reinforcement learning, and further researched through experimental design and found that the agricultural machinery autonomous path tracking control can achieve better automatic control after empirical learning. I-DQN algorithm enables agricultural robots to adapt to the environment faster when performing path tracking, which improves the performance of path tracking. It has important guiding significance for further promoting the automatic navigation and control of agricultural machinery to realize the efficient operation of agricultural mechanization.


Introduction
Automatic navigation control of agricultural machinery is a key technology to support precision agriculture. is technology can improve the working accuracy and efficiency of agricultural machinery, so that the driver can get rid of long-time tired and repetitive driving work and have enough time to monitor and operate agricultural machinery. erefore, the automatic navigation control of agricultural machinery has broad development prospects. e path tracking methods that are at the core of the automatic navigation control of agricultural machinery mainly include model-based control methods and modelindependent control methods. In terms of model-based control methods, related scholars have separately studied the path following control methods based on the kinematics model and dynamics model of agricultural machinery [1][2][3][4][5][6][7][8][9]. However, among these methods, the method based on the kinematics model is mainly to approximate the model with a small angle linearization and design the controller under the assumption of constant speed.
is introduces not only linearization error, but also the controller's performance when the speed changes. Robustness also deteriorates; while the control method based on the dynamic model can fully consider the dynamic characteristics of agricultural machinery, the dynamic model parameters are difficult to obtain online and in real time. In terms of model-independent control methods [10][11][12][13][14][15], the online adaptive determination of the forward-looking distance in the pure tracking method has not been well solved although the intelligent method has some human-like intelligence and incomparable traditional control methods. It has linear mapping ability, but its design requires certain experience knowledge and complex learning and training process. Aiming at the outstanding advantages of intelligent methods in agricultural machinery path control, this paper proposes an agricultural machinery path tracking method based on deep reinforcement learning. e research of this method has certain practical significance for the development of efficient agricultural operation methods.

Related Theories
e research in this paper will use the deep combination of reinforcement learning and deep learning and make full use of the decision-making advantages of reinforcement learning and the perceptual advantages of deep learning [13,16,17] to carry out research. In deep reinforcement learning, reinforcement learning is used to define problems and optimization goals, deep learning is used to solve strategy functions or value functions, and backpropagation algorithms are used to optimize the objective function. To a certain extent, deep reinforcement learning has general intelligence to solve complex problems.

Deep Learning.
Deep learning is derived from the idea of artificial neural network, which combines low-level features to form higher-level features and attribute categories. e most basic unit of artificial neural network is neuron, also known as perceptron. A deep neural network is called a multilayer perceptron. e difference from a single-layer perceptron is that it adds multiple hidden layers and can have multiple outputs. In the hidden layer, more complex feature information can be learned and multiple values can be output. It also enables the neural network model to solve more types of problems, such as classification, regression, dimensionality reduction, and clustering. At the same time, combining deep neural networks with different activation functions can further enhance the expressive ability of the model [13,16,17].
e deep neural network model is shown in Figure 1. e structure can be divided into input layer, hidden layer, and output layer. e input layer refers to information obtained through sensors or from the environment, such as radar data of agricultural intelligent harvesting vehicles. Each hidden layer is a feature level, in which each neuron represents a feature attribute. e output of the output layer is the required variables, such as the angular velocity and linear velocity of agricultural intelligent harvesting vehicles.
In DNN, each layer of neural network is fully connected; that is, the neurons in each i+1 layer are connected by the second layer of neurons. Assume that there are m neurons in the l − 1 layer network, and W ij represents the weight between the jth neuron in layer 1 and the kth neuron in l − 1 layer, b l j is the bias of the kth neuron in the lth layer, and σ(z) is the activation function. en, for the output a l j of the jth neuron of the lth layer, there are e above process is the forward propagation of the neural network, but to optimize the parameters of the neural network, backpropagation is required. In order to calculate the error between the model output and the real training sample output, the neural network needs to first define the loss function for training, as defined in Finally, the error is used to update the weight of each neuron, and finally a better model is obtained. is is the process of backpropagation of the deep neural network.

Reinforcement
Learning. Different from deep learning that focuses on perception and expression, reinforcement learning focuses on finding problem-solving strategies [18,19]. Reinforcement learning is mainly composed of agent and environment. Since the interaction between the agent and the environment is similar to the interaction between the organism and the environment, it can be considered that reinforcement learning is a general learning framework, which represents the future development trend of general artificial intelligence algorithms [20,21]. e basic framework of reinforcement learning is shown in Figure 2. Agents interact with the environment through states, actions, and rewards. Suppose that the state of the environment at time t in Figure 2 is denoted as s t , and the agent performs a certain action a t in the environment. At this time, the action a t changes the original state of the environment and makes the agent reach a new state s t+1 at time t+1. In the new state, the environment generates a feedback reward r t to the agent. e agent performs a new action a t+1 based on the new state s t+1 and the feedback reward r t+1 and iteratively interacts with the environment through feedback signals [22]. e ultimate goal of the above process is to maximize the cumulative reward for the agent. Equation (3) is the calculation process of the cumulative reward G.
G � r 1 + r 2 + · · · + r n . (3) In the above process, the rule of selecting actions according to the state s and the reward r is called the strategy π, where the value function v is the expectation of the cumulative reward. Reinforcement learning is to continuously perform trialand-error learning according to the feedback information of the environment and then adjust and optimize its own state information. e purpose is to find the optimal strategy or the maximum reward.
ere are two types of environments in which an agent is located [23]: one is that the environment is known, which is called model-based; the other is that the environment is unknown, which is called model-free. e relationship between model-based tasks and modelfree tasks is shown in Figure 3. e line following agricultural robot shown in Figure 3(a) controls its walking by sensing the black course on the ground through sensors. Since the black route on the ground is planned in advance and the surrounding environment is also controllable and known, it can be regarded as a model-based task. Figure 3(b) is the autopilot system of a car. In the real traffic environment, many things cannot be estimated in advance, such as the behavior of passers-by, the trajectory of passing vehicles, and other emergencies, so it can be regarded as a model-free task.

Deep Q-Learning (DQN) Algorithm.
e DQN algorithm is a famous work of the Google DeepMind team. ey used reinforcement learning to propose a deep learning network model for solving control strategy problems, opening a new era of deep reinforcement learning [24][25][26][27]. e Q-learning algorithm stores Q values in the form of Q tables, as shown in Figure 4. is method of storing the Q value can handle maze problems when the state space and action space are very small, but when the problem has a large action or state space, the method of applying the Q table will cause a very large amount of data. e DQN algorithm combines Q-learning and deep learning algorithms, using deep convolutional nerves as shown in Figure 5.
e DQN algorithm has made the following improvements on the basis of the reinforcement learning algorithm: (1) DQN uses a deep neural network to simulate the Q value function. e value function here corresponds to the weight θ of each layer in the convolutional neural network, that is, Q(s, a; θ) ≈ Q π (s, a). In this way, the update process of the Q value function is essentially an update of the weight θ of the neural network. When the parameter θ of the neural network is determined, the value function Q is also determined. (2) Use experience playback technology to train neural networks. e deep neural network used by DQN is a supervised neural network model. e input data needs to be independent of each other and meet the same distribution. Since the data collected by the agent in the environment is continuous, there is a correlation between adjacent data. When the algorithm uses a set of continuous data for training, the direction of gradient descent will become the same. Calculating the gradient under the same training step size may cause the result to not converge. e experience playback mechanism puts the data collected by the agent into a memory bank, then uniformly randomly samples from the memory bank, and extracts the data from it for neural network training. By using experience replay, the behavior distribution can be averaged in its many previous states, thus smoothing the learning process and avoiding fluctuations or divergence of parameters. At the same time, assign priority to each conversion in the experience replay memory, which can greatly improve the learning efficiency compared with the uniform sampling from the experience replay memory.
(3) e Q target network is set up to calculate the TD error. When using the convolutional neural network to approximate the Q value network, the parameter θ is processed by the gradient descent method, and the update process is In (4), r + cmax a′ Q(s ′ , a ′ ; θ) is called the TD target, and the network used in calculating the TD target is called the target network. e neural network used to approximate the Q value function is called the estimation network. From the above formula, it can be seen that the parameters used by the target network are the same as the parameters of the estimated Q network, so that the results obtained by the calculation will have relevance. e training results of reinforcement learning are unstable. To solve this problem, the DQN algorithm expresses the parameters of the target network as θ − . In the update of the neural network, the parameter θ of the estimated network is updated in real time, and the parameter θ − of the target network is obtained by assigning the parameters of the estimated network to the target network after N rounds of iterations, so (4) changes to In the update of the neural network, the loss function is defined by the mean square error: Scientific Programming Error function gradient: After updating the network of (7) and obtaining the value of Q(s, a; θ), you can use ∇Q(s, a; θ) to obtain the optimal Q value for the nerve of (5).

Behavioral Learning eory.
Considering the application of agricultural machinery in actual agricultural land, agricultural machinery should have high flexibility and stability in complex environments. erefore, this paper adopts a four-wheel agricultural machinery movement model, which provides power for the agricultural machinery movement through two rear wheels. e two front wheels adopt different steering angles to ensure the smooth steering of the mobile agricultural machinery. e movement model is shown in Figure 6.
When the agricultural machinery system is turning, its turning process can be simplified into a bicycle model as shown in Figure 7.   In the map coordinate system, (x r , y r ) and (x f , y f ), respectively, represent the coordinates of the center position of the two rear wheels of the agricultural machine and the coordinates of the center position of the two front wheels, and v r and v f , respectively, represent the center position of the front wheel and the center of the rear wheel of the agricultural machine. e speed of the position, φ, is the heading angle of the agricultural machine in the map coordinate system, δ f is the deflection angle of the front wheel of the agricultural machine, and l is the distance between the center position of the front wheel and the center position of the rear wheel. P is the instantaneous turning center of the rear wheel center position of the agricultural machinery during the turning process; R is the turning radius of the center point of the rear wheel of the agricultural machinery, assuming that the deflection angle of the center of mass of the moving agricultural machinery does not change during the turning process; that is, the instantaneous turning radius and the radius of curvature of the path are the same. en the speed of the rear wheel center (x r , y r ) of the agricultural machinery is v r : I also know the kinematic constraints of the center of the front and rear wheels of agricultural machinery: Combining (8) and (9) can get According to the relationship between the center coordinates of the rear and front wheels (x r , y r ) and (x f , y f ): Incorporating (10) into (11) can reach the angular velocity ω when the agricultural machinery turns: ω is the angular velocity at which the agricultural machinery rotates around the instantaneous rotation center P. And the moving speed of the agricultural machinery v r can get the turning radius R and the front wheel deflection angle δ f : Finally, the kinematics model of mobile agricultural machinery can be obtained as  Scientific Programming adjust its own strategy according to the reward and penalty values feedback from the environment in the process of exploring the environment and finally realize the task of path tracking. e framework of the path tracking algorithm is shown in Figure 8. e Autolabor four-wheeled vehicle is used to simulate the operation of agricultural machinery in the design of the path tracking framework.

Design of the Control Strategy for the
In the above framework, the agricultural machinery obtains external information through the Lidar sensor and executes action a, tries different states S t , and at the same time obtains the corresponding reward value r according to the set reward and punishment function. When exploring the environment, OU noise is added to increase the exploration degree of the action space, and the experience explored in the environment is stored in the form of tuples and placed in the experience playback pool. When training the network, the priority playback mechanism is used to sample and learn the important experience samples first, reducing the training time of the mobile agricultural machine, and finally the mobile agricultural machine learns to track autonomously in the environment. e following will design the state space, action space, and reward and punishment functions in the algorithm framework.

Agent State and Space Design.
In order to simplify the path tracking model, it is assumed that the agricultural machine is moving at a fixed speed; that is, the agricultural machine has a fixed moving distance in each time step, so the steering angle φ of the machine is taken as the action space, and the dimension is 1.
In deep reinforcement learning training, the purpose of agricultural machinery is to move to the target path while avoiding obstacles. erefore, the state space of agricultural machinery needs to include its own positional relationship with obstacles and target paths. is article defines the state space of agricultural machinery as follows: x − x aim , y − y aim k .
Among them, (x, y) and θ represent the position and orientation of the agricultural machine in the current map, and k is the standardized coefficient; d obj and d aim represent the distance between the agricultural machine and the nearest obstacle and the target path; (x − x obj ), (y − y obj ) and (x − x aim ), (y − y aim ), respectively, represent the distance information of the agricultural machinery from the nearest obstacle and the target path.
In actual movement, the real-time pose of the agricultural machine in the environment can be obtained through SLAM technology, and the distance between the agricultural machine robot and the obstacle is obtained through sensors.

Reward Function Design.
e reward and punishment value is the feedback signal given to the agent by the environment, which reflects the pros and cons of the actions performed by the agent during the task learning process. When the agricultural machinery obtains a higher reward value from the environment, it indicates that the current behavior of the agricultural machinery is more conducive to the path tracking task; on the contrary, if the mobile agricultural machinery receives a large penalty value in the environment, it means that the behavior performed by the mobile agricultural machinery is not good for the path tracking task and should be avoided as much as possible. Finally, the mobile farming opportunity adjusts its strategy according to the rewards and punishments in the environment. During the training of mobile agricultural machinery, when the agricultural robot reaches the target point or touches obstacles and walls, the agricultural robot is given a fixed reward. When the agricultural robot has not reached the target or touched an obstacle, the reward value contains two parts: one is the negative reward value of the distance information between the agricultural machine and the nearest obstacle; the second is the positive reward value of the distance information between the agricultural robot and the target path. e sum of the two parts of the reward value is used as the final reward value obtained by the agricultural robot after each action, set as follows: erefore, the reward function of agricultural machinery action is e rewards in the above reward and punishment function are divided into continuous rewards and instant rewards. Continuous rewards are rewards that are generated every time the agricultural robot takes an action; that is, rewards are rewards that are given immediately under certain circumstances.

Design of Autonomous Path Tracking Control for Agricultural Machinery.
e path tracking process design of mobile agricultural machinery under the deep reinforcement learning algorithm is shown in Figure 9. e agricultural machinery first obtains environmental information through sensors and calculates the orientation and distance of obstacles and targets and selects the corresponding action value according to the exploration noise and exploration attenuation rate. At the same time, it is judged whether it is the end state or the target state. If it is the end state, reset the environment and restart; otherwise continue to learn in the environment; if it is the target state, continue to judge whether the algorithm has converged; if it converged, the program ends; otherwise continue to generate target endpoints and interact with the environment until the end.

Deep Neural Network Structure Design.
e deep neural network of agricultural machinery is based on the Actor-Critic framework. In the current state, the mobile agricultural machinery obtains and executes actions through the Actor network and interacts with the environment to reach the next state and obtain reward values. At this time, the Critic network takes the actions and state values output in the Actor network as input and outputs the evaluation of the current action value. is evaluation indicates the pros and cons of the action value of the mobile agricultural machine in the current state. e structure design of the network is shown in Figure 10.
In the Actor network, the input is the state S of the agricultural machinery robot. e number of neurons in the hidden layer is 400 and 300, the activation function is Relu, and the output layer is the linear velocity v and angular velocity v of the mobile agricultural machinery. Since the retreat of agricultural machinery is not considered, the linear velocity w has only positive values, and the angular velocitywis a vector, and the positive and negative values indicate the direction, so the Sigmoid and Tanh activation functions are used to output the action values in the continuous action space. In the Critic network, the hidden layer uses the same number of neurons and activation function as the Actor network. e Q value of the output layer does not require an activation function to perform a nonlinear transformation and directly performs a linear transformation. Finally, the smallest Q value is selected from two Critic networks of the same structure to avoid overestimation of the deviation. According to the set reward and punishment mechanism, network parameters will be continuously optimized, so that the Actor network can get a higher reward value after performing actions. In the Critic network, the value calculated in the Actor network is scored, and the score result is sent back to the Actor network.
e Actor network will update according to the score result. e combination of the two networks can improve the efficiency of algorithm update.

Simulation Environment Settings.
is chapter will adopt the mobile agricultural machinery model. Autolabor is a ROS-based mobile four-wheeled vehicle instead of agricultural machinery. It has programmable, SLAM mapping navigation, and motion control functions. At the same time, Autolabor software is also provided in open source form. In RVIZ, the models of agricultural machinery robots are a tF igure 8: Agricultural machinery path tracking framework based on Autolabor.
Scientific Programming commonly described in URDF and XARCO files, and their essence is in XML format. Autolabor's model files are shown in Figure 11. After the model is built, start the model for testing. Create the file display.launch in the launch folder Figure 12. e first input parameter model is the path to the urdf file to be launched. e two input parameters gui specify whether to enable the joint rotation control panel window. Two parameters indicate describing the model description file to be started (urdf) and the joint to the control window (gui, corresponding to each joint), respectively. ree nodes are used to send joint information, robot control information, and rviz start.
Among them, Link and Joint can be compared to human skeletons and joints, which are the basis for describing the  structure of agricultural machinery and agricultural machinery robots and are constructed in a tree structure. e main body, wheels, and joints of the agricultural machinery and agricultural machinery robot are defined in the link, and some attributes are given: <visual> defines the appearance attributes of the link; <geometry> defines the shape of the structure; <inertial> and <collision> specify, respectively, inertial properties and collision properties. e final Autolabor model in RVIZ is shown in Figure 13.
Next, create a topographic map of agricultural land based on the topographic characteristics of agricultural land, as shown in Figure 14.

Pretest Results and Analysis of Physical Fitness.
When the mobile agricultural machinery is undergoing training experiments, it is essentially a process in which the agricultural robot explores the environment and adjusts its action strategy according to the feedback of the environment and finally realizes the path tracking and obstacle avoidance of the agricultural robot. During the training of agricultural robots, the starting point is the starting point, and the target end point is randomly generated in the set simulation environment.
e same coordinate range as the obstacle collision area cannot be set as the target end point. When the agricultural machinery robot reaches the target, it means that it has successfully completed a path tracking task and uses this point as the starting point to continue to the next randomly generated target end position. When an agricultural robot fails to track the path, it is regarded as a terminal state. e terminal state includes that the   agricultural robot encounters an obstacle, a wall, or reaches the upper limit of the planned number of steps. At this time, the agricultural robot will start the next training from the planned starting point. Finally, the training is completed after reaching the set maximum number of training rounds. e training process is shown in Figure 15.

Experimental Parameter Settings.
In order to improve the reliability of the experimental data, the experiments in this chapter are all completed under the environment ubuntu 6.04+cuda9.0+pytorch0.4.1, and the experimental hardware conditions are i7-8750H + GeForce GTX1060 + 16G. e specific settings of the experimental parameters are shown in Table 1.

Experiment and Result Analysis.
In this section, experiments will be conducted on static obstacle scenes and dynamic obstacle scenes, respectively. In each scene, the path tracking results of the DQN algorithm and the agricultural robot proposed in this paper will be tested, and the results will be analyzed.

Static Obstacle Experiment.
e reward value of the first 1000 training rounds is plotted as a reward curve, as shown in Figure 16. In the initial stage of training, since the agricultural robot has just started to interact with the environment, it will often drive away from the target and finally collide with obstacles or walls, so the penalty value is high, and the initial reward is basically around -500 to -400. In the 200 rounds before training, because the DQN algorithm cannot distinguish the importance of experience, it can only continue to explore and try to learn, and the curve fluctuates greatly. e agricultural machinery algorithm uses a priority playback mechanism, which will give priority to learning some important experiences. Compared with the DQN algorithm, it reduces the volatility of the curve, and the I-DQN algorithm starts to accelerate the convergence in about 100 rounds; however, the DQN algorithm does not start to increase the rewards until 250 rounds. As the training time increases, the I-DQN algorithm basically converges after 300 rounds, and the DQN algorithm gradually converges around 450 rounds. erefore, in scenario 1 under the same training conditions, the I-DQN algorithm has better convergence and stability. Figure 17 shows the path tracking success rate in the 1000 rounds before the training of the agricultural robot. e trend line of the success rate and the reward value curve are roughly the same. e I-DQN algorithm starts around 100 rounds, and the success rate is greatly improved, reaching 70% in 200 rounds. In about 300 rounds, the success rate of the I-DQN algorithm basically reached 90%; in contrast, the DQN algorithm had fewer successes in the early stage and lacked stability. In 200 rounds of training, there was only a 50% success rate until after 450 rounds. e success rate has gradually reached 90%. erefore, the importance area of the experience samples in the experience pool can make the agricultural machinery robot better learn path tracking planning tasks and finally learn to use experience to avoid obstacles and reach the end.

Dynamic Obstacle Experiment.
In order to test the path tracking ability of the agricultural robot in the DQN and I-DQN algorithms under different types of obstacles, a dynamic obstacle path tracking test was performed in scenario 2. After the agricultural robot enters the termination state, the dynamic obstacle also returns to the original point. e agricultural machinery robot restarts path tracking. Analyze the reward value and success rate during training under the dynamic obstacle scene, and test the path length and planning time. Figure 18 shows the reward value curve of the two-algorithm training under scenario 2. Similar to scenario 1, the agricultural robot is trying to learn how to avoid obstacles under the two algorithms in the early stage, because the dynamic obstacle avoidance process is more complicated, and the agricultural robot is more likely to collide with obstacles at first, and it takes longer to learn. After 150 rounds, the volatility of the I-DQN algorithm began to decrease, and the reward value increased rapidly in the subsequent 200 rounds, and finally the algorithm gradually converged around 400 rounds. e DQN algorithm fluctuated greatly in the first 200 rounds. e 250 rounds began to rise gradually and did not begin to converge until 550 rounds.  Figure 15: Experimental training process.
Scientific Programming e path tracking success rate results under scenario 2 are shown in Figure 19. It can be seen that the two algorithms have low success rates in the first 150 rounds, but the success rate of the I-DQN algorithm is greater than the DQN algorithm in the subsequent 200 rounds. e success rate of 350 rounds reaches 75%, which is about 30% higher than the success rate of the DQN algorithm. In 400 rounds, the success rate of I-DQN algorithm basically reached 90%, while DQN had the same success rate in 550 rounds, which proves that I-DQN is better than DQN in path tracking under dynamic obstacle scenarios.
After the training, the dynamic obstacle avoidance process of the mobile agricultural machine in the Gazebo environment is shown in Figure 20. e agricultural robot has been able to continuously reach different target paths while avoiding dynamic obstacles.    Because obstacle avoidance is more complicated in dynamic obstacle scenarios, the requirements for obstacle avoidance and path tracking of mobile agricultural machinery are higher. erefore, in order to better test the algorithm, the path length and movement time of the path tracking in scene 2 are tested. Before the test starts, ten coordinate points are randomly generated as the target end point. In order to increase the reliability of the experiment, the coordinate range is set outside the obstacle bypass area, so that the agricultural robot must pass through the obstacle area and will not appear when tracking the path. e target path is very close to the agricultural robot. After the end point is set, perform ten experiments on the I-DQN and DQN algorithms in scenario 2, starting from the origin each time, and use ten randomly generated end points as the target path to perform path tracking, respectively. According to the target path, the straight-line distance length of the starting point is sorted, and the final moving results are shown in Table 2.
Comparing Tables 2 and 3, under the same target end point, the average path length and planning time of the I-DQN algorithm are shorter than those of the DQN algorithm, and the gap gradually increases as the target path distance increases, which proves that the agricultural machinery algorithm tends to make the agricultural machinery robot learn to take a shorter path in a dynamic obstacle scene, and the time is shorter, which improves the performance of path tracking. e average results of the ten times of I-DQN and DQN path tracking are shown in Table 4.
Based on the analysis of the above experimental results, the DQN algorithm realizes the autonomous path tracking of mobile agricultural machinery in an unknown environment. At the same time, whether in static or dynamic obstacle scenarios, the I-DQN algorithm has a faster convergence speed, allowing agricultural robots to learn to avoid obstacles and reach the target destination faster, and the stability and path tracking performance are improved.

Conclusion
With the rapid development of information technology and the realization of smart agriculture, digital agriculture has become an inevitable trend in agricultural development now and in the future. Based on this background, this paper studies the automatic navigation control of agricultural machinery, adopts deep reinforcement learning theory, designs an autonomous path tracking control strategy for agricultural machinery, and conducts experimental simulations through two operating scenarios. e DQN and I-DQN algorithms are applied. In the path tracking task, a number of experiments were designed to verify and analyze    the results. e analysis of experimental results shows that the DQN algorithm realizes the autonomous path tracking of mobile agricultural machinery in unknown environments. At the same time, the I-DQN algorithm has a fast convergence speed. Whether in static or dynamic obstacle scenarios, it can make the agricultural machinery robot learn to avoid obstacles and reach the destination faster, so as to improve the stability and path tracking performance. is research simplifies the motion model and, to a certain extent, does not achieve the true restoration of the actual scene. It has certain limitations for practical applications, but the ideas provided have laid a theoretical foundation for subsequent practical application research.
Data Availability e dataset can be accessed from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.