Mobile Robot Application with Hierarchical Start Position DQN

Advances in deep learning significantly affect reinforcement learning, which results in the emergence of Deep RL (DRL). DRL does not need a data set and has the potential beyond the performance of human experts, resulting in significant developments in the field of artificial intelligence. However, because a DRL agent has to interact with the environment a lot while it is trained, it is difficult to be trained directly in the real environment due to the long training time, high cost, and possible material damage. Therefore, most or all of the training of DRL agents for real-world applications is conducted in virtual environments. This study focused on the difficulty in a mobile robot to reach its target by making a path plan in a real-world environment. The Minimalistic Gridworld virtual environment has been used for training the DRL agent, and to our knowledge, we have implemented the first real-world implementation for this environment. A DRL algorithm with higher performance than the classical Deep Q-network algorithm was created with the expanded environment. A mobile robot was designed for use in a real-world application. To match the virtual environment with the real environment, algorithms that can detect the position of the mobile robot and the target, as well as the rotation of the mobile robot, were created. As a result, a DRL-based mobile robot was developed that uses only the top view of the environment and can reach its target regardless of its initial position and rotation.


Introduction
e path planning is one of the important topics of the robotics since navigating a robot to a desired destination without collision is a significant task [1]. e significance of such a task is estimated to increase even further [2,3] due to evolving interaction between robots and human beings. is means the human beings will have more places shared with robots giving rise to the importance of path planning. In literature, conventional path planning techniques such as A-star, ant colony optimization, and Artificial Potential Field (APF) can be encountered [4] which are not efficient to address the issues in real-time within complex environments [5]. With respect to achieving a more robust solution, deep learning (DL) has been successfully implemented for path planning [6]. However, DL needs labeled data in order to learn the environment which is a time-consuming task [7]. To overcome the latter problem, the researchers have recently adopted the deep reinforcement learning (DRL) technique as one of the efficient solutions.
Reinforcement learning (RL) tries to maximize the numerical reward signals by focusing on actions that must be taken depending on the specific cases [8]. Instead of guiding the RL agent to perform a specific task, it is asked to learn which actions are more rewarding [9]. e RL agent, interacting with the environment sufficiently, learns which action contributes more to the cumulative reward. In another word, it learns the best state-action pair. However, exploring the environment becomes costly in the case of increasing numbers of states and actions due to the curse of dimensionality [10]. erefore, it may become impossible to solve some of the real-world problems with high dimensions of action and state spaces by using traditional reinforcement learning [11]. e remarkable progress of the DL has also significantly affected the RL and resulted in the DRL method, which is a combination of both methods [12]. Significant achievements have been reported for arcade games [13,14], board games [15,16], autonomous driving [17], robots interacting with humans [18], drones [19], unmanned surface vehicles [20], robot navigation problems [21], and robotic surgery [22], which all are examples of complex systems with multiple states.
Several recent studies on path planning of mobile robots by using the DRL method are briefly provided below: (i) Zheng et al. [1] proposed a method that optimizes the path planning process of a mobile robot for an unmapped environment. In the proposed method, a memory module based on the special structure of long short-term memory (LSTM) was introduced. anks to this method, the long-term memory capacity of the robot was increased, and the number of trials and calculation time required for training was shortened.
(ii) Gong et al. [5] suggested optimizing the deep deterministic policy gradient (DDPG) network with LSTM to solve the path planning problem of the mobile robot in environments with static obstacles. In this method, a mixed noise along with a more reasonable reward function was used for quick training. e proposed algorithm and DDPG-based algorithms [23,24] were experimentally compared and the advantages of the proposed algorithm were demonstrated in a complex environment in terms of exploration efficiency, optimum path and time.
(iii) In Zhou et al. [7], an improved DQN algorithm was proposed for the path planning problem of patrolling robots. In this study, the reward penalty functions were improved, and the sparse reward problem was resolved by optimizing the state-action space by adding new reward value points. e advantages of the proposed algorithm over the classical DQN algorithm were experimentally demonstrated. (iv) In Wang et al. [25], an improved DQN method was proposed by focusing on the problem of poor exploration and sparse reward in mobile robot path planning. e reward function was improved by combining the artificial potential field method with the reward function to optimize the stateaction space. e performances of the proposed method and classical DQN were compared. (v) In Xing et al. [26], the area division Deep Q-Network (AD-DQN) method was proposed. A mobile wireless powertrain robot was able to determine the optimal path with the proposed method in terms of charging a large number of IoT devices. (vi) Huang et al. [27] proposed a method that determines two reward thresholds for solving the anomalous reward problem encountered in the path planning process of a mobile robot in an unknown dynamic environment. e improvement in value-based DRL algorithms was experimentally demonstrated with the proposed method.
(vii) Yu et al. [28] proposed a mobile robot path planning method based on neural networks and hierarchical reinforcement learning. e performance of the proposed method was evaluated for an environment with static obstacles. (viii) Wang et al. [29] proposed a DDQN-based method with prioritized experience replay (PER) for mobile robot path planning. Simulation experiments were shown that this method had a better convergence rate and success rate than classical DQN in unknown environments. (ix) Zhang  On the contrary to the above listed works, in this study, a DRL agent was trained in a discrete virtual environment with sparse rewards and focused on the real-world targeting problem of a mobile robot using these DRL network parameters. e contributions of this study can be listed as follows: (i) An improved DQN algorithm is proposed for large environments with sparse rewards in order to increase the convergence rate and success rate according to the classical DQN algorithm.
(ii) Network parameters of the agent trained in the Minigrid virtual environment were used for the first time in the real environment.
(iii) A pixel-based image processing algorithm is proposed to detect the rotation of the mobile robot. (iv) e model trained in the virtual environment is used to provide a more efficient operation in the real environment. (v) e proposed design allows the mobile robot to be used in swarm robotics and multi-agent systems with the advantage of low cost and long-range controllability from a central computer.
e rest of the article is organized as follows. Section 2 introduces the virtual environment used in DRL agent training. In Section 3, the DQN algorithm with hierarchical starting position is presented as an efficient method, in terms of training time and successfulness rate, compared to the conventional DQN algorithm. Section 4 describes the design and control of the mobile robot used in the real environment. Section 5 explains how to pair a discrete virtual environment with a real-world environment and the mode of operation of the DRL-based mobile robot which can achieve its target regardless of its starting position and rotation. Section 6 explains the implementation stage and presents the performance analyses. Finally, the paper is concluded in Section 7.

Structure of Virtual Environment
OpenAI Gym is a toolkit designed for RL research. Creating a common interface, OpenAI Gym contains many different tasks and environments [33].
is study uses one of the OpenAI Gym platforms, the Minimalistic Gridworld Environment (MiniGrid).
e MiniGrid Environment platform has numerous environments with increasing difficulty levels [34]. e studies regarding natural language processing [35][36][37], reinforcement learning for environments with sparse reward signals [38][39][40][41][42][43][44][45] and hierarchical-based reinforcement learning in natural language processing [46] can be found for MiniGrid environment. But the studies were limited to the virtual environment. is study focuses on real environment application by using the agent trained in virtual environment. e environments shown in Figure 1 are used in the application. Both environments represent an empty room, and the agent's goal is to reach the green target plot in an environment that provides sparse reward. e only difference between the two environments is the number of plots constituting the environment. erefore, this section where virtual environment analysis is performed explains the structure of MiniGrid-Empty-8 × 8-v0 environment.

States.
e MiniGrid-Empty-8 × 8-v0 environment consists of 8 × 8 plots. e state of the virtual environment can be obtained in three different ways. As shown in Figure 2, these include; (i) RGB image consisting of a combination of 32 pixels × 32 pixels plots by default, (ii) String structure in which each plot is encoded with two characters, (iii) 3-dimensional matrix where each plot is encoded with numerical values.
In the application, the state of the virtual environment is obtained as a 3-dimensional matrix. As the number of parameters is much less compared to the RGB image, and the parameters consist of numerical values, it is easy to make them suitable for artificial neural network (ANN) use. As the number of data entered in the ANN increases, the network becomes more complex and the outer wall whose state never changes is removed to make the network simpler. e state parameters of the environment of a 3-dimensional matrix structure are flattened as in Figure 3 so that they can be used in ANNs.
As shown in Figure 3, each plot has three parameters (object, color, direction).
ese parameters are added consecutively as in the figure, resulting in an array of 108 parameters showing the state of the environment. e array created for the MiniGrid-Empty-16 × 16-v0 environment has 588 parameters.

Actions.
Practically, the agent can make 3 different movements. A numerical value is defined for each movement made by the agent. ese are

Reward Function.
In the application, the total reward value that the agent can receive in a level ranges from 0 to 1. In order for the agent to complete the episode, the agent must either reach its target or take the maximum number of steps. e maximum number of steps for the application was determined as 50. If the agent reaches at its target, the reward function is as shown in the following equation: According to the reward function, as the number of steps increases, the reward earned decreases, thus the agent should find the shortest path to maximize its reward. If the agent takes the maximum number of steps, the episode ends and the agent gets 0 points because it fails.

DQN and HSP-DQN
With the advancement in DL, the Markov Decision Process can be solved with deep neural networks [47]. As one of such methods, DQN is successful in some areas at the human level [13]. DQN is a DRL algorithm combining supervised learning and RL techniques [13]. e structure of DQN, which is frequently used in fields such as games [13], robotics [48], and natural language processing [49], is shown in Figure 4.

Computational Intelligence and Neuroscience
In the DQN algorithm, the agent makes random movements at the beginning of the training and records its experiences about the environment in the experience replay memory. How much reward (r) the action (a) brings in a state (s) and the new state (s′) after the action are recorded in this recording field. When the experiences obtained in this way reach a sufficient number, the experience equal to the size of the mini-batch is randomly selected from the experience replay memory and used for the training of the Q-network. So, the network is not affected by local minimums. Also, in the DQN algorithm, the agent determines its action using Epsilon-Greedy (∈). It ∈ can take values between 0 and 1. ∈ A value close to 1 indicates that the agent focuses on exploration, while a value close to 0 indicates that    Computational Intelligence and Neuroscience it focuses on making the best action determined by its experience. For this reason, at the initial stage of training, ∈ the agent is allowed to explore the environment by choosing a value close to 1. As the episodes progress, ∈ its value is proportionally reduced, allowing the agent to choose the best action using its experience. e DQN algorithm is very practical for discrete and small-sized environments due to its described structure.
In the application, MiniGrid-Empty-8 × 8-v0 and Mini-Grid-Empty-16 × 16-v0 environments are used as virtual environments. e agent was trained separately for the specified environments with the DQN algorithm and the HSP-DQN (Hierarchical Initial Position DQN) algorithm proposed by us, and the algorithm performances were examined. e network architecture in Figure 5 is used for both algorithms.
In the real environment application, it is desired that the mobile robot reaches its target, regardless of its starting position and direction. erefore, in the training with DQN, the position and the direction of the agent are chosen randomly in each new episode.
However, in the training with HSP-DQN, a region where the agent has a high probability of reaching the target with random movements is determined, and the training is started in this region. e agent is started in a random position and direction, provided that it stays in the determined region in each new episode. After the network is sufficiently trained for this region, a new training region is determined and ∈ value is increased, allowing for the agent to re-explore. ere are two regions at this stage; the region where the training was performed for the first time and the region where the training was already performed. Both regions are used during training. In each new episode, one of the two episodes is chosen alternately and the agent is started in a random position and direction. In this way, the previous information of the network can be kept up to date and new regions can be explored. At this stage, after the network is sufficiently trained, the region trained for the first time is included in the previously trained region. ∈ value is increased again and a new exploration region is determined.
us, the network is trained gradually. ese stages continue until the network learns the whole environment. Figure 6 shows the application of the HSP-DQN algorithm in the MiniGrid-Empty-8 × 8-v0 environment. Figure 7 shows DQN and HSP-DQN reward graphs for MiniGrid-Empty-8 × 8-v0 environment. As the HSP-DQN algorithm focuses on exploration in each new region definition, sections where it fails or gets low scores are observed. However, the graphics show similarity after 1500 s. Table 1 shows performance data of DQN and HSP-DQN. In both algorithms, successful results were obtained at the end of 3000 sections. Although the total number of steps of HSP-DQN is slightly less than that of DQN, it does not provide an obvious advantage.
As the MiniGrid-Empty-8 × 8-v0 environment area is not large, the DQN algorithm has been successful in learning the environment.
Algorithm performances are also compared for the MiniGrid-Empty-16 × 16-v0 environment with a larger area. Figure 8 shows the application of the HSP-DQN algorithm in the MiniGrid-Empty-16 × 16-v0 environment. Figure 9 shows DQN and HSP-DQN reward graphs for MiniGrid-Empty-16 × 16-v0 environment. For the DQN algorithm, the number of successful episodes in the first 2000 episodes is quite low. At the end of 8000 episodes, it is observed that it often fails. In the HSP-DQN algorithm, successful and unsuccessful episodes are observed up to 6000 episodes, but in the following episodes, it is seen that the failures became sparse and the reward interval tended to narrow, that is, the agent explores short paths. Table 2 shows performance data of DQN and HSP-DQN. Accordingly, the successful episode rate of HSP-DQN is quite high compared to DQN. e total number of steps of DQN at the end of 8000 episodes is approximately 84% more than HSP-DQN. In other words, the training period of the DQN, which has a low success rate, is also quite long compared to the HSP-DQN. Also, at the end of 8000 episodes, the Epsilon-Greedy value decreases to 0.018 in DQN, while this value is 0.178 for HSP-DQN. So, the HSP-DQN is likely to explore shorter paths.

General Structure of the Robot.
Maneuverability is an important feature for mobile robots [50]. Omni-wheeled robots have high maneuverability due to their ability to move directly from one position to another without being redirected [51].
As shown in Figure 10, a mobile robot with high maneuverability was designed to use the agent trained in the virtual environment in the real environment application. e robot consists of five layers. e 1st and 2nd layers are the visual layers and are used for the determination of the robot's position and rotation. eir intended purposes will be explained in the next sections. e 3rd layer has two serial connected Li-ion batteries with a current capacity of 2500 mAH and a voltage value of 3.7 V, which provide for the energy needs of the robot.
In the 4th layer, there is the robot control circuit that executes commands from the computer via the RF controller, as shown in Figure 11.
RF control circuit and robot control circuit are microcontroller-based circuits. Both circuits use Atmel's ATme-ga328P microcontroller. e robot control circuit is designed so that it can control DC motors with encoder, control the PWM speed, and give audible and visual warnings. For the communication between the controller and the robot, the SX1278 transceiver module that operates in the 433 MHz frequency band and communicates with Long Range (LoRa) modulation is used. e 5th layer is the layer containing four DC motors with N20 encoder and four 8-cylinder omni wheels with a 50 mm diameter, which are controlled by these motors.   Nevertheless, because the real environment is a continuous environment, the mobile robot can be in numerous different angular and positional positions. In order for the mobile robot to use the information it receives from the discrete environment, the real environment should be likened to a discrete environment. In other words, the mobile robot should be able to move in a way that can bring to a desired position. e mobile robot was designed to make 6 different movements, as shown in Figure 12, to go to the desired location. It can make turns with an accuracy of approximately 3°and linear movements with an accuracy of about 1 cm in the range of 0°-360°.

Movements and Calibration of Mobile
Although omni-wheeled robots have high maneuverability, the floor under them can negatively affect their movements [52]. Although the designed mobile robot has motors with encoder, encoder signals cannot guarantee the position of the robot. Due to the floor under the robot and the standing positions of the omni wheels, undesirable situations such as idle rotation of the wheels during take-off may be encountered. To reduce the mentioned problems relatively, the mobile robot calibration program shown in Figure 13 was prepared. By using this program, the number of signals from the encoder for 1 unit step of the robot (determined as 30 cm for this study), and the number of  To control the abovementioned features of the mobile robot, a data package is designed as in Figure 14. is data packet is prepared according to the action to be taken by the robot and sent with RF signals, and the robot can perform the necessary actions by decoding these signals. Table 3 shows the experimentally obtained encoder signal values to be used in the study.
For example, the data packet required for the mobile robot to move 1 unit forward and turn 90°left is shown in Figure 15.

Real Environment Application with HSP-DQN
is section shows that the HSP-DQN network trained in a discrete empty environment can also be used in a real environment. Here, the HSP-DQN network was trained in a discrete environment and its application was performed by likening the real environment of the application to a discrete structure. Figure 16 shows the real environment, which will be likened to a discrete structure. e environment data are obtained by the IP camera that takes the top view of the environment. However, the mobile robot used in the real environment receives the action commands from the computer via RF signals.
Simulation of real environment to discrete structure: (i) Detection of mobile robot and target with convolutional neural networks, (ii) Plotting the real environment as in the virtual environment, (iii) Correction of the position and angle of the mobile robot to fit in the discrete environment, (iv) e real environment is matched with the virtual environment in stages.
At the end of these stages, the mobile robot is able to use the HSP-DQN network trained in the virtual environment. Explanations about the stages are given in this section.
Today, CNN models developed for object detection have reached an enormous position due to their success in classification. However, model sizes are also increasing. is limits its use in real-world applications where latency and resource usage are important, such as autonomous robots. erefore, model efficiency is becoming increasingly important in real-life applications [60].
In our study, EfficientDet D0 CNN model was used for object detection. is model, which uses the EffiCientNet model as the backbone network, has a good performance in image classification by scaling the network width, depth, and input resolution together. As the number of parameters of the model is optimized, it is also advantageous in terms of using system resources [60][61][62].
e CNN model was trained to detect the agent and target objects shown in Figure 17. 117 and 32 images were used for training and testing, respectively, in different resolutions and light conditions. e total-loss graph of the model trained on 30,000 epochs is shown in Figure 18. e graphs show that the model has been trained to a large extent after 6,000 epochs.
Using the CNN model, the presence of the Agent and target in the environment, the position and the center information of the objects in the image are determined. In addition, the pixel distance of the environment can be converted by taking the agent image as a reference.

Plotting of Real Environment.
e environment where the HSP-DQN network is trained is a discrete environment. However, in order for the mobile robot to use the trained HSP-DQN network, the real environment should be likened  Computational Intelligence and Neuroscience to the discrete environment.
e steps of the simulation process are described below; Stage-1: Real environment data are obtained using only the top view of the environment. Using the CNN from this image, the positions of the agent and the target are determined as in Figure 19. Stage-2: Centers of the target (Goal x , Goal y ) and the agent (Agent x , Agent y ) are calculated for use in the   Step Stage-3: e centers of the plots in the real environment including width (Image w ) and length (Image h ) of the environment image are determined using equation (3). As the location of the agent is fixed when the real environment is plotted, a list of the plot centers (G centers ) is created by taking the center of the target as a reference, and the real environment is plotted.
Stage-4: e location of the agent may not match the center of any plot because the center of the target is used as the reference in the plotting process. For this reason, the mobile robot should be positioned according to the nearest plot. Using equation (4), the distances of the mobile robot to the plot centers ((Center d )) are listed.
e element of the G centers list Center d as the minimum value in the list shows the plot center closest to the mobile robot. is plot also shows the initial position of the mobile robot. e plotted real environment is shown in Figure 20. After the real environment is plotted, the location where the center of the robot should be at the beginning is determined.    Computational Intelligence and Neuroscience

Determination of the Rotation of the Mobile Robot.
CNNs have become the main method in the field of image recognition thanks to their strong feature extraction ability. Most CNNs do not deal with the angle of the image and just focus on image recognition. However, angle data are also crucial in robot applications [63].
In the present study, the agent was trained with HSP-DQN in a discrete virtual environment. Only four directions exist for the virtual environment. However, in the real environment, the mobile robot can take an unlimited angular position. In order for the mobile robot to use the information in the virtual environment, it should adjust its direction to resemble the most suitable direction in the virtual environment. For this reason, a structure that can detect the angle of the mobile robot is required. e image used by CNN for the detection of the mobile robot is also used in the algorithm created for the detection of the mobile robot's angle. e angle and direction of the isosceles triangle on the mobile robot are also the same as the angle and direction of the mobile robot. e purpose of the algorithm is to determine the direction and angle of the mobile robot by determining the direction and angle of the triangle in the image. Figure 21 shows the stages of the algorithm.
Stage-1: A top image of the mobile robot and the environment where the target is located is taken. Stage-2: e coordinates of the agent are determined by CNN and this part is taken from the image. Just below the agent image, a white layer is used, which visually covers the part under the robot. us, while the algorithm is running, it is not affected by the colors on the ground and the mobile robot can use the algorithm on any color ground. Stage-3: Bilateral Filter is applied to reduce noise and sharpen edges in the image [64]. Stage-4: As the mobile robot image is in the shape of a circle, the frame determined by CNN is roughly like a square, the center of the frame always remains inside the triangle shape. R, G, and B values are listed separately by retrieving the RGB values of the pixel in the center of the image and its eight neighbors. e values in each list are ordered in ascending order. e reference RGB values of the triangle image are obtained by retrieving the 5th elements of the lists created for R, G, and B. is process aims to establish a structure that can work in different light environments. A two-color image is obtained by assigning red color value to pixels that are close to the reference RGB values and white color value to other pixels. Stage-5: e pixels where the red color is first detected are determined by scanning the image separately from the right, left, bottom, and top. us, four points are obtained. Stage-6: e closest two points within the four detected points show the same corner. Only one of the two points can be used. To make a selection, the distances of two points to other points are calculated separately and the point that is farther from other points is selected as the 3 rd corner point because being further away implies that the point is further away. Stage-7: e distances between the corner points are calculated. As the twin sides of the triangle are longer than the base side, the side with the two closest vertices shows the base of the triangle and the other vertex shows the vertex of the triangle. By drawing a line to the midpoint of the vertex and the base, a Pythagorean triangle is formed as in Figure

12
Computational Intelligence and Neuroscience of the triangle, hence the angle of the mobile robot, is determined by Arctangent.
If the angle found is 0°, the position of the mobile robot in real environment and virtual environments can be matched. If the angle found is different from 0°, the direction of the mobile robot can be in one of the four regions as shown in Figure 22. e triangle direction determines which region the mobile robot is in. e direction to which the mobile robot will turn is determined by comparing which direction it is angularly close to.
Using the algorithm, it is determined in which region the mobile robot is, how much the robot should turn angularly to the right or left to match the most suitable direction in the virtual environment, and in which position the mobile robot will be after the rotation is corrected. e mobile robot autonomously adjusts its angle using the values determined in this episode and then positions it to the initial position determined in the previous episode with linear movements (right, left, forward, backward). After the mobile robot corrects its angle and position, both the mobile robot and the target are located in the center of the plot they are in, as shown in Figure 23.
e mobile robot is also positioned in one of the four directions it can take, as in the virtual environment.
As the mobile robot and the target are located only in the centers of the plots and there are four directions that the mobile robot can take, as shown in Figure 24, the real environment and the discrete virtual environment can be matched.

Matching the Real Environment with the Virtual Environment and Determining the Path Plan.
To use the HSP-DQN network trained in the virtual environment in the real environment, it is necessary to match the position of the mobile robot with the position of the virtual agent, and the position of the real target with the position of the virtual target. For the virtual environment, the agent can be in any position and direction at the beginning, and the target is located in the lower right corner of the virtual environment as in Figure 25.
However, in the real environment, the mobile robot can be in any direction and position around the target. In other words, many situations exist where the real and virtual environments cannot match. To solve this problem, the real environment is divided into four regions, as in Figure 26, according to the position of the mobile robot to the target. e virtual environment is rotated so that it can match the real environment.  Computational Intelligence and Neuroscience 13 To match the rotated virtual environment with the real environment, the initial direction and the position of the virtual agent are converted according to the region as in Figure 27.
As a result of these conversions, all states in the real environment can be matched with the virtual environment and the trained HSP-DQN network can be used for the real environment, as in Figure 28. e path plan is determined in the virtual environment before the mobile robot takes action. With the trained HSP-DQN, an action is generated for the current state of the agent and applied in the virtual environment, enabling the agent to move to its new position. e HSP-DQN continues to generate action until the agent reaches its target, and these   14 Computational Intelligence and Neuroscience actions are listed for the mobile robot to use. In this way, the path plan of the mobile robot is obtained. After the path plan is drawn, the actions to be taken are sent to the mobile robot in order. After each action, the angle and the position of the mobile robot are checked and deviations are corrected. Whether the mobile robot completes its current action or not is determined by the background difference method. e mobile robot is ensured to reach the target by sending the next action it needs to do after each completed action.

Discussion and Future Work
In the previous sections of the paper, the proposed algorithms have been discussed. In this section, the performance of the system will be examined. Although DQN algorithm is successful in MiniGrid-Empty-8x8-v0 environment, it cannot achieve similar success in MiniGrid-Empty-16 × 16-v0 environment because the agent can only receive rewards when it arrives at the destination in MiniGrid-Empty environments. If the environment has a large area such as MiniGrid-Empty-16 × 16-v0 and the rewards are sparse, it is difficult for the DQN algorithm to generate enough meaningful data to train the network with random movements, and the convergence speed becomes very slow. In the proposed method, the initial position of the agent is started in the region where the probability of reaching the reward is high and this increases the probability of the network to reach meaningful data. en, as the network learns the paths in the environment the area where the agent can start is gradually expanded. us, for large areas with sparse rewards the convergence problem of the network is addressed. Also, performance analyses are performed for MiniGrid-Empty-16 × 16, which is a larger environment than Mini-Grid-Empty-8x8.
In the experimental study, a computer with Windows 10 Pro 64 Bit operating system was used. e hardware specifications are provided in Table 4.
Neural networks used in DQN and HSP-DQN algorithms have the same features. e information about the structure of the neural network is shown in Table 5 and the hyper parameters of the algorithms are shown in Table 6.
In the DQN algorithm, the number of different states that the agent can start is 780. While the neural network is being trained, one of the 780 states is randomly selected as the initial state in each new section. In the proposed algorithm, as the sections progress, the area where the training can begin is gradually enlarged. us, the number of different states that the agent can initiate increases, as shown in Table 7.
In both algorithms, the performances of the models were examined after training 8000 sections under the abovementioned conditions. e time spent by the algorithms for training the neural network is shown in Table 8. e training of the proposed algorithm was completed in 45.78% less time than the DQN algorithm. Although the agent moved more in the environment in the DQN algorithm, the time spent in the sections increased as it reached fewer targets compared to the proposed algorithm.
In order to compare the success of the trained models, all conditions were tested in the environment with 780 different initial conditions. e related results are presented in Table 9. From the table it can be seen that success rate of the proposed algorithm is 41.28% higher than the DQN algorithm. Success of the algorithms is also graphically shown in Figure 29(a).
Moreover, the lengths of the paths detected by the algorithms were also examined. e models trained with DQN and the proposed algorithm were started with the same initial conditions and the obtained path lengths were compared as shown in Figure 29(b). e proposed algorithm finds a shorter path compared to the DQN algorithm in 70.1% of different initial conditions.
As a result of the tests, it was seen that the success rate and convergence speed of the proposed algorithm are better than the DQN algorithm for large environments with sparse rewards. In addition, the effects of some DRL-based algorithms developed with our study on success rates are shown in Table 10.
In the real environment application of the study, only the top view of the environment was used as the system input.
Since it was desired that the application run in real time, this image should be used efficiently. e object detection capability, speed, and model size of the model used for object detection were very important. erefore, the EfficientDet-D0 model was used for object detection by considering these criteria. e performance analyses of the EfficientDet model presented in the original article are shown in Figure 30. e model is considered to be suitable for mobile robot applications with limited resources in terms of model size, speed, and object detection.
Performance of the EfficientDet model in our application was also tested. e model was trained as specified in Section 5.1. e trained model detected objects with the image taken from the IP camera mounted at a height of 180 cm from the ground. e captured image had a resolution of 800 * 600 pixels and covered an area of 2.1 × 1.6 m 2 . In the performed tests, the model could successfully detect mobile robot and target objects, provided that the ambient lighting is appropriate. Also, it was determined that the object recognition process was completed in the time interval of 820-920 ms. As mentioned earlier, there are 780 different initial conditions in the MiniGrid-Empty-16 × 16 virtual environment. With the employment of the structure described in Section 5.4, the target is ensured to be in any position in the environment, rather than just in a certain position. In other words, the target can be positioned in 196 different positions at the beginning. us, the success of the trained network is ensured for 780 × 196 � 152.880 different initial conditions. In here, the recommended regional matching process takes less than 1 ms to complete. e real environment processes are performed in different steps. Table 11 provides the processing times for those steps. e designed system is capable of detecting the state of the environment within 1-1.5 seconds in order to allow the robot to perform a specific task.
A central computer is used for the designed system in order to perform the semantic data extraction, determination of the mobile robot motion, and notification of the determined motion. e mobile robot acts as an actuator by using the commands from the central computer. As the mobile robot does not require any computational load, a low-cost robot was designed with the part listed in Table 12.
LoRa-based modulation is used for the wireless communication between the computer and the mobile robot.
e LoRa has advantages such as wide coverage, low power consumption, low cost, and no license requirement [72]. However, it has a data transfer rate of 290 kbps which is slower than many wireless communication methods [73]. Despite the latter disadvantage, LoRa still is an ideal communication method for the application reported in this work since only small-sized data signals are transmitted for the movement of the robot.
In the real environment experiments, the DRL-based mobile robot was launched with different initial conditions for the four regions that are specified in Section 5.4. As can be observed from the following link, the mobile robot was ZONE-3 → x 1 -x 2 ≥ 0 and y 1 -y 2 < 0 (x,y) → (6-x,6+y) → 0 ZONE-4 → x 1 -x 2 < 0 and y 1 -y 2 < 0 (x,y) → (6+x,6+y) Figure 27: Conversion of virtual environment according to real environment. 16 Computational Intelligence and Neuroscience able to successfully reach the target in all experiments (Please see: https://youtu.be/bEAmJF6lD4g). e following were also observed in the experiments: (i) e omni-wheeled structure of the mobile robot has advantages in terms of maneuvering. However, it is prone to disturbances from the floor on which the robot is operating. Although this situation does not prevent the robot from reaching its target, it negatively affects the time of the robot to reach the target.
(ii) anks to the adaptive structure of the system, the working area of the system can be changed by changing the height of the camera from the ground. However, excessive magnification of the area covered by the camera or improper lighting of the environment may adversely affect the detection of objects in the environment.
is work forms a good basis for the future works as it demonstrates the possibility of controlling many robots within a long range by using a single center. e designed robot has the potential to be used in swarm robotics and multi-agent systems, as it is cost-effective and long-range controllable; thus, this work presents a good contribution to one of the important research topics known as multi-robot applications [74][75][76]. e latter is important for accelerating global policy education, as the use of large numbers of robots can provide greater data diversity [77]. In the future, realworld applications of DRL algorithms, which will include          sequential tasks to enable the agent to reach its target optimally in environments with static and dynamic obstacles or control more than one mobile robot, can be performed as a future work.

Conclusion
In this study, the HSP-DQN algorithm, which had a higher convergence rate and success rate than the classical DQN algorithm, was proposed for the environments with sparse rewards. With the proposed algorithm, a real environment application was performed for the first time by using the network parameters of an agent trained in the minimalistic grid world virtual environment. A low-cost mobile robot that can be controlled within a long range by a central computer was designed for this purpose. Also, a system which can efficiently match the virtual environment with a discrete structure and the real environment was designed.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.