FPGA Implementation of A∗ Algorithm for Real-Time Path Planning

The traditional A∗ algorithm is time-consuming due to a large number of iteration operations to calculate the evaluation function and sort the OPEN list. To achieve real-time path-planning performance, a hardware accelerator’s architecture called A∗ accelerator has been designed and implemented in field programmable gate array (FPGA). The specially designed 8-port cache and OPEN list array are introduced to tackle the calculation bottleneck. The system-on-a-chip (SOC) design is implemented in Xilinx Kintex-7 FPGA to evaluate A∗ accelerator. Experiments show that the hardware accelerator achieves 37–75 times performance enhancement relative to software implementation. It is suitable for real-time path-planning applications.


Introduction
Path planning on grid maps is still an important problem in many modern domains, such as robotics [1], vessel navigation [2], and commercial computer games [3]. In some applications, the path-planning algorithm needs to run in real-time performance. For example, mobile robotics and real-time strategy (RTS) games operate in a highly dynamic map where obstacles and roads can change suddenly. In such cases, the maps cannot be loaded in advance to generate initial paths. What is more, the paths generally must be solved in milliseconds due to a large amount of path planning or replanning requests. erefore, real-time path planning is needed.
A-star (or A * ) search algorithm [4] is one of the most widely used heuristic path-planning algorithms on grid maps. It generates the global optimal paths dynamically and can theoretically guarantee the convergence of the global optimal solution [5]. Such characteristic makes it suitable for dynamically changed maps such as real-time path planning in robotics or RTS games.
However, the software A * algorithm's performance is not real-time due to a large number of iteration operations. Lots of previous works attempted to overcome this by lowering the number of cells to be expanded [6]. But the performance still cannot reach real-time performance. For example, Yao et al. [7] made the searching steps reduced from 200 to 80 (reduced to 40%) but the searching time only reduced from 4.359 s to 2.823 s (reduced to 65%). This paper introduces a hardware framework to accelerate the performance of the A * algorithm by parallelizing the iteration operations. e scientific code could benefit from executing on accelerators like field programmable gate arrays (FPGAs) [8].
e calculation bottleneck mainly focuses on two parts, calculating and sorting the evaluation value of each node. e evaluation value is used to determine the next searching steps. To make them parallel, a scalable array-based architecture that contains eight parallel processing lines was designed. Each processing line is responsible for one searching direction.
e architecture was implemented in Xilinx Kintex-7 FPGA and compared to the software algorithm. FPGAs offer high flexibility to Application-Specific Integrated Circuit (ASIC) when implementing the algorithm with a high degree of parallelism [9,10]. Results show that 37-75 times performance enhancement could be achieved with the accelerator's clock frequency at 100 MHz.
is research makes the following contributions: (1) e exploration of the way to accelerate the performance of the A * algorithm by parallelizing the iteration operations. (2) e architecture design of the hardware accelerator for the A * algorithm. Efficient 8-port cache design is achieved to load the nodes data of 8 directions in parallel. e most suitable parameters were decided by the design space exploration. An array-based OPEN list architecture was proposed to achieve sorting while data is flowing in the array. (3) Evaluation of the system-on-a-chip (SOC) design in the FPGA circuit board. Experimental results show that parallelizing the iteration operations of A * algorithm can achieve massive performance enhancement, and the hardware design is suitable for applications with real-time performance requirements. e rest of this paper is organized as follows. Related work is discussed in Section 2. e A * algorithm is analyzed in Section 3 to show the performance bottleneck. en the hardware accelerator is introduced in Section 4 to tackle the bottleneck. Section 5 introduces the system design, and experiments to analyze this work are devised in Section 6. In the end, concluding remarks are drawn in Section 7.

Heuristic Path-Planning Algorithm.
e goal of path planning is to find the most direct and shortest path from the starting point to the target point according to the terrain and obstacles in the map. Global path-planning algorithms have been applied frequently and widely because of their advantages in computation time and avoidance of local optimum [11]. e most well-known algorithm of this type is Dijkstra's algorithm [12]. It finds the shortest path by traveling from the start cell to the neighboring cells and calculates the path's cost. en it chooses the lowest cost cell to travel again until the target cell is reached. e defect of Dijkstra's algorithm is that nearly all the cells are expanded before the shortest path is found. e A * algorithm [4] improved this by adding a heuristic value to evaluate the path's cost function during the iteration of choosing the next cell. e heuristic value can lead the search path towards the goal. en Focused Dynamic A * (D * ) algorithm [13] and D * lite algorithm [14] were proposed to extend the ability to cope with dynamic changes in the graph used for planning. ey have been used for path planning on a large assortment of robotic systems [15][16][17]. Anytime Dynamic A * (AD * ) [18] uses an inflation factor to get to a suboptimal solution quickly, meeting real-time requirements. Liu et al. [2] further improved the A * algorithm for more complicated environments.
To make A * algorithm converge more efficiently, Szczerba et al. [19] proposed a sparse A * search (SAS). is algorithm accurately and efficiently "prunes" the search space according to the constraint, which lowers the number of cells to be expanded. Block A * [20] is a database-driven search algorithm. ey load the map in advance and calculate the Local Distance Database (LDDB) that contains distances between boundary points of a local neighborhood. en the search process is based on blocks. is method effectively lowers the number of iteration operations and achieves about 4x performance enhancement compared to A * algorithm, but it needs the map to be loaded in advance. Yao et al. [7] proposed a way of weight processing of evaluation function to reduce the number of cells to be expanded. ey made the searching steps reduced from 200 to 80 (reduced to 40%) but the searching time only reduced from 4.359 s to 2.823 s (reduced to 65%).
ose previous works can be summarized as improving the performance of A * by lowering the number of iteration operations or expanding nodes. But the performance enhancement is not so obvious. In this paper, we tried another way of parallelizing the iteration processing of expanding the cells by hardware accelerator in FPGA.

FPGA Implementation of Path-Planning Algorithms.
One of the most popular path-planning algorithms implemented in FPGA is the genetic algorithm (GA) [21]. e GA method is based on Darwin's theory of evolution, where crossovers and mutations can generate better populations. However, the evolution process needs numerous iteration operations. A continuous research activity during the past 20 years proves the effectiveness of hardware acceleration by parallelism. For example, Allaire et al. [22] accelerate the genetic operators on FPGA and achieve up to 50,000x performance enhancement in the population update operation. Hachour [23] shows the FPGA implementation for the GA path planning of autonomous mobile robots. dos Santos et al. [24] achieve the parallelism by arraybased architecture and achieve obvious performance enhancement.
Lots of researchers have also investigated into the FPGA implementation of heuristic path-planning algorithm. Fernandez et al. [25] proposed a parallel architecture for implementation of Dijkstra's algorithm. e node processor architecture was introduced to achieve parallelism for iteration process. For a 256 graph, the computation takes only 42 microseconds, which shows that FPGA implementation can achieve real-time performance. Jagadeesh et al. [26] also implemented Dijkstra's algorithm on FPGA and achieved 2.2 times performance enhancement compared with CPU. Idris [27] proposed the hardware architecture of accelerator for A * algorithm but did not provide simulation result. Nery et al. [28] provided the coprocessor design based on Xilinx High-Level Synthesis (HLS) compiler and achieved 2.16x speedup for A * algorithm. However, they did not tackle the bottleneck problem of sorting OPEN list.

Algorithm
In order to achieve parallelism, the performance bottleneck of A * algorithm needs to be analyzed. e traditional A * algorithm will first be introduced and then the performance bottleneck problems will be analyzed for hardware implementation. e traditional A * algorithm was first proposed in [4] and targeted for determining the minimum cost path through a graph. It is also suitable for finding the minimum cost path from the start node to the destination node on grid maps. Grid maps are a standard simplified model of real maps, commonly used in mobile robotics [20]. Grid maps are made up of square nodes, and each node stands for a step when moving in the map. An example is shown in Figure 1.
In this paper, we assume that a specific node on the grid map is only allowed to reach one of its eight neighbors. at is, the angle of the path is confined to a 45-or 90-degree turn. Previous works also researched about paths with any-angle turn [29,30]. But it is not the critical point of this paper. e flowchart of the traditional A-star algorithm is shown in Algorithm 1.
e open_list (which is defined as OPEN list in the manuscript) in the algorithm is an array that contains the nodes to be calculated. e goal is to choose the next step which has the minimum cost value from current node to the destination.
is process is called expanding nodes. e evaluation value f is the heuristic value to estimate the distance. It is calculated as (1) where g(n) is the actual cost from start node N(x s ,y s ) to current node n and h(n) is the heuristic function that estimates the cost from current node n to destination N(x g , y g ). e heuristic function chooses Euclidean distances.
3.2. Algorithm Analysis. On a grid map, an individual cell is able to move to one of its eight closest neighboring nodes (successors). From Algorithm 1, it can be seen that the iteration operations focus on calculating the successor's evaluation function, the process of which is called expanding nodes.
In order to analyze the bottleneck problem in the process of expanding nodes, we performed experiments to monitor the execution time. e software algorithm was written in C language and executed in a single thread. After that, the software algorithm was optimized to run in parallel 8 threads. e compiler is MSVC on the windows platform and more detailed description will be listed in Section 6.3. e 256 × 256 grid map with 10% randomly placed obstacles is used for the experiments. e results of software were obtained on Inter(R) Core(M) i5-3337U @ 1.80 GHz with 4 GB memory. We ran the software code 10,000 times and averaged the execution time. e experiments' results are shown in Figure 2. e process was divided into two parts. e "OPEN list" process included the process of inserting, sorting, and deleting the OPEN list and the "calculation" process included the other calculations of expanding nodes.
When running in the single thread, the averaged execution time is 330 ms. e operations of OPEN list consume 95% of total time. After distributing the software process into 8 threads, the total execution time is 136 ms, with 2.4x speedup. However, the OPEN list operations are still the most time-consuming part.
To tackle the bottleneck problems of the OPEN list operations (inserting, sorting, and deleting nodes), the hardware architecture is designed to have 8 parallel processing lines. Each line is responsible for one direction. e OPEN list is an array-based architecture that contains 8 parallel sequence queues, which are called OPEN list array. e OPEN list array will sort the data in parallel.

Design of the A * Accelerator
e hardware architecture is designed to tackle the bottleneck problems of the A * algorithm discussed in Section 3.2. Although it is targeted for FPGA implementation in this manuscript, it is also suitable for ASIC chip implementation.

Data Structure of the Nodes.
e information of a node includes parents' coordinates, actual cost value g, evaluation value f, and information about whether this node is an obstacle or in the OPEN list. Since the heuristic function value h is calculated for each node, it is not necessary to be stored along with the nodes. e grid map in this manuscript is confined to smaller than 256 × 256. e extension to bigger maps will be discussed in the future work.
Under such situation, the data type of cost value and evaluation value are designed as signed integer. e data structures are concluded in Table 1. e 1-bit information is combined with cost value to form a 32-bit integer value. en, the total size of a node's data structure is 10 bytes.

Hardware Framework
Design. An overall hardware framework is shown in Figure 3. e grid map is initialized and stored in the memory. After the start node is loaded, the nodes management module calculates the successors' coordinate and read data information into 8 N(x g ,y g )).
e two most critical modules are nodes cache and the OPEN list array. Since the nodes' management module handles 8 nodes in parallel, the nodes cache must transfer 8 nodes' information in one cycle if not missed. So, it is designed as an 8-port cache. In addition, the efficiency of sorting nodes in the OPEN list array determines the throughput of the accelerator.

Design of the Cell Cache.
To load successors' information in parallel, an 8-port cache is needed. On a grid map, the successors of a certain node are all in a 3 × 3 square. According to this characteristic, one cache line is designed to store a square block of N × N nodes. e size of the cache line is determined by design space exploration (DSE). e block-based arrangement is shown in Figure 4. e worst situation is when the nodes are in the corner of the block, and the successors are divided into 4 neighbor blocks. erefore, the 8-port cell cache consists of 4 banks to achieve reading blocks in parallel. e bank selection is determined by the lower bits of block coordinate. bank is bank selection strategy will ensure that the four parallel requirements will request different cache lines in the same cache. An example of the worst case is shown in Figure 5.
e detailed architecture is shown in Figure 6. e 8 addresses will be transferred to cache in parallel and distributed to different banks by interconnect crossbar. e reading address of the same block will fall on the same cache line, so the reading requests will be merged. When the data of one bank is missed, it will write data back if it is "dirty" and read data from memory. e cache miss penalty time is the clock cycles of writing back and reading. If the data in the bank is available, it will be distributed to the port according to the requests. e cache is connected to memory controller by AXI bus and the buses data size is 64 bits. e mapping algorithm of the cache is designed as 2-way set associative. Direct mapping is the easiest mapping algorithm, but it is not suitable for A * algorithm implementation. A * algorithm is a heuristic path-planning algorithm. erefore, the process of reading nodes is random. Two different nodes in the same cache line will seriously affect the performance. Fully associative mapping algorithm will increase the design complexity of the cache. In order to balance between cache size and performance, we choose 2-way set associative. e trade-off between the cache size and performance is determined by design space exploration.

Design of the OPEN List Array.
e OPEN list array is composed of 8 parallel sequence queues called OPEN list queues. Each queue contains input buffer (IB) structure to store the input data and output buffer (OB) for nodes ready for output. e overall architecture is shown in Figure 7. e nodes in OB are sequenced by evaluation function's value f. e input node is sent to IB when a new node is inserted into the OPEN list. e sorting process is similar to the bubble sort algorithm. When the input node is flowing in IB and reaches the position IB i , it will compare with the next OB buffer's position OB i+1 . If the value f of the node in IB i is lower than that in OB i+1 , these two nodes will be swapped and the node of OB i+1 will be sent to the next buffer IB i+1 . In this way, the larger OB's value will be swapped into IB.
Another situation is when the head of OB pops from the queue, it leaves a "bubble" in position OB 0 . en the corresponding IB cell IB 0 needs to compare with the next OB cell OB 1 . If the value f of IB 0 is lower than OB 1 , the node in IB 0 will be put in the position of IB 0 . Otherwise, the node in OB 1 will be put in that position and the bubble shifts right in OB buffer. Figure 8 shows an example of the above two processes. At cycle 1, node 3 is inserted into IB and needs to swap with node 5 in OB. At cycle 2, node 3 takes the position of OB_1 and node 5 goes to IB_1. Assume that the next node inserted is node 4 and node 2 pops out. en node 3 will take the position of OB_1. At cycle 3, node 4 reaches the position of IB_1 and will take the bubble position. At cycle 4, the inserted node finds the correct position.

Design of the Other Modules.
e node with the lowest value f must be the one of the nodes produced by the evaluation engine or on the head of the OPEN list queue.   erefore, the comparison engine module will compare the above 16 nodes and choose the node for the nodes' management module to expand. e nodes' management module gets the node's coordinate from the comparison module and transfers the nodes' information from the node cache to 8 parallel evaluation engines. e evaluation engine computes the value f of each node and inserts it into the OPEN list array. e computation of evaluation value contains the root operations, which is done by the lookup table.

System Design
As described in the above section, the performance bottleneck lies in the data fetching and sorting efficiency. In this section, we perform the overall optimization problem as maximizing the system throughput under resource constraints. We demonstrate the design's implementation to maps with 256 × 256 resolution, and it is scalable to bigger maps. en the hardware system-on-a-chip (SOC) will be introduced briefly to show the implementation process in field programmable gate array (FPGA).

Analytical Model of Cache.
e system throughput is determined by cache performance. We built a high-level model (C++) to perform the design space exploration (DSE) to identify hardware parameters with maximum system throughput. e design space includes two dimensions.
(1) e cache line size (marked as N). Higher cache line size will lower the miss rate of the cache. But it increases the possibility of loading useless nodes and cache misses rate, which affects the cache miss penalty.
(2) e number of cache lines in a bank (marked as M).
Higher number of blocks is better, but it increases the size of the cache. erefore, some design option needs to be "pruned" due to the limitations of memory size on FPGA. e memory size should be less than 1 MB.     International Journal of Reconfigurable Computing makes the structure too complex to implement. erefore, we choose 2-way set associative to balance between cost and performance. e total number of nodes in the cache is calculated to be N × M. e size of cache is N × M × sizeof(node). e number of cache misses is marked as m. e cache miss penalty is marked as T. e cache's performance is modeled by memory time MT.

(4)
Although it is not the actual cache miss penalty time, it can be used to evaluate the trend of design options. e software model runs on 256 × 256 resolution maps with obstacles randomly generated. e experimental results will be introduced in Section 6.1.

System Architecture.
A brief diagram of the experimental platform is shown in Figure 9. e A * accelerator was developed by Verilog RTL language and implemented into Xilinx Kintex-7 FPGA. e reason we chose this FPGA is that another project of our team was developed on this FPGA. Other components of the SOC (such as CPU and buses) were also developed in that project. e CPU is based on ARM cortex-M0 ISA which is used to transfer maps' data to DDR and transfer A * accelerators' data to PC. e DDR controller and the A * accelerator are connected by 64-bit wide AXI buses. e experimental maps were loaded into DDR3 memory in advance. e CPU controls the A * accelerator through 16bit APB buses. e calculating time (counted in clock cycles) will be read back from REGs and transferred to PC for evaluation. e implementation results will be discussed in Section 6.2.

Design Space Exploration of the Cache.
e software model for the design space exploration runs on a 256 × 256 resolution map with obstacles randomly placed. e start point is fixed to (0, 0) and the goal is fixed to (255, 255). e experiment results of different design options are shown in Figure 10.
From the chart, three options perform relatively better results. e results with cache size are shown in detail (see Table 2). e cache size is calculated by Cache_size �cache_line_size * block_number * bank_number * size_of_cell � M * N * 4 * 10.

(5)
Although design option 2 gives the best performance, its size makes it unfeasible for the implementation of most FPGAs. What is more, the performance of design option 1 has already tackled the performance bottleneck of fetching nodes' data. erefore, the block size is designed to be 8 × 8 and the block number of each bank is 15.

FPGA Implementation Results.
e A * accelerator was developed by Verilog RTL language and synthesized by Xilinx EDA tool Vivado 2019.2. e FPGA used to implement the whole design is Xilinx Kintex-7 XC7K410T. e synthesis results of the accelerator are illustrated (see Table 3). e maximum frequency of the A * accelerator is 255 MHz. In order to achieve better timing results, the core's     8 International Journal of Reconfigurable Computing clock frequency is designed as 100 MHz. e timing constraints come from the combination logic in OPEN list array module. In order to achieve sorting in parallel, the comparison chain is deep. e timing can be further optimized in the further work. We further analyzed the utilization results. e resource utilization bottleneck falls on Slice LUTs. Moreover, almost all the LUTs are consumed by OPEN list array module to achieve sorting in parallel.

Performance Enhancement.
We compared the software and hardware implementation of A * algorithm using a 256 × 256 grid map filled with randomly placed obstacles, with the probability of a cell being an obstacle ranging from 0% to 50% like [15]. e software algorithm first runs on PC to find whether the shortest path is available. en the available map will be transferred to the memory on the FPGA circuit board and controls the hardware accelerator for calculation. e results of time and path will be transferred back to the PC for evaluation after FPGA's calculation. e results of software implementation were obtained on Inter(R) Core(M) i5-3337U @ 1. For each dataset, a total number of 10,000 maps were averaged to form the results (see Table 5). In every map, two situations are tested. In the worst cases, the start point is fixed to (0, 0) and goal point is fixed to (255, 255). In another situation, they are randomly placed on the map. e results show that 37-75 times performance enhancement was achieved by hardware accelerators. is achievement is mainly because each node calculates the 8 neighboring evaluation values and inserts the OPEN list in parallel. Moreover, the specially designed cache reduces the time of fetching data from memory. erefore, it is suitable for real-time path-planning applications.
An interesting phenomenon is that performance enhancement decreases when the number of expanding nodes decreases. We further investigated this phenomenon through analyzing the simulation waves and found that most  of the time is during the memory-fetching process. For example, in the random 50% data set, over half of the running time is spent on fetching nodes' data. erefore, most of the cache's data is wasted because the number of expanding nodes is low.

Comparison.
To compare the performance enhancement, the works of software and hardware implementation of A * algorithm are chosen. Yao et al. [7] proposed the software optimization methods to lower the nodes to be expanded. eir work improved the processing time from 7.879 s to 3.061 s, with 2.58x speedup. Yap et al. [20] proposed block A * algorithm and achieved up to 4.7x speedup. Nery et al. [28] provided the hardware implementation based on Xilinx High-Level Synthesis (HLS) compiler and achieved 2.16x speedup. e comparison results are summarized in Table 6. e different experimental results are conducted under different situations. erefore, it is not easy to scale the results for comparison. However, the results of the speedup are a good standard for comparing the effectiveness of the optimization methods. e proposed architecture achieves great performance enhancement compared to the previous work due to the carefully designed OPEN list array and nodes cache. Nery et al.'s work [28] is also implemented in FPGA, but the execution process is still in serial. erefore, their work's enhancement is not so obvious.
By the way, from the execution time, it seems that Yap et al.'s work [20] outperforms ours. But their methods need to calculate data related to the grid maps (called LDDB) in advance (costing 1.2 s). It s not suitable for real-time path planning.

Conclusions
is article proposed a hardware framework to accelerate the A * path searching algorithm by parallelizing the iteration operations. e 8-port cache is designed to tackle the memory bandwidth bottleneck and OPEN list array to tackle the calculation bottleneck. e proposed architecture shows 37-75 times speedup even at a low clock frequency of 100 MHz. erefore, it is of research value to implement A * family algorithm for more complicated path-planning applications. e proposed SOC design shows its capability for further implementation as a coprocessor in Application-Specific Integrated Circuits (ASICs).
In the future, the proposed architecture will be investigated to lower the resource consumption of LUTs by optimization of the OPEN list array. e cache will be optimized to adapt to more situations. What is more, the extensible and configurable architecture for general graph applications will be considered in the future work.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.