Pipeline Implementation of Polyphase PSO for Adaptive Beamforming Algorithm

Adaptive beamforming is a powerful technique for anti-interference, where searching and tracking optimal solutions are a great challenge. In this paper, a partial Particle Swarm Optimization (PSO) algorithm is proposed to track the optimal solution of an adaptive beamformer due to its great global searching character. Also, due to its naturally parallel searching capabilities, a novel Field ProgrammableGateArrays (FPGA) pipeline architecture using polyphase filter bank structure is designed. In order to perform computations with large dynamic range and high precision, the proposed implementation algorithm uses an efficient user-defined floating-point arithmetic. In addition, a polyphase architecture is proposed to achieve full pipeline implementation. In the case of PSOwith large population, the polyphase architecture can significantly save hardware resources while achieving high performance. Finally, the simulation results are presented by cosimulation with ModelSim and SIMULINK.


Introduction
Potential interference has been the major concern for system designers in military and critical civilian wireless communication since it may obscure the original received signal.As we all know, the traditional filters process signals in frequency domain, which are usually incapable of interference cancellation in cases when the interference signals occupy the same frequency band as the desired signal.In this case, if we attempt to suppress high-power interferences, the low-power signals of interest will be eliminated.Adaptive beamforming [1], known as a spatial filtering method, has been a powerful technique to enhance signals of interest while suppressing the interference and the noise signal as a result of the linear combination of the array antenna.Most of the adaptive beamforming algorithms, according to whether the training sequence is used or not, could be divided into two classes [2]: blind adaptive algorithm and nonblind adaptive algorithm.And, in our research, the nonblind algorithms are employed.
LMS [3] approaches may have been the most widely used nonblind adaptive beamforming algorithm in engineering applications due to its robustness and simplicity.However, it exhibits a slow convergence and easily tracks into local optimal solution, which would be a fatal flaw when the digital wireless communication system has a high-performance requirement of real-time implementation.
Particle Swarm Optimization (PSO), which was proposed by Professors Eberhart and Kennedy in 1995 [4], is now one of the most important and widely used swarm intelligence algorithms.Using some simple principles, the PSO algorithms mimic the behavior of birds flocking to guide the swarm particles to search for global optimal solution.Compared to other evolutionary algorithms such as the genetic algorithms [5], simulated annealing algorithms [6], ant colony algorithms [7], and others, the PSO algorithms is much easier to implement and shows great performance in convergence speed and in searching global optimal solutions.Therefore, it has been successfully used in many engineering applications in recent years, including adaptive filters, which can be regarded as real-world optimization problems [8][9][10][11][12][13].
Similar to other iterative evolutionary computation approaches, the PSO algorithm is also a population-based optimization technique, the main drawback of which is long execution times, specifically when solving large scale complex engineering problems.Therefore, with the advantage of naturally parallel searching capabilities, parallel implementation of the PSO algorithms has been proposed to overcome the problems mentioned above, achieving high performance in comparison with software solutions [14][15][16][17].However, the PSO algorithm's hardware cost will increase rapidly when its population enlarges, since every increase in swarm size will result in a linear increase in the consumption of hardware resources.This weakness has restricted the use of the PSO algorithm in wide applications of digital signal processing methods.
Recently, advances in Very Large Scale Integration (VLSI) technology have seen significant interest in using Field Programmable Gate Array (FPGA) to speed up scientific and engineering computation with its parallel implementation and configurable hardware technology [18][19][20].Taking advantage of powerful designed architecture, such as pipelining and parallel computing, FPGA could achieve much greater processing speed than common software solutions.
FPGA implementation of the PSO algorithm is a feasible and cheap solution because of its parallel high-performance computing and configurable character.Several different parallel architectures have been proposed to implement the PSO algorithm.Most of the previous work dealing with the implementation of the PSO algorithms based on FPGA uses fixed-point arithmetic since the conventional FPGA technology just provides integer and fixed-point arithmetic [21][22][23][24][25].This approach could reduce the hardware cost in the logic area; however, the simplification is likely to result in resolution degradation because of its small dynamic range.A simple implementation of adaptive filters with the PSO algorithm based on FPGA has been presented in literature [23].In the anti-interference communication field, especially in military wireless communication, the narrow interference signal's power is usually more than 30 dB higher than the signal of interest which requires a large dynamic range; namely, the algorithm operates over small and large numbers during the PSO execution.In addition, the iterative PSO algorithm needs high precision to offset the effect of update error.Obviously, fixed-point arithmetic could not satisfy these two requirements.Hence, we propose the adaptive beamforming algorithm with PSO using the user-defined floating-point arithmetic which would reduce the loss of precision while decreasing the consumption of hardware resources as much as possible.Although few previous works [18,26] have implemented the PSO based on floating-point arithmetic, they are still presented using common parallel architectures in which each particle has to use independent hardware units to achieve signal processing.This results in a large consumption of hardware resources and power, which is an adverse issue for digital communication systems.
In this paper, we present a novel pipelined architecture based on FPGA to implement an adaptive beamforming algorithm using PSO based on the minimum mean square error (MSE) criterion.The proposed architecture is based on user-defined floating-point arithmetic [8].This implementation architecture mainly applies to modern digital antiinterference communication systems in which the baseband chip cycle is much greater than the system clock period.As a consequence, a large time redundancy is generated, of which full use could be made.Essentially, this novel architecture reuses hardware resources meaning that all particles share the same hardware units to evaluate fitness and update position.This hardly makes any difference in achieving high performance of the system because of the large fixed time redundancy.Using digital polyphase filtering signal processing technology could save a large amount of hardware resources and power consumption since essentially only one hardware processing unit  is needed for one particle.In addition, the existing floating-point arithmetic on FPGA designed by XILINX executes a formatting operation after finishing every addition or multiplication operation which would no doubt increase the consumption of resources.Further, the existing floating-point arithmetic uses the IEEE-754 standard, which may not be enough to achieve large dynamic range and high precision.For the two reasons given above, the implementations of adaptive beamforming with the PSO algorithm are based on suitable user-defined floating-point arithmetic.
The remainder of this paper is organized as follows.The model of adaptive beamforming and the PSO algorithm is presented in Section 2. Section 3 describes the related operations covering FPGA implementation of adaptive beamforming with the PSO algorithm.Section 4 provides the entire proposed implementation architectures.The simulation methods and results are given in Section 5. Finally, we present our conclusions in Section 6.

Adaptive Beamforming
In a real digital anti-interference communication system, an adaptive beamformer only processes baseband signals rather than the RF (Radio Frequency) signals or IF (Intermediate Frequency) signals.Figure 1 shows the entire simplified adaptive beamforming system based on the Uniform Linear Array (ULA) with  isotropic antennas.The output of the ULA () is given by [10] where Figure 2 shows the working principle of the adaptive beamformer.The aim of adaptive beamforming is to use an a priori desired signal   () to estimate the signal of interest from the received signal outside of the interference and noise.
As shown in Figure 2, the output of the adaptive beamformer is the linear combination of the weight vectors and the output of DDC.The criterion is to maximize the output in the x 2 (t) x N−1 (t) x 0 (t) x 0 (n) x 1 (n) x 2 (n)  direction of signal of interest and to get null in the direction of the interferences.The weight vectors are updated in each iteration by using the adaptive beamforming algorithm based on the minimum mean square error (MSE) criterion.Therefore, the adaptive beamforming problem can be described as follows: the output of the adaptive beamforming  is the linear combination of the input signal X() with complex weight vectors W() (where

Adaptive beamformer +
Then the error signal () is minimized between the desired signal () and the output ().Finally, () is used to update the weight vectors ().
As described above, a simple example using LMS based on MSE criterion as the adaptive beamforming algorithm can be expressed as [1] where where  denotes Hermitian transpose and  denotes transpose.The error signal () is given by And the weight vectors updated equation is presented in the following: where parameter  is the correlation of the power spectrum of the input signal, representing the step size which controls the convergence speed.

Adaptive Beamforming Based on PSO Algorithm
In this section, an adaptive beamforming algorithm using PSO based on MSE criterion is proposed.Searching the optimal solution for adaptive beamforming can be regarded  as a Multiobject Optimization Problem (MOP).And the MOP can be described in the following formula [26,27]: where  is the feasible solution and -min means the minimization of the functions group.As for the adaptive beamformer, the object is to search for the optimal weight vectors using the given input signals and the desired signals.
The weight vectors () can be regarded as a set of the functions  ().Therefore, the criterion is described as follows: where  is the fitness function of PSO and  = 0, 1, . . .,  − 1.
To solve the MOP model of the adaptive beamforming by PSO, we consider a -dimensional problem space.The position of the th particle is expressed as   = ( 1 ,  2 , . . .,   ), which is represented as a weight vector and the speed of the change of position of   is   = (V 1 , V 2 , . . ., V  ). = 0, 1, . . .,  − 1, where  is the population size.In each iteration , the PSO update equation is expressed as where  is the inertia weight and it mainly plays the role of balancing the local search and global search [14]. 1 and  2 represent the acceleration constants, usually both set to 2, which is easy to implement by a shift operation on FPGA. 1 = [0, 1] and  2 = [0, 1] are two random numbers ranging from 0 to 1 [8].  = ( 1 ,  2 , . . .,   ) represents the individual best position, and   = ( 1 ,  2 , . . .,   ) represents the best global position in the search space.
As for the specified optimization problem by the PSO algorithm, the fitness function could be described as (8), in which parameter  is the particles' population.The flowchart of the adaptive beamforming based on the PSO algorithm is given in Figure 3.

Related Operation Based on Floating-Point Arithmetic
The algorithms implemented on FPGA heavily depend on the algorithmic precision.The user-defined floating-point  arithmetic allows the designer to make appropriate use of the bit-width of the floating-point representation according to the balance of logic area consumption and the precision requirement of the algorithm implementation.As stated in ( 9) and ( 10), the related operations of the algorithm include multiplication, addition, and random number generation.
For the user-defined floating-point data, the multiplication operation is easy to realize by multiplying the IP (Intellectual Property) core provided by XILINX.Therefore, our main work focuses on the pipeline addition operation and random number generation.

Floating-Point Uniform Random Number
Generator.PSO is a stochastic searching algorithm, which is based on several particles randomly moving in a feasible space.In order to compute ( 9) and (10), where  1 and  2 are supposed to be set randomly, we need to use the uniform Random Number Generators (RNGs).The position and velocity of the population in PSO also require the RNGs to generate uniform random initial values.In our proposed scheme, the RNG module is built by the configurable bit-width Linear Feedback Shift Registers (LFSRs), whose input is commonly driven by the feedback XOR (exclusive OR) function of several bits of the overall shift registers.The mantissa is a period of 2 bit-width .LFSRs on FPGA are operated on fixedpoint data.Hence, we could define two LFSRs to generate the mantissa and exponent in floating-point format, respectively.
For the sake of simplicity of computation, all signals are power normalized; therefore, as presented in Figure 4, the signed bit of the exponent LFSR is set as 1.That is to say, the generating exponent is always a negative integer.To avoid an integer that is too small, the bit-width of LFSR's exponent is set as 4.And the bit-width of its mantissa is supposed to be configurable enough according to the requirement of precision.In this way, the algorithm avoids the fixed-pointto-float-point conversion.

User-Defined Floating-Point Pipeline Addition Operation.
The floating-point addition operation consists of the sequence of mantissa and exponent operations: shift, swap, round, and format [28,29].
The floating-point pipeline addition in our proposed implementation architectures consists mainly of two parts: the basic-adder and formatting operation shown in Figure 5, in which SHR and SHL mean shift to the right and left, respectively.The basic adder first compares the exponents of two input operands; then the bigger one is incremented as the exponent of the output sum.At the same time, according to the compared result, it swaps and shifts the mantissa of the smaller number to align the two-incoming numbers.Then the two mantissas are added and the sum is truncated by discarding the lowest bit.The formatting operation first preprocesses the exponent and mantissa of the sum of the basic adder.Then it calculates how many duplicated sign bits there are and finally outputs the exponent by a subtraction operation and the mantissa by SHL operation according to the number of duplicated sign bits.
A conventional floating-point addition IP core, provided by XILINX, does the formatting operation after every twoincoming addition operation.However, the formatting operation consumes much more hardware resources compared to the basic adder because of the operation of calculating the duplicated sign bits.
In our proposed architecture, we use the eight-incoming floating-point adder in formula (2) and the two-incoming and four-incoming floating-point adder for others.However, the use of the formatting operation should be minimized since it consumes greater resources compared to the floatingpoint adder based on the standard IEEE-754.Hence the architectures of two-incoming and four-incoming floatingpoint adders can be implemented in the way shown in Figures 6 and 7, in which it is unnecessary to conduct formatting after every basic-adder operation; instead it conducts formatting after summing all incoming numbers.In this way, the architecture of eight-incoming floating-point adder is presented in Figure 8.

Pipeline Polyphase Architecture of PSO
In this section, we first explain what the time redundancy is and then discuss how to use it to achieve a novel pipeline polyphase PSO (PPPSO) architecture for adaptive beamforming.The polyphase term is derived from polyphase filtering, a time-sharing multiplex technology which can make good use of hardware resources units while not affecting the high performance of the algorithm in our proposed  architecture.Finally, one particle's whole hardware unit and its main parts are presented.

Time Redundancy.
In modern digital communication system, AD converters sample signals very fast as a consequence of extremely high requirement of data throughput and huge hardware resources consumption if it processes the signals directly after AD converters.In fact, it is not a feasible solution since there are not enough hardware resources, and it is unnecessary as well.In general, the sampling signals will be transformed by DDC and achieve the baseband signals with a low chip rate (e.g., 500 K chip/s).However, the system clock of a 7-series XILINX FPGA can easily achieve 250 M rate which is 500 times faster than the chip rate.An example is shown in Figure 9.
As we can see in Figure 4, every baseband chip continues for 500 system clock cycles, only one of which is needed in a conventional digital signal processing (DSP) scheme based on FPGA.And this leads to a large time redundancy; that is to say, the baseband signal chip is invalid in the other 499 system clock cycles, which is no doubt an enormous waste of hardware resources.Therefore, we propose a pipeline polyphase scheme to make full use of this part of resources.To make a better illustration, we define Time Redundancy Rate (TRR) as follows: where ⌊⋅⌋ means rounding down to the nearest integer.One of the most important characteristics of PSO is that all particles in the same population are independent of the optimal solution (exchanging information only by ).Therefore, only one hardware unit is shared by all particles to evaluate fitness value and update positions.This greatly reduces the use of hardware resources.Undoubtedly,   the greater the TRR is, the larger the population of the PSO algorithm can be set, and higher performance can be achieved, theoretically. 1 and  2 will be generated at each system clock cycle by RNG function mentioned above.As for the inertia coefficient , the proposed PPPSO architecture adopts the suggestion from [14], setting it as a dynamic function of iteration index, given by

Adaptive Beamformer with PPPSO
where  max and  min represent the maximum and minimum value of , respectively;  is the current iteration index of the PSO algorithm; and  is the maximum iteration index when the iterative process ends.The timing diagram of the critical signals in our proposed PPPSO architecture is presented in Figure 11.As depicted in Figure 11, a polyphase period has an  (population size of the algorithm) system clock cycle, in which the PPPSO algorithm finishes one iteration.
That is to say, each particle will independently (they only exchange searching information at the end of a polyphase period by the global best) finish its individual best update in a system clock cycle, benefited from the pipeline polyphase signal processing technique.In a polyphase period, the input signals, desired signals, and the inertia  remain unchanged for all particles.Take position (shown as  in Figure 11) as an example to show how the pipeline architecture works. represents the th data of the th (1 ≤  ≤ ) phase data channel which means th particle's position value.Therefore, in a whole polyphase period, every particle would receive the same  and  as the input of the whole architecture to finish the update process.Since it is a pipeline process, every particle in a specified phase channel could share the same hardware units to achieve its own update using the previous position's value in the same phase channel and they do not affect each other.In this way, when one polyphase period finishes, each particle will have finished searching its own individual best value and finished searching the optimal solution.

The Individual Best and Global Best Update Module.
The individual best and global best update module (depicted as Figure 12) is another critical step of the whole pipeline polyphase architecture.It contains the fitness function to evaluate the fitness value of every particle's position.The individual best and global best are considered to update or not according to computed value of fitness function.
As shown in Figure 12, the fitness function value is calculated using ,  and  as incoming data.Then the individual and global best will update or not according to the new evaluated value of the fitness function.The global best just updates one time at the end of the polyphase period according to the compared result of the global best position's fitness value of the current and previous iteration.However, it must compare the corresponding evaluated value at every different specified phase channel, which is the method to update the individual best.Hence each individual best of the particles will be stored in RAM in order, as shown in Figure 12, to achieve the comparison of each particle's individual best position of the current and last iteration.
In our proposed adaptive beamformer with the PPPSO algorithm, a four-antenna simple ULA is applied.Hence, each particle has four dimensions.
The fitness function uses the MSE criterion to minimize the error value as stated in formula (8), in which  denotes the position of the particle in the PPPSO algorithm.It is noted that the adaptive beamformer with the PPPSO is a complex-based algorithm so that all baseband signals are complex-based.And the complex error is calculated as shown in Figure 13.

Particle Update Equation.
The position and velocity update module is shown in Figure 14.As stated in formulas ( 9) and (10), the updating process of each particle in each dimension requires five additions (or subtractions), three multiplications, and two uniform RNGs.
As depicted in Figure 14, all operations in the hardware units of position and velocity updating module work in a full-parallel pipeline way.These operation hardware units need to work together in every system clock cycle because of the pipeline requirement.However, our scheme makes all particles share just one particle updating module, which makes good use of the pipeline polyphase implementation.

Simulation Results and Analysis
The proposed architectures for an adaptive beamformer based on the PPPSO algorithm have been developed in hardware description language using Verilog HDL and VHDL (Very High Speed Integrated Circuits Hardware Description Language).All the architectures are synthesizable in the XILINX ISE 14.7 tool and are based on the parameterizable floating-point packages with user-defined bit-width.Our proposed architecture mainly aims to the PPPSO algorithm with a large scale population (more than 64 in size).As mentioned above, the TRR is easy to achieve 500 in XILINX 7-serias devices.Hence, the population can reach a scale of 500 in size theoretically.Mentor Graphics ModelSim is the most conventional HDL simulator to validate the timing of signals in the whole architecture.However, it is complicated if only Mod-elSim is used to validate the results of the whole adaptive beamforming system.Hence, for convenience and simplification, a cosimulation technique by ModelSim and MAT-LAB/SIMULINK with HDL Verifier is applied to verify the simulated results.HDL verifier automates Verilog and VHDL design verification and analyzes its response, providing interfaces to link MATLAB/SIMULINK with ModelSim.In this way, we are able to compare the complete calculated results from ModelSim and the theory results from MATLAB to verify the responses.The cosimulation schematic diagram is depicted in Figure 15.
We consider a ULA with four elements for simulating the real situation.The SNR (Signal-to-Noise Ratio) is set as 1 dB and the ISR (Interference Signal Ratio) is 30 dB.The azimuth (AZ) of the signal and interference are set as 0 ∘ and 60 ∘ , respectively.The desired signals and interference signals are composed of PN sequences and sine function, respectively.The system supposes that the ULA receives signals including AWGN and horizontal narrow-band interference so that the pitch angle is 90 ∘ .These parameters are shown in Table 1.
As for the initial parameters, Huang et al. [8] suggested a fixed value of 2.0 for both acceleration coefficients.The inertia weight is set to be ranging from 0.9 to 0.4, as a linear function as stated in formula (11).The maximum and minimum of the velocity are 0.125 and −0.125, respectively.The hardware resources cost of DSP48 and RAM is unchangeable because of the fixed use of multipliers and RAM.As shown in Table 3, the cost of LUTs and FFs is gradually decreased with shorter bit-widths of mantissa and exponent.Taking into consideration Table 3, designers have the option to balance the hardware unit consumption and performance of precision.We suggest that the algorithm should use much shorter bit-widths of mantissa and exponent while not doing so affects convergence of the algorithms.From our simulation results, a value of 36 for bit-width of mantissa and a value of 8 for bit-width of exponent would already satisfy the requirement for the precision.

Simulation Results.
As mentioned above, it is convenient to verify the simulation results using cosimulation technology with ModelSim and SIMULINK as shown in Figure 15.All simulation results are based on user-defined floatingpoint arithmetic with a value of 36 for bit-width of mantissa and a value of 8 for bit-width of exponent.
Figure 16 depicts the results of the MSE performance of the PPPSO algorithm with different sizes of population (128, 196, 256, and 320, resp.).A 10-ensemble Monte Carlo Method is applied to our simulation with different initial values for all swarm particles.It can be observed in Figure 16 that MSE learning curves of the PPPSO are very steep since the number of swarm particles is very large.Although the MSE learning curves shown in Figure 16 are closely convergent, the convergent speed of the algorithm with 256-size and 320-size populations is obviously greater than it is with 128-size and 192-size populations.
Figure 17 shows the amplitude pattern of the PPPSO algorithm with different population sizes by using global best position, in the situation that signals are amid interferer and AWGN with a SIR and SNR values mentioned in Table 1.
The algorithm in all of these situations can achieve a great performance to null the signals from 60 ∘ direction, namely, the interferer's direction, and achieve a high gain for signals at 0 ∘ direction, namely, the interested signal's direction.In general, they all are able to achieve a wider main lobe and can null the signal at direction of the interferer while attempting to achieve maximum reception in the specified direction of desired signal.

Conclusions
This paper describes a pipeline polyphase PSO architecture implementation on FPGA for an adaptive beamformer, using the efficient user-defined floating-point arithmetic.The userdefined floating-point arithmetic can perform computations with a large dynamic range and suitable precision while saving hardware resources consumption for the digital antiinterference communication application.The major advantage of our proposed architecture is to allow the use of the PSO algorithm with a large scale population by polyphase signal processing technology.In order to use polyphase architectures to implement the proposed algorithm rather than a full-parallel architecture, a pipeline hardware architecture of one swarm particle's processing unit is required, in which the hardware processing unit could be shared by all other swarm particles, with the consequence of saving a large cost of logic area.
Synthesis results demonstrate that using FPGA to implement the adaptive beamformer based on the PSO algorithm is an entirely acceptable solution.Moreover, the proposed architecture allows the designers to explore the balance of precision and performance by using the user-defined floating-point arithmetic.
In order to simplify the simulation process, the cosimulation technique with ModelSim and SIMULINK is applied to validate the results of the whole adaptive beamforming system with a four-antenna ULA.The PPPSO architectures with various large scale populations are simulated.The MSE learning curve and amplitude pattern are applied to measure performance.The simulation results demonstrate that it is efficient to implement the PPPSO algorithm with large scale populations.
In the future, we intend to explore the balance for exactly suitable precision requirement and the hardware logic area.Furthermore, a complicated time-varying situation is also supposed to take more real scenario into account.

Figure 3 :
Figure 3: The flowchart of the algorithm.

Figure 9 :
Figure 9: System clock cycle and data chip rate.
Algorithm.The whole architecture of the adaptive beamformer based on the PPPSO algorithm is presented in Figure 10.It consists of three parts: (1)  and  updating module: the individual best and global best values update or not according to the evaluation of the fitness function value; (2) position and velocity update module: the swarm particle updates according to formulas (9) and (10); formula (3) signals storage module: it mainly makes use of Random Access Memory (RAM).As depicted in Figure10, the  and  updating module receives the input signals (shown as ) and the desired signals (shown as ), then calculates fitness values according to the fitness evaluation function, and finally updates the values of individual best and global best.The individual and global best, together with  1 ,  2 , and , apply to the position and velocity updating module to accomplish the updating process.Finally, our proposed pipeline polyphase architecture requires storage of all critical coefficients, including position, velocity, , and .

Figure 12 :
Figure 12: The  and  update module.

Figure 13 :
Figure 13: Method to calculate the complex error.

Figure 17 :
Figure 17: ULA amplitude pattern for PPPSO with different population size.
() denotes the signal of interest with the Direction of Arrival (DOA)   and   () denotes the interference signals with the DOA   .(  ) and (  ) denote the steering vectors for the signal of interest and interfering signals, respectively.() is the additive white Gaussian noise (AWGN).

Table 2 :
Synthesis results for the PPPSO with different population size.

Table 3 :
Synthesis results for architectures based on user-defined floating-point arithmetic with various bit-width.