A Multiagent Reinforcement Learning Solution for Geometric Configuration Optimization in Passive Location Systems

Passive location systems receive electromagnetic waves at one or multiple base stations to locate the transmitters, which are widely used in security fields. However, the geometric configurations of stations can greatly affect the positioning precision. In the literature, the geometry of the passive location system is mainly designed based on empirical models. *ese empirical models, being hard to track the sophisticated electromagnetic environment in the real world, result in suboptimal geometric configurations and low positioning precision. In order to master the characteristics of complicated electromagnetic environments to improve positioning performance, this paper proposes a novel geometry optimizationmethod based onmultiagent reinforcement learning. In the proposed method, agents learn to optimize the geometry cooperatively by factorizing team value function into agentwise value functions. To facilitate cooperation and deal with data transmission challenges, a constraint is imposed on the data sent from the central station to vice stations to ensure conciseness and effectiveness of communications. According to the empirical results under direct position determination systems, agents can find better geometric configurations than the existing methods in complicated electromagnetic environments.


Introduction
Passive location techniques are used for various scenarios, such as telecommunication pseudobase station discovery, and aviation interference investigation. Traditional passive location algorithms [1] mainly include angle of arrival (AOA) [2], difference of time arrival (TDOA) [3], and frequency difference of arrival (FOA) [4]. ese algorithms locate transmitters after estimating the signal parameter called the two-step positioning methods. e direct position determination (DPD) [5,6] uses observations from all the stations to locate the transmitter without estimating the signal parameters, which outperforms two-step methods in low signal-to-noise ratio (SNR) scenarios [7]. e geometric configurations of stations can significantly affect the positioning precision [8], both in two-step and DPD positioning algorithms. In the literature, some existing studies tried to obtain general principles in geometric configurations from massive experiments [9,10]. And, only some rough conclusions have been drawn. For instance, all stations should not line up, or stations should form a triangle to surround the transmitter. ere also exist several studies that have employed heuristic methods, such as genetic algorithm (GA) [11] or particle swarm optimization (PSO) [12], to search the optimal geometry. ese methods are based on empirical models in which signals are assumed to propagate ideally. However, in the real world, an electromagnetic environment changes abruptly with the positions of stations due to various factors, such as signal frequency, interference, attenuation, multipath, obstacles, and noises. ese factors can hardly be described fully in empirical models, leading to suboptimal geometric configurations and low positioning precision. erefore, it is vital to adjust geometric configuration to fit the sophisticated electromagnetic environment, so as to improve precision in a passive positioning task. is problem is regarded as a sequential decision-making problem in a real-world complex electromagnetic environment, rather than an optimal geometry searching problem based on empirical models. Reinforcement learning (RL) is a viable and elegant approach to yield an optimal policy for sequential decisionmaking problems [13]. e tricky electromagnetic environment can be tracked by RL in a trial-and-error paradigm. e nonlinear and parameterized deep neural network (DNN), providing the compact and powerful representation of experiences, can adapt to the complicated electromagnetic spatial distribution accurately. erefore, this paper addresses the problem of finding optimal geometric configuration in the passive location system through deep reinforcement learning (DRL) [14].
Under the framework of DRL, a station is used as a mobile agent. e terms' station and agent are hereafter used interchangeably which can receive signals and decide where to go. Agents need to optimize the geometric configuration collaboratively to improve the positioning precision, and they can share information via communication channels to facilitate the collaboration. However, the communication traffic matters when the number of agents increases, especially in adverse communication conditions. is paper proposes an efficient multiagent reinforcement learning algorithm to optimize the geometric configuration for passive location systems. To this end, each station is regarded as a mobile agent with all agents having a collective objective of finding an optimal geometry to improve the positioning precision. To facilitate the collaborations among agents, they are trained based on value function decomposition, which can solve the credit assignment problem among agents implicitly. For a vice station, it needs to obtain information from other stations to improve the evaluation of the situation and promote the quality of decisions on where to go. Meanwhile, it is necessary to reduce the communication traffic due to transmission and processing challenges. A mutual information objective function then is employed to constrain the messages sent to vice stations to ensure the expressiveness and conciseness. e proposed method is evaluated on simulated DPD positioning tasks in a complicated electromagnetic environment. e results demonstrated that the agents can find better geometric configurations than existing methods.

Background
is section introduces the relevant background on passive locations (concretely, DPD) and MARL.

Passive Location with DPD.
Consider H transmitters and L stations intercepting the transmitted signal, as shown in Figure 1. Each station is equipped with an antenna array consisting of M elements. e hth transmitter's position is denoted by p h � [x h , y h ] ⊤ . e complex envelopes of the signals observed by the ℓth station are given by the following equation [5]: (1) where 0 ≤ t ≤ T, z ℓ (t) is a complex time dependent M × 1 observation vector, and b ℓ is an unknown complex scalar representing the channel attenuation between the hth transmitter and the ℓth station. Moreover, α ℓ (p h ) is the ℓth array response to the signal transmitted from position p h , and v h (t − τ ℓ (p h ) − t (0) h ) is the hth signal waveform transmitted at time t (0) h and delayed by τ ℓ (p h ). e vector n ℓ (t) represents noise, interference, and multipath effects on the signals.
For brevity, we use α ℓh and τ ℓh instead of α ℓ (p h ) and τ ℓ (p h ). e observed signal can be partitioned into K sections with length (T/K) ≫ max τ ℓ . Taking the Fourier transform of each section, we obtain where j � 1, . . . , J is the index of Fourier coefficients and en, the vector α ℓ (j, p h , b ℓh ) concludes all information about the transmitter's position. Furthermore, the phase shift caused by the transmit time t (0) h is cancelled out when s h (j, k) is used by the DPD method.
In (3), the received signal is presented in matrix notation with Since the vector v(j, k) is the same at all stations, the observed vectors of all stations can be concatenated together as where Assume the hth column of A(j) is denoted by α(j, p h , b h ), corresponding to the hth emitter, and can be factored as where  Mathematical Problems in Engineering e additive noise vector n(j, k) is assumed to be a realization of a circularly complex Gaussian process with zero mean. e second-order moments is given by e covariance matrix W represents the thermal noise as well as interference. In the case of spatially white noise, W is a block diagonal matrix given by Assume that signals and noise are uncorrelated so that A matrix is defined to construct the DPD estimator as follows: where . e matrix Z vv (j) becomes diagonal for large K if the signals are uncorrelated. e hth column of U(j) and its ℓth subvector are denoted by u h (j) and u ℓh (j), respectively. e DPD estimator for general noise covariance is presented as In the case of partially white noise with a spectral density matrix defined in (10), the DPD estimator becomes According to [5], the Cramér-Rao lower bound (CLRB) on the covariance of any unbiased estimator of the position vector with no model errors is where (15), determined by received signals z and locations of stations p, is utilized as the reward function. e CRLB plays a major role in developing the passive location agents through MARL.

Multiagent Reinforcement Learning.
In reinforcement learning, an agent interacts with the environment for a given goal. At time t, it observes state s t ∈ S with S denoting the state space, takes action a t ∈ A with A representing the action space, receives reward r t ∈ R, and moves to the new state s t ∈ S. e agent aims to learn a policy that maximizes the long-term reward. e action-value function, which starts from state s, takes action a, and follows policy π, is denoted by Q π (s, a) [13]: where c ∈ [0, 1] is the discount factor that determines the importance of future rewards.
In the multiagent reinforcement learning (MARL) [15,16], agents (robots, UAVs, sensors, etc.) interact with a shared environment to complete the given tasks. Basically, agents are the learnable units that want to learn policies in order to maximize the long-term reward through interactions with the environment. Most MARL problems are classified as NP-hard problems [17] for the sophisticated environments and the combinatorial nature of the problem.
In a cooperative MARL problem, agents must jointly optimize an accumulative scalar team reward over time. e centralized RL approach can be employed to solve the cooperation problem, i.e., all state observations are merged together and the problem is reduced to a single agent problem with a combinatorial action space. Whereas, according to Peter [18], the naive centralized RL methods fail to find the global optimum even if it is possible to solve the problems with such huge state and action spaces. e challenge lies in the fact that some of the agents may become lazy and unable to learn and cooperate as they should. is may cause the whole system to face a failure. ey addressed these problems by training individual agents with a value decomposition network (VDN) architecture. e agents learn to decompose the team value function into agentwise value functions as follows: where ρ and a represent the observation-action history and joint action, respectively, and ω i is the value function parameters of agent i. VDN aims to learn an optimal linear value decomposition from the team reward signal, by backpropagating the total Q gradient through deep neural networks representing the summation of individual value functions. e VDN solves the credit assignment among agents implicitly without any specific reward for individual agents. Rashid [19] regarded the cooperative MARL problem as the VDN does, but added a constraint on the objective: which makes the weights of the mixing network positive and ensures monotonic improvement.

MARL-Based Geometry Optimization
is section proposes a MARL-based geometric configuration optimization method for passive location systems.

Model Framework.
In this paper, a DPD location system is considered with L mobile stations (e.g., UAVs equipped with positioning equipment), i.e., L DPD agents. Each agent transfers the intercepted signals to a central processing agent where the emitter's position is estimated. Agents have no knowledge of the emitter and the electromagnetic environment. Due to the influence of multipath and noises, the signals received by different agents vary. To adapt to the complicated electromagnetic spatial distribution accurately, a MARL-based method, with positioning error being the reward function, is considered.
e key elements in the MARL scheme are defined as follows: (a) CRLB is an effective index for evaluating the precision of a passive location system. Let the background position of the transmitter be p ⋆ � (x ⋆ , y ⋆ ). en, the CRLB is a function of state s and the background position p ⋆ , i.e., CRLB(p ⋆ , s), according to (15). (b) e statistic error is a popular class of position errors, such as the mean error (ME) and the mean square error (MSE).
Among the errors listed above, only CRLB can assess the geometry without estimating the target position, which reduces considerable amounts of time and computing in training. erefore, CRLB is used as the reward function in training the DPD agents.

Learn to Optimize the Geometry.
is section presents an efficient multiagent actor-critic algorithm for geometric configuration optimization in passive location tasks. e overall architecture of the proposed method is illustrated in Figure 2. It is developed based on two main considerations: (i) factorizing the global value function into individual value functions with local observations for better collaboration and (ii) utilizing information constraints to facilitate communications and optimize the messages to tackle the transmission challenges. Figure 2, the global value function is factorized into linear combination of individual value functions as follows:

Value Decomposition. As shown in
where ρ � (ρ 1 , . . . , ρ N ). And, ρ i � ((s i 1 , a i 1 ), . . . , (s i t , a i t )) is the history of local observations, actions, and messages received. Local value functions are parameterized by e policy of each station maps the history of observations and actions to the next action: π θ i (ρ i , a i ). e joint policy for the location system is denoted by π θ (ρ, a) � i π θ i (ρ i , a i ). Both actor and critic of each agent utilize the gated recurrent unit (GRU) [20] to process the input of observation history. GRU is a special kind of recurrent neural network that has the ability to capture the long range connections of states. e mixing network and individual value functions are trained in an end-to-end manner by minimizing the TD loss as follows:

Information Constraint.
For the central station, it must collect observations from all the stations to estimate the transmitter's position. Nevertheless, a vice station j just needs the data that can help to make better decisions. erefore, the central station must learn how to send messages as short as possible but enough for vice stations to act better. A natural solution is to add information constraints. In practice, to improve the effectiveness of messages sent to vice stations, it is necessary to maximize the mutual information of messages and station's actions. Let c be the index of the central station; then, mutual information is defined by (21) where m c represents the message sent from central station to a vice station and m in j � m c (j ≠ c) and a v is the joint action of all vice stations.
However, if this is the only objective, agents could always ensure a maximally informative representation by taking the identity encoding of raw data (m c � ρ), which contradicts the transmission reduction goal. To increase the conciseness of messages, the complexity of the messages is limited by the constraint I(m c , ρ) ≤ H 0 . It is then possible to learn an encoding m c , which is maximally expressive about a v in addition to being maximally compressive about ρ. With Lagrange multiplier β, the information bottleneck (IB) is defined as follows: where ω IB represents the parameters of the encoder and the decoder network. e value networks are then trained together with the encoder and the decoder by minimizing an overall objective: where ω consists of ω V and ω IB and λ is the weight that trades off between these two subobjectives. e policy gradient [21] of station i is defined as where e policy of station i is optimized through the gradient ascend: where θ i refers to the parameters of station i's policy and η > 0 shows the step size. e details of the training process are shown in Algorithm 1.

Experiments
In this section, we develop a simulated electromagnetic environment for passive location tasks, based on which the agents are evaluated.

Environment.
In the experiment, the simulator's geographical coverage is 10 km×10 km, as shown in Figure 3. e transmitter is located in the center of the map and is equipped with an isotopically radiating antenna. e signal model, defined by (1), is employed with some modifications. e channel attenuation is a function of the receiver's position p: b ℓ (p) ∝ λ s /(4π d), which follows the free space path loss. e noise and interference, as well as the multipath effect, are all compassed in the noise n ℓ , which is modeled by the spatially white noise in (10). ere are some low regions, highlighted in green in Figure 3, where the noises are stronger than other areas. It should be noted that due to these low SNR regions, the contours of SNR turn into irregular concentric rings. Furthermore, in the real world, it is also impossible to approach too close to the transmitter; therefore, there is a forbidden 1 km area around the transmitter in the simulator.

Setup.
Consider one central station and L − 1 vice stations with the task of cooperatively optimizing the geometric configuration in an area consisting of free propagation regions, low SNR regions, and forbidden regions. At each time step t, stations observe the environment to obtain the state s t and make decisions about moving in direction φ on distance d, e.g., a t . While moving, stations shut off the positioning and communication devices until arriving the next positions. If the time step t reaches the maximum of t max , the location task ends. Figure 4 demonstrates the process of executing a passive location task in training and execution mode in different branches. With geometry formed by stations at each time step t, the reward is given by the theoretic error bound, CRLB: where z is the received signals and p � (p 1 , . . . , p L ) refers to the positions of all L stations. Also, the root mean square error (RMSE) is calculated to describe the positioning error more intuitively:  Mathematical Problems in Engineering 5 where p k ⋆ is the kth estimation of p ⋆ and N est denotes the estimation times for each geometric configuration.

Results and Analysis.
e agents are trained in the passive positioning task mentioned above by setting the maximum time step to 100. For the sake of comparison, a basic version of the proposed method is also evaluated. In that version, the central station sends nothing except for the reward (naive DPD agents). e top segment of Figure 5 shows the learning curve in terms of the averaged reward for DPD agents with communications versus naive DPD agents. DPD agents with communications converge to a much higher return than the naive DPD agents, which indicates that, with messages sent by the central station, vice stations are able to estimate the value function more accurately. In other words, communications are essential to geometry optimization in DPD location tasks. e bottom segment of Figure 5 illustrates the information bottleneck loss L IB against the training epochs. L IB declines quickly through training. e proposed agents can achieve a higher position precision with lighter communication overhead.
To show the learned decomposition of value functions, Figure 6 demonstrates the error curve, normalized value functions, and the agents' situations when learned agents perform a certain DPD positioning task. According to the top segment of Figure 6, both CRLB and RMSE decline with more steps taken by agents. Furthermore, the RMSE of positioning converges to the CRLB with respect to optimization steps. It means that agents can find geometric configurations where estimation error becomes closer and closer to the CRLB, which is the best achievable output for passive location systems. e middle and bottom segments of Figure 6 show that when the agents are in the low SNR area, their value functions decrease and the positioning error increases, which is consistent with our experiences. Figure 7 demonstrates the final geometric configuration found by the proposed agents as well as that optimized by the GA [1]. According to the geometry yielded by the GA, there is a station in the low SNR region, which is a suboptimal geometry. In other words, the GA optimizes the geometry on the empirical model, which cannot identify the low SNR regions in the simulator. By contrast, the trained agents can  (1) Initialize the DPD passive location system with target transmitter emitting signals, specify the number of stations L and the central station c; (2) Initialize neural network parameters ω, θ (3) Initialize the iteration counter t⟵0. (4) repeat (5) for i � 1: L, i ≠ cdo (6) Intercept the signals o i t ; Send the state (s i t , a i t ) to the central station; (8) end for (9) e central station intercepts signals o c t and send m c to vice stations; (10) Update the parameters of value networks: ω←ω + η∇ ω L(ω); (11) for all i do (12) Update the parameters of policy network: end for (14) Update the counter t←t + 1; (15) until the task is completed or reaching the maximum of counter ALGORITHM 1: Geometric configuration optimization with multiagent reinforcement learning. 6 Mathematical Problems in Engineering   avoid low SNR regions and find the optimal geometry successfully.

Conclusions
is paper analyzed the geometry optimization problem of passive location systems in a complex electromagnetic environment and proposed a MARL method to address it in a try-and-error fashion. In the method, by factorizing the global value function into the agentwise value functions, agents can learn to optimize the geometric configuration cooperatively. Moreover, by adding the mutual information constraints, the communication traffic from the central station to vice stations can be greatly reduced while effectiveness is ensured. A simulator with a sophisticated electromagnetic environment for passive location task is also developed, the results on which showed that the agents could find better geometric configurations than existing methods.
is paper should be seen as a first attempt at learning geometric configuration optimization through MARL in a passive location task. Although DPD is used in the proposed method, it can be replaced by any other passive location algorithm (e.g., TDOA or AOA) to enhance the algorithm flexibility in various location scenarios.

Data Availability
e data used to support the findings of the study are available from Shengxiang Li (lishengxiangzz@163.com) upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Transmitter
(0, 0) GA Our agents Figure 7: Optimized geometric configuration compared to the existing methods. Red : geometry found by our trained agents. Blue: geometry found by PSO, with one station in the low SNR region.