Trading and Pricing Sensor Data in Competing Edge Servers with Double Auction Markets

With the development of the IoT (Internet of Things), sensors networks can bring a large amount of valuable data. In addition to be utilized in the local IoT applications, the data can also be traded in the connected edge servers. As an efficient resource allocation mechanism, the double auction has been widely used in the stock and futures markets and can be also applied in the data resource allocation in sensor networks. Currently, there usually exist multiple edge servers running double auctions competing with each other to attract data users (buyers) and producers (sellers). Therefore, the double auction market run on each edge server needs efficient mechanism to improve the allocation efficiency. Specifically, the pricing strategy of the double auction plays an important role on affecting traders’ profit, and thus, will affect the traders’ market choices and bidding strategies, which in turn affect the competition result of double auction markets. In addition, the traders’ trading strategies will also affect the market’s pricing strategy. Therefore, we need to analyze the double auction markets’ pricing strategy and traders’ trading strategies. Specifically, we use a deep reinforcement learning algorithm combined with mean field theory to solve this problem with a huge state and action space. For trading strategies, we use the Independent Parametrized Deep Q-Network (IPDQN) algorithm combined with mean field theory to compute the Nash equilibrium strategies. We then compare it with the fictitious play (FP) algorithm. The experimental results show that the computation speed of I-PDQN algorithm is significantly faster than that of FP algorithm. For pricing strategies, the double auction markets will dynamically adjust the pricing strategy according to traders’ trading strategies. This is a sequential decision-making process involving multiple agents. Therefore, we model it as a Markov game. We adopt Multiagent Deep Deterministic Policy Gradient (MADDPG) algorithm to analyze the Nash equilibrium pricing strategies. The experimental results show that the MADDPG algorithm solves the problem faster than the FP algorithm.


Introduction
With the development of the IoT (Internet of Things), smart terminals embedded with a large number of sensors such as cameras, GPS, and gyroscopes are becoming more and more common in daily life [1], where massive amounts of data are collected [2]. In addition to being utilized by the local smart IoT applications, these valuable data can be traded in the connected edge servers, which can on one hand provide the computing resource for the smart phone applications, and on the other hand, provide a market mechanism for trading the data between the data users (referred as buyers) and data generators (referred as sellers) [3]. For example, traffic information can be collected from the smartphone to edge server, which can be sold to some navigation applications for optimizing the route planning. In this scenario, double auction, as an auction mechanism in which there are multiple buyers and sellers (referred as traders in the following) in the market can be used for trading data between data users (buyers) and data generators (sellers) by the edge server. In this mechanism, buyers and sellers can bid at any time during the trading, and the market will match buyers with sellers who have submitted the bids at a specified time. This mechanism allows traders to enter the market at any time and trade multiple commodities at the same time. Due to its high allocation efficiency, double auctions have been widely used to solve real-world resource allocation, such as in the stock market [4], the emission trading market [5], the spectrum auction market [6], cloud computing resource allocation market [7], and the sensors networks resource allocation market [3]. In such markets, both traders and the double auction market need to adopt efficient trading strategies and market mechanism [8].
In the real world, there may exist a large number of traders on the edge server to trade data. Furthermore, there may exist multiple edge servers running double auction markets. At this moment, traders need to decide which double auction to participate in and how to bid in the selected double auction market, while the double auction market needs efficient mechanism to improve the allocation efficiency to attract more traders. Since the pricing strategies can determine the price at which traders will trade, it will affect the traders' profit significantly. At the same time, the trading strategies (i.e., how to choose the market and bidding) will in turn affect the market's pricing strategies, thereby affecting the market allocation efficiency. Therefore, we need to analyze the trading strategies of traders and pricing strategies of double auctions in the environment with multiple competing edge servers running double auction markets.
In double auctions, both the market and traders are selfinterested agents, and their strategies are affected by each other. Game theory is widely used to analyze the strategic interactions of self-interested agents, in which Nash equilibrium is an important solution concept. Therefore, we will analyze the Nash equilibrium trading strategies and pricing strategies in this competing environment. Specifically, this problem involves a large number of traders, which may have continuous bidding space and private preference. Although the generalized FP algorithm can solve similar problems, it will be difficult to solve the Nash equilibrium in a feasible time when there are a large number of traders. In this paper, we will analyze the Nash equilibrium trading strategies and pricing strategies based on deep reinforcement learning and mean field theory. For the Nash equilibrium trading strategies, we combine the Independent Parametrized Deep Q-Network (I-PDQN) algorithm, which is suitable for solving the problem with hybrid actions, with the mean field theory [9,10] to solve the Nash equilibrium trading strategies. The experimental results show that this algorithm can solve the problem significantly faster than the FP algorithm. For the Nash equilibrium pricing strategies, we adopt the Multiagent Deep Deterministic Policy Gradient (MADDPG) algorithm. We also find that the Nash equilibrium pricing strategy obtained by this algorithm is the same as the solution of FP algorithm, and the MADDPG algorithm can solve the problem faster than the FP algorithm. The experimental results can also provide useful insights for designing the practical trading strategies and pricing strategies in the real world.
The rest of this paper is structured as follows. In Section 2, we introduce the related work. In Section 3, we introduce the basic settings of the double auction market. In Section 4, we analyze the Nash equilibrium trading strategy based on the I-PDQN algorithm and mean field theory. In Section 5, we use MADDPG algorithm to solve the Nash equilibrium pricing strategy. Finally, we conclude the paper in Section 6.

Related Work
There exist a number of works about data acquisition [11][12][13][14]. Specifically, Sangoleye et al. studied the data acquisition problem from the SIoT nodes following a technoeconomics-based approach via exploiting Contract Theory. Chung et al. proposed a test-bed that consists of on-body sensors and an Android mobile device to acquire the human activity data and then used LSTM network to recognize human behavior. Ho et al. proposed a frame that uses unmanned aerial vehicles (UAVs) to collect data and used Particle Swarm Optimization (PSO) method to find the optimal topology in order to reduce the energy consumption, Bit Error Rate (BER), and UAV travel time. Maksymova et al. studied liDAR sensor data acquisition and compression for automotive applications.
There also exist a lot of works about data trading, such as [15][16][17]. Specifically, Tian et al. proposed a market mechanism considering the privacy leakage for trading IoT data in one-to-many trading scenarios [18]. They further proposed a many-to-many data trading strategy, which redefines some unreasonable assumptions of the existing mechanisms [19]. Yu et al. proposed a market model to trade mobile data between mobile users by taking into account data demands and demand uncertainty [20]. Hui et al. proposed a sensing service system by considering the utilities of data providers and data service providers with a data pricing strategy in the vehicle sensor networks [21]. Niyato et al. proposed a data market model for IoT data [22]. Al-Fagih et al. proposed a data pricing model for public sensing data by considering delay, quality of services, and trust factors [23]. Furthermore, double auction, as a highly efficient resource allocation mechanism, has been widely used in the data trading market. For example, Jiao et al. designed a double auction-based data market model and pricing mechanism to maximize the profit [24]. Chen et al. used double auction to trade sensor data [25]. Sun et al. used edge servers as a double auction market to solve the problem of insufficient computing resources [26]. Cai et al. proposed a truthful double auction mechanism for data trading market against three major challenges, including diverse market preferences, the complex conflicts of interest relations of data consumers, and the strategic behaviors of both sides [27].
Trading strategies and pricing strategies play an important role in the double auction market, and therefore, there exist a number of works about trading strategies and pricing strategies in double auctions. For trading strategies, Gode et al. proposed the "Zero Intelligence" (ZI) trading strategy for the first time [28]. Traders can only randomly select bids, and all bids are uniformly distributed. Brown and Von proposed the fictitious play algorithm (FP algorithm for short) [29], in which each trader estimates the FP beliefs of other traders through historical bids and calculates the current best response strategy based on this. But the original  [31]. For the first time, Schvartzman and Wellman combined empirical game theory with the Q-learning algorithm in reinforcement learning to analyze the optimal trading strategy of traders in the double auction market [32], but this algorithm is only suitable for a small and discrete space of bidding actions. Chowdhury et al. proposed a trading strategy using Monte Carlo Tree Search (MCTS) [33]. However, this algorithm is suitable for discrete bidding sets and cannot deal with bidding problems with continuous types and action spaces. Bredin and Parkes designed a framework of truthful bidding in double auction market [34]. Furthermore, there also exist some works analyzing the market pricing strategies in the competing environment with multiple double auction markets. Miller and Niu experimentally analyzed traders' market selection strategies in the competing marketplaces trading environment [35]. Cai et al. analyzed the impact of different adaptive strategies on the trading strategy and its own earnings in the market competition environment [36]. Shi et al. considered two different pricing strategies and analyzed how to adjust their pricing strategies to attract traders in two competing markets [37], and then they considered four typical types of fees in pricing strategies, to analyze the Nash equilibrium market selection in competing environment [38].
From the above work about data trading strategies, we can find that there exist few works on trading with continuous types and action spaces under incomplete information, and most of the above works only consider a small number of traders when analyzing the Nash equilibrium solution. Regarding the market pricing strategies, although there exist some works considering the competing environment, these works have not considered how the market should adjust the pricing strategy under the incomplete information game of a large number of traders. In this paper, we will analyze the Nash equilibrium trading strategies of sensor data and market pricing strategies of the double auction market running on the edge server in a competing environment with a large number of traders.

Basic Settings
In this section, we will introduce the basic settings of traders and the double auction market running on the edge server. We will describe the basic settings of traders and introduce how to compute the expected profits of traders. We then introduce the pricing strategy of double auctions and describe how to compute the allocation efficiency of the double auction market.
3.1. Basic Setting of the Trader. In this paper, the traders consist of data buyers and data sellers. The set of buyers is denoted as B = f1, 2,⋯,Bg, and the set of sellers is denoted as S = f1, 2,⋯,Sg. The set of all markets is denoted as M = f1, 2,⋯,Mg. Each trader has a type, and the type of the seller is the lowest price it is willing to sell. The buyer's type indicates the highest price at which the buyer is willing to buy an item, and the seller's type is the lowest price at which the seller is willing to sell the item. The type actually indicates the trader's preference on the item. The types of a buyer and a seller are denoted as θb ∈ ½0, 1 and θ~s ∈ ½0, 1, respectively, which is private information, that is, the type of each specific buyer or seller is unknown to others. However, the types of all buyers and sellers are assumed to be common knowledge, and i.i.d drawn from the cumulative distribution functions Fb ∈ ½0, 1 and F~s ∈ ½0, 1, respectively, there are assumed to be differentiable, and the probability density functions are fb and f~s, respectively. We assume that a small cost τ will occur when the trader enters the market (for example, the time it takes for online trading). Therefore, when the buyer's type is too low or the seller's type is too high, they choose not to enter the market. In doing so, the behavior of buyers bidding low offers and not entering the market, and the behavior of sellers bidding high offers and not entering the market can be distinguished. Next, we describe how traders choose a market and bid in the market. We define the action of a buyer as a tuple δb = m, db m and thus the buyer's market choice and the bid in the selected market are treated as trading strategy. Note that m ≠ 0 means that the buyer bids db m in market m, and m = 0 means that the buyer does not enter any market. Similarly, we use δ~s = m, d~s m to represent the seller's action.
3.1.1. Trader's Expected Utility. In this section, we introduce how to compute the expected utilities of traders. In what follows, we introduce how a seller's expected utility is calculated. Similarly, the buyer's expected utility can also be derived in the same way. The expected utility of a seller is determined by its type θ~s, its action δ~s = m, d~s m , and its belief about the actions of other buyers and sellers Ωb m , Ωb m , We define Ω~s m as a tuple d~s 1,m , d̂s 2,m ,⋯,d~s η,m , where d~s i,m represents the i th smallest seller's bidding action in market m. In particular, the number of sellers taking each different action is denoted as a tuple x = x 1 , x 2 ,⋯,x η where x i represents the number of sellers choosing action d~s i,m . Now, the seller's position is determined as follows. We use X < m ðΩ~s m , d~s i,m Þ to represent the number of other sellers who have a lower bidding than d~s i,m in market m, and it can be calculated as Similarly, excluding the seller itself, we use X = m ðΩ~s m , d~s i,m Þ to represent the number of sellers who have the same bid as 3 Journal of Sensors the seller, and it can be calculated as Now, any position from X < m ðΩ~s m , d~s i,m Þ + 1 to X < m ðΩ~s m , d~s i,m Þ + X = m ðΩ~s m , d~s i,m Þ + 1 can be seller's bidding action d~s i,m , and this position is denoted as v m ∈ V m , where V m is the set of all possible positions. So the probability of any v m in the set V m is Now, the seller's expected value can be calculated as where φðv m , Ωb m , d~s i,m Þ represents whether the seller can trade in the market.
where Y m represents the total number of buyers with a bid of d~s i,m or less in the market.
Considering all v m in the market m, the expected value of the seller is The derivation process of the buyer's expected value is the same.
Then, we derive the equation of calculating the expected payment of the seller, and expected utility can be calculated as expected payment minus expected value. We can determine the equilibrium price range ½d l , d h and the price is p k = d l + k × ðd h − d l Þ according to the equilibrium k pricing strategy. Then, the seller's expected payment for bidding d~s i,m is Now the seller's expected utility for bidding d~s i,m is The derivation process of the buyer's expected utility is similar. Now at market auction stage t, assuming that the seller's trading strategy is δb = ðg, db g Þ, the seller's immediate reward is The accumulative reward of the seller is where γ is the discount factor in reinforcement learning, indicating the degree of importance of future rewards. The derivation of the buyer is the same.

Market Setting.
We now introduce the basic settings about the pricing strategy of double auctions.
3.2.1. Equilibrium K Pricing Strategy. In this paper, it is assumed that all markets adopt equilibrium k pricing strategy, in which the pricing parameter of the market is k ∈ ½0, 1. Therefore, it is stipulated that the pricing parameter of market m is k m and the competitive pricing strategies of M markets are δ M = k 1 , k 2 ,⋯,k M . In equilibrium k pricing, the equilibrium price range is E p . After equilibrium matching, traders who successfully match (the matched seller's asking price does not exceed the buyer's bid) can trade at any price within the equilibrium price range. Therefore, the set of buyers and sellers who successfully match and can trade is f<b 1 ,s 1 >,<b 2 ,s 2 >,⋯,<b lm , s lm > g, and the set of bidding is f<db 1 , d~s 1 >,⋯,<db lm , d~s lm > g. According to the above conditions, the equilibrium price interval must be a subinterval of interval ðd~s lm+1 , db lm+1 Þ, Under equilibrium k pricing, all traders trade at the same price and are in E p . The trading price is Obviously, when k is larger, to the market biases to the seller, otherwise, it biases to the buyer.

Allocation Efficiency.
We now introduce how to compute the allocation efficiency of the markets. Allocation efficiency is one of the most important metrics of measuring the performance of the double auction. The allocation efficiency is the ratio of the actual profit obtained by all buyers and sellers in the market to the maximum profit theoretically 4 Journal of Sensors obtained when they submit their types as the bid, which is where TB is the set of actual transactions made by traders, θ b i is the type of the buyer in transaction i, θ s i is the type of the seller in transaction i, TP i is the transaction price of transaction i, TB * is the set of transactions when traders submit their types as their bids, and TP * i is the transaction price of transaction i when traders submit their types as the bids.

Market Reward.
In this paper, the competing double auction market intends to maximize the allocation efficiency by adopting an efficient pricing strategy in order to attract traders. Therefore, we take the market allocation efficiency as the market reward.
In each stage t, each market publishes its pricing action. Traders then choose a market to participate and bid according to the trading strategy. When all participated traders have bid, each market matches buyers with sellers according to the equilibrium matching strategies According to equation (12), the immediate reward of the market is expressed as follows: The accumulative reward of the market is

Nash Equilibrium Trading Strategy
When traders choose the edge servers market to participate and bid, their strategies are affected by each other. Therefore, we need to derive the Nash equilibrium trading strategies. In this paper, all traders use reinforcement learning to improve their trading strategies until all traders have converged. At this moment, traders have reached the Nash equilibrium strategy. It should be noted that although the learning process is repeated, the game we study is essentially a one shot game. One shot game means that all participants play only one round of game. In this repeated learning process, agents will choose the action in the current state according to the observed information of previous states and the obtained profit and enter the next state at the same time. This process is a sequential decision-making process. Therefore, we model it as a Markov decision process and use the deep reinforcement learning algorithm to solve the Nash equilibrium strategy. We use I-PDQN (independent parameterized deep q-network) algorithm to analyze traders' Nash equilibrium trading strategy and evaluate it against the FP algorithm [40] in terms of computation speed and convergence result. We assume that there are two competing edge server double auction markets. When the number of markets is greater than 2, our method is still applicable. In each stage, traders need to select a market and bid. Therefore, the trading strategy consists of two parts, choosing a market, where the action space is discrete, and bidding, where the action space is continuous. Therefore, the whole trading action is a hybrid action with continuous and discrete action. Furthermore, this problem involves a large number of traders. Therefore, we intend to solve the trading strategy problem of a large number of traders with hybrid actions based on I-PDQN algorithm and mean field theory.
4.1. I-PDQN Algorithm. As we have discussed in the above, P-DQN algorithm [41] is applicable to the hybrid action space of a single agent. This algorithm is then extended to the environment with multiple cooperative agents [42]. However, traders in the double auction market are not cooperative, and therefore, we extend it to the environment with multiple noncooperative agents, called I-PDQN algorithm. In the following, we first briefly introduce P-DQN algorithm and then introduce I-PDQN algorithm.
P-DQN algorithm can deal with the problem with hybrid action space. The idea is to update discrete action strategy and continuous action strategy, respectively, in combination with DQN algorithm [43] and DDPG algorithm [44]. In the P-DQN algorithm, first, the low-level parameters related to each high-level discrete action are selected, and then, the discrete-continuous hybrid action pairs that can maximize the action value function are calculated. More specifically, the discrete-continuous hybrid action space A can be defined as where ½E = 0, 1, ⋯, E − 1 is the set of discrete actions, and X e is set of all discrete actions e ∈ ½E corresponding to the continuous actions. Therefore, a deterministic function can be defined to map the state and each discrete action e to the corresponding continuous parameter x e where θ is the weight of deterministic policy network. A discrete action value function Qðs, e, x e ; ωÞ is further defined to map the state s and all hybrid actions to the actual value. ω is the weight of the discrete action value network. P-DQN updates the discrete action policy network parameters through the following loss function: where the expression of y is 5 Journal of Sensors where s ′ is the next state after the hybrid action ðe, x e Þ is taken. The policy update of the continuous parameter part is through fixing the parameter ω and minimizing the loss function l Θ ðθÞ: Therefore, the action value function Qðs, e, x e ; ωÞ mainly plays two roles. First, it outputs the greedy strategy for all discrete actions (consistent with DQN), and second, it provides a gradient for the policy update of continuous parameters.
After introducing P-DQN algorithm, we now introduce I-PDQN algorithm for multiple non-cooperative agents. I-PDQN is an algorithm with low space and time consumption. Specifically, I-PDQN has a space complexity of OðλnÞ at each round. Where n is the size of replay memory, and replay memory is cleared for each round, which means our algorithm does not take up much memory space. Note that it is difficult to get an exact value for the time complexity of deep reinforcement learning. However, in our experiments, we can get the convergence result in a reasonable time. In more detail, the algorithm takes the competing market pricing parameters, the number of buyers and sellers and the bidding space as the input, and finally outputs the Nash equilibrium trading strategy. Because each trader intends to maximize its own profit, it learns the best trading strategy independently. Therefore, the I-PDQN algorithm adopts the autonomous learning paradigm, and each trader has an independent P-DQN algorithm to learn [45,46]. Because this game involves a large number of traders, we introduce the mean field theory into I-PDQN algorithm, to describe the state of the market. The detail of the algorithm is shown in Algorithm 1.

Parameter Setting.
We now experimentally analyze the Nash equilibrium trading strategy. In the experimental analysis, we consider 50 buyers and 50 sellers. For the hybrid action of each trader, the discrete action set is expressed as ½E = f0, 1, ⋯, E − 1g, where E is the total number of discrete actions, and the continuous action parameter corresponds to each discrete action x e ∈ ½0, 1. In the action selection stage, each trader first generates the continuous parameters corresponding to all discrete actions according to the observed states. The exploration probability is set as ε = 0:995 t , and the exploration probability will gradually decrease with the increase of training iterations. For the selection of discrete actions, traders randomly select a discrete action from the uniform distribution of ζ~Uð0, 3Þ with the probability of ε for exploration. [0,3] uniform distribution is used for random exploration of discrete actions. Traders randomly select a number in [0,3] uniform distribution. If it is between [0,1], it means that the choice of discrete action is 0, they do not enter the market; if (1,2] means that the choice of discrete action is 1, select market 1 to enter; if (2,3] means that the choice of discrete action is 2, select market 2 to enter. There are six states in each stage of the market, and the specific parameters are explained in Table 1. For the replay memory D i = 200000 of each trader i, the sample size ℝ = 64 is selected in batch, the update ratios of θ i and ω i are α = 0:01 and β = 0:001, and the discount factor of y is γ = 0:95.

Experimental
Results. We selects two typical pricing strategies <0, 1 > and <0:5,0:5 > for analysis. These are the most common strategies in the economic market. The two markets compete with each other. I-PDQN algorithm is used to train the Nash equilibrium trading strategies to get which market traders will enter and how many to bid in the equilibrium state. Figure 1 shows the changes of traders' market choice in the iterative process with a combined pricing strategy of <0 , 1 > where the pricing strategy of market 1 is k = 0 and the pricing strategy of market 2 is k = 1. At this time, market 1 is completely biased towards the buyer, while market 2 is completely biased towards the seller. As can be seen in Figure 1(a), with the passage of training, sellers with types less than 0.5 will gradually enter market 2. This is because market 2 is completely biased towards sellers, and thus, market 2 will attract sellers to participate. However, since sellers with types 0.5 cannot win in the competition, they will choose to go to market 1 in order to successfully trade. The analysis of buyers in Figure 1(b) is the same as the analysis of sellers' market selection strategy. In Figure 1(b), market 1 will eventually attract buyers with type greater than 0.5 and sellers with type greater than 0.5, while market 2 will attract traders with smaller type. This shows that through continuous learning, buyers and sellers will choose a market that is conducive to their own market or can trade successfully. Figure 2 shows the convergence results of traders' Nash equilibrium trading strategy in the competing environment with a pricing strategy of <0, 1 > . Note that the training process of Algorithm 1 can only output the equilibrium action of the specific types. Based on these equilibrium actions corresponding to trader types, we further use neural network to fit the final trading strategy, which is a mapping from trader types to actions. The results show that in the equilibrium state, how traders select the market, and bid in the participated market. We also can find that both markets can attract traders, and the markets can coexist. According to the market choice of traders, it can be seen that traders with larger types enter market 1, while traders with smaller types will enter market 2. In market 1, because it is completely biased towards the buyer, buyers are willing to bid close to their types, a.k.a. bidding truthfully, while sellers want to hide their bids more. In market 2, sellers are willing to bid close to their true types because market 2 is completely biased towards sellers. Sellers will try to bid truthfully in order to improve the matching probability. From Figure 2, we also find that when the type of buyer is less than 0.12 and the type of seller is greater than 0.88, they will choose not to 6 Journal of Sensors enter the market because of the fixed cost (such as time cost) of entering the market. For the competing markets with pricing strategy < 0:5,0:5 > , the results show that under the same pricing strategy, traders will eventually converge to only one market, where which market to be converged is random, and the bidding strategies of all traders are similar to those in a single market. This is because when the two competing markets are the same, traders will choose the market with more participants to enter in order to improve their probability of being matched. This leads to that only one market can survive in the end.

Experimental Evaluation against FP.
Another way to solve the game with continuous private type is to use the generalized FP algorithm. Therefore, we will evaluate our algorithm against the FP algorithm. In this evaluation, we still consider that there are two competing markets, and the market pricing strategy is <0:1,0:9 > . We also assume that there are 50 buyers and 50 sellers. We use the two algorithms to train the traders' trading strategies, respectively, to obtain the final Nash equilibrium trading strategy. The experiment is repeated for 50 times. In each experiment, I-PDQN algorithm will initialize the types of traders randomly in [0,1] under the uniform distribution. For FP algorithm, the types of traders and initial FP beliefs are also initialized randomly. Figure 3 shows the average profits of traders when entering different markets when Nash equilibrium trading strategies are obtained by different algorithms. It can be seen that the results obtained by the above two algorithms are almost the same, which can prove that the I-PDQN algorithm can achieve the Nash equilibrium strategy the same as FP algorithm.
We also evaluate the computation speed of the two algorithms. We calculate the number of iterations and the computation time of each iteration. The average and standard deviation are calculated and the results are shown in Table 2.
The results show that although the I-PDQN algorithm has more iterations when converging to the equilibrium, the single iteration computation time of FP is about 5.031 times that of I-PDQN algorithm, and therefore, the total average time of FP algorithm is 4.6745 times that of I-PDQN algorithm. Therefore, we can see that using I-PDQN algorithm can compute the trader's Nash equilibrium trading strategy faster. The reason is that traders use I-PDQN algorithm interacts with the environment and others constantly, and they can obtain more experience tuples to train their own policy network, and therefore, they need more iterations. However, the algorithm only needs to calculate their own hybrid actions according to the current observed state, and therefore, it takes less time. In FP algorithm, traders need to calculate the current best response strategy against FP beliefs every iteration and update their FP beliefs. All traders will repeat this process until convergence. Therefore, with the increased number of traders, the computation time of each iteration in FP algorithm will increase, resulting in the increased total convergence time.
Input: market pricing parameters k 1 and k 2 , number of buyers B, number of sellers S, trader bidding space λ Output: the Nash equilibrium trading strategy of the trader 1 Initialization: For each Trader i ∈ B + S, Initialize the Exploration Parameter ε, Batch Size ℝ, Uniform Distribution ξ, and Randomly Initialize the Network Weights ω i i and θ i , and t = 0, and the Initial State Is s 0 i 2 while The loss function of traders is not convergence do 3 For each trader i, Calculate the continuous parameter x e i,t ⟵ μ i ðs i,t ; θ i,t Þ corresponding to all discrete actions according to the current state; 4 Select action a i,t = ðe i,t , x i,t e according to the following rules Select an action at random f rom ζ ε 6 When the bidding time of the current stage ends, each trader obtains its immediate return r i,t and the state s t+1 i of the next stage through the market rules; 7 For each trader i, the tuple s i t , a i,t , r i,t , s i t+1 is stored in replay memory D i ; 8 Strategy training: 9 Each trader i takes ℝ samples from replay memory D i and calculates y i according to equation: 10 Calculate the random gradients ∇ ω l Q i,t ðωÞ and ∇ θ l Θ i,t ðθÞ according to equations (17) and (19), and update the weight according to equation: Best bid for traders with the same roles H 4 Average bid of traders with the same roles H 5 Average bid of traders with opposite roles H 6 Last round transaction price 7 Journal of Sensors

The Competing Pricing Strategy
After analyzing the Nash equilibrium trading strategy, we now analyze how the double auction market set the pricing strategy in Nash equilibrium. Specifically, we will use the MADDPG algorithm to design the competing pricing strategy and evaluate it against the FP algorithm in terms of computation speed and convergence result.
In the competing environment, the double auction market will adjust the pricing strategy in real-time in order to attract traders and obtain higher allocation efficiency. Intuitively, the pricing strategy and traders' Nash equilibrium strategy are affected by each other, and therefore, this is a joint learning process between the market and traders, which is shown in Figure 4. In the first stage, the market selects a pricing strategy based on the observed state. In the second stage, traders select the market and submit the bids according to the Nash equilibrium trading strategy. The competing market then compute the allocation efficiency according to the current actions of traders and then further update the pricing strategy in order to improve the allocation efficiency. This process is repeated to reach the equilibrium state. At this moment, we can obtain the Nash equilibrium pricing strategy and the Nash equilibrium trading strategy under this pricing strategy. 5.1. MADDPG Algorithm. As we have described in the above, the joint learning process is also a sequential decision process, and it involves two competing markets. This can be regarded as a Markov game. Therefore, we use Multiagent Deep Deterministic Policy Gradient algorithm [48] to analyze the Nash equilibrium pricing strategy.
The MADDPG algorithm is centralized trained and decentralized executed. Furthermore, each piece of experience replay contains the information of all agents at the current stage. Each agent learns multiple strategies, and at the same time use the overall effect of all strategies to do the optimization. The space complexity of MADDPG depends on the size of the replay memory D, which usually does not exceed the number of traders in the market, which is OðλnÞ, where λ is the size of MDP tuples, and n is the size of replay memory. The same as I-PDQN, replay memory is cleared for each round. For the time complexity, it also cannot be accurately calculated. However, it can ensure that the convergence strategy is calculated in a reasonable time. Now, we will briefly introduce MADDPG. We use θ = ½θ 1 , ⋯, θ n to represent the parameters of the strategy of n agents, and π = ½π 1 , ⋯, π n to represent the strategy of n agents. The cumulative expected reward for the agent i is Jðθ i Þ =   Journal of Sensors E s~ρ π ,a i~πθ i ½∑ ∞ t=0 γ t r i , t, and for the deterministic strategy μ θ i , the gradient is where Q u i is a value function for each agent. For the centralized critical, it is updated by minimizing the following loss function: where the equation of y is   Journal of Sensors parameter space of the two markets as the input and output the Nash equilibrium market pricing strategy.

Experimental
Analysis of Pricing Strategy. We now experimentally analyze the Nash equilibrium pricing strategy. The experimental setting is the same as that in I-PDQN. The state of each market is a tuple, expressed as s i = n~s i , nb i , ask avr , bid avr , g i , where n~s and nb are the number of buyers and sellers entering the market, bid avr and ask avr are the average bidding of both buyers and sellers, g is the number of deals. The pricing parameter spaces P 1 and P 2 of the two markets are between [0,1] and the replay memory size D = 200000. For the generation of the original pricing action, we use the normal distribution a t~N ða 0 , 0:5Þ as the noise exploration. The number of samples taken for training each time is ℝ = 128, the learning rate of actor network is α a = 0:001, the learning rate of critical network is α c = 0:0001, the update factor of target network parameters is τ = 0:001, and the discount factor is γ = 0:9.

Experimental Results.
In this experiment, the two markets obtain the Nash equilibrium pricing strategy through continuous training. Figure 5 shows the action selection trend of the competing market in the iterative process. It shows that market 1 chooses higher pricing parameter in the initial stage and finally stabilized at k = 1. For market 2,  Input: continuous action space P 1 and P 2 of Market Output: Equilibrium pricing strategies π 1 and π 2 of market M 1 and M 2 1 Initialization: Initialize the actor network μ and critical network Q of M 1 and M 2 , respectively, and initialize the respective parameters θ μ and θ Q to initialize the target networks μ′ and Q′ corresponding to the above two networks and parameters θ μ ′ and θ Q ′, initialize replay memory D 2 Random initialization distribution N for action exploration; 3 Initial respective market states s 1 0 and s 2 0 , and set the iteration cycle to t = 0 4 whileThe loss function of traders is not convergencedo 5 Action selection: 6 The market selects actions a 1 t and a 2 t according to a m i = μðs m t | θ μ Þ + ℕ t , respectively 7 Release the pricing action a t = ða 1 t , a 2 t Þ, then the trader adjusts his equilibrium trading strategy (I-PDQN algorithm) under the pricing, and then the market calculates the reward r t = ðr 1 t , r 2 t Þ and the new state s t+1 = ðs 1 t+1 , s 2 t+1 Þ 8 Store tuples s t , a t , r t , s t+1 in the replay memory 9 Strategy training: 10 fori = 1, 2 (Update the strategy network for the two markets, respectively)do 11 Randomly sample r tuples from replay memory D and calculate y m i 12 The critical network Q is updated by minimizing the loss function of equation (21)  13 The actor network μ is updated by maximizing the gradient of the sample strategy through equation (20) 14 Update the target network parameters θ Q′ and θ μ′ through equation θ i ′⟵ τθ i + ð1 − τÞθ i ′ 15 end 16 end Algorithm 2. MADDPG algorithm. since the higher pricing strategy of market 1 attracted a large number of sellers at the beginning, market 2 also tried to set a higher pricing parameter, that is, k = 0:8, but it cannot beat market 1. During this period, the action choice of market 2 fluctuated greatly, then gradually chooses a lower pricing parameter, and finally stabilized at k = 0. Actually, we have tried many experiments. The results show that the two competing markets will eventually stabilize at k = 0 and k = 1, that is, in the equilibrium state, the pricing parameter of market 1 is k = 0 and that of market 2 is k = 1 or vice versa, which is related to the initialized market network parameters. This means that the market will be in favor a class of traders, buyers, or sellers. In this case, the two markets can coexist. This further shows that in a highly competing environment, it is difficult for a market to attract all traders. 5.3. Experimental Evaluation against FP. Now, we evaluate our algorithm against FP algorithm in terms of the computation speed and the allocation efficiency. The parameters are the same as those in Section 4.2. Each experiment is repeated for 10 times, and then we compute the average result.
Under the equilibrium k pricing strategy, the experimental results show that when the algorithm converges, the final pricing strategies of the two algorithms are stable k = 0 and k = 1, where which pricing parameter the market adopts is related to the initial parameters or the initial FP belief. This shows that MADDGP algorithm will finally get the same result as FP algorithm.
Furthermore, we look into the convergence speed in different algorithms when they reach Nash equilibrium. The results are shown in Figure 6. It can be seen that when the pricing strategy converges, the average computation time of FP algorithm is 1.2 times that of MADDPG algorithm. This means that our algorithm can reach the equilibrium faster than FP.

Conclusion
In this paper, we analyze the Nash equilibrium trading strategy of sensor data with a large number of traders in the competing environment with multiple edge servers running double auction markets. We adopt a deep reinforcement learning algorithm I-PDQN combined with the mean field theory to solve the Nash equilibrium trading strategy. In the experimental analysis, the Nash equilibrium result of the algorithm is consistent with that of FP algorithm, and the computation speed is significantly faster than FP algorithm. Then, we analyze how the edge server with double auction set the price effectively in the competing environment. We use MADDPG to compute the Nash equilibrium pricing strategy. The experimental result show that the Nash equilibrium pricing strategy of this algorithm is consistent with FP algorithm, and the computation speed is faster than FP algorithm. The analysis of this paper can provide some useful insights for designing the practical trading strategy and pricing strategy in the competing environment with multiple edge servers to trade sensor data.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.