A Reinforcement Learning-Based Configuring Approach in Next-Generation Wireless Networks Using Software-Defined Metasurface

. The next generation of wireless networks including Five and Six Generations (5 G and 6 G ) can provide very high data rates as a demand for the Internet of Everything (IoE) system which connects millions of people and billions of machines. To reach such a high data rate, the wireless networks should work at high enough frequencies, such as millimeter and THz bands, which in turn suﬀer from a large attenuation and acute multipath fading. The idea of coating any objects in the environment with Software-Deﬁned Metasurfaces (SDMs) was presented to control these eﬀects by managing the electromagnetic properties of the environment. Since the programmable environment can be changed during the communication, for example, a sudden obstacle appears, this management should be adaptive. This paper presents the use of a reinforcement learning (RL) algorithm for dynamically conﬁguring such an environment. In other words, when a change happens in the environment, for example, an obstacle blocks some EM waves, the agent receives a large punishment, and therefore a new action is selected. In our model, the transmitted electromagnetic waves and the tiles are considered as the agents and states, respectively. Moreover, the actions of each tile include absorbing or reﬂecting the impinging waves in a speciﬁc direction. We utilize the Q-learning technique to establish proper wireless links between the users and the access point (AP) by controlling the state of the tiles in an environment covered by the SDMs. Evaluation of the proposed model for diﬀerent scenarios, including emerging sudden obstacles, indicates its potential to provide a proper signal level for all the users and improve the average received power up to 12% in comparison with the related works.


Introduction
With the increase of the smart wireless devices usage and the emergence of the idea of the Internet of Everything (IoE), wireless communication systems have been faced with a demand for higher data rates as well as a better Quality of Service (QoS). Accordingly, after deploying 5G around the world, academic and industrial efforts have started to conceptualize 6G [1][2][3]. To provide such a high data rate communication, these new generations of wireless systems should work at higher frequencies such as millimeter wave or even THz bands. On the other hand, the larger attenuation due to the free space path loss and lower penetration depth as well as acute multipath fading limit the use of such high-frequency electromagnetic (EM) waves [4][5][6][7][8]. To alleviate these drawbacks, the researchers have encouraged using the coating technologies to turn the propagation environments into a programmable media. Despite the regular environments in which the EM waves undergo various interactions including reflections, diffractions, and scattering, there is a complete control over EM waves impinging upon surfaces in the programmable wireless environment (PWE) and it is possible to reengineer their direction [9].
Relays [10,11], reflectarrays [12][13][14][15], and metasurfaces [9,[16][17][18][19] are the most popular coating technologies for PWEs. Relays contain an array of low-cost antennas which passively either reflect or attenuate impinging EM waves on the wall, configuring the environment to improve wireless networks operating nearby. Obviously, this approach cannot adaptively reconfigure the PWE. e second approach, reflectarrays, comprises a number of λ/2 patch antennas in a 2D grid arrangement. Moreover, there are some active elements such as PIN diodes to alter the phase of the reflected EM wave. e common functionality of reflectarrays is wave steering and absorption, especially reachable at the far field. Another approach is using metasurface that is similar to reflectarray but with a higher density of meta-atoms. Having high enough density lets the metasurface produce any EM profile, even in the near field [20]. We use the Software-Defined Metasurface (SDM) approach [16] in the remainder of this paper.
In an SDM-based PWE, all of the walls, doors, furniture, and other objects are coated with SDM tiles. Each tile has some networked hardware, control elements, and adaptive meta-atom metasurfaces and can receive external commands and set the states of its control elements to match the intended EM behavior [16]. Furthermore, tiles have environmental sensing and reporting capabilities to discover the communicating devices within the environment. In addition, there is a server that senses the EM profile and active users present in the environment and adaptively configures the matching functionality for each SDM tile.
Adaptively configuring the functionality of the tiles is a major challenge in an SDM-based environment, especially when the number and the position of the users can be constantly changing. However, the environment also can be changed; for example, an obstacle can suddenly block the directed EM waves and therefore some of the wireless links are failed. e employed approaches to control the wireless communication in the PWEs including solving an optimization problem [16] or neural-network-based technique [9] cannot dynamically reconfigure the functionality of tiles as soon as the environment changes.
In this paper, we propose an approach based on reinforcement learning (RL) to adaptively configure and control the indoor wireless communication environment. In our learning model, agents are the transmitted EM waves, the tiles play the role of states, and the different functionalities of tiles are considered as valid actions. By applying the reinforcement learning algorithm, the tiles can make a better decision, while the EM waves gradually interact with the environment. So, when a change happens in the environment, for example, an obstacle blocks some EM waves, the agent receives a large punishment and therefore a new action is selected. Since there is usually more than one user, we model the wireless communication environment as a multiagent RL problem. We consider different scenarios, namely, providing permanent coverage for the NLOS region, providing particular connections for each user when the number of users increases, providing connection in presence of obstacles, and finally providing connection after the sudden emergence of some obstacles, to evaluate the effectiveness of our approach. Simulation results indicate that the RL approach improves the average received power up to 12% in comparison with related works. e main contributions of this paper are as follows: (i) We properly model the PWE to apply the RL technique (ii) We introduce a novel multiagent RL-based approach to configure PWE (iii) Different scenarios are considered to model what is happening in a real PWE e rest of this paper is organized as follows. Section 2 reviews the related works. e preliminary background for SDM is presented in Section 3. Section 4 provides the proposed RL model to configure the tiles. e paper is evaluated via ray-tracing-based simulations in Section 4.1. Finally, the conclusion, as well as future works, is presented in Section 5.

Related Work
e idea of an SDM-based communication environment is to coat the planar objects with units from a kind of softwarecontrolled metasurfaces called SDM tiles [16]. As can be seen in Figure 1, the tiles consist of a set of switch elements controlled by a set of networked controllers and a gateway that provides intertile and external connectivity. Using the gateway, the network of the controllers receives commands from the configuration server to change the current state of the switches and therefore configure the tiles for yielding the desired functionality. In addition, the gateway has the role of a bridge to provide the required power for the tiles. According to Figure 1, the tiles form a network with grid topology in which at least one of the gateways is connected to the environment configuration server for accumulating environmental sensed data to discover the communicating devices within the environment and propagating the proper commands within the tile network. ese commands contain at least the type of action such as steering or absorbing and the address of the intended tile gateway. It is worth noting that the translation of action to a tile switch element configuration is usually done using a populated lookup table during the tile design/manufacturing process [16].
Due to the development of controlling techniques for the EM waves behavior, the researchers are attracted to SDMbased PWEs, especially for the realization of the next generations of wireless networks. e most related literature focus on designing PWE tile unit [8,13,18,21,22] rather than the adaptive configuration approaches for it. In this section, we review the studies that concentrated on proposing the approaches to adaptively configure the SDMbased PWEs.
Liaskos et al. introduced the idea of using SDMs, for the first time, to have a programmable wireless environment, especially for the next generations of wireless networks [16]. ey provided coverage in an NLOS region considering 12 receivers in that area and adequately set the states of the tiles by solving an optimization problem using the genetic algorithm (GA) technique. Moreover, power maximization over the NLOS region is considered as the criterion for the genetic algorithm. Although this approach leads to adequate coverage in the NLOS region, it cannot adaptively reconfigure the functionality of the tiles as soon as the environment changes. For example, when an object suddenly emerges and blocks some of the wireless links, a new optimization problem should be solved, which will be timeconsuming.
Liaskos et al. in another work [9] presented an approach based on machine learning algorithms, in particular neural networks, to adaptively configure the tiles for a set of users. e authors model the wireless propagation as a neural network with walls as layers, tiles as nodes, and their cross interactions as links.
is research considered a problem with only one user and the extension case with multiple users has not been discussed.
In another study, Liaskos et al. [17] model PWEs as a graph and describe their workflow and performance objectives as path finding problems. Unlike the above-mentioned works, this reference presented a network-layer solution to configure PWEs for multiple users and objectives. Nevertheless, the proposed approach in this reference is time-consuming to reconfigure the states of the tiles when the environment changes.
Our proposed RL-based approach in this paper can adaptively configure the PWEs that serve multiple users. In addition, when the environment changes, for instance, an obstacle blocks the path of some EM waves, the tiles can report this blockage to the configuration server. After recalculating the Q-table by considering the new reward/ punishment value, the blocked EM waves can immediately find a new path. Table 1 summarizes the main adaptive configuration approaches for PWE tiles in the literature.

System Model and Problem Formulation
In this section, we present the reinforcement learning model of an SDM-based communication system working at millimeter-wave frequencies for the next generations of wireless networks. By applying reinforcement learning, the agent can make a better decision, while it gradually interacts with the environment [23,24]. We utilize the Qlearning technique, which is one of the value-based reinforcement learning approaches, to establish proper wireless links between the users and the access point (AP) by controlling the state of the tiles in an environment covered by the SDMs. Q-learning searches for an optimal stateaction policy, that is, a sequence of actions that maximize the expected discounted reward [25]. is optimal policy defines how the agent selects an action with the regard to its state. As a typical action-selection policy, the agent chooses the action with the highest Q-value with the probability of 1 − ϵ and acts stochastically with the probability of ϵ. In other words, there should be a tradeoff between exploration and exploitation. e exploration tries to execute the actions that are not executed before when the agent is at a specific state. At the beginning of the learning process, the agent has no experience of the environment and therefore it needs to obtain some rewards and punishments from the environment. So, in these conditions, the exploration is dominant and the value of ϵ is near one. Of course, as the agent interacts with the environment more and more and obtains some experiences, the value of ϵ is reduced and, therefore, the exploitation becomes dominant. e Q-value updating rule according to the Bellman equation [25] is given as follows: where s t is the state at time t, α is the learning rate, c is the discount factor, and a ′ denotes one of the valid actions when the agent has state s t+1 . Our reinforcement learning model for the SDM-based communication system has the following elements:   user, its signal is received by a proper tile (e.g., the nearest tile) as the first state. Afterward, the agent (signal), with regard to the optimal policy, moves to the next states (tiles) until it arrives to the goal (AP).
It should be noted that each tile can receive the EM waves with specific angles. So, when the agent is at state s t , some of the states can be valid as the next state s t+1 for the agent. In addition, each state can be used by only one agent during the communication.
So, the actions that lead to a transition from one state to these previously used states become invalid.
It is worth noting that the Q-values for different states are stored in the server. (iii) Action: For each tile, the wave can be transmitted with some predefined angles. erefore, for each state, the action can be either steering the waves (agents) according to one of these valid angles or absorbing the waves. e action at time t is denoted as a t ∈ A, where A is the action set. e server selects one of these valid angles according to the Qvalues. (iv) Reward: For each hop, that is, transition from one tile to another, we consider a negative reward (punishment) r h . is punishment makes the agent reach the goal with fewer hops and therefore prevents the agent from oscillating. In addition, if the agent arrives at the goal by selecting a valid action, we consider a large reward denoted by r g .
3.1. Multiagent Model. As previously mentioned, it is common to have more than one user communicating with the AP. For such a scenario, we should extend the abovementioned single-agent RL algorithm to the multiagent one. A popular approach for a multiagent problem is Nash-Q learning [26,27], in which the agents can be assumed to play a game to obtain maximum payoff. Instead of finding the maximum Q-value for each agent, the Nash Equilibrium property is employed to find the optimal set of actions for each state [27]. Nevertheless, this approach cannot be efficient for our problem because, in addition to a large number of users, the number of valid actions for each agent can be large. For example, if the numbers of agents and their valid actions are 10 and 12, respectively, then a set of 120 equations should be solved to obtain the Nash Equilibrium for the mixed strategy. erefore, determining the Nash Equilibrium for such a problem with this amount of calculations, which should be repeated each time for transition from one state to another, is considerably time-consuming. Moreover, whenever a new user wishes to make a connection with the AP, the Nash Equilibrium should be calculated for all the users together, which can lead to a change in the state sequences of the previously connected users and therefore increase the delay for them.
Since it is important to reduce the delay of establishing communication between the transmitter and receiver as much as possible, we should consider another approach for our problem. We can simplify the model by considering that the users usually do not simultaneously request a connection at the same time. In other words, we can consider this problem as a single-agent model that is repeated for each user, and the environment is modified after each iteration. It is worth noting that this modification is needed because each tile can be used only by one agent. erefore, at the end of each episode for every agent, some of the states are invalid for the other agents. In this model, all agents have a common Q-table and update it [27]. So, we can rewrite equation (1) by adding a subscript i to denote agent i and by defining the Q-values as a function of all agents' actions [27]: Algorithm 1 indicates the proposed reinforcement learning-based communication for the next generation of wireless networks. is algorithm takes the number of agents (n a ), the number of states (n s ), the acceptable error (e), and ϵ as input. Since each tile can only be used once to steer, we define set U to indicate the tiles that have been used before. In addition, ΔQ indicates the variation of the Q-table after finishing each episode and we set its initial value to some

Reference
Approach Discussion Liaskos et al. [16] Genetic algorithm (GA) is approach cannot adaptively reconfigure the tiles functionality as soon as the environment changes. Liaskos et al. [9] Neural network is approach considered a problem with only one user.
Liaskos et al. [17] Graph model is approach is time-consuming to reconfigure the states of the tiles when the environment changes.
is work RL is approach can adaptively reconfigure the states of the tiles when the environment changes in presence of multiple users. value greater than e. For each episode, all the agents should reach the goal. erefore, if the episode is finished for some of the agents, others should proceed to reach the goal. After all agents have reached the target, the episode is finished and the variation of the Q-values is calculated over a batch of recent episodes. If this variation is lower than a threshold e, the learning process has been completed. Each agent has its unique ε value that is equal to one at the beginning of the algorithm and then decreases after each agent's action by a constantly decreasing factor c. Finally, we can predict the execution time of the proposed algorithm to configure the PWE. Suppose that R is the data rate of the wireless links, L is the length of pilot packets in bits sent to update Q-table before data transmission phase, d p is the total propagation delay between the user and AP, and d proc is processing delay per tile. If the average number of episodes required for the proposed RL algorithm to converge is N episode and the average number of tiles activated to forward the EM waves for each user is n tile , the total time needed before starting the data communication, denoted by T, can be estimated as follows: Because the next generations of wireless networks work at high frequencies, the value of R is considerably high. On the other hand, for an indoor environment, d p will be negligible. We calculate T in Section 4.3 when some parameters such as n tile and N episode can be estimated.

Performance Evaluation
In this section, we evaluate adopting reinforcement algorithm to establish wireless communication links in an environment covered by SDM. We consider an indoor space, similar to that used in [9,16], with the dimensions of 15 m × 10 m × 3 m. A wall with a length of 12 m and a thickness of 1 m exists in the middle of the room, which creates two separate sections, namely, line-of-sight (LOS) and non-line-of-sight (NLOS) regions. All the walls have been covered by 1 m × 1 m tiles. So, according to the dimensions of the room, there are 193 tiles. An AP working at 60 GHz with 100 dBmW transmitting power is Input: n a :number of agents, n s :number of states, e: acceptable error, ϵ Output: Final Q table ε i ⟵ε, ∀i ∈ 1, 2, . . . , n a n g : the set of agent which at the episode have not reached to the goal n g ⟵ 1, 2, . . . , n a S(s, a): return the next state corresponding to current state s and performing the action a U:the set of states used by all of the agents ΔQ: Variance of Q-  (2) Update Q(s i , a i ) by equation (3) Update n g ϵ i ←ϵ i * c end end Store Q(s i , a i ) Calculate ΔQ end ALGORITHM 1: e process for finding the paths for multiple users in a PWE using the proposed RL algorithm. Security and Communication Networks 5 located at the height of 2 m somewhere in the LOS region. In our model, the tiles can steer or absorb the received EM waves from the directions formed by a combination of angles −60°, −45°, −30°, −15°, 0°, 15°, 30°, 45°, 60°{ } in the elevation plane and 0°to 360°with a step of 15°in the azimuth plane. We assume that no power is dissipated when the tile steers the wave and, on the other hand, all the power of the wave is absorbed while it is absorbing. To evaluate the performance of the proposed RL-based system, we develop a 3D ray-tracing simulator by Python.
We consider the reward function as equation (5) in which we have made differences between the next states in LOS and NLOS regions. If we assume that the EM waves propagate from the user in the NLOS region to the AP in the LOS region, we consider a larger value for r h2 to encourage the EM waves to enter the LOS region for arriving at the AP as soon as possible.
To obtain the optimal values for reward in our RL modeling, we calculate the average number of activated tiles in a scenario with three users located in NLOS region for different values of r h1 and r h2 . According to Figure 2(a), the minimum average number of activated tiles is achieved for r h1 � −4 and r h2 � 6. In addition, we search for the optimum values of c and α according to a similar process. As can be seen in Figure 2(b), there is a lot to choose for these two parameters and we choose c � 0.9 and α � 0.2. e parameters related to the RL algorithm and the PWE which are used in the simulations have been listed in Tables 2 and 3, respectively.
In this section, we consider four different scenarios and evaluate our proposed method in comparison with the related works.

Scenario 1: Providing Coverage for the NLOS Region.
An approach for providing coverage in an NLOS region is to put a large number of receivers in that area and adequately set the state of each tile. is setting is typically done by means of the optimization algorithms such as genetic algorithm (GA). For example, in [16], 12 receivers are uniformly distributed over the NLOS region and then, using the GA, the optimal tiles configurations are searched to maximize the minimum received power over these receivers. In this subsection, we use the proposed RLbased model to establish proper wireless links between these 12 receivers and the AP located at position 2.25, 3, 2 { } m. So, we apply the proposed multiagent RL algorithm to properly configure the environment. Figure 3 shows the minimum, maximum, and average of the total received power by these receivers in comparison with the obtained results in [16]. According to this figure, our RL-based approach improves the mean and min values of the received powers up to 12% and 35%, respectively, after approximately 500 episodes. In addition, we consider a different number of randomly located users in the NLOS region and investigate the coverage ability of this scenario based on the prelearning, that is, the achieved Q-table corresponding to the previously mentioned 12 receivers. Table 4 shows the minimum, maximum, and average of the received power by the users. e mean values of the received powers are greater than 26.43 dBmW which obviously are large enough for the proper communication.

Scenario 2: Performance Evaluation with Varying Number of Users.
In the second scenario, there is not any prelearning and the transmitted EM waves should find themselves the proper paths to their destinations. For such a scenario, we evaluate the performance of the system for a different number of users. Accordingly, we increase the number of users and put them randomly in the NLOS region. For each case, we calculate the max, mean, and min of the received powers and compare the results with those obtained in the first scenario (Table 4), as shown in Figure 4. According to this figure, when the PWE is particularly configured for some users with specific locations, as in Scenario 2, users generally receive more power compared to Scenario 1 in which the configuration is performed according to some predetermined receivers.
Moreover, we compare the results of the presented approach based on neural network in [9] with those obtained in our Scenario 2 for a single user as well as the regular propagation in a nonprogrammable environment. It is worth noting that, in this reference, the user can receive several signals, that is, five signals from different tiles. So, to do a fair comparison, we modify our approach by assuming five users at the same place, as shown in Figure 5, and reporting the summation of received powers by these users as the overall power. In addition, the transmitter works at 2.4 GHz with −30 dBmW transmitting power. Table 5 indicates that the proposed RL approach provides approximately the signal level reported in [9] for such an especial case.

Scenario 3: Performance Evaluation in Presence of
Obstacles. In this subsection, we consider a more realistic problem in which some obstacles can block the path of the EM waves. In the first case, we evaluate the proposed system's performance by assuming that one, two, and three obstacles exist in the indoor environment, as can be seen in Figure 6. For each case, we repeated the simulation ten times. Figure 7 gives the average received power of all cases for a different number of users. According to this figure, by increasing the number of obstacles in the PWE, the average received power is generally decreased. Nevertheless, the average received power at least is around 21 dBmW which is an acceptable value in an indoor environment.
Afterward, we investigate the ability of establishing wireless links between the users and the AP when some obstacles suddenly emerge after 500 episodes. Again, we consider three cases shown in Figure 6, with the difference that there are not any obstacles at the beginning of the communication, to evaluate the effect of environmental changes. Figure 8 illustrates the average received power of all cases for a different number of users. As we expected, the proposed RL-based approach adaptively configures the PWE, for the new conditions, to properly connect the AP and the receivers. In addition, we compare the number of episodes needed to converge our algorithm for both cases, namely, with and without environment changes, as indicated in Figure 9. Obviously, when the environment suddenly changes, the number of required episodes is increased.

Scenario 4: Performance Evaluation with Multipath
Interference Cancellation. In a conventional wireless communication, the signal will reach the receiver not only via the direct path but also as a result of reflections from different objects. So, the overall signal at the receiver is a summation of the variety of signals being received. Since they all have different path lengths and therefore arrive with the different phases, the overall signal strength varies as a result of the different ways in which the signals will sum together. In this subsection, by using the obtained results in scenario 3, we can overcome the multipath phenomenon. In other words, we consider an obstacle at the place of each user to prevent other signals from passing through the receiver. Figure 10 shows two examples for such a scenario. Moreover,           arrive at the destination is about 55 μs. It should be noted that we consider that the average number of tiles activated to forward the EM waves is five ( Figure 2). Now, if the number of episodes needed for converging our algorithm is 1000, according to equation (4), the total time for starting a communication will be about 55 ms.

Conclusion
We proposed an approach based on reinforcement learning (RL) to adaptively configure the indoor wireless communication environment in this paper. e proposed approach can be applied to a multiple user problem as well as a changing environment. To evaluate the performance of our method, we considered four different scenarios, namely, providing permanent coverage for the NLOS region, providing particular connections for a different number of users, providing connections in presence of obstacles, and finally providing connections after the sudden emergence of some obstacles. Our evaluation indicates that the proposed RL-based approach can properly provide wireless links for all the above-mentioned scenarios with enough signal level. e main concern about using the RL approach is the time required for updating the Q-table as the initial phase of the communication before starting the data communication. Our calculation shows that this delay time at the initial phase of communication for an indoor environment working at high frequencies can be negligible. It should be noted that when the environment is constantly changing, the time needed for updating the Q-table may disrupt the data communication.
As future work, we decide to use deep reinforcement learning (DRL), instead of the conventional RL algo-   rithm, to extend our approach for configuring more complicated environments. To this end, we can consider the IDs of the activated tiles and the current tile as well as the AP location as the neural network inputs. On the other hand, the output of the network can determine the reflection angle (proper action) to forward the EM waves to the next tile.

Data Availability
No data were used to support this study.

Ethical Approval
is article does not contain any studies with human participants or animals performed by any of the authors.