Over-the-Air Computation with Quantized CSI and Discrete Power Control Levels

. In this paper, an Over-the-Air Computation (AirComp) scheme for fast data aggregation is considered. Multisource data are simultaneously transmitted by single-antenna mobile devices to a single-antenna fusion center (FC) through a wireless multiple-access channel. The optimal power levels at the devices and a postprocessing scaling function at the FC are jointly derived such that mean square error of the computation is minimized. Different than the existing approaches that rely on perfect channel state information (CSI) at the FC and assume that the devices ’ optimal power levels can be selected from an in ﬁ nite solution set, in the present paper, it is assumed that only quantized CSI is available at the FC and that the aforementioned optimal power levels lie in a ﬁ nite discrete set of solutions. To derive the optimal power levels and FC ’ s scaling factor, a dif ﬁ cult nonconvex constrained optimization problem is formulated. An ef ﬁ cient and robust solution to quantization errors is developed via the deep reinforcement learning framework. Numerical results verify the good performance of the proposed approach while it exhibits a signi ﬁ cant reduction in the required feedback.


Introduction
The sixth generation (6G) of wireless communications is foreseen to accommodate a huge number of mobile devices within the context of the so-called internet-of-things (IoT) for enabling novel and demanding applications such as smart cities, interconnected autonomous vehicles, and so forth [1].These devices require the aggregation of massive data distributed to them, to support their functions.To that end, a promising technology called Over-the-Air Computation (AirComp) has recently emerged for fast wireless data aggregation [2,3].
AirComp exploits the signal superposition property of the multiple-access channel (MAC) between the devices and a fusion center (FC) for averaging their simultaneously transmitted data over the wireless medium.By properly applying processing at both the devices and the FC ends, AirComp can be used to also calculate different data functions from their average that belong to the class of the socalled nomographic functions, for example, geometric mean and polynomial expressions.Recent works in the field of AirComp have expanded the original ideas in Nazer and Gastpar [2] and Soundararajan and Vishwanath [3] under different system models [4][5][6][7].
To deal with the fading characteristics of the wireless medium, the work in Cao et al. [8] and later works in the federated learning domain [9,10] presented optimized power control schemes for AirComp systems by minimizing the computation error at the FC.This presented significant performance gains since it avoided the suboptimal approach of channel inversion power control, used in the previous works [4][5][6][7].
On the other side, the approaches in Cao et al. [8][9][10] they require perfect channel state information (CSI) at the FC side.This requirement can be very restrictive, especially when the CSI is typically estimated at the devices from the downlink training symbols and then has to be fed back to the FC.If highly accurate CSI is fed back to the FC, the required overhead could be extremely high.In the literature of communication systems, feedback mechanisms have been developed based on quantized CSI (QCSI) [11,12].In such approaches, the estimated CSI is quantized to one of the known states and then only the index of the detected state is fed back, thus reducing the required overhead.The latter comes at the cost of inferior performance, if no robust methods to quantization errors are used.AirComp schemes with QCSI have yet to be developed and this is the first objective of this paper.
Furthermore in Cao et al. [8][9][10], the optimal power levels at the devices are selected from an infinite space of values.Given that these devices are mostly of low hardware complexity/capabilities, for example, sensors, it is highly probable that they can support only a limited finite number of power levels.This perplexes the derivation of the optimal setup for the considered AirComp regime since it requires the solution to difficult discrete optimization problems.Moreover, it is very challenging to achieve satisfactory performance, due to the reduced solution set compared to the original approaches.Contrariwise, such an approach results in reduced overhead for feeding back the optimal power levels to the mobile devices.Such solutions are not yet available for AirComp systems and their development is the second objective of the present paper.Analytically, the contributions of this work are as follows.
An IoT network applying AirComp over a fading MAC is assumed based on a single antenna FC and single antenna devices.The devices simultaneously transmit their sensing data to the FC to calculate their average value.The objective is to determine the optimal transmit power levels for the devices and the postprocessing scaling factor applied at the FC given that the FC has only QCSI knowledge and the devices can set their transmit power levels from a finite discrete set.The optimal transmit power levels and the FC center's scaling factor are jointly derived such that the mean square error (MSE) of the computation is minimized.To that end, first a difficult nonconvex constrained optimization problem is defined with the view to minimize the MSE given the QCSI knowledge and the discrete set of transmit power levels.Then, a solution to the defined problem is developed based on the deep reinforcement learning (DRL) framework [13].The DRL-based solution is able to exploit the computational power of the deep neural network (NN) is order to efficiently solve the difficult optimization problem while being robust to the CSI imperfections due to the quantization.Overall, this paper aims to show the efficacy of the DRL in providing better suboptimal power level suggestions compared with the typical scheme (i.e., direct inference of the optimal AirComp power control policy with the QCSI values as input) in terms of MSE.The comparisons are conducted with realistic system assumptions, including coarse CSI knowledge and discrete power levels.Numerical results show that the performance of the proposed approach is very satisfactory under coarse QCSI knowledge when compared to the perfect CSI approaches and their extensions to the QCSI case.
The rest of the paper is organized as follows: Section 2 describes the considered system model and formulates the problem to be solved.Section 3 derives the DRL-based algorithmic solution to the defined optimization problem.Section 4 presents the numerical results and Section 5 concludes this work.

System Model and Problem Formulation
An AirComp scheme is considered over a MAC based on which K single-antenna mobile devices are sending information to a single-antenna FC (Figure 1).Let us assume that the devices are measuring a set of time-varying parameters of the environment they are deployed to.The FC aggregates the received information with the view to calculate the average of the measured data from the mobile devices.That is, in timeslot t, the FC calculates the function The received signal at FC, at timeslot t, is given by where is the channel coefficient of channel k (also called perfect CSI) and ⋅ j j denote the conjugate and the absolute value of a complex number, respectively and z t ð Þ is a complex additive white Gaussian noise variable at the FC of zero mean and variance σ 2 .Upon receiving y t ð Þ, the FC applies a denoising factor η t ð Þ for recovering the average measurement by the devices.Thus, the signal at the FC after postprocessing is given by, where the scaling factor K is introduced for averaging.As it is evident, the values of the power allocation p k t ð Þ, 1 ≤ k ≤ K and denoising η t ð Þ variables have to be determined in order to apply the AirComp scheme.A common approach for deriving the values of the required variables is by the minimization of the MSE between the calculated average of the transmitted data b f t ð Þ in Equation ( 4) and the actual one f t ð Þ.Under the assumption of statistically independent observations s k t ð Þ f gamong the users, the instantaneous MSE can be shown to be given by, where E ⋅ f g is the expectation operator.By dropping the time index t for simplicity and based on Equation ( 5), the values of the power allocation p k , 1 ≤ k ≤ K and denoising η variables can be derived as the solution to the following minimization problem: where Pk is the power level of each sensor.
This problem is nonconvex since the sensor power vector p k f g and the denoising factor η are coupled in the objective function.On to top of this, problem P 1 ð Þ requires CSI knowledge at the transmitter's side.The required CSI is estimated at each device via training symbols transmitted from the FC at the downlink.Thus, by exploiting the uplink-downlink channel reciprocity, the devices estimate the required CSI and then, they feed it back to the FC for solving P 1 .If perfect (or highly accurate to be more practical) CSI is assumed at the FC, the feedback phase results in huge communication overhead.To that end, in this paper, we assume a QCSI feedback estimation scheme based on a predetermined QCSI codebook.
Let us now assume that the FC and the devices have knowledge of this predetermined QCSI codebook.Based on the estimated CSI via the previously described procedure, each device locates the closest representative entry in the codebook and feeds back to the FC only the index associated with the detected QCSI state requiring reduced communication overhead.
By straightforwardly applying the closed form of the solution in Cao et al. [8] for P 1 under QCSI information, the performance exhibits severe degradation, especially for very coarse CSI quantization.Moreover, the situation is further perplexed if it is assumed that the devices can set their power levels through a discrete and finite codebook.This results to a feasible solution set for P 1 that is discrete and thus, in a very difficult optimization problem with no known efficient solution that has, in general, exponential complexity for its solution.The defined problem to be addressed under the QCSI and finite power levels set is defined as: where P k ¼ P k; 1 ; È …; P k; M g is the set of discrete power levels for the kth device, b h k is the QCSI feedback of channel k, M is the number of power levels assumed to be the same for all the devices without generality loss and In the following, a solution will be developed for solving P 2 based on the DRL framework that effectively deals with the QCSI errors and the discrete levels of the devices' power.

DRL-Based Solution
In this section, the solution to P 2 is derived via a deep Qlearning (DQL) method.DQL is a DRL method that utilizes a NN as a quality function estimator (Q-value) [14].In principle, the deep Q-network (DQN) agent observes the wireless environment in the form of a state s 2 S and performs an action a 2 A, where S and A correspond to the state and action spaces, respectively [15].Then, depending on the quality of the performed action, the agent receives a reward r.The DQL method involves the Bellman equation, given by Wireless Communications and Mobile Computing where s t is the state of the environment at time t and a t is the action performed by the agent.The hyperparameters α 2 0; ½ 1 and γ 2 0; ½ 1 correspond to the learning rate and discount factor and are used as a trade-off between previous Q-values Q t−1 s t ; ð ð a t ÞÞ, immediate rewards r s t ; ð ð a t ÞÞ and the optimal future rewards γ max a 0 Q s tþ1 ; ð f a 0 Þg.The Bellman equation in practice quantifies the quality of being in state s t and performing the action a t , designating the learning strategy framework.The DQN agent interacts with the environment in a trial-and-error process, ideally performing all possible actions A from all possible states S. Therefore, the agent gains experience regarding the favorable and disadvantageous actions from any current state through the reward function during the training phase of the DQL algorithm [16].Regarding the deep learning context, two identical (in dimensions) NNs are involved: (i) the Qnetwork which is used to estimate the current best action (considered to contain the input features) and (ii) the target Q-network which is used to estimate the next action (or action policy) that will return the maximum long-term reward (considered to contain the output labels).Once the training phase of the DQL algorithm has been finalized and the hyperparameters γ and α of the Bellman equation have been stabilized, the pretrained agent may be utilized for inference purposes in order to determine the action selection policy.
A solution for P 2 , is derived based on a DQN agent, located at the FC which interacts with the wireless environment.The design parameters of the DQN agent may be described: State space: The state space describes the wireless environment from the communications' perspective.In the proposed solution, the state space includes the combined QCSI and power information of each wireless sensor.At a given time t, the system space can be expressed as s t ¼ s 1 ; ½ s 2 ; …; s K with s k being related to the power p k 2 P k and channel coefficient h k of sensor k.The value of the sensor k QCSI is represented by b Þ, where W ⋅ ð Þ is a quantization function that depends on the number of quantization bits J.The set of QCSI values is also defined as b The state values of sensor k can be then derived by s k 2 P k × b H ; 8k (all possible states are 2 JK ).In this context, the DQN agent at the FC receives the combined information related to the QCSI and power values for all sensors (Figure 1).
Action space: Upon observing the system state, the DQN agent selects an action during a specific training episode.Specifically, at a given time step t, a discretized power level is selected by the agent and assigned to each sensor p k; m , along with a discretized denoising factor η. Formally, the DQN agent action is described as a t ¼ p 1; m ; À Â p 2; m …; p K; m ; η ϕ Þ and its dimensionality is K þ 1.As aforementioned, the power values that are assigned to the wireless sensors depend lie in P k g.Similarly, the values selected by the agent for the denoising factor η from Φ levels lie in set H ¼ 0; f η 1 ; …; η Φ g (all possible actions are M K × Φ).The selected action is then implemented on the wireless environment and the state space is updated at the next time step, since it encompasses the updated power vector of all wireless sensors.Reward function: The performed action a t from a state s t results in a new state s tþ1 and a positive or zero reward, depending on whether this action was beneficial toward the optimization goal.At a given time t the reward function is defined as: where the objective function F can be expressed by: Evidently, the reward function leads the DRL agent during the training process to gradually favor a sequence of   Wireless Communications and Mobile Computing actions that minimize the F function and thus, also the MSE in Equation ( 5).Note that the training procedure is based on DRL principles with experience replay [14].The complete procedure for training the AirComp-DRL model is summarized in Algorithm 1.Following the initialization of the learning hyperparameters (α and γ), a replay memory D (filled with experience/transition tuples corresponding to random actions), the two Q-function approximators (Q-and target Q-neural networks with random weights) and the 2 J -level channel quantizer W ⋅ ð Þ are also initialized.In each training episode, the wireless environment is initialized before the agent begins to perform actions by randomly selecting the power levels p k to the sensors and the η value which constitute the initial system state S 0 .Depending on the phase of the training process, an action a t is selected, that is, the power levels of the sensors are randomly selected in the exploration mode, whereas the power vector is estimated by the Qnetwork during the exploitation mode of the algorithm.The performed action leads the environment in a new system state S tþ1 and a reward r t is returned to the DRL agent, according to the objective of the reward function.The transition tuple S t ; ð a t ; r t ; S tþ1 Þ is stored in the replay memory D, while a minibatch of experience tuples is randomly selected from D and, based on the Bellman equation, the Q-network is used to estimate the quality of immediate and future actions (in case that S jþ1 is not a terminal system state).Thereafter, the gradient descent method is utilized to update the weights of the Q-network neurons (backpropagation), using the target Q-network estimations as output labels (every N c steps of the algorithm, the weights of the Qnetwork are inherited to the target Q-network neurons).Finally, to gradually transit from exploration to exploitation, an ϵ-greedy method with linear decaying is adopted.Noteworthy, the DRL training efficiency is highly influenced by the degree of exploration completeness, which in turn depends on the state/action space dimensionality.A sufficient number of training episodes T should ensure that the agent visits as many as possible state/action pairs.
Regarding the inference procedure of a pretrained model, the DRL agent performs actions only in exploitation mode (ϵ ¼ 0), while the storing of experience tuples in the replay memory and the backpropagation processes are simply omitted.

Simulation Results
In this section, numerical results are demonstrated both for the DRL training phase and MSE comparison between different schemes.The presented simulations were conducted in Python 3.8, whereas the libraries TensorFlow (version 2.3), Keras, and Scikit-Learn were used for constructing and training the AI/ML models.Coding scripts ran on a personal PC (CPU i7-8700; 3.2 GHz; RAM 8 GB; no GPU usage).

DRL Training.
A QCSI-based AirComp system with 20 sensors is considered during the DRL hyperparameter stabilization (see Figure 2).In addition, a time-varying channel model, composed by a dominant pathloss component and a Rayleigh fading component with variance σ 2 c ¼ 0:1 is adopted for the rest of the simulations to represent dynamic channel conditions.The channel quantizer W ⋅ ð Þ is implemented via a k-means clustering algorithm trained over 10,000 channel samples (where k represents the quantization levels 2 J ¼ 4).In this sense, time-varying QCSI values are represented by the k-means centroids based on a minimum Euclidean distance criterion.Without loss of generality, it is assumed that the number of power and discrete FC scaling factor levels involved in the DRL solution are M ¼ Φ ¼ 10.The values for sets P k , 1 ≤ k ≤ K, and H are derived by uniformly discretizing the continuous sets 0; ð P max and 0; ð η max , respectively, where P max ¼ 1W and η max ¼ 1.The noise variation at the receiver (FC) is set to σ 2 ¼ 0:01.Upon testing multiple NN setups, we concluded to a NN with three fully connected hidden layers with sizes 3 × , 2 × , 1 × MK ð þ ΦÞ, while the update frequency of the Q-target network is set to N c ¼ 100 steps.The activation function of all neurons included in the hidden layers was the rectified linear (ReLU) one, whereas the neurons of the output layer employed the linear activation one.
Notably, the DRL reward convergence defines the extent to which the resulted policy can significantly optimize the objective function.Initially, two of the most critical hyperparameters involved in the DRL training process, namely the learning rate α (monitors the update ratio between new and previous Q-values) and discount factor γ (balances the degree of which immediate or future-expected rewards are preferred), were fine-tuned to ensure optimal reward convergence.To that end, hyperparameter stabilization was obtained by inspecting the training/learning curve for varying values α (see Figure 2) and γ (see Figure 3).As shown in Figures 2 and 3, the reward time course gradually transits from the exploration to the exploitation stage, reaching the highest values for α ¼ 0:0001 and γ ¼ 0:9 (reward function Wireless Communications and Mobile Computing was relatively insensitive to γ parameters).Both parameters were set to their optimal values for the rest of the simulations.

Impact of Power
Granularity.This section includes further simulations related to the impact of the power granularity (number of available power levels M that can be selected by the DRL agent).In general, every MSE minimization model in AirComp systems presents a total estimation error (in MSE solutions) which is usual to the sum of four individual error terms: (i) channel quantization error (introduced by the bit-based representation of CSI), (ii) power discretization error (derived by the realistic and discrete power level configuration of the sensors), (iii) eta discretization error (scaling factor of data fusion takes practically discrete values), and (iv) model fitting error (resulted by the model itself in attempting to ensure a good trade-off between overand under-fitting, also called generalization error).Power granularity comprises a crucial parameter for the DRL performance and MSE optimization, since it defines the extent to which the agent can precisely tune the transmitting power of the sensors for a given power range.The target is to investigate whether the increasing M actually improves the DRL performance, given stable CSI quantization (here 4level quantization), number of sensors (K = 30) and power range (here 0.1-1 W).Variations in the CSI quantization and/or power range do not result in loss of generality of the conclusions.
In specific, the higher the number of power levels for a specific available power range (e.g., from 0.1 to 1 W), the better the training performance.This is attributed to the fact that the power granularity is increased with increasing M, therefore making the action information (i.e., the outputs of the DRL agent) to better approximate the perfect (selected by the optimal solution) power level.Ideally, one could expect that when the number of available power levels is infinite (i.e., M → 1), then MSE of DRL better approaches the MSE of the optimal solution, given that the power level suggestions are almost continuous values, and not discretized levels.Noteworthy, the optimal solution outcome depends drastically on the inputs of the DRL agents, which are the QCSI values.Intuitively, when both perfect CSI and continuous power levels are considered, then the DRL solution is closer to the optimal one.However, in realistic conditions, as the power granularity increases, the DRL becomes more demanding in the DRL network dimensionality, given that the number of the output layer neurons is increased with power levels.Thus, there is an upper bound of the power granularity, above which MSE starts to degrade due to the concurrent increment of model dimensionality and complexity.The higher the dimensionality of the output layer, the more demanding the training phase due to the fact that the number of available actions is increased (i.e., more Q-values have to be estimated in the output neurons of the DRL agent).
As shown in Figure 4, the MSE performance follows a Ushaped form as a function of the number of power levels M.This means that, for a given power range, number of sensors, and quantization level, MSE is improved (i.e., lower MSE values) until M reaches a threshold (here critical M = 20).Beyond this critical value of M, MSE performance starts to degrade (i.e., higher MSE values) because of the higher complexity and dimensionality of the DRL model.Specifically, complexity is proportional to the NN density, which is also increased with the number of available power levels.We also note that, as the number of available actions that can be selected by the DRL agent is increased, NN dimensionality should be increased to accurately estimate the large number of available power configurations.This U-shaped function of MSE versus M implies that the DRL performance in Air-Comp MSE minimization comes with a power granularity limitation, with the latter requiring no more than M = 20 power levels.In conclusion, power granularity has to be thoroughly selected in MSE minimization problems, so as

Comparative Results
. For comparison purposes, the MSE at the FC is computed for the proposed approach and compared to the one of two baseline schemes: (i) the optimal power allocation and FC denoising factor scheme using perfect CSI knowledge in Cao et al. [8] and (ii) the optimal power and η allocation strategy computed in Cao et al. [8] under QCSI knowledge.
The MSE can be calculated as F=K 2 for the three aforementioned schemes, taking into account the sensors' assigned power levels and the assigned denoising factor that emerge from each allocation framework.Toward this direction, a comparison of the MSE computed at the FC receiver for the three solutions is shown in Figure 5 for J ¼ 1 and varying number of wireless sensors.It is observed that, the DRL-assisted solution reaches lower MSE values (∼10 dB) compared to the optimal solution under coarse QCSI knowledge, regardless of the number of sensors that participate in the AirComp system.Similar results can be observed in Figure 6, where the MSE at the FC is compared amongst the three solutions for J ¼ 2 with respect to the number of IoT sensors.Notably, the MSE gap between the optimal solution with perfect CSI and with QCSI knowledge is reduced due to the increased number of quantization levels.Nevertheless, the power vector and denoising factor allocation strategy provided by the DRL solution accomplish decreased MSE values (∼3-4 dB) in contrast to the optimal QCSI solution.
As indicated from the results, the potency of the deployed DRL framework on an AirComp system becomes apparent in cases that quantization error significantly degrades the AirComp MSE performance.To this end, the communication overhead between the IoT sensors and the FC can be effectively reduced (low number of transmitted quantization bits), without significant degradation of the MSE value and the overall performance of the AirComp system.

DRL Versus Optimal Solution under
Coarse CSI Knowledge.The main drawback of the optimal solution is that it requires perfect CSI knowledge, as well as it assumes perfect precision in the power configuration.In realistic conditions, the perfect CSI is not known and the precise power level proposition requires a high number of bits to be transmitted.The assumption of discrete power levels is adopted in this work primarily to reduce the information required to be exchanged by low-capacity and low-memory devices.Thus, here we aimed to demonstrate the efficacy of applying DRL under coarse QCSI knowledge.
Assuming that the optimal solution is inferred with the QCSI values as input, the resulting solution deviates from the perfect power allocation and MSE minimization, primarily due to the quantization error introduced by the QCSI inputs.These quantization errors cannot be corrected by the optimal solution itself, since the latter is basically a closed-form equation requiring only the perfect CSI values as inputs.The reason for which DRL outperforms the optimal solution inferred with QCSI as input may be attributed to the rewarding function that is used to train the agent.Specifically, the rewards received by the agent (only during the training) take into account the perfect CSI to better estimate the Q-values of all power actions.Note that, the training is performed offline with simulated data, whereas during the inference phase, the agent is directly compared with the optimal QCSI method using only the QCSI.The inference output (i.e., power suggestions) is used to calculate the resulting MSE, which was proven to outperform the optimal with QCSI method.Wireless Communications and Mobile Computing

Conclusion
In this work, a wireless sensor AirComp system with QCSI feedback and discrete power control levels is studied.To mitigate the quantization error introduced in the AirComp's MSE calculation, a DRL-assisted framework for power level and denoising factor selection is thoroughly described and implemented, jointly exploiting quantized channel and power information.The proposed DRL model is compared against the analytical (optimal) MSE optimization solution, assuming both perfect and QCSI knowledge and power level configuration.Numerical results confirm the potency of the centrally placed DRL agent in reducing the performance gap between the optimal solution with versus without ideal CSI.Overall, this study demonstrates the dominance of the DRL-assisted solution under coarse QCSI conditions, highlighting the effective communication overhead reduction without considerable degradation of the AirComp system's performance.

FIGURE 2 :
FIGURE 2: Learning curves for different values of learning rate α as a function of the training episodes.

FIGURE 4 :FIGURE 3 :
FIGURE 4: MSE performance derived by DRL as a function of the number of power levels M.

FIGURE 6 :
FIGURE 6:  Comparison of MSE as a function of the number of sensors between the DRL (red), and optimal schemes with perfect (green) and quantized (blue) CSI.Four-level quantization is considered.

FIGURE 5 :
FIGURE 5: Comparison of MSE as a function of the number of sensors between the DRL (red), and optimal schemes with perfect (green) and quantized (blue) CSI.Two-level quantization is considered.
System model.The IoT devices are measuring time-varying parameters that are transmitted with device-specific power p k over channel h k .The FC aggregates the received signals and applies a denoising factor η.
Take action a t , Observe reward r t and state S tþ1 Store transition S t ; ð a t ; r t ; S tþ1 Þ in D ▹ Experience Replay Select random minibatch of transitions S t ; ð a t ; r t