Adaptive Quantization for Multichannel Wiener Filter-Based Speech Enhancement in Wireless Acoustic Sensor Networks

Speech enhancement in wireless acoustic sensor networks requires the exchange of audio signals. Since the wireless communication often dominates the nodes’ energy budget, techniques for data exchange reduction are crucial. Adaptive quantization aims to optimize the bit depth of each exchanged signal according to its contribution to the speech enhancement performance.This enables the network to scale its energy and communication bandwidth requirements according to the current operating environment. The impact metric was previously proposed to predict the effect of quantization in linear minimum mean squared error (MMSE) estimation. We provide new insights into greedy adaptive quantization based on this impact metric. We achieve this by expanding the mathematical framework to include a new metric based on the gradient of the MMSE as a function of the quantization noise power. Using these tools, we show how the MMSE gradient naturally leads to a greedy algorithm and how the impact metric is a generalization of the gradient metric and a previously proposed metric. Besides, we validate the impact metric for adaptive quantization both in a simulated and in a real wireless acoustic sensor network deployed in a home environment, showing the energy savings achievable through greedy adaptive quantization.


Introduction
A wireless acoustic sensor network (WASN) is a collection of battery-powered sensor nodes where each node is equipped with a microphone or microphone array, a processing unit, and a wireless communication module [1].The nodes are distributed over an area of interest with the goal of performing a signal processing task such as noise reduction or acoustic localization.The main advantage of a WASN over a single stand-alone microphone array is its extended coverage, which is made possible by placing many microphones over the area of interest.This typically translates into a better performance, as microphone array algorithms benefit from enhanced spatial diversity.Furthermore, the deployment of a WASN often yields a higher probability to have microphones close to a sound source, which is advantageous since these microphones will record signals with high signal-to-noise ratio (SNR).
Nevertheless, WASNs pose several technical challenges that are not present in stand-alone microphone arrays, such as internode synchronization, delay management, communication bandwidth usage, and energy efficiency.The latter, energy efficiency, is crucial to allow the network to perform its task for a reasonable period of time, since nodes are mostly powered by batteries and hence have a tight energy budget.A significant effort has been made to classify the different approaches to improve energy efficiency in wireless sensor networks (WSNs), as the optimal techniques depend on the intended WSN application.A comprehensive taxonomy of these approaches can be found in [2], and a more recent survey in [3] also considers the importance of the different techniques for specific classes of applications of WSNs.
In this paper we focus on a speech enhancement application for a WASN, where the goal is to estimate a desired speech signal while suppressing interfering sound sources and noise.In particular we focus on the multichannel 1.1.Sensor Subset Selection.A substantial part of previous research on energy efficiency in WSNs has been focused on the sensor subset selection problem, which is aimed at using only the signals from those sensors (microphones, in the case of WASNs) that provide a significant contribution to the signal processing task at hand, while putting other sensors to sleep.This saves energy by avoiding the transmission of signals from sensors with low relevance and allows the communication bandwidth resources to be allocated to the transmission of the signals from the most useful sensors.The sensor subset selection problem is combinatorial and thus difficult to solve in general.Due to its importance, it has been the focus of extensive research, and several techniques have been proposed to tackle it.For an overview of these techniques, the reader is directed to [7].Recent work on sensor selection can be found in [8,9] and references therein.In [8] the authors investigate the sensor selection problem for parameter estimation in a WSN where the sensor measurements follow a nonlinear model, assuming that the measurements are independent random variables.The problem is formulated as a nonconvex optimization problem and solved through convex relaxation.In [9] the authors develop a more general framework where they consider correlated measurement noise and propose a greedy algorithm to solve the sensor selection problem based on the Fisher information matrix.
A different approach has been proposed to solve the sensor selection problem for signal estimation based on a greedy algorithm using the utility metric [10,11].The utility of a sensor signal is defined as the change in estimation performance when the sensor is removed from the estimation process and the corresponding estimator is subsequently reoptimized.The motivation is that the utility can be computed and tracked at a very low computational cost, which combined with the greedy approach allows performing sensor subset selection swiftly and at low complexity, even though the solution will generally be suboptimal.Besides, the algorithm is fully datadriven and does not require any prior knowledge of the underlying measurement model, such as the microphone and source positions or the acoustic transfer functions, which indeed is generally not available in WASN applications.This priority on speed and low complexity is crucial for adaptive signal estimation, since the network needs to rapidly react to the changing signal conditions (e.g., sound sources moving in the case of a WASN) and has to avoid investing too much energy from the already limited budget of the nodes.This approach has been specifically applied to WASNs [12], and it has been extended to a distributed implementation of the MWF [13].
1.2.Adaptive Quantization.While sensor subset selection does indeed help to save energy and communication bandwidth, it forces the nodes into a binary behaviour; that is, they either transmit their signals at full resolution or they are put to sleep.One technique to provide a more flexible scaling of the estimation performance and the energy consumption of the network is adaptive quantization, where each sensor signal is assigned a variable bit depth to encode its signal samples according to its contribution to the estimation performance.By using this technique, nodes are able to spend more or less energy on data transmission according to the estimation performance required.From the point of view of information theory, this problem can be tackled using source coding techniques.A comprehensive overview of source coding for WASNs can be found in [14,15], where the focus is directed towards theoretical results based on rate-distortion theory.
In [16], a pragmatic approach is taken, in which a generalized version of the utility metric referred to as the impact metric is introduced to predict the MMSE increase in the estimation due to the quantization noise.This allows modeling the effect of the quantization noise resulting from changing the bit depth of each sensor signal's samples on the estimation performance.The impact metric can be used by a heuristic algorithm to gradually decrease the bit depth in each sensor signal until a target MMSE (or corresponding SNR) is met.

Contributions and
Outline of the Paper.The goal of this paper is twofold.Our first goal is to provide some new insights on greedy adaptive quantization based on the impact metric from [16].To this end, we expand the mathematical framework for adaptive quantization in linear MMSE estimation and we apply it in a WASN with a centralized processing architecture.We consider the MMSE as a function of the quantization noise power in each sensor signal, and based on this we define a new metric for adaptive quantization based on the gradient of the MMSE.We demonstrate how this MMSE gradient naturally gives rise to a greedy algorithm.We then show how the impact metric is in fact a generalization of this gradient metric, which then also motivates the use of a greedy algorithm using the impact metric.Besides, we explain how the utility metric for sensor subset selection [10,11] can be viewed as another limit case of the impact metric.Finally, we discuss the theoretical advantages and disadvantages of each metric and propose a correction to improve the gradient metric.
The second goal of the paper is to validate the impact metric for adaptive quantization in a speech enhancement task in a simulated as well as in a real life WASN in a home environment.We compare the behaviour of the four metrics and show the superiority of the impact and the corrected gradient metrics over the gradient and utility metrics due to their inherent adaptation to the significance of each quantization bit.To conclude, we provide an estimation of the savings in transmission energy achievable through the use of the greedy adaptive quantization algorithm based on the aforementioned metrics.
The paper is structured as follows.In Section 2, we formulate the problem statement and signal model, we briefly review the multichannel Wiener filter for speech enhancement, and we introduce the quantization error model that is used throughout the paper.In Section 3 we model the effect of quantization noise in linear MMSE estimation and show how adaptive quantization can be performed based on four metrics derived from this model (utility, impact, gradient, and corrected gradient).In Section 4 we show experimental results of adaptive quantization for speech enhancement performed on real recordings from a WASN.Finally, we present the conclusions in Section 5.

Problem Statement
We consider a WASN composed of several nodes, each having one or more microphones, with  microphones in total.The signal samples of the th microphone signal are encoded, upon acquisition by the analog-to-digital converter, with a certain bit depth dictated by the hardware in use.We consider a centralized scheme for the network, where each node transmits its microphone signals to a fusion centre, which could be one of the nodes in the WASN or an external node with access to more computational power or energy resources.The fusion centre's task is to obtain an estimate of the desired speech component present in one of the microphone signals, which will be referred to as the reference microphone signal (the reference microphone does not necessarily belong to the fusion centre; the microphone of any node can be selected to be the reference).This speech enhancement task is solved in the fusion centre through the use of a multichannel Wiener filter [4][5][6], which produces a linear MMSE estimate of the desired speech signal component in the reference microphone signal.We will give a brief review of the MWF in Section 2.2.
Our main focus will be the problem of reducing the bit depth of each individual microphone signal in the WASN according to its contribution to the speech enhancement performance.The bit depth reduction leads to a reduction in the required communication bandwidth and in the node's required energy budget for wireless transmission, but it will also have an impact on the speech enhancement performance.Besides, the contribution of each node to the enhancement performance is subject to changes in the acoustic scenario, so we will focus on strategies with low computational complexity that allow the fusion centre to perform a quick decision on the desired bit depth assignment for each individual microphone.This enables each node at runtime to scale down the energy spent in wireless transmission according to the current operating environment.
An illustration of the problem is given in Figure 1, where a small network with two nodes and a fusion centre is depicted.The nodes quantize the signals of each individual microphone  with the corresponding bit depth   before transmission.The fusion centre performs the speech enhancement task using the transmitted quantized microphone signals (dotted lines) and takes a decision on the optimal bit depth for each communicated microphone signal (dashed lines).
In the remaining part of this section we introduce formally the signal model for the WASN, we briefly review the multichannel Wiener filter for speech enhancement, and we Fusion centre explain the quantization error model we will use throughout the rest of the paper.
2.1.Signal Model.We denote the set of microphones by K = {1, . . ., }.The signal   captured by the th microphone can be described in the short-time Fourier transform domain (STFT) as where  is the frame index,  represents frequency,   (, ) is the desired speech signal component, and V  (, ) is the undesired noise signal component.We assume that   (, ) and V  (, ) are uncorrelated.We note here that V  (, ) contains all undesired sound signals, which may include speech from undesired speakers besides acoustic noise.For the sake of simplicity, we will omit the indices  and  throughout the rest of the paper, keeping in mind that all operations take place in the STFT domain unless explicitly stated otherwise.The fusion centre stacks all signals in the  × 1 vector The vectors x and k are defined in a similar manner, so the relationship y = x + k is satisfied.

Multichannel Wiener Filter.
In speech enhancement, the goal is to obtain an estimate of the speech component  ref present in the microphone signal  ref selected as the reference.We will focus on the multichannel Wiener filter to perform the speech enhancement task, and we will provide a brief summary in this section.For more information the reader is directed to [4][5][6].
The multichannel Wiener filter is the linear estimator ŵ that minimizes the mean squared error (MSE) where {⋅} is the expectation operator and the superscript  denotes conjugate transpose.When the microphone signal correlation matrix R  = {yy  } is full rank (in practice, this assumption is usually satisfied because of the presence of a noise signal component in each microphone signal that is independent of other microphone signals, such as thermal noise.If this is not the case, matrix pseudoinverses have to be used instead of matrix inverses), the solution to the minimization problem is given by where r  ref = {y * ref } and the superscript * denotes complex conjugation.Since we assume that x and k are uncorrelated, r  ref is given by r  ref = R  c ref , where R  = {xx  } is the desired speech correlation matrix and c ref is the  × 1 vector c 1 = [0, 0, . . ., 1, . . ., 0]  , where the entry corresponding to the reference microphone signal is equal to one.
The matrix R  can be estimated by temporal averaging, for instance, using a forgetting factor or a sliding window.Temporal averaging is not possible for R  since the desired speech signal components x are not observable.In practice, the noise correlation matrix R VV = {kk  } can be estimated during periods when the desired speech source is not active, as indicated by a voice activity detection (VAD) module.Since we assume that x and k are uncorrelated, it is then possible to use the relationship R  = R  − R VV to obtain an estimate of R  .However, this is prone to robustness issues, created by oversubstraction, leading to the estimated desired speech correlation matrix not being positive semidefinite.These issues arise often in high frequencies, where the desired speech component may have very low power.To improve robustness in low SNR and nonstationary conditions, an implementation based on the generalized eigenvalue decomposition (GEVD) can be employed [17,18].
The minimum mean squared error (MMSE) can be obtained by plugging (4) into (3) to obtain where is the power of the desired speech signal.

Quantization Error
Model.We will consider uniform quantization of the time domain samples of each microphone signal   (), prior to the transformation to the STFT domain.
In practice, this means that the nodes transmit their time domain samples and the STFT is performed in the fusion centre.We discuss the possibility of quantizing the STFT coefficients directly prior to transmission in Section 3.4.This configuration would require each node to perform the STFT over its own microphone signals and transmit the frequency domain coefficients to the fusion centre.
The quantization of a real number  ∈ [−/2, /2] with  bits can be expressed as where In practice, the parameter  is given by the dynamic range of the analog-to-digital converter of the corresponding microphone.The quantization error, or noise, is then defined as The mathematical properties of the quantization noise   have been the subject of extensive study [19][20][21], where it has been shown that the input signal and the quantization noise are uncorrelated under certain technical conditions on the characteristic function of the input signal.Under the same conditions, the mean squared error due to quantization is given by We consider that, for the th microphone signal, the time domain samples of   are quantized with   bits according to (6) before being transmitted to the fusion centre.The quantization error can be expressed as where  indexes the samples of frame .The fusion centre performs the STFT and collects the results for each frequency  and frame  in the  × 1 vector y  given by where e = [ 1 , . . .,   ]  is the  × 1 vector whose th element is the quantization error corresponding to the th microphone signal at frequency .Note that all  signals have been included in the quantization process.However, if the fusion centre is also equipped with microphones (e.g., it is a node of the WASN), these signals do not need to be transmitted and hence have a fixed quantization.In this case, the microphone signals from the fusion centre are removed from the adaptive quantization process, but they are still included in the estimation process.
Using the statistical properties of the quantization error [19][20][21], we will assume that every element of e is uncorrelated with every element of y.Again, under certain technical conditions, the power spectrum of the quantization noise is white; that is, its power is evenly distributed across all frequencies [19].Although these conditions are not always satisfied in practice, particularly for quantization with only a few bits, we will combine this property with (9) to approximate the quantization noise power at each frequency as where  is the length of the discrete Fourier transform (DFT) used to implement the STFT in practice.The factor  in (12) appears as a consequence of the application to   () of Parseval's theorem for the nonunitary DFT, given by where   (  ) is the -point DFT corresponding to   ().The nonunitary definition of the DFT is given by where   is the input sequence,  is the imaginary unit, and   is the resulting transformed sequence.If a factor of 1/ √  is applied to the right-hand side of ( 14) the DFT becomes a unitary transformation and the factor  is no longer needed in (12).In the rest of the paper we assume that the nonunitary DFT is used to implement the STFT, keeping in mind that the unitary DFT can be employed simply by rescaling (12).

Adaptive Quantization for the Multichannel Wiener Filter in a WASN
We now consider the effect of quantization noise on the estimation process described in the previous section.Our interest here is to study how changing the bit depth for the transmission of the microphone signal samples affects the operation of the MWF, in particular, how it affects the MMSE.The analysis of this effect will lead to a metric based on the gradient of the MMSE which, as we will show, naturally leads to a greedy adaptive quantization algorithm.We will then demonstrate how this gradient metric is a limit case of a recently proposed impact metric [16], which was already known to also generalize the utility metric proposed in [10,11].Besides, based on this reasoning, we propose a correction to improve the gradient metric for adaptive quantization.This analysis provides a motivation for applying a greedy algorithm based on any of these metrics, which allows dynamically changing, at any moment in time, the bit depth assigned to each microphone signal.In Section 4, we will demonstrate experimentally that the impact and the corrected gradient metrics outperform the gradient and utility metrics, due to their inherent adaptation to the difference in quantization levels corresponding to different bit depths.

Effect of Quantization on the Minimum Mean Squared
Error.The MWF ŵ based on the quantized microphone signal samples is obtained following (4) as where R     = {y  y   }.Using (11) and the assumptions stated in Section 2.3, we express R     as The quantization error correlation matrix R  is diagonal (one could intuitively expect quantization to reduce the crosscorrelation between the microphone signals.In the Appendix we consider a quantization model that includes this reduction and show that its effect on the MWF is equivalent to the one presented in Section 3.1), with the th element of the diagonal being {|  | 2 } =    , where    is defined in (12).As e is assumed to be uncorrelated with  ref , the cross-correlation remains unchanged; that is, As explained in Section 2.2, r    ref can be computed as where R V  V  = {k  k   }, which indeed confirms (17).Similarly to (5), we can now find the MMSE corresponding to ŵ , given by We highlight that   (ŵ  ) is a function of the quantization error powers    , which can be made explicit by rewriting the function as where p  = [  1 , . . .,    ]  is the vector of quantization error powers and where diag(⋅) is the operator that generates a diagonal matrix with diagonal elements equal to the entries of the vector in its argument.Equation ( 21) is important because it defines the cost function that we will use as the basis for adaptive quantization, since it is the minimum mean squared error that can be obtained with a linear estimator (i.e., the MWF) after adding quantization noise to each microphone signal.We emphasize that (21) gives the MMSE when the MWF is first reoptimized using the quantized microphone signals, that is, based on (15), and not the mean squared error resulting from applying the original (optimized for the nonquantized signals) MWF ŵ to the quantized microphone signals.

Gradient-Based
Approach to Adaptive Quantization.The goal of adaptive quantization is to allocate a bit depth to each sensor which is smaller than (or at most equal to) an initial maximum bit depth.Since each bit depth reduction also reduces the speech enhancement performance, the goal becomes to find the bit depth allocation which uses the minimum total number of bits ∑    given a maximum tolerated MMSE.Equivalently, the problem could be stated as finding the lowest MMSE with a given total number of bits ∑    .The gradient of the function   (p  ) gives the direction of maximal increase of the MMSE for a given p  , that is, for a given bit depth allocation.To further reduce the total number of bits beyond the bit depth allocation corresponding to p  , p  has to be changed to p  +Δp  , where Δp  is constrained to have nonnegative entries.The corresponding MMSE increase for an infinitesimally small Δp  is then given by the inner product of Δp  and the gradient of   (p  ).In order to compute this gradient, we will use the intermediate step which follows from applying the identity [22] a  X −1 a X = −X − aa  X − (23) together with the fact that (R  +R  ) −1 is a Hermitian matrix.Equation ( 22) can be simplified using ( 15)-( 17) to obtain Since the matrix R  is diagonal, we can now find the gradient g  as the diagonal of the right-hand side term in (24); that is, where the operator |⋅| is applied element-wise to its argument.
To minimize the MMSE increase for an infinitesimally small Δp  , the inner product Δp   g  has to be minimized.However, every component of g  is nonnegative and the vector Δp  is also constrained to have nonnegative components.Hence the best choice for Δp  is a vector whose components are all zero except the one corresponding to the minimum element of g  .
This result shows that when adding a small amount of quantization noise, it should be added to a single microphone signal instead of dividing it over multiple microphone signals.This naturally leads to a greedy algorithm, where at each step the gradient g  is computed from the MWF ŵ using (25), after which its minimum element is identified and the bit depth for the corresponding microphone signal is reduced by  bits.Note that the above reasoning has assumed the vector p  to be a continuous variable; that is, each element of the vector can take any real value.However, the bit depth is a discrete variable and it determines the quantization noise power added to a signal.Hence, the smallest possible quantization power that can be added to a signal corresponds to reducing its bit depth by 1 bit, which is the recommended value for  in order to avoid taking a too large step.This also avoids reducing the bit depth of one signal too quickly, which may be a poor choice compared to distributing the  bit reduction over several signals.After removing a bit from the microphone signal with the smallest entry in the gradient vector, the MWF is reoptimized to the new bit depth assignment, and the gradient is recomputed.This process is continued until the MMSE exceeds a predefined threshold.

Alternative Metrics for Adaptive Quantization.
In this section, we will show how the gradient metric used in the previous section is a limit case of the impact metric, which has been used in [16] for adaptive quantization.This provides an intuitive explanation of why the greedy approach, which follows naturally from the gradient metric, also works well when using this impact metric, as will be demonstrated in Section 4.
The impact metric from [16] was initially proposed as a generalization of the utility metric defined in [10,11].The utility of the th microphone signal   is defined as the increase in MMSE when   is removed from the estimation [10].The mathematical expression of this definition is given by where ŵ− is the reoptimized MWF obtained with all signals except   .Assuming the MWF ŵ is known, then the utility of   is shown [10] to be equal to where   is the th element in the diagonal of R −1  and   is the th element of ŵ.
The impact of the noise   is defined as the increase in MMSE when the uncorrelated noise signal   is added to   , while other microphone signals remain unchanged [16].In mathematical terms the definition can be expressed as where ŵ is the reoptimized MWF for y  , as in (15), with e = [0, . . .,   , . . ., 0]  .In [16] the impact is shown to be equal to where   is again the th element in the diagonal of R −1  ,   is the th element of ŵ, and    represents the power of the noise added to   , given by (12) for the case of quantization noise.
To simplify further notation and the comparison between different metrics, we consider the gradient for the case p  = 0, where 0 is the zero vector, such that (25) is rephrased as g = |ŵ| 2 , where (the comparison is valid for any p  ; we choose this case purely to simplify the notation) each element is given by Despite the fact that the impact (29), utility (27), and gradient (30) metrics predict a change in the minimum mean squared error, which implicitly requires to reoptimize the MWF, all three metrics can be calculated from the current MWF coefficients at almost no additional computational cost compared to the computation of ŵ itself.
By comparing (29) with ( 27) and (30), we see that both the gradient   and the utility   are limit cases of the impact    when    → 0 and    → ∞, respectively.Although    → 0 would obviously give an impact equal to zero, the relative differences between the impact metric for different  become equal to those of the gradient metric.
These two limit cases can be interpreted as follows.For the utility, the interpretation is that removing the microphone signal   from the estimation process is similar to adding an infinite amount of noise on   (   → ∞), making it completely useless, which corresponds to a removal of that channel.For the gradient, the distinction between the gradient and the impact is that the gradient characterizes the best linear approximation of the function   (p  ), while the impact computes the actual MMSE increase produced by adding the error   with power    .Since the gradient approximation is only valid in an infinitesimally small neighbourhood, it is only able to accurately capture the influence of   on the MMSE for small values of    .Besides, note that the quantization noise power    increases exponentially with each bit reduced, so the gradient becomes less accurate as the microphone signals are quantized with lower resolution.On the other hand, the impact metric accounts directly for    , which makes it inherently adaptive to the significance of each bit considered for removal.For low significance bits, the impact is close to the gradient.However, as the significance of a bit increases, the impact behaves more like the utility.By contrast, the gradient assumes that    corresponding to a bit removal is the same for all , or in other words it assumes that the search space is isotropic, which only holds true when all microphone signals have the same bit depth.This can be adjusted by making    in (21) a linear function of the resolution corresponding to the least significant bit, for example,   Δ   , and taking the derivative with respect to   .This would then provide a warped gradient vector where D = diag(Δ  1 , . . ., Δ   ).Note that this warped gradient is again an asymptotic case of the impact measure, if    is substituted with   Δ   in (29) and letting   → 0.

Frequency Domain Considerations.
To conclude, we must turn our attention to the fact that all of the above is valid at each frequency .This opens the possibility to assign a different bit depth to each frequency component of each microphone signal   .In Section 2.3 we took the approach of performing quantization in the time domain.In order to select the signal from which a bit is to be removed, we need to choose a rule to combine each metric across all frequencies.We propose to perform a sum of the metrics across all frequencies.For instance, for the impact the combined metric would be given by For the utility, gradient, and warped gradient the combined metric is defined in a similar way.It is noted that one could as well use a weighted sum in (32), for example, based on speech intelligibility weights.We provide a summary of the greedy quantization algorithm based on any of the four metrics described so far in Algorithm 1.However, strategies to allow the assignment of a different bit depth to each frequency component can be considered, as is commonly done in audio coding, to represent the most relevant frequency components with higher accuracy.Instead of assigning a different bit depth to every single frequency bin, frequency bins can also be grouped in a set of  frequency bands Ω = Ω 1 ∪ ⋅ ⋅ ⋅ ∪ Ω  , where Ω comprises all frequency bins such that |Ω| = .This means that every STFT coefficient of each microphone signal   at the frequency band Ω  is quantized following (6) with  , bits.The real and imaginary parts of each STFT coefficient are quantized independently.The corresponding metric can be computed in a similar way to (32) as where  , is the impact corresponding to the th microphone signal in the th frequency band.For the utility, gradient, and warped gradient the combined metric is again defined in a similar way.This configuration opens up several strategies to decide which frequency band and microphone signal will have its bit depth reduced in each iteration of the algorithm.For our discussion we consider the strategy of removing, in each iteration, one bit in each frequency bin assigned to the frequency band Ω  min of the microphone signal   min with minimum  , .This is the most conservative greedy strategy, which can be viewed as a limit case that will generally provide a better performance compared to greedier strategies where the bit depth is reduced in multiple channels and frequency bands simultaneously.It is noted that a more conservative greedy strategy comes with the cost of a larger number of required iterations to reach a predefined total number of bits.In Sections 4.1 and 4.2 we show the performance of this particular strategy applied to a speech enhancement scenario.
Note that, in every iteration, the bit depth in |Ω  | (out of ) frequency bins is reduced, which corresponds to a reduction of |Ω  |/ bits per time domain sample.This is less than the full bit per sample reduction achieved through time domain quantization, which shows that the proposed strategy for frequency domain quantization is more conservative than the strategy for time domain quantization.
Besides, it is important to mention that frequency bands do not influence each other in the sense that the bit depth reduction in one band will not affect the decision in the rest of the bands.In the case of nonuniform bands, where each frequency band spans a different number of frequency bins, a trade-off with the transmission energy has to be considered, that is, removing a bit from a wider frequency band will introduce more quantization noise but will result in less energy spent in transmission since the total number of bits will be lower.

Experimental Results
In this section we discuss the results obtained from several experiments to observe and characterize the performance of the greedy adaptive quantization algorithm based on the four metrics described in Section 3. We will discuss experiments on two different audio datasets.In the first one the audio signals captured by the microphones are obtained by simulating the acoustics of a room with the image method [23].In the second one, the audio signals were recorded using a wireless (1) Choose a metric   from   ,   ,  warped, or   .
(2) Initialize   ∀ ∈ K to the dynamic range of each sensor.
(3) Initialize the bit depth assignment   ∀ ∈ K to the maximum bit depth allowed by the hardware.(4) Initialize    ∀ ∈ K using equation ( 12). ( 5) while MMSE current < MMSE threshold do (6) Each signal   is quantized in time domain with   bits using ( 6).(7) Receive  fr signal frames from   ∀ ∈ K. (8) Apply STFT to the received frames.(9) Compute ŵ(  ) ∀  based on the quantized microphone signals using equation ( 15). ( 10) Update    using   − 1 and equation ( 12) ∀ ∈ K (The update is done with   − 1 in order for the metric to predict what would happen if the bit depth of the thsignal is reduced by 1 bit.However only one signal gets its   actually reduced in step ( 14)).( 11) Compute the selected metric   (  ) ∀  according to equation ( 29), ( 30), ( 31) or ( 27) respectively.(12) Combine   (  ) using equation (32).( 13) Find the index  min of the signal with minimal   .( 14) Reduce   min by 1 bit.(15) If   min equals 0 after the reduction, remove the  min th signal for subsequent iterations.( 16) end while Algorithm 1: Greedy adaptive quantization for MWF in WASN.acoustic sensor network set up in a real home environment in a house in Mol, Belgium, using nodes designed by researchers from the MICAS group of the Department of Electrical Engineering (ESAT) in KU Leuven.The details of each experiment will be discussed in Sections 4.1 and 4.2.In all experiments the desired speaker audio consists of three sentences, spoken by a female speaker, from the TIMIT database [24].The noise characteristics will be described in the section corresponding to each experiment.The sampling frequency is   = 16 kHz.The audio processing is implemented in batch mode, where the correlation matrices R  (  ) and R VV (  ) are estimated using samples over the entire length of the microphone signals.An ideal VAD is used to exclude the influence of speech detection errors.The audio signals are divided into frames using a Hann window with 50% overlap, and the STFT is implemented using a discrete Fourier transform (DFT) of length  = 512.The multichannel Wiener filter is computed based on a GEVD of R  (  ) and R VV (  ) as in [17] since, as we mentioned in Section 2.2, this method is superior to the subtraction-based implementation.
In order to assess the changes in noise reduction and speech distortion due to the bit depth reduction we will use two figures of merit, the speech intelligibility weighted signalto-noise ratio (SI-SNR) [25] and the speech intelligibility weighted spectral distortion (SI-SD) [6].They are based on the band importance function   , which expresses the importance for intelligibility of the th one-third octave band with centre frequency  , .The values for  , and   are defined in [26].The definitions of the two figures of merit are given by The quantity SNR  is the SNR (in dB) in the one-third octave band with centre frequency  , .In order to account for quantization, the quantization noise in the input signals can be obtained by subtracting the clean input signal and its corresponding quantized version.The quantization error obtained is added to the noise component of each microphone, and they are filtered to obtain the noise component in the output signal, which is then used to compute the noise power at each one-third octave frequency band.
For the SI-SD, SD  is the average spectral distortion in the one-third octave band with centre frequency  , , given by The function   () is given by where  out () is the speech component at the output of the MWF, and  in () is the frequency domain speech component at the reference microphone signal.A distortion value of 0 indicates undistorted speech, while larger values correspond to increased speech distortion.To account for quantization,  out () is computed by first quantizing the speech component at each microphone with the corresponding bit depth and then applying the filter to the quantized speech components.

Simulated Room Acoustics.
Our first experiment is a study of the behaviour of the greedy algorithm for adaptive quantization using simulated room acoustics.The scenario consists of a room of dimensions 5 × 5 × 3 m, with a reverberation time of 0.2 s.In the room there are two babble noise sources [27] and one desired speech source.The WASN consists of four nodes, where each node is equipped with three omnidirectional microphones, such that the total number of microphone signals is  = 12.Independent white Gaussian noise was added to each microphone signal with a power of 2.5 ⋅ 10 −5 , about 1% of the power of the babble noise impinging on the microphones.A 2D diagram of the acoustic scenario is depicted in Figure 2. All sources are located at a height of 1.8 m, while the nodes are placed 2 m high.The intermicrophone distance at each node is 4 cm and the sampling rate is 16 kHz.The maximum bit depth was set to 16 bits.The broadband input SNR for every microphone lies between 0 dB and 5 dB.The acoustics of the room are modeled using a room impulse response generator, which allows simulating the impulse response between each source and each microphone using the image method [23].The code is available online (https://www.audiolabs-erlangen.de/fau/ professor/habets/software/rir-generator).The total duration of the signals is 20 seconds.In Figures 3 and 4 we can see the SI-SNR and SI-SD at each iteration of the greedy adaptive quantization algorithm presented in Algorithm 1 based on the four metrics discussed.In this experiment the quantization is performed in the time domain, as explained in Section 2.3, such that each time domain sample of the microphone signal   is quantized using its allocated bit depth   .Note that both the SI-SNR and the SI-SD are plotted versus the average bit depth per sample and channel at each iteration, given by ∑    /.In terms of SI-SNR, the impact metric performs better than both the utility and the gradient, as we expected due to its inherent adaptability to the significance of each bit for different bit depths.The same can be said about the warped gradient, which performs better than the uncorrected gradient and close to the impact due to the correction to account for the significance of each bit.In terms of distortion, there is no clear winner when the total number of bits is high.However, the impact and the warped gradient introduce the least distortion as the number of bits decreases.We now turn our attention to quantization in the frequency domain, where each microphone signal   has a bit depth  , allocated to its frequency band Ω  , as explained in Section 3.4.The STFT coefficient at each frequency bin   ∈ Ω  is quantized using  , bits.In each iteration, one frequency band at one microphone signal has its bit depth   min , min reduced by one.The pair ( min ,  min ) is given by the channel and band with minimum impact (or corresponding metric).For this experiment we considered  = 4 uniform frequency bands, each spanning /4 frequency bins.The bit allocation  , of any band can be reduced to a minimum of 2 bits.If all bands of a microphone signal   are assigned  2 bits, the signal is removed from the estimation process for subsequent iterations.In Figures 5 and 6 we can again see the SI-SNR and SI-SD at each iteration of the greedy adaptive quantization algorithm.The two figures of merit are plotted versus the average bit depth per sample and channel ∑    /, where   = (1/) ∑   , .We can observe again the impact and the warped gradient performing better in terms SI-SNR, which is consistent with our previous experiment.However, the decay in SI-SNR for the utility and the gradient is less pronounced, and the region where their performance is similar to the impact and the warped gradient is larger.In terms of speech distortion the results are also consistent with the previous experiment in the sense that there is no clear winner, although the impact seems to perform better as the number of bits decreases for this particular experiment.

Experiments on Real Recordings.
In order to further compare the four metrics for greedy adaptive quantization, we turn our attention to an audio scenario where the signals are recorded using a real life wireless acoustic sensor network set up in a house in Mol, Belgium, consisting of 6 nodes with 4 microphones per node.A 2D schematic of the whole house can be seen in Figure 7, although only the living room was used for this experiment.The acoustic scenario consisted of one loudspeaker acting as the desired speaker (represented by the blue circle) and a kitchen fan (located in the top right corner of the living room in the 2D schematic) acting as the noise source.Only the nodes marked 1, 2, 3, 6, 7, and 8 were used for this experiment.The speech signal for the loudspeaker consisted of three sentences from the TIMIT [24] database, spoken by a female speaker.The total duration of the recording was 23 seconds.
The microphones employed were Sonion N8AC03 (analog), and the intermicrophone distance at each node was 5 cm.A picture of one node with the location of the microphones indicated is shown in Figure 8.The sampling frequency was   = 16 kHz, and the analog-to-digital converter of every node was configured to use a bit depth of 12 bits for acquisition.The microcontroller unit in each node is the Wonder Gecko EFM32WG980 from Silicon Labs [28], which is used for sampling and sending data to a Raspberry Pi 3 [29] via USB.The Raspberry Pi at each node is used to upload the audio samples to a USB drive.A picture of one node can be seen in Figure 8.The nodes were synchronized once every second using a pulse that was sent through coaxial cable and triggered by a GPS/DCF receiver.The recorded audio signals were stored and subsequently processed using the MATLAB software as described at the beginning of Section 4. We implemented the processing offline to focus on the characterization of the performance of the bit depth reduction algorithm and the comparison of the different metrics using real audio data.
In Figure 9 we can see the results of the SI-SNR of the output signal estimated from the MWF using the recorded audio signals.In this case, quantization was performed in the time domain.The SI-SNR of the input microphone signals lied between −16 and −7 dB.The noise power for the SI-SNR calculation was computed using the nonspeech segments.The greedy adaptive quantization algorithm was stopped when the total number of bits used was 20 bits.It can be observed that the impact metric again outperforms the gradient and the utility metrics and provides a smoother way of downscaling the WASN performance, in agreement with the results from Section 4.1.Besides, the warped gradient performs very close to the impact due to the correction to account for the significance of each bit, again in agreement with the results from Section 4.1.We would like to note that the impact and the warped gradient outperforming the gradient and the utility, as we can observe in both Figures 3 and 9, agree with the theoretical discussion of Section 3.3, where we describe the limitations of each metric.The four   metrics achieve a similar performance only in the high resolution regime, where the samples from every signal are encoded with a high bit depth and the bits removed have low significance.
Finally, we turn again our attention to quantization in the frequency domain, as explained in Section 3.4.We followed the same strategy as in the previous section, where we consider  = 4 uniform frequency bands, each spanning /4 frequency bins.In Figure 10 we can see the behaviour of the SI-SNR for this experiment, where a slower decay compared to the evolution in Figure 9 is observed.Although the impact outperforms the rest of the metrics, the four metrics diverge less from each other compared to the time domain quantization as seen in Figure 9.We note that for this experiment the warped gradient performs worse than the utility and the gradient.

Analysis of Energy Consumption.
To conclude, we focus on estimating the energy savings that can be achieved in communication by reducing the bit depth assignment of the microphone signals using the greedy adaptive quantization algorithm.This estimation is based on the power consumption of the WASN hardware nodes we used to record the audio signals.We employ a simplified model for the average energy E RF required to transmit  RF bits from one node to the fusion centre given by where  RF is the data rate in bits per second and  RF is the average power consumed by the radio module in active status.We note that (37) provides only an approximation of the required transmission energy since it ignores some factors such as the retransmission of lost packets.However, a detailed model for the transmission energy is outside the scope of this paper.The interested reader can find more advanced methods in [30].We will first discuss the case where quantization is performed in the time domain; that is, the bit depth assigned to the microphone signal   is equal for every frequency.
The number of bits  RF needed for the transmission of an audio frame of length  samples from microphone signal   can be calculated as follows: where   is the bit depth assigned to the microphone signal   ,  overhead is the length in bits of the headers containing protocol information, and  pkt, is the number of packets necessary to fit  samples from   according to the network protocol rules.The radio module of the nodes we used to acquire our audio recordings consists of an IEEE 802.15.4 standard compliant radio from Atmel (AT86RF233) in combination with an ARM Cortex M4 microcontroller.In active mode, the power consumption is  RF = 41.8 mW at  RF = 1 Mbps.The packet in the IEEE 802.15.4 standard consists of 127 payload bytes and 6 header bytes [31].The 127 bytes include 2 CRC bytes and 125 bytes of actual data plus headers originating from higher layers (such as, e.g., IPv6 for the network layer and UDP for the transport layer).We will assume that 25 bytes correspond to headers from higher layers.This leads to each packet carrying 33 bytes of overhead and a maximum of 100 bytes of data corresponding to audio samples.The number of packets necessary to transmit  audio samples encoded with bit depth   is then given by As we have explained in Algorithm 1, when a signal is assigned 0 bits, it gets removed from the estimation process for subsequent iterations.We are interested in calculating the total energy spent in the transmission of  samples per microphone signal included in the estimation process, which is given by where E RF, is computed using (37) and (38) and K  is the subset of K containing the indexes of the microphone signals included in the estimation process.However, we also have to consider the messages the fusion centre needs to send to the nodes every iteration to inform them of which microphone signal   will have its bit depth   reduced.These messages are limited in size since only the index of the signal whose bit depth needs to be reduced has to be communicated to the nodes.The length of one fusion centre packet in bits is given by where we assume that the message contains one byte of payload.The energy spent in the transmission of these packages is related to the speed of refreshment of the bit depth allocation algorithm, that is, the rate at which the network performs the iterations required by the algorithm.We will denote this rate by  refr ∈ (0, 1], which is given by the inverse of the number of frames the network waits between two consecutive iterations of the bit depth allocation algorithm.A value of 1 means that we change the bit depth allocation every frame and a value of 0.5 every two frames.Following (37) the average energy per frame required to transmit the fusion centre packet is given by We can then modify (40) to include E FC so that the total energy spent by the network in the duration of one frame is where  nodes is the number of nodes in the network, which is included to account for the energy spent by the nodes in the reception of the packet.Note that it is implicitly assumed here that the energy spent in the reception of a packet is on the same order of magnitude of the energy spent for its transmission.This assumption is valid in short distances [32], which can be expected in the context of a WASN.A quick calculation of the ratio between E FC and E RF, for  = 512,   = 8,  overhead = 264 bits (corresponding to 33 bytes), and  refr = 1 yields roughly 5%.While this is only an approximate energy model and other concerns related to communications may arise due to the speed of refreshment, such as the use of bandwidth or the need for retransmissions, from the point of view of energy we can conclude that even for fast rates, that is, one iteration per frame, the reduction of transmission energy is not jeopardized by the refreshment rate in most situations.In practice, deciding on a value for the refreshment rate  refr depends on the dynamics of the acoustic scenario; for example, in a scenario with moving sources it may be interesting to have a high rate to be able to track the sources, while in a static scenario a lower rate can be sufficient.We turn our attention now to quantization with a different bit depth in each of the  frequency bands.This leads to each microphone signal   having a bit depth  , assigned for each frequency band Ω  .The number of bits  RF needed for the transmission of /2 complex STFT coefficients from microphone signal   can be calculated following (38) as where   is the number of frequency bins included in band Ω  and   is the average number of bits assigned to microphone signal   , which is given by The number of packets necessary is now given by  pkt, = ⌈    8 ⋅ 100 ⌉ .
We note that since each payload byte allows the fusion centre 256 combinations of channel and frequency band indexes, a packet of very similar length to the one we considered in (41) can be used in this case to let the fusion centre inform the nodes of where to remove bits.While the quantization in several frequency bands allows for extra granularity, the energy analysis shown above applies in a straightforward manner by considering the average number of bits   in place of   .Finally, in Figure 11 the resulting SI-SNR (the same as in Figure 9) is plotted versus the total energy spent in transmission calculated from (43).Similarly, in Figure 12 we show the resulting SI-SNR (the same as in Figure 10) plotted versus the total energy spent in transmission calculated following the energy analysis for frequency domain quantization shown above.These graphs illustrate the estimated transmission energy savings which can be achieved through the use of the greedy adaptive quantization algorithm.For time domain quantization, from Figure 11 it can be observed that the total transmission energy can be reduced roughly by half without a meaningful loss in performance and cut by four for a small loss of 1 dB.For frequency domain quantization the savings are potentially higher since the total transmission energy can be reduced roughly to one-third without meaningful loss in performance.

Conclusions
We have provided a better understanding of adaptive quantization for speech enhancement in wireless acoustic sensor networks based on the previously proposed impact metric.We have done so by extending the mathematical framework of adaptive quantization in linear MMSE estimation, where we have proposed a metric based on the gradient of the MMSE and demonstrated how this metric naturally leads to a greedy approach.Moreover, we have shown that the impact metric is a generalization of the gradient metric, where the gradient is a limit case of the impact.We also propose a correction to improve the gradient metric by considering the significance of each quantization bit for different bit depths.Besides, the impact also generalizes a utility metric previously proposed for sensor subset selection.Through the use of a simulated and a real life environment we have assessed the superiority of the impact and the corrected gradient metrics over the gradient and the utility metrics due to their adaptability to the significance of each quantization bit.Besides, we have provided an estimation of the possible energy savings achievable through the use of the greedy adaptive quantization algorithm based on any of the studied metrics.In future work, an extension of this approach to a distributed speech enhancement algorithm will be explored, hence going beyond the centralized setting targeted in this work.Another important research direction will be the incorporation of psychoacoustic characteristics of human hearing to the bit depth allocation algorithm in order to improve the allocation in different frequency bands.

Figure 1 :
Figure 1: Example of a small WASN with adaptive quantization.

Figure 2 :
Figure 2: Acoustic scenario for the simulated room acoustic experiment.
Figure 3: SI-SNR at each step of the greedy quantization algorithm using time domain quantization for the simulated room acoustic experiment.

Figure 4 :
Figure 4: SI-SD at each step of the greedy quantization algorithm using time domain quantization for the simulated room acoustic experiment.
Figure 5: SI-SNR at each step of the greedy quantization algorithm with frequency domain quantization for the simulated room acoustic experiment.

Figure 6 :
Figure 6: SI-SD at each step of the greedy quantization algorithm with frequency domain quantization for the simulated room acoustic experiment.

Figure 7 :
Figure 7: Schematic in 2D of the house used for the WASN recordings, with the desired speaker in blue and the WASN nodes in red.

Figure 8 :
Figure 8: One node of the WASN used to make the recordings.

Figure 10 :
Figure 10: SI-SNR at each step of the greedy quantization algorithm using frequency domain quantization for the real recordings.

Figure 11 :Figure 12 :
Figure 11: SI-SNR versus total transmission energy spent in the duration of one frame in the case of time domain quantization.