Blind-Matched Filtering for Speech Enhancement with Distributed Microphones

A multichannel noise reduction and equalization approach for distributed microphones is presented. The speech enhancement is based on a blind-matched filtering algorithm that combines the microphone signals such that the output SNR is maximized. The algorithm is developed for spatially uncorrelated but nonuniform noise fields, that is, the noise signals at the different microphones are uncorrelated, but the noise power spectral densities can vary. However, no assumptions on the array geometry are made. The proposed method will be compared to the speech distortion-weighted multichannel Wiener filter (SDW-MWF). Similar to the SDW-MWF, the new algorithm requires only estimates of the input signal to noise ratios and the input cross-correlations. Hence, no explicit channel knowledge is necessary. A new version of the SDW-MWF for spatially uncorrelated noise is developed which has a reduced computational complexity, because matrix inversions can be omitted. The presented blind-matched filtering approach is similar to this SDW-MWF for spatially uncorrelated noise but additionally achieves some improvements in the speech quality due to a partial equalization of the acoustic system.


Introduction
In many speech communication systems, like hands-free car kits, teleconferencing systems, and speech recognition systems, the desired speech signal is linearly distorted by the room acoustics and also corrupted by undesired background noise.Therefore, efficient speech processing techniques are required to enhance the speech signal under the constraint of a small speech distortion.The use of multiple microphones can improve the performance compared to single microphone systems [1].The most common way to place the microphones is beamformer arrays with designed array geometry.Beamforming algorithms exploit the spatial directivity effects by a proper combining, like the Frost beamformer [2] or the generalized sidelobe canceler (GSC) [3].Usually, the microphones are located in close proximity and the same signal conditions at the microphone positions are assumed.
Alternatively, multimicrophone setups have been proposed that combine the processed signals of two or more distributed microphones.The microphones are positioned separately in order to ensure incoherent recording of noise [4][5][6].Basically, all these approaches exploit the fact that speech components in the microphone signals are strongly correlated while the noise components are only weakly correlated if the distance between the microphones is sufficiently large.
For immersive communication, future communication devices must be able to collect the desired speech signal as naturally as possible.But the speech signal quality depends on the speaker's distance to the microphone (array).Therefore, we propose the use of a setup with distributed microphones, where the user can place the microphones arbitrarily.Hence, the array geometry is arbitrary and not a priori known.
In this paper, we discuss schemes for an optimal speech signal combining in real-world acoustic scenarios with distributed microphones.With distributed arrays, the transfer functions to the different microphones vary and these variations have to be taking into account providing an optimal signal combining.Often when the room acoustic is taken into account in a beamformer design, one microphone is taken as a reference channel, for example, the speech distortion weighted-multichannel Wiener filter (SDW-MWF) [7,8] or the general transfer function GSC (TF-GSC) [9,10].For microphone arrays with close proximities and similar transfer functions, this, is a suitable solution.However, for distributed microphones, the a priori chosen reference channel is not necessarily the ideal choice.Moreover, possible equalization capabilities are often neglected.
The matched filter (MF) [11] and the special case of the MF, the minimum variance distortionless response (MVDR) beamformer, provide a signal combining that maximizes the signal-to-noise ratio (SNR) in the presence of additive noise.A direct implementation of matched filtering requires knowledge of the acoustic transfer functions.With perfect channel knowledge, the MVDR beamformer also provides perfect equalization.However, with speech applications, the acoustic transfer functions are unknown and we have no means to directly measure the room impulse responses.There exist several blind approaches to estimate the acoustic transfer functions (see, e.g., [12][13][14]) which were successfully applied to dereverberation.However, the proposed estimation methods are computationally demanding.In [15], an iterative procedure was proposed where the matched filter was utilized in combination with a least-mean-squares (LMSs) adaptive algorithm for blind identification of the required room impulse responses.
In general, a signal combining for distributed microphones is desirable which does not require explicit knowledge of the channel characteristics.In a previous work, we have developed a matched-filter approach under the assumption of a uniform incoherent noise field [16].The optimal weighting of the matched filter can be estimated by an approximation of the input SNR values and a phase estimate.Similarly, a scaled version of the MVDR-beamformer coefficients can be found by maximizing the beamformer output SNR [17].In the frequency domain, these coefficients can be obtained by estimating the dominating generalized eigenvector (GEV) of the noise covariance matrix and the input signal covariance matrix.For instance, an adaptive variant for estimating a GEV was proposed by Doclo and Moonen [18], and later by Warsitz and Haeb-Umbach [19].Furthermore, it can be shown that the SDW-MWF also provides an optimal signal combining that maximizes the signal-to-noise ratio [20].The SDW-MWF requires only estimates of the input and noise correlation matrices.Hence, no explicit channel knowledge is required.However, the SDW-MWF does not equalize the speech signal.
In this work, we consider speech enhancement with distributed microphones.In Section 3, we present some measurement results that motivate a distributed microphone array.In particular, we consider two different acoustic situations: a conference room where the noise level is typically low, but the speech signal is distorted due to reverberation, and a car environment where the reverberation time is low, but the strong background noise occurs.
The basic idea of the presented approach is to apply the well-known matched-filter technique for a blind equalization of the acoustic system in the presence of additive background noise.This concept is strongly related to the SDW-MWF.Therefore, we discuss different matched-filtering techniques and their relation to the multichannel Wiener filter in Section 4.
In many speech applications, a diffuse noise field can be assumed [21].With a diffuse noise field, the correlation of the noise signals depends on the frequency and the distance between the microphones.Typically, for small microphone distances, the low-frequency band is highly correlated whereas the correlation is low for higher frequencies.With a larger microphone spacing, the noise correlation is further decreased and the noise components can be assumed to be spatially uncorrelated.In Sections 5 and 6, we demonstrate that this fact can be exploited to reduce the complexity of the SDW-MWF algorithm as well as to improve the equalization capabilities.
The calculation of the MWF requires the inversion of the correlation matrix of the input signals.This is a computationally demanding and also numerically sensitive task.In Section 5, we show that for a scenario with a single speech source and with spatially uncorrelated noise the matrix inversion can be omitted.Using the matrix inversion lemma [22] the equation of the MWF filter weights can be rewritten to an equation that only depends on the correlations of the input signals and the input noise power spectral densities at the different microphones.
In Section 7, we present a blind-matched filtering approach for speech recording in spatially uncorrelated noise where no assumption on the geometrical adjustment of the microphones is made.The approach presented in [16] is limited to uncorrelated noise signals where the noise power spectral densities are equal for all microphone inputs.In this work, we extend these results to situations where the noise signals are spatially uncorrelated, but the noise power spectral densities can vary.Furthermore, we show that combined with a single channel Wiener filter, this new structure is equivalent with the SDW-MWF with respect to noise suppression.However, the new approach provides a partial equalization of the acoustic transfer functions between the local speaker and the microphone positions.
Finally, we demonstrate in Section 8 that the presented filter structure can be utilized for blind system identification.For equal noise power spectral densities at all microphone inputs, the matched filter is equal to the vector of transfer function coefficients up to a common factor.Hence, by estimating the ideal matched filter, we estimate the linear acoustic system up to a common filter.Note that many known approaches for blind system identification can only infer the different channels up to a common filter [15].Similarly, with the proposed system, all filters are biased.We derive the transfer function of the common filter and demonstrate that the biased acoustic transfer functions can be reliably estimated even in the presence of strong background noise.

Signal Model
In this section, we briefly introduce the notation.In general, we consider M microphones and assume that the acoustic system is linear and time invariant.Hence, the microphone signals y i (k) can be modeled by the convolution of the speech signal x(k) with the impulse response h i (k) of the acoustic system plus an additive noise term n i (k).The M microphone signals y i (k) can be expressed in the frequency domain as where Y i (κ, ν), X(κ, ν), and N i (κ, ν) denote the corresponding short-time spectra and H i (ν) the acoustic transfer functions.S i (κ, ν) = H i (ν)X(κ, ν) is the speech component of the ith microphone signal.The subsampled time index and the frequency bin index are denoted by κ and ν, respectively.In the following, the dependencies on κ and ν are often omitted for lucidity.Hence, we can define the Mdimensional vectors S, N, and Y, in which the signals are stacked as follows: Note that T denotes the transpose of a vector or matrix, whereas the conjugate transpose is denoted by † and conjugation by * , respectively.H denotes the vector of channel coefficients: In the following, we assume that the noise signals are zeromean random processes with the variances σ 2 N1 , . . ., σ 2 NM .We denote the signal-to-noise ratio (SNR) at the microphone i by where σ 2 X is the speech power at the speaker's mouth.

Measurement and Simulation Setup
Throughout the paper, we will illustrate the proposed method with measurement and simulation results for two different acoustic situations: a conference room where the noise level is typically low, but the speech signal is distorted due to reverberation, and a car environment where the reverberation time is low, but the strong background noise may lead to very low input SNR values.In this section, we first present some measurement results that motivate a distributed microphone array.Then, we describe the setup for the simulations.The basic idea of the presented approach is to apply the well-known matched filter technique for a blind equalization of the acoustic system in the presence of additive background noise.We first discuss some measurement results obtained in a conference room with a size of 4.7 × 4.8 × 3.0 m.For these measurements, we used three omnidirectional microphones which are placed on a table in the conference room as shown  in Figure 1.The microphone distance was chosen 1.2 m for mic. 1 and mic.3, and 1 m between the other microphone pairs (see Figure 1).This results in distances in the range from 0.5 m to 1.3 m between the local speakers and the microphones.
With an artificial head, we measured the room impulse responses for five local teleconference participants.For this scenario, Figure 2 shows the magnitudes of the acoustic transfer functions.The influence of the room acoustic is clearly visible.For some frequencies, the magnitudes of the acoustic transfer functions show differences of more than 20 dB.It can also be stated that the microphone with the best transfer function is not obvious, because for some frequencies H 1 (ν), H 2 (ν), and for others H 3 (ν) has less attenuation.
Figure 3 depicts the SNR versus frequency for a situation with background noise which arises from a fan of a video projector.From this figure, we observe that the SNR values for frequencies above 1.5 kHz are quite distinct for these three microphone positions with differences of up to 10 dB depending on the particular frequency.Again, the best microphone position is not obvious in this case, because the SNR curves across several times.
Theoretically, if we assume spatially uncorrelated noise signals, a matched filter combining these input signals would result in an output SNR equal to the sum of the input SNR values.With three inputs, a matched filter array achieves a maximum gain of 4.8 dB for equal input SNR values.In the case of varying input SNR values, the sum is dominated by the maximum value.Hence, for the curves in Figure 3, the output SNR would essentially be the envelope of the three curves.This is also shown in Figure 3, where the output SNR of an optimal signal combining is plotted (solid line).
For the simulation results presented throughout the paper the following setup was used: all processed signals are sampled at a sampling rate of f s = 16000 Hz.For the shorttime Fourier transform (STFT), we used a length of L = 512 and an overlap of K = 384 samples, while an overlap-add processing for the signal synthesis was performed.As clean speech signals we used two male and two female speech samples, each of a length of 8 seconds.Therefore, we took the German-speaking test samples from the recommendation P.501 of the International Telecommunication Union (ITU) [23].To generate the speech signals s i at the microphones, the clean speech was convolved with the corresponding room impulse responses.The reverberation time for the conference scenario was T 60 = 0.25 s.We also show results for a conference scenario with T 60 = 0.5 s, but for this scenario the impulse responses are generated using the image method [24].Most presented algorithms require estimates of the noise power spectral density (PSD) and a voice activity detection (VAD), here we used the methods described in [16] throughout the paper.
For the measurements in the car environment, one microphone was installed close to the inside mirror, while the second microphone was mounted at the A-pillar (the A-pillar of a vehicle is the first pillar of the passenger compartment, usually surrounding the windscreen).This microphone setup leads to a distance of 0.6 m between the two microphones.We consider three different background noise situations for noise recordings: driving noise at 100 km/h and 140 km/h, and the noise arising from an electric fan (defroster, car in standstill).For a comparison with typical beamformer constellations, we installed a second pair of microphones at the inside mirror, such that the microphone distance between these two microphone was 0.04 m.This microphone setup is evaluated later in Section 5. Figure 4 shows the measured setup for the in-car environment.For all measurements in the car, we used cardioid microphones, which are often used for automotive applications.For the measurements of the room impulse responses also the artificial head was used.The reverberation time for this scenario was T 60 = 0.05 s.

Optimal Signal Combining
In this section, we discuss combining schemes which are optimal in a certain manner.For such a combining of the microphone signals, each input Y i is processed by a linear filter G i before the signals are added.Stacking these filter functions into the vector G, we have ( Therefore, the processed signal at the output of the combining system can be expressed as follows: The SNR at the system output is defined as the ratio: where is the correlation matrix of the noise signals.

Maximization of the SNR.
Our aim is now to find the filter functions G, which are optimal in the sense of a maximal output SNR of the combining system.Hence, the maximization problem can be stated as This maximization problem leads to an eigenvalue problem with the matched filter (MF) solution [25]: where c is a nonzero constant value.Hence, one has to weight the input signals according to the acoustic transfer functions and the inverse of the noise correlation matrix.This weighting is also known as maximum SNR (MSNR) beamformer.Applying a constant factor to R −1 N H does not affect the SNR at the filter output.Therefore, the matched filter can also be utilized for equalization to get a flat frequency response according to the source speech position (G MF † H = 1).In this case, the following form is used: This algorithm is also called the minimum variance distortionless response (MVDR) beamformer and was described by Cox et al. [26].For this technique, knowledge about the room impulse responses and the noise correlation matrix is needed.For an estimation of the noise power density and the cross-power density, several approaches exist in the literature [21,[27][28][29][30].But the estimation of the room impulse responses is a blind estimation problem [12][13][14].A reliable estimation of the room impulse responses in realtime is still an open issue.Often only a linear-phase compensation is done, by applying a sufficient time delay for the signals, thus the transfer functions are replaced by the steering vector: where Δ i (ν) denotes the phase difference between the first and ith microphone.This corresponds to the classical Frost beamformer [2].Thus, an estimate of the time delay of arrival (TDOA) is required.Note that for a known array geometry this information is equivalent to the direction of arrival (DOA).

Multichannel MMSE Criterion.
Another related criterion is the minimization of the mean square error (MSE) between the output signal and a reference signal.This can also be used to find an optimal combining strategy for the microphone signals.To calculate the minimum mean squared error (MMSE) estimate of the clean speech signal X at the speaker's mouth, one has to minimize the following cost function: By setting the complex derivative with respect to G * to zero, one obtains the solution of this minimization problem as where R Yx = E{YX * } is the cross-correlation vector between the clean speech and the microphone input signal and R Y = E{YY † } is the correlation matrix of the microphone input signal, respectively.In the literature, this is often referred as the multichannel Wiener filter (MWF), which can be used for signal combining and noise reduction.
To overcome the problem of the required but unavailable cross-correlation vector R Yx in the definition of the MWF, cf.(14), one can define an MWF that minimizes the mean squared error with respect to the speech signal of a reference microphone signal S ref .In [7], the SDW-MWF was proposed, while a tradeoff parameter was introduced to the MWF.With this parameter, it is possible to adjust the noise reduction capabilities of the MWF with respect to the speech signal distortion at the output.Thus, the signal distortion is taken into account in the optimization.Therefore, the distortion is measured as the distance between the speech component of the output signal and the speech component of an input channel.This reference channel is selected arbitrarily in advance.
The error signal ε for the minimization is then defined as the difference between the output signal G † Y and the speech component of the signal Y ref : The column vector u selects the reference channel, that is, the corresponding entry is set to one and the others are set to zero.Using the two MSE cost functions: the unconstraint minimization criterion for the SDW-MWF is defined by where 1/μ W is a Lagrange multiplier.This results in the solution: where R S is the speech correlation matrix and μ W is a parameter which allows a trade-off between speech distortion and noise reduction (for details cf.[7]).
For further analyses, we assume that the single speaker speech signal is a zero-mean random proces with the PSD σ 2 X and a time-invariant acoustic system.The correlation matrix of the speech signal can be written as Using the matrix inversion lemma [22], the SDW-MWF can be decomposed as where (H † R −1 N H) −1 is the noise variance at the output of G MVDR : Appendix A provides a derivation of this decomposition.Thus the SDW-MWF is decomposed in an MVDR beamformer and a filter that is equal to the acoustic transfer function of the reference channel (H * ref = H † u).Furthermore, the noise reduction is achieved by a single channel Wiener filter where μ W can be interpreted as a noise overestimation factor [31].
From this decomposition, it can be seen that the SDW-MWF provides an optimal signal combining with respect to the output SNR.Yet, it is not able to equalize the acoustic transfer functions.This is also obvious in Figure 5, where the overall system transfer function is depicted (dashed line).
Here, the first microphone with the transfer function H 1 (ν) was used as reference channel.For this plot, the Wiener filter part G WF (ν) of the transfer function was neglected.We observe that the overall transfer function of the SDW-MWF (dashed line) is equivalent to the transfer function of the reference channel (semidashed line).Note that we measured the transfer function between the speaker's mouth and the output of the SDW-MWF.Also, the flat transfer function of the MVDR beamformer is plotted for a comparison (dotted line).

The SDW-MWF for Spatially Uncorrelated Noise
The calculation of the MWF in (18) requires the inversion of the correlation matrix of the input signals.This is a computationally demanding and also numerically sensitive task.In this section, we show that for a scenario with a single speech source and with spatially uncorrelated noise, the matrix inversion can be omitted.Using the matrix inversion lemma [22], the equation of the MWF filter weights can be rewritten to an equation that only depends on the correlations of the input signals and the input noise PSDs at the different microphones.
Consider the speech-distortion-weighted multichannel Wiener filter according to (18).We can rewrite the inverse (R S + μ W R N ) −1 using the result in (A.1): Furthermore, using (21), we have which is the signal-to-noise ratio at the output of the MVDR beamformer.
Using the inverse of (μ W R N + R S ) from ( 23) and the definition of the SDW-MWF in (18), we have Because speech and noise are independent, we can estimate R S by R S = R Y − R N .Therefore, we obtain Note that the column vector u selects the reference channel, that is, the corresponding entry is set to one and the others are set to zero.Because we assume that the noise signals at the different microphones are uncorrelated, R N is a diagonal matrix, and the elements of the main diagonal are the noise variances σ 2 N1 , . . ., σ 2 NM .Therefore, we obtain the inverse Let ref be the index of the one in the vector u.R −1 N R Y u results in the column vector: Therefore, we obtain the following expression for the SDW-MWF for spatially uncorrelated noise signals: This representation of the speech distortion weighted multichannel Wiener filter omits the inversion of the matrix (R S + μ W R N ).For spatially uncorrelated noise signals, the SNR γ can be calculated as the sum of the input signal-tonoise ratios:

The SDW-MWF for Diffuse Noise
In the literature, many investigations on the spatial correlation properties of noise fields have been made.The assumption of spatially uncorrelated noise is rarely fulfilled in real-world scenarios, but it has been found, for example, by Martin and Vary in [21], that many noise fields can be assumed to be diffuse, like the noise in a car environment [32], office noise [21], or, for example, babble noise [33].
For diffuse noise, the spatial correlation depends on the intermicrophone distance and is dominant especially in the lower-frequency bands.Typically, the low-frequency band is highly correlated whereas the correlation is low for higher frequencies.This fact can be exploited by omitting the matrix inversion for higher frequencies.
To evaluate the correlation between the noise signals at different positions, the coherence function of the noise signals from different intermicrophone distances can be computed.The magnitude squared coherence (MSC) between two signals n i and n j is defined as follows: where σ NiNj (ν) and σ 2 Ni are the cross-power spectral density (CPSD) and the power spectral densities (PSDs) of the signals n i and n j , respectively.The values of the coherence function are between 0 and 1, where 0 means no correlation between the two signals at that frequency point.For highly correlated signals, the MSC will become close to 1 for all frequencies.
In [34], Armbrüster et al. have shown that the coherence of an ideal diffuse sound field recorded with omnidirectional microphones can be computed as follows: where L denotes the length of the short-time Fourier transform (STFT) and f s is the sampling frequency.The speed of sound is denoted by c, and d mic represents the microphone distance.The zeros of the theoretical coherence function in (32) can by calculated by the following expression: In the following, we consider the coherence for the noise signals of the in-car scenario for the driving situation at 100 km/h (see Section 3). Figure 6(a) shows the coherence functions of the noise for the microphone pair with an intermicrophone spacing of d mic = 0.04 m.Also, the theoretical coherence function computed according to (32) is shown.Obviously, there is a high correlation of the noise signals at frequencies below 2 kHz.Note that the coherence of the noise signals is closely approximated by the theoretical coherence function C theo , although cardioid microphones were used for this measurement.In Figure 6(b), the coherence function of the noise for the microphone pair with a 0.6 m spacing is depicted.In this constellation, the noise signals at the two microphones are highly correlated for frequencies below 150 Hz only.
From (33) and Figure 6, it is obvious that the correlation of the diffuse noise signals depends on the intermicrophone distance.Therefore, the noise has only a high correlation at low frequencies and especially the high frequencies are only weakly correlated.Thus, the assumption of spatially uncorrelated noise is fulfilled for the higher frequency bands.Therefore, we propose to calculate the filter weights depending on the theoretical coherence C theo (ν); for frequencies with a high coherence, we calculate the filter weights using the matrix inversion (see (18)), while for frequencies with a low coherence, we assume uncorrelated noise and thus the weights are computed according to (29).Hence, the filter function is calculated according to where C lim is a parameter that allows a trade-off between accuracy and computing time.
The simulation results for the in-car microphone scenario with the two different microphone setups are given in Table 1.Each scenario (microphone setup and noise condition) was simulated twice.The first time it was simulated using the fullband matrix inversion according to (18) (denoted by fullband MWF).These results can be seen as an upper bound for the performance evaluation of the proposed method.The second time we used the proposed approach of the SDW-MWF with the partially inversion of the correlation matrix (partial MWF).Therefore, the inversion was omitted for all frequency bins with a theoretical coherence C theo less than 0.7.This leads in our simulation setup to the threshold frequencies f lim = 1500 Hz for the closed spaced microphone  pair and f lim = 100 Hz for the setup with the microphone spacing of 0.6 m.As an objective evaluation criterion, we calculated the segmental signal-to-noise ratio (SSNR) of the output signal.Therefore, a voice activity detection according to the ITU P.56 was used [35].Furthermore, we show results from an instrumental quality analysis in Table 1.The speech quality and noise reduction were evaluated according to the ETSI standard EG 202 396-3 [36].This algorithm calculates three objective quality measures (according to the mean opinion score (MOS) scale): Speech-MOS (S-MOS), Noise-MOS (N-MOS), and Global-MOS (G-MOS).From these results, we observe that the partial MWF algorithm obtains nearly the same performance as the fullband SDW-MWF.

Matched Filtering for Spatially Uncorrelated Noise
We have seen in Section 4.2 that the SDW-MWF provides an optimal signal combining with respect to the output SNR, where the SDW-MWF does not require explicit channel knowledge to obtain this result.For spatially uncorrelated noise the SDW-MWF according to (29) requires only estimates of the input SNR values and the input crosscorrelation with respect to the reference channel.However, in contrast to the MVDR beamformer, the SDW-MWF does not equalize the acoustic system.
In the following, we show that knowledge of the input SNR values and the input cross-correlation with respect to the reference channel is sufficient to provide at least a partial channel equalization.We consider the matched filter for spatially uncorrelated noise signals.If we assume that the noise signals at the different microphones are uncorrelated, R N is a diagonal matrix, and the elements of the main diagonal are the noise variances σ 2 N1 , . . ., σ 2 NM .Therefore, we obtain the inverse R −1 N as in (27).In this case, the filter coefficients of the matched filter can be determined independently and we obtain as ith coefficient of the matched filter according to (10) and according to (11).

Filter Design.
In [16], we have demonstrated that under the assumption of a uniform and spatially uncorrelated noise field, this optimal MF weighting can be obtained by the following filter: (37) and an additional phase synchronization.γ denotes the sum of all input SNR values.Hence, this filter requires only estimates of the input SNRs.In the following, we extend this concept to nonuniform noise fields.In this case, the optimal weighting depends also on the noise power densities σ 2 Ni .Consider now the following filter: where σ 2 N is the mean of the noise power spectral densities at the different microphones, defined by This filter depends on the noise power density σ 2 Ni and all input SNR values.Using (4), we obtain Note that the term ) is common to all filter coefficients.Hence, the filter according to (38) is proportional to the magnitude of the matched filter according to (35).
The proposed filter in (38) is real valued.To ensure cophasal signal combining, we require some additional system components for phase estimation.

Phase Estimation.
For a coherent combining of the speech signals, we have to compensate the phase difference between the speech signals at each microphone.Therefore it is sufficient to estimate the phase differences to a reference microphone.Let φ i (ν) be the phase of the complex channel coefficient H i (ν).We consider the phase differences to the a reference microphone Cophasal addition is then achieved by For multimicrophone systems with spatially separated microphones a reliable phase estimation is a challenging task.A coarse estimate of the phase difference can also be obtained from the time-shift τ i between the speech components in the microphone signals, for example, using the generalized correlation method [37].However, for distributed microphone arrays in reverberant environments this phase compensation leads to a poor estimate of the actual phase differences.This can be observed in Figure 7 which depicts the phase φ 1 (ν) of the reference channel for the in car scenario with intermicrophone spacing of 0.6 m (see Section 3).In an anechoic environment, the phase of the reference channel as well as the phase difference Δ 2 for the second microphone would be linear functions of the frequency.Hence, we could expect ideal sawtooth functions if we consider the phase in the interval [−π, π].From Figure 7, we observe that this is only a rough estimate of the actual phase values.
In order to ensure a cophasal addition of the signals, we employ a phase estimation similar to the approach presented in [16].We use a frequency domain least-squares (FLMS) algorithm to estimate the required phase difference.Using Y ref as reference signal, the filter with Note that the filter is only adapted if voice activity is detected, where we used the VAD method described in [16].The FLMS algorithm minimizes the expected value: For stationary signals the adaptation converts to a filter transfer function: where E{Y * i Y ref } is the cross-power spectrum of the two microphone signals and E{|Y i | 2 } is the power spectrum of the ith microphone signal.Assuming that the speech signal and the noise signals are spatially uncorrelated, (45) can be written as For frequencies where the noise components are uncorrelated, that is, E{N * i N 1 } = 0, this formula is reduced to The phase of the filter G ANG i is determined by the two complex channel coefficients H i and H ref where the product Hence, for the coherent signal combining, we use the phase of the filter G ANG i : According to (28) and ( 29), the phase of the filter G SDW is determined by the cross-correlation of the input signals.
Comparing (28) and (45), we note that the proposed approach leads to the same phase compensation as with the SDW-MWF.Note that the output signal of the SDW-MWF is computed as With the estimated phase, we can now express the complex filter as Figure 7 presents simulation results for this phase estimation, where Δ 2 denotes the actual phase difference computed from the measured impulse responses and est.Δ 2 is the estimated phase difference.The presented results correspond to the driving situation with a car speed of 140 km/h and an intermicrophone distance of 0.6 m, as described in Section 3.

Residual Transfer Function.
Next, we derive the residual transfer function of the proposed signal combining.Using (40), the complex filter transfer function can be expressed as Assuming ideal knowledge of the SNR values and a perfect phase estimation, we can derive the overall transfer function.
Comparing the MVDR beamformer in (36) with (50), we observe that the proposed system has a resulting transfer function: That is, Hence, the proposed system does not provide perfect equalization.However, the filter provides partial dereverberation, where the dips of the acoustic transfer functions are smoothed if the dips occur not in all transfer functions.Moreover, if the noise is uniform and stationary, the number of channels M is sufficiently high and in case of spatially uncorrelated channel coefficients, the sum ) tends to a constant value independent of the frequency (cf.[15]).

Noise Reduction.
As shown by the decomposition of the speech-distortion-weighted multichannel Wiener filter in (20), the noise reduction of the MWF is achieved by a singlechannel Wiener filter.Therefore, we combine the proposed matched filter approach with a single channel Wiener filter.In the reminder of this section, we discuss the integration of the Wiener postfilter to the blind-matched filter approach.The considered system is shown in Figure 8.
The single channel Wiener filter in (20) can be rewritten to an equation which only depends on the output SNR γ: It is possible to integrate the Wiener filter function from (53) in the filter functions of the proposed blind-matched filter (38).This leads to the filter function G MFWF i , which consists of a blind matched filter (MF) with a single-channel Wiener postfilter (WF): thus the MSE with respect to the speech component of the combined signal is minimized.

Simulation Results.
In this section, we present some simulation results for the proposed combining system with additional noise suppression.Therefore, we used the simulation setup described in Section 3. Table 2 presents the results for the simulated in-car environment using the configuration with the intermicrophone distance of 0.6 m.As an objective Figure 8: System structure of the blind-matched filtering with noise reduction for two channels.evaluation criterion, we calculated also the segmental SNR of the output signal.Also the input signal-to-noise ratios are shown in Table 2. Furthermore, we show results from an instrumental quality analysis in Table 1.Comparing these values with the results shown in Table 1, we observe that the proposed algorithm outperforms the SDW-MWF for this scenario.For this simulation, a higher noise overestimation factor μ W can be used, while the S-MOS results are nearly the same in comparison to the results of SDW-MWF.This is because the proposed combining system partial equalizes the acoustic system and, therefore, the speech signal components are equalized with respect to the speech signal X at the speaker's mouth.Thus, the system can achieve a higher output SNR at the same level of speech distortion.For the conference scenario, Table 3 shows the results of the performed simulations.It can be seen that the output SNR of the system as well as the MOS values is nearly the same for all speaker positions.This is a result of the proposed combining scheme.
The effect of the partial equalization is obvious by a comparison of the individual acoustic transfer functions with the overall system transfer function H(ν).This is shown in Figure 9(a) for the speaker at position three, where the transfer functions between the speaker's mouth and the microphones are plotted as dashed and dotted lines.The overall system transfer function (including the system and the acoustic signal path) is plotted as solid line.It can be seen that the deep dips of the individual transfer functions are equalized.The overall transfer function follows the envelope of all transfer functions (which may include also the microphone characteristic).In Figure 9(b), the transfer To show the applicability of the proposed system also in more reverberant environments, we used a simulated conference scenario with a reverberation time T 60 = 0.5 s.For the generation of the room impulse responses, we used the image method described by Allen and Berkley in [24].The results are presented in Table 4, again the output SNRs for the different speaker positions are in the same range.Also, simulation results for the SDW-MWF are given for a comparison of these two techniques.

Blind System Identification
In order to demonstrate that the filter G i approximates the matched filter, we show that the structure in Figure 10 can be used for blind system identification.The SNR values for speech signals are fast time-varying.Hence, we use again FLMS filters G LMS i to estimate the average filter transfer functions.Note that if we have equal noise power spectral densities at all microphone inputs, the matched filter G MF is equal to the vector of transfer function coefficients H up to a common factor.This factor can vary with the frequency.Hence, by estimating the ideal matched filter, we estimate the linear acoustic system up to a common filter.Furthermore, note that many known approaches for blind x(k) * h 2 (k) Figure 10: Basic system structure for the system identification approach for two channels.system identification can only infer the M different channels up to a common filter [15].Similarly, with the proposed system, all filters G i are biased by a common factor.For equal noise power spectral densities, this common filter has the transfer function: and the LMS filters should converge to For simulations, we use the two microphone in-car setup with an intermicrophone distance of 0.6 m.We consider the driving situation with a car speed of 140 km/h.The magnitude of the actual transfer functions H i and the magnitude of the corrected filter transfer function G LMS i H are depicted in Figure 11.We observe that the transfer functions are well approximated.As a quality measure, we use the distance and obtain values of D 1 = −16.4dB and D 2 = −11 dB after 5 seconds of speech activity.For a driving situation with a car speed of 100 km/h, we obtain D 1 = −17.9dB and D 2 = −11.9dB, respectively.

Conclusions
In this paper, we have presented a speech enhancement system with distributed microphones, where the array geometry is arbitrary and not a priori known.The system is based  on a blind-matched filtering approach where the filter coefficients depend only on the input signal-to-noise ratios and the correlation between the input signals.For spatially uncorrelated but not necessarily uniform noise, the system provides an optimal signal combining that maximizes the output SNR.Moreover, the presented approach achieves a partial equalization of the acoustic system up to a common filter.To demonstrate that the ideal filter coefficients can be reliably estimated, we have presented an application for blind system identification.The system is able to identify the M different channels up to a common filter.The presented simulation results indicate that this identification is robust against background noise.To provide a perfect equalization, the remaining filter ambiguity needs to be resolved separately.However, the presented system could also be combined with other speech dereverberation algorithms, for example, the single channel reverberation suppression algorithms presented in [38,39].The system assumes a single speech source, but a situation with more than one active speaker cannot be avoided in real conference scenarios, so further investigations are needed to evaluate the concept for such scenarios.

Figure 3 :
Figure 3: Input SNR for a conference scenario with background noise.

Figure 4 :
Figure 4: Illustration of the measured in-car scenario.

1 Figure 5 :
Figure 5: Comparison of the overall transfer function of the SDW-MWF and the ideal MVDR beamformer.

Figure 6 :
Figure 6: Comparison of the magnitude squared coherence.

Figure 7 :
Figure 7: Actual phase of the reference channel (a) determined from the impulse response, actual phase difference and estimated phase difference for channel 2.

Figure 9 :
Figure 9: Comparison of the system transfer functions.

Figure 11 :
Figure 11: Estimated (dashed line) and actual transfer functions for the two channels.

Table 2 :
Simulation results for the proposed system in the car environment, cf.results on the right of Table1.

Table 3 :
Simulation results for the system in the conference environment with with T 60 = 0.25 s.

Table 4 :
Simulation results for the system in a simulated conference environment with T 60 = 0.5 s.