Design Considerations When Accelerating an FPGA-Based Digital Microphone Array for Sound-Source Localization

The use of microphone arrays for sound-source localization is a well-researched topic. The response of such sensor arrays is dependent on the quantity of microphones operating on the array. A higher number of microphones, however, increase the computational demand, making real-time response challenging. In this paper, we present a Filter-and-Sum based architecture and several acceleration techniques to provide accurate sound-source localization in real-time. Experiments demonstrate how an accurate sound-source localization is obtained in a couple of milliseconds, independently of the number of microphones. Finally, we also propose different strategies to further accelerate the sound-source localization while offering increased angular resolution.


Introduction
Most of the signal processing needed in microphone arrays is traditionally done using general purpose processors.However, the computational demand is directly related to the number of microphones of the array.This number is drastically increasing as low-cost MEMS technology is readily available.Current FPGAs are a potential solution thanks to their high-computational power and low latency response.In fact, FPGAs have been already considered by other researchers, mainly for converting the analogue or digital microphone signals into an audio format [1,2] without further signal processing computation.We believe that FPGAs not only are able to manage relatively large microphone arrays, but also enable a faster response when compared to using general purpose processors.
In order to satisfy the most time stringent sound-source localization applications that also use an incremental number of microphones, we propose a flexible, scalable, and realtime architecture.Main targets are the performance, scalability, and accuracy of the system to detect the direction of sound sources in real-time.Furthermore, we propose several techniques based on our architecture to accelerate the soundsource localization to guarantee real-time detection.
The architecture presented in this paper is an improved and more detailed version than the one presented in [3].Because this novel architecture is designed to be part of an embedded system, the resource and the power consumption are included together with the performance in our analysis of the system.A frequency analysis is also done based on design parameters such as the number of microphones or the number of orientations.Altogether this leads to an architecture for which the frequency response must satisfy the basic needs of an application requiring real-time soundsource localization.
The main contributions of this work can be summarized as follows: (i) A Filter-and-Sum based architecture for a fast soundsource localization.
(ii) A complete frequency and performance analysis of the system.
(iii) Strategies to speed up the overall execution time.
This paper is organized as follows.Section 2 presents related work.The principles used for the sound-source localization are introduced in Section 3. In Section 4 our proposed architecture is detailed.A complete time analysis

Related Work
The use of microphone arrays for sound-source localization is a well-researched problem, where complexity increases with the number of microphones involved and the required response time of the application.The response time is indeed crucial for applications such as a counter-sniper systems [4,5].Such military systems are composed of microphone arrays mounted on top of a soldiers helmet and connected to an FPGA for signal processing.A similar approach is applied in [6], where the authors present a hat-type hearing system composed of 48 digital MEMs microphone array with an FPGA as the computational component.Their main target is a hearing aid system which emphasizes up to 10 dB the sound coming from a certain direction.Such type of applications demands a fast response of the system while being power efficient.
Indoor applications, such as videoconferencing, home surveillance, and patient care, make also use of microphone arrays for speech detection [1,7].This paper describes the design and implementation on an FPGA of an eightelement digital MEMS microphone array for distant speech recognition.In [8] the authors propose a beamformingbased acoustic system for localization of the dominant noise source.The signal acquisition consists of a microphone array composed of up to 33 MEMS microphones whereas the PDM demodulation and the beamforming are implemented in an FPGA.The implementation in the FPGA is completed with the delay-and-sum beamforming, measuring 60 angles, and generating a polar map for directivity pattern presentation.Another example is proposed in [9], in which the soundsource localization is obtained by using distributed microphone arrays in a WSN.The distributed information collected by the nodes is transferred and processed using data-fusion techniques in order to locate and profile the sound sources.Despite the fact that they implement most of the processing components on an FPGA, the 64k-FFT component becomes too large and resource hungry such that it is not suitable for low and middle-end FPGAs.In both publications, however, their solutions are not scalable and not adaptable to dynamic acoustic environments.Furthermore, they do not provide information about how fast their systems can be.Instead, we present a detailed description and analysis of a flexible, scalable, and real-time architecture.

Sound-Source Localization
Our microphone array is designed to spatially sample its surrounding sound field in order to detect and to locate certain types of sound sources.A 360 ∘ sound power scan is performed for a configurable number of orientations.A beamforming technique focuses the array in one specific direction or orientation, by amplifying all sounds coming from that direction and by suppressing sounds coming from other directions.A polar power plot is obtained from which the lobes can be used to estimate the nearby sound sources.Figure 1 shows the functional elements required to locate the sound-source, which involve several filters, a beamformer, and a relative sound power estimator.

Microphone Array Description.
The sensor array is composed of 52 digital MEMS microphones and designed for far-field and nondiffuse sound fields [9].The array pattern consists of four concentric subarrays of 4, 8, 16, and 24 MEMS microphones mounted on a 20 cm circular printed board (Figure 2).Each subarray is differently positioned in order to facilitate the capture of spatial acoustic information using a beamforming technique.Furthermore, the sensor array response is dynamically modified by individually activating or deactivating subarrays.This distributed geometry allows adapting the sensor to different sound sources.For instance, not all the subarrays need to be active to detect a particular sound-source.The computational requirements drastically decrease and the sensor array becomes more power efficient if only a few numbers of subarrays are active.

Filters.
The selected digital MEMS microphones are the ADMP521 MEMS microphones designed by Analog Devices, which offer an omnidirectional polar response and a wideband frequency response ranging from 100 Hz up to 16 kHz [10].These digital MEMS microphones have a multiplexed pulse density modulation (PDM) as output.The PDM signals are generated by using an analogue to digital converter (ADC) based on a sigma delta converter.The sigma delta conversion technique uses an embedded integrator-comparator circuit to sample the analogue signal and outputs a 1-bit signal [11].The ADMP521 MEMS microphones use a fourth-order sigma delta converter, which reduces the added noise in the audio frequency spectrum by shifting it to higher frequency ranges.This undesirable high-frequency noise needs to be removed.The ADMP521 MEMS microphones require a clock input of around 1 to 3 MHz as sampling frequency (  ).This range of   is chosen to oversample the audio signal in order to have sufficient audio quality and to generate the PDM output signal.Therefore, the PDM signal needs not only to be filtered to remove the noise but also to be downsampled to convert the audio signal to a Pulse-Code Modulation (PCM) format.The target audible frequency range, from  min to  max , determines the decimation factor (  ) to properly downsample the PDM signal while satisfying the Nyquist theorem.
The usual range of   is from a few tens up to hundreds when targeting audible frequency ranges.For instance,   needs to be 83 to recover audio signal oversampled at 2.49 MHz for a target  max of 15 kHz.

Filter-and-Sum
Beamforming.The beamforming technique applied in our proposed architecture is based on the Filter-and-Sum beamforming [12].The original Filterand-Sum beamforming applies an independent weight to  A variant version of the Filter-and-Sum recovers the audio signal from the PDM signal, applies the same low-pass FIR filter, and delays the filter output signal of each microphone by a specific amount of time (Δ) before adding all the output signals together (Figure 3).The time delay (Δ  ) for a microphone  is determined by the focus direction , the position vector (  →   ) of microphone , and the speed of sound ().
where the unitary vector ( ⃗ ) defines the direction vector of a far-field propagating signal with a focus direction .The total output ((, )) of the array can be expressed based on the signal output of each microphone in the time domain   () and the number of microphones in the array (): The response of the Filter-and-Sum beamforming, however, is usually represented in the frequency domain due to its dependence on the signal frequency.Let   () be the output signal of each microphone at angular speed  = 2 for frequency  and  the number of microphones in the array.The total output ((, )) is defined as in [13]: which can be simplified by assuming a monochromatic acoustic wave as where   () is the output signal of the monochromatic wave,   is the incoming monochromatic angular speed,  0 is its direction, and  is the array focus.(  ,  0 , ) is known as the array pattern, which determines the amplification or gain of the array output.For instance, when  0 = , which occurs when the array is focusing in the direction of the incoming monochromatic wave, the gain reaches its maximum , equal to the number of microphones.Figure 3: The proposed Filter-and-Sum beamforming filters and delays the output of each microphone before adding them together.(a) The acoustic wave received at each microphone is measured and filtered.The beamforming technique considers the time Δ  that the input signal takes to travel from the microphone  to the origin is proportional to the projection of the microphone vector  →   on ⃗  .(b) This Δ  is determined by the position of the microphone in the array and the desired focus direction  of the array.Consequently, the signals coming from the same direction are amplified after the addition of the delayed inputs.Source [9].power per horizontal direction, which is done by a 360 ∘ sweep overview of the surrounding sound field.The directional power output of a microphone array, defined here as the polar steering response power (P-SRP), corresponds to the array's directional response to sound sources present in a sound field (Figure 4).The P-SRP is obtained by considering multiple broadband sources coming from different directions, for instance, human speech.
The output power when the microphone array is exposed to a broadband sound-source () with an angle of incidence  0 can be modelled as where   with  ∈ {1, . . ., } is the amplitude of one of the  frequency components of ().The equation can be generalized to consider a sound field  composed of multiple broadband sound sources at different locations and with uncorrelated noise: (, ) =  (,  1 ) +  (,  2 ) + ⋅ ⋅ ⋅ +  (, ) The array's power output can be expressed as since the power of a signal is the square of the array's power output.Finally, the normalized power output is defined as the P-SRP.
P-SRP (, ) =  (, ) The comparison of (, ) for different values of  determines in which direction the sound-source is located since the maximum power is obtained when the focus corresponds to the location of a sound-source.
The calculation of the P-SRP is usually defined in the frequency domain [14,15], which requires the computation of a Fourier transform.Instead, we propose applying Parseval's theorem which states that the sum of the squares of a function is equal to the sum of the squares of its transform.This theorem drastically simplifies the calculations since P-SRP can be computed in the time domain.Let us define the sensing time (  ) as the time the array is registering the previously defined sound field  for each orientation.Therefore, the power (,   ) can be expressed as follows: Consequently, P-SRP can be expressed in the time domain by

Sensor Array Evaluation.
The defined P-SRP allows estimating the direction of arrival of multiple sound sources under different sound field conditions.Nevertheless, the precision and accuracy of its estimation can be determined by different quality metrics.
The Filter-and-Sum beamforming is applied to a discrete number of orientations or angles.The angular resolution of the microphone array is determined by the number of measurements per 360 ∘ sweep.A higher number of measurements increment the resolution of the P-SRP displayed as a polar power map (Figure 5) and decrease the location error of the sound-source.The lobes of this polar power map can then be used to estimate the bearing of nearby sound sources in nondiffuse sound fields conditions.In fact, the characteristics of the main lobe when considering a single sound-source scenario determine the directivity of the microphone array.The definition of array directivity,   , is proposed in [16] for broadband signals.The authors propose the use of (  ) as a metric of the quality of the array since   depends on the main lobe shape and its capacity to unambiguously point to a specific bearing.The definition of array directivity presented in [16] is adapted for 2D polar coordinates in [9] as follows: where (, ) is the output power of the array when pointing to the direction  and (1/2) ∫ 2 0 (, ) 2  is the sum of the squared output power in all other directions.It can be expressed as the ratio between the area of a circle whose radius is the maximum power of the array and the total area of the power output.Consequently,   defines the quality of the microphone array and can be used to specify a certain threshold for the microphone array.For instance, if   equals 8, the main lobe is eight times slimmer than the unit circle and offers a confident estimation of a sound-source within half a quadrant.
Whereas   is usually considered for broadband sound sources, other metrics are necessary to profile the array's response for different types of sound sources.Figure 6 depicts the maximum side lobe (MSL) and the half-power beamwidth, which are two complementary metrics used to characterize the response of arrays for narrowband sound sources.Half-power beamwidth is the angular extent by which the power response has fallen to half of the maximum level of the main lobe.Since the half-power coincides with a 3 dB drop in power level, it is often called 3 dB beamwidth (BW −3 dB ).This metric determines the angular ratio between the power signal level which is at least 50% of the peak power level and the remaining circle.By contrast, MSL is another important parameter used to represent the impact of the side lobes when characterizing arrays.MSL is the normalized ratio  between the highest side lobe and the power level of the main lobe expressed in dB.Both metrics, the MSL and BW −3 dB , are desired to be as low as possible, whereas   should be as high as possible to guarantee a precise sound-source location.

A Filter-and-Sum Based Architecture
The proposed architecture uses a Filter-and-Sum basedbeamforming technique to locate a sound-source with an array of digital MEMS microphones.Many applications, however, demand a certain scalability and flexibility when locating the sound-source.With such requirements in mind, the proposed architecture has some additional features to support a dynamic response targeting applications with realtime demands.The proposed architecture is also designed to be battery power efficient and to operate in streaming fashion to achieve the fastest possible response.One of the features of the ADMP521 microphone is its low-power sleep mode capability.When no clock signal is provided, the ADMP521 microphone enters in a low-power sleep mode (<1 A), which makes this sound-source localizer suitable for battery powered implementations.The PCB of the MEMs microphone array is designed to exploit this capability.Figure 2 depicts the subarray distribution of the MEMs microphones.Using the clock signal, it is possible to activate or deactivate subarrays since each subarray is fetched with an individual clock signal.This flexibility allows disabling not only subarrays of microphones, but also the associated computational components, decreasing the computational Order of the FIR filter demand and the power consumption.The proposed architecture is properly designed to support such flexibility.
The array computes its response as fast as possible to reach real-time sound-source location.The proposed architecture is designed to process in stream fashion and is mainly composed of three cascaded stages operating in pipeline (Figure 7).The first stage is the filter chain, which is composed of the minimum number of components required to recover the audio signal in the target frequency range.The second stage computes the Filter-and-Sum beamforming operation.The final stage obtains (, ) for the focused orientation.A polar power map is obtained once a complete steering loop is completed.The different stages are discussed in more detail in the following subsections.Table 1 summarizes the most relevant parameters of the proposed architecture.

Filter Stage.
The filter stage contains a PDM demultiplexer and as many filter chain blocks as MEMS microphones (Figure 8).Each microphone of the array is associated with a filter chain composed of a couple of cascaded filters.The fullcapacity design supports up to 52 filter chain blocks working in parallel, but their number is defined by the number of active microphones.The unnecessary filter chain blocks are disabled at runtime.The microphones' clock   determines the input rate and, therefore, how fast the filter stage should operate.The low operating frequency for current FPGAs allows interesting power savings [17].
Every pair of microphones has its PDM output signal multiplexed in time.Thus, at every edge of the clock cycle the output is the sampled data from one of the microphones.The PDM demultiplexing is the first operation to obtain the individual sampled data from each microphone.This task is done in the PDM splitter block.
The next component consists of a cascade of filters to filter and to downsample each microphone signal.Traditional digital filters such as the Finite Impulse Response (FIR) type of filters are a good solution to reduce the signal bandwidth and to remove the higher frequency noise.Once the signal is filtered it can be decimated to decrease the oversampling to a reasonable audio quality rate (e.g., 48 kHz).However, this filter consumes many adders and dedicated multipliers (DSPs) from the FPGA resources, particularly if its order increases.
The Cascaded Integrated-Comb (CIC) filter is an alternative for low-pass filtering techniques which has been developed in [18,19] and involves only additions and subtractions.This type of filter consists of 3 stages: the integrating stage, the decimator or integrator stage, and the comb section.PDM samples are recursively added in the integrating stage while being recursively subtracted with a differential delay in the comb stage.The number of recursive operations in both the integrating and comb section determines the order of the filter ( CIC ) and should at least be equal to the order of the sigma delta converter from the DAC of the microphones.After the CIC filter, the signal growth () is proportional to the decimation factor ( CIC ) and the differential delay (DD) and is exponential to the filter order [19].
The output bit width grows proportionally to .Denote by  in the number of input bits; then the number of output bits  out is as follows: The proposed CIC decimation filter eliminates higher frequency noise components and decimates the signal by  CIC at the same time.However, a major disadvantage of this filter is the nonflat frequency response in the desired audio frequency range.In order to improve the flatness of the frequency response, a CIC filter with a lower decimation factor followed by a compensation FIR filter is often chosen like in [20][21][22].
The CIC filter is followed by an averager, which is used to cancel out the effects caused by the microphones' DC offset output leading to a constant offset in the beamforming values.This block improves the dynamic range, reducing the bit width required to represent the data after the CIC.
The last component of each filter chain is a low-pass compensation FIR filter based on a Kaiser window.This filter equalises the passband drop usually introduced by CIC filters [19].It additionally performs a low rate change.The proposed filter also needs a cut-off frequency of  max at a sampling rate of   / CIC , which is the sampling rate obtained after the CIC decimator filter with a decimation factor of  CIC .This low-pass FIR filter is designed in a serial fashion to reduce the resource consumption.In fact, the FIR filter order  is also determined by  CIC .Thereby the stream nature of the architecture, the CIC filter, is able to generate an output value every clock cycle.Due to the decimation factor, only one output value per  CIC input value is propagated to the lowpass FIR filter.Therefore, the FIR filter has  CIC clock cycles to compute each input value, which determines its maximum order.The filtered signal is then further decimated by a factor of  FIR to obtain a minimum bandwidth BW = 2 ⋅  max of audio signals to satisfy the Nyquist theorem.The overall   can be expressed based on the low rate change of each filter.

Beamforming Stage.
As detailed before, the main purpose of the beamforming operation is to focus the MEMS microphone array in one particular direction.The detection of sound sources is possible by continuously steering in loops of 360 ∘ .The number of orientations,   , determines the angular resolution.Higher angular resolutions demand not only a larger execution time per steering loop, but also more FPGA memory resources, to store the precomputed delays per orientation.The beamforming stage depends on the number of microphones and subarrays.Although Filter-and-Sum beamforming assumes a fixed number of microphones and a fixed geometry, our scalable solution satisfies those restrictions while offering a flexible geometry.Figure 9 shows our proposed Filter-and-Sum based beamformer.This stage is basically composed of FPGA's blocks of memory (BRAM) in ring-buffer fashion that properly delay the filtered microphone signal.The values of the delays at a given moment depend on the focus orientation at that moment and are determined by the array pattern (  ,  0 , ) from (5).The delay for a given microphone is determined by its position on the array and on the focus orientation.All possible delay values per microphone for each beamed orientation are precomputed, grouped per orientation and stored in ROMs during compilation time.During execution time, the delay values Δ  () of each microphone  when pointing to a certain orientation  are obtained from this precomputed table.
The beamforming stage is designed to support a variable number of microphones.This is enabled by grouping the input signals following their subarray structure.Therefore, instead of implementing one simple Filter-and-Sum of 52 microphones, there are four Filter-and-Sum operations in parallel for the 4, 8, 16, and 24 microphones.Their sum operation is firstly done locally for each subarray and afterwards between subarrays.The only restriction of this modular beamforming is the synchronization of the outputs in order to have them properly delayed.Therefore, the easiest solution is to delay all the subarrays with the maximum delay of the subarrays.Although the output of some subarrays is already properly delayed, additional delays, shown at the Sums section in Figure 9, are inserted to assure that the proper delay of each subarray has been obtained.This is achieved by using the valid output signals of each subarray beamforming, without additional resource cost.Consequently, only the Filter-and-Sum beamforming modulo linked to an active subarray is enabled.The not active beamformers are set to zero in order to avoid any negative impact of the beamforming operation.
A side benefit of this modular approach is a reduction of the memory resource consumption.Since each subarray has their ring-buffer memory properly dimensioned to its maximum sample delay, the portion of underused regions of the consumed memories is significantly low.

Power Stage.
Figure 10 shows the components of the power stage.Once the filtered data has been properly delayed and added for a particular orientation , (, ) is calculated following (10).The P-SRP is obtained after a steering loop, allowing the determination of the sound sources.The soundsource is estimated to be located in direction shown by the peak of the polar power map, which corresponds to the orientation with the maximum (, ).

Performance Analysis of the Filter-and-Sum Based Architecture
A performance analysis of the proposed architecture is presented in this section.The analysis shows how the design parameters such as the filters' characteristics affect the final execution time of the sound-source locator.The links between performance and design parameters are explained followed by the description of the different acceleration strategies.These strategies can be considered standalone or combined for certain timing constraints.The advantages of these strategies are lately presented in Section 6.

Time Parameters.
The overall execution time of the proposed architecture is defined by the latency of the main components.A detailed analysis of the implementation of components and the latency that they incur provides a good insight about the speed of the system (Table 2).The operation frequency of the design can be assumed to be the same as the sampling frequency.Let us define  P-SRP as the overall execution time in clock cycles required to obtain P-SRP.Thus,  P-SRP is defined as (16) where   is the execution time of one orientation and is determined by the execution time of the filter stage ( filters ), the execution time of the beamforming ( beamforming ), and the execution time of the power stage ( power ), which are the main components of the system as explained in the previous section.The proposed architecture is designed to pipeline each stage, overlapping the execution of each component of the design.Therefore, only the initial latency or initiation interval (II) of the components needs to be considered, since it corresponds to the system group delay.
Let us assume that the design operates at the same frequency   like the microphones; then (16) can be rearranged as follows: where   is the latency of the system and determined by the initiation interval of the filter stage ( filters

II
), the initiation interval of the beamforming stage ( beamforming II ), and the initiation interval of the power stage (

power II
).The time during which the microphone array is monitoring one particular orientation is known as   .This is the time required to calculate a certain number of output samples (  ).As previously detailed, the digital microphones oversample the audio signal by operating at   .The reconstruction of the audio signal in the target range demands a certain level of decimation   .This level of decimation is done by the CIC and the FIR filter in the filter stage, with a certain level of decimation ( CIC ) and ( FIR ), respectively.Based on   defined in (1), the time   is expressed as follows: II of each stage of the implementation can also be further decomposed based on the latency of the components, where   II is the initiation interval of each component .Therefore,  II is defined as the sum of all the initiation intervals: Equation ( 16) can be rearranged (see Figure 11) as The execution time  P-SRP is determined by   and   , since the level of decimation is determined by the target frequency range and  II is determined by the components' design.Although most of the latency of each component of the design is hidden thanks to the pipelined operation, there are still some cycles dedicated to initialize the components.A detailed analysis of  II provides valuable information about the performance leaks.
CIC.The initiation interval of the CIC filter represents the time required to fulfil the integrator and the comb stages.Therefore, the order of the CIC ( CIC ) determines  CIC II .
DC.The component which must remove the DC level of the signal introduces a minor initial latency due to its internal registers.Since it needs at least two input values to calculate the DC level, it also depends on  CIC .
FIR.The initiation interval of the FIR filter is also determined by the order of this filter ( FIR ).Since the filter operation is basically a convolution, the initial output values are not correct until at least the ⌈( FIR + 1)/2⌉th input signal of the filter.Because the filters are cascaded,  CIC also affects  FIR II .
Therefore,  filters II is expressed as follows: Delay.The beamforming operation is done through memories, which properly delay the audio samples for a particular orientation.The maximum number of samples determines the minimum size of these delay memories.This value represents the maximum distance between a pair of microphones for a certain microphone array distribution and may vary for each orientation.The initiation interval of the Filter-and-Sum beamformer is therefore expressed as the maximum distance between pairs of microphones for a particular orientation.
where max(Δ am ()) is the maximum time delay of the active microphones for the beamed orientation .Therefore,  Delay II is mainly determined by the microphone array distribution,   , and the target frequencies, determining   .Due to the symmetry of the microphone array and for the sake of simplicity, it is assumed that each orientation has the same max(Δ am ).Notice this does not need to be true for different array configurations.
Sum.The proposed beamforming is composed of not only a set of delay memories but also a sum tree.The initiation interval of this component is defined by the number of active microphones ( am ).
Therefore,  beamforming II is expressed as follows: Power.The final component is the calculation of the power per orientation.This simple component has a constant latency of a couple of clock cycles.
The timing analysis of the initiation interval of each component of the architecture gives an idea about the design parameters with higher impact.The definition of the filters, mainly their order, is determined by the application specifications, so it should not be modified to reduce the overall execution time.On the other hand, the distribution of the microphones in the array affects not only the frequency response of the system, but also the execution time.Notice, however, that the number of microphones does not have timing impact.Only the number of active microphones has a minor impact in terms of a couple of clock cycles of difference.Nevertheless, (21) already shows that the dominant parameters are   and   .

Sensitive Parameters.
The timing analysis provides an indication of the parameters dominating the execution time.Some parameters, like the microphone array distribution, which determine the beamforming latency, are fixed while others like   or   per orientation are variable.
Orientations. Figure 5 depicts how an increment of   leads to a better sound-source localization.This resolution, however, has a high repercussion on the response time.A simple strategy is to maintain the angular resolution only for where it is needed while quickly exploring the surrounding sound field.For instance, the authors in [3] propose a strategy to reduce the beamforming exploration to 8 orientations, with an angular separation of 45 degrees.Once a steering loop ends, the orientations are rotated one position, which represents a shift operation in the precomputed orientation table.Therefore, all the supported 64 orientations are monitored after 8 steering loops.Despite this strategy intending to accelerate the peak detection by monitoring the minimum   , the overall   remains the same for achieving the equivalent angular resolution.
Sensing Time.The sensing time is a well-known parameter of radio frequency applications.The time   is known to strengthen the robustness against noise [23].In our case, the time a receiver is monitoring the surrounding sound field determines the probability of properly detection of a sound-source.Consequently, a higher   is needed to detect and locate sound sources under low Signal-to-Noise (SNR) conditions.Despite the fact that this term could be modified in runtime to adapt the sensing of the array based on an estimated SNR, it would demand a continuous SNR estimation, which is out of the scope of this paper.
To conclude, Table 2 summarizes the timing definitions.On one hand,   determines the number of processed acoustic samples and therefore directly affects the sensing of the system.On the other hand,   determines the angular resolution of the sound-source search and influences the accuracy.There is a trade-off between   and   and the quality of the sound-source location.

Strategies for Time Reduction.
The following three strategies are proposed to accelerate the sound-source localization without any impact on the frequency response and   of the architecture.An additional strategy is proposed specially for dynamic acoustic environments, but with a certain accuracy cost.

Continuous Processing.
The proposed architecture is designed to reset the filter and beamforming stages after   due to orientation transition.Thanks to beamforming after the filter stage, the system can be continuously processing while resetting.The filter stage does not need to stop its processing.The input data is not lost due to the reset operations since the filtered input values are stored in the beamforming stage.Furthermore, the initialization of the beamforming stage can also be eliminated since the stored data from the previous orientation can be reused for the calculation of the new one.With this approach, (17) becomes as follows:

Time Multiplexing.
Nowadays, FPGAs can operate at clock speeds of hundreds of MHz.Despite the fact that the power consumption is significantly lower when operating at low frequency [17], the proposed architecture is able to operate at much higher frequency than the data sampling rate.This capability provides the opportunity to parallelize the beamforming computations without any additional resource consumption.Instead of consuming more logic resources by replicating the main operations, the proposed strategy, similar to Time-Division Multiplexing in communications, consists in time multiplexing these parallel operations.Because the type of the input data is oversampled audio, the selection of the operations to be time multiplexed is limited.Based on (21), the candidates to be parallelized are   and   .Since the input data rate is determined by   , (18) shows that   cannot be reduced without decreasing   or changing the target frequency range.Nevertheless, since the computation of each orientation is data independent, they can be parallelized.The After   II the delay memories which compose the Filterand-Sum beamforming stage have already stored enough audio data to start locating the sound-source.Because the beamforming operation relies on delaying the recovered audio signal, multiple orientations can be computed in parallel by accessing the content of the delay memories at a higher speed than the sampling of the input data.It basically multiplexes the output beamforming computations over time.The required frequency   to parallelize all   for this architecture is defined as follows: Due to (1),   can be also expressed based on the target frequency range: Notice that the required frequency to multiplex in time the computation of the orientations does not depend on the number of microphones in the array.Figure 12 shows the clock domains when applying this strategy.While the frontend, consisting of the microphone array and the filter stage, operates at   , the output of the beamforming is processed at   .The additional cost in terms of resources is the extension of the register for the power per angle calculation.A memory of   positions is required instead of the single register used to store the accumulated power values.This strategy allows fully parallelizing the computation of all the orientations.Thus,  P-SRP is mainly limited by   and the maximum reachable frequency of the design, since   is determined by the microphones' operational frequency and   by the frequency range of the target sound-source.In fact,   determines how many orientations can be processed in parallel.

Parallel Time Multiplexing.
This proposed strategy is an extension of the previous one.The frequency   is limited by the maximum attainable operating frequency of the implementation, which is determined by many factors, from the technology to the available resources on the FPGA.For instance, if  max equals 30 kHz and the maximum attainable operating frequency is 100 MHz, then up to 1666 orientations could be computed in parallel.However, if not all the resources of the FPGA are completely consumed, especially the internal blocks of memory (BRAM), there is still space for improvement.With the time multiplexing strategy, the memories of the beamforming stage are fully accessed, since in each clock cycle there is at least one memory access or even two memory accesses when new data is stored.Therefore, more memory resources can be used to further accelerate the computation of the P-SRP.The simple replication of the beamforming stage, preconfigured for different orientations, will be enough to double the number of processed orientations while maintaining the same  P-SRP .The strategy mainly consumes BRAMs.Nevertheless, due to the value of the max(Δ  ) at BW for our microphone array, only few audio samples are needed to complete the beamforming.This fact drastically reduces the memory consumption, which provides the potential computation of thousands of orientations by applying both strategies.All strategies can be applied independently despite the fact that some will only work properly when combined.Not all strategy combinations are beneficial.For instance, a dynamic angular resolution should be only combined with the time multiplexing of the orientations when   is higher than   .Otherwise the reduction of   by dynamically readjusting the target orientations does not provide any acceleration and it would only degrade the response of the system.

Results
The proposed architecture is evaluated in this section.Our analysis starts evaluating different design solutions based on the timing analysis introduced in Section 5.1.One representative configuration is evaluated based on the frequency response and accuracy by using the metrics described in Section 3.5.This evaluation also considers sensitive parameters such as the number of active subarrays and the relevance of   , already introduced in Section 5.2.The resource and the power consumption for a Zynq 7020 target FPGA are also   presented.Finally, the strategies presented in Section 5.3 are applied for the representative design.

General Performance Analysis.
The proposed performance analysis from the previous section is here applied on a concrete example.The explored design parameters are   and  max , keeping   and   both constant to 64.Whereas   is determined by the microphone's sampling frequency,  max is determined by the target application.For our design space exploration, we consider an  max from 10 kHz to 16 kHz in steps of 125 Hz and   ranges from 1.25 MHz until 3.072 MHz as specified in [10].
Equations ( 16) to ( 18) and ( 20) to (32) are used to obtain  P-SRP .The performance analysis starts obtaining   for every possible value of   and  max .All possible combinations of  CIC and  FIR are considered based on (15).The lowpass FIR filter parameters are  FIR , which is determined by  CIC , and  max as the cut-off frequency.Each possible lowpass FIR filter is generated considering a transition band of 2 kHz and an attenuation of at least 60 dB at the stop band.If the minimum order or the filter is higher than  FIR the filter is discarded.We consider these parameters as realistic constraints for low-pass FIR filters.Furthermore, a minimum order of 4 is defined as threshold for  FIR .Thus, some values are discarded because   is a prime number or  FIR is below 4. Each low-pass FIR filter is generated and evaluated in Matlab 2016b.
Figure 13 depicts the minimum timings of the DSE that the proposed Filter-and-Sum architecture needs to compute one orientation.  is slightly reduced when varying   .For instance, it is reduced from 5.03 ms to 3.97 ms when  max = 10 kHz.A higher   means a faster sampling, which is in fact the operational frequency limiting factor.Furthermore, a higher decrement of  P-SRP is produced when increasing   and  max .Higher values of  max allow higher values of  CIC , which can greatly reduce computational complexity of narrowband low-pass filtering.However, too high values of  CIC lead to such low rates that, although a higher order low-pass FIR filter is supported, it cannot satisfy the low-pass filtering specifications.Notice how the number of possible solutions decreases while increasing  max .Due to   and  max ranges, the values of   vary between 39 and 154.Though, as previously explained, many values cannot be considered since they are either prime numbers or the decomposition in factors of  CIC leads to values below 4. Because higher values of  max lead to low values of  CIC for low   , these  CIC values cannot satisfy the specifications of the low-pass FIR filter.
Finally, relatively low values of  P-SRP are obtained for  max values from 10 kHz to 10.65 kHz and   ranging from 2.7 MHz to 3.072 MHz.It is produced by high values of  CIC , which means that a higher order low-pass FIR filter is supported.As expected, high values of  CIC lead to high order low-pass FIR filters and lower  FIR .A lower  P-SRP is possible thanks to avoiding unnecessary computations since fewer samples are decimated after the low-pass FIR filter.

Analysis of a Design.
As shown in Figure 13, several design considerations drastically affect the final performance.However, most of these design decisions do not have a significant impact on the system response compared to other factors such as the number of active microphones or the number of orientations.The analysis of impact of these parameters on the system's response and performance is done over one particular design.Table 3 summarizes the configuration of the architecture.The design considers   = 2 MHz, which is the clock for the microphones and the functional frequency of the design.This value of   is the intermediate value between the required clock signals of the ADMP521 microphones [10].The selected cut-off frequency is  max = 15.625 kHz, which leads to   = 64.In this example design  CIC = 4 with a decimation factor of 16 and a differential delay of 32.The chosen FIR filter has a beta factor of 2.7 and a cut-off frequency of  max at a sampling rate of 125 kHz, which is the sampling rate obtained after the CIC decimator filter with a  CIC = 16.The filtered signal is then further decimated by a factor  FIR = 4 to obtain a BW = 31250 kHz audio signal.
The architecture is designed to support a complete steering loop up to 64 orientations, which represents an angular resolution of 5.625 ∘ .On the other hand, the subarray approach allows activating the 52 microphones if all the 4 subarrays are active.The final results are obtained by assuming a speed sound of ≈343.2 m/s.

Frequency Response.
The waterfall diagrams of Figure 14 show the power output of the combined subarrays in all directions for all frequencies.In our case, the results are calculated with a single sound-source varying between 100 Hz and 15 kHz in steps of 100 Hz and placed at 180 ∘ .All results are normalized per frequency.Every waterfall shows a clear distinctive main lobe.When only subarray 1 is active there are side lobes at 5.3 kHz and 10.6 kHz which impede the sound-source location for those frequencies.The frequency response of the subarrays improves when they are combined since their frequency responses are superposed.The combination of the subarrays 1 and 2 reaches a minimum detectable frequency of 3.1 kHz, when combining subarrays 1, 2, and 3 and all subarrays reach 2.1 kHz and 1.6 kHz, respectively.These minimum values are clearly depicted in Figure 15, with a threshold of 8 for   , which indicates that the main lobe's surface corresponds to maximally half of a quadrant.The frequency response of the combination of subarrays has a strong variation at the main lobe and, therefore, in   .Figure 15 depicts the evolution of   when increasing the angular resolution and when combining subarrays.The angular resolution determines that the upper bound   converges, which is dependent on the number of orientations.The number of active microphones, on the other hand, influences how fast   converges to its upper limit.Consequently, the number of active microphones determines the minimum frequency which can be located when considering a threshold of 8 for   .Alongside the directivity, other metrics such as the main beamwidth and the MSL levels metrics are also calculated to properly evaluate the quality of the array's response.Figure 16 depicts the MSL when varying the number of active subarrays and the number of orientations.A low angular resolution leads to a lower resolution of the waterfall diagrams, but only the metrics can show the impact.At frequencies between 1 and 3 kHz the main lobe converges to a unit circle, which can be explained by the lack of any side lobe.Higher frequencies present secondary lobes, especially when only the inner subarray is active, which increases the MSL values independently of the angular resolution.A low angular resolution leads to unexpected low values of MSL since the secondary lobes are not detected.On the other hand, a higher number of active microphones lead to lower values of MSL, independently of the angular resolution.Figure 17 depicts the BW −3 dB metric for a similar analysis of the number of microphones and angular resolution.On one hand, a higher number of microphones produce a faster decrement of BW −3 dB , reflected as a thinner main lobe.Nevertheless, BW −3 dB of each subarray converges to a minimum, which is only reached at higher frequencies.The angular resolution determines this minimum, which ranges from 90 ∘ till 11.25 ∘ when 8 or 64 orientations are considered, respectively.4 summarizes the resource consumption when combining subarrays.The consumed resources are divided into the resources for the filter stage, the beamforming stage, and the total consumption per groups of subarrays.The filter stage mostly consumes DSPs while the beamforming stage mainly demands BRAMs.Most of the resource consumption is dominated by the filter stage, since a filter chain is dedicated to each MEMs microphone.What determines the resource consumption is the number of active subarrays.

Resource Consumption and Power Analysis. Table
The flexibility of our architecture allows the creation of heterogeneous source-sound locators.Thus, the architecture can be scaled for small FPGAs based on the target soundsource profile or a particular desirable power consumption.For instance, the combination of the two inner subarrays would use 12 microphones while consuming less than 10% of the available resources.The LUTs are the limiting resource due to the internal registers of the filters.In fact, when all the subarrays are used around 80% of the available LUTs are required.Nevertheless, any subarray can be disabled in runtime, which directly deactivates its associated filter and beamforming components.Although this does not affect the resource consumption, it has a direct impact over the power consumption.Table 5 shows the power consumption The other two strategies proposed in Section 5.3.1 are designed to fully exploit the FPGA resources and to overcome time constraints when considering a high angular resolution.In the first case, since the design under evaluation has a small angular resolution (  = 64), there is no need for a higher   when applying the time multiplexing strategy.However, a higher angular resolution can be obtained when considering the unconsumed resources without additional timing cost.Table 8 shows the combination of strategies increases the angular resolution without additional time penalty.The operational frequency ( op ) determines at what speed the FPGA can operate.By following (33), the beamforming operation can be exploited by increasing   up to the maximum frequency, which increases   as well: max (  ) = max ( op ) Many thousands of orientations can be computed in parallel when combining all strategies.The beamforming stage can be replicated as many times as the remaining available resources allow.Of course, this estimation is certainly optimistic since the frequency drops when the resource consumption increases.Nevertheless, this provides an upper bound for   .For instance, when only the inner subarray is considered, the DSPs are the limiting component.However, up to 53 beamforming stages could be theoretically placed in parallel.When more subarrays are active the BRAMs are the constrained component.Notice how the number of supported orientations increases if the number of subarrays decreases.It has, however, an impact on the frequency response and the accuracy of the system, as shown in Section 6.2.1.Nevertheless, tens of thousands of orientations can be computed in parallel consuming only around 2 ms by operating at the highest  op and by replicating the beamforming stage to exploit all the available resources.

Conclusions
In this paper we have presented a scalable and flexible architecture for fast sound-source localization.On one hand, the architecture can flexibly disable sections of the microphone array that are not needed or disable them to respect power restrictions.The modular approach of the architecture allows scaling the system for a larger or smaller number of microphones.Nevertheless, such capabilities do not impact the frequency and accuracy of our sound-source locator.On the other hand, several strategies to offer real-time soundsource localization have been presented and evaluated.These strategies not only accelerate but also provide solutions for those time stringent applications with a high angular resolution demand.Thousands of angles can be monitored in parallel, offering a high-resolution sound-source localization in a couple of milliseconds.

Figure 1 :
Figure 1: Operations needed for the proposed architecture to locate a sound-source.

Figure 4 :Figure 5 :
Figure 4: Examples of a polar map obtained under experimental conditions for sound sources of 5 kHz (a) and 8 kHz (b).

Figure 7 :
Figure 7: Main stages of the proposed architecture.

Figure 8 :
Figure 8: The filtering stage consists of a couple of filters with a downsampling factor.

Figure 9 :
Figure 9: Details of the internal structure of the proposed modular Filter-and-Sum beamforming.Note that the delay values are stored in a precomputed table.

Figure 10 :
Figure 10: The power stage consists of a couple of components to calculate P-SRP and the estimated location of the sound-source.

Figure 13 :
Figure 13: Minimum values of   based on   and  max .Different perspectives are displayed in the bottom figures.Notice how the shortest   is obtained when increasing  max and   .

Figure 14 :
Figure 14: Waterfall diagrams of the proposed architecture.The figures are obtained by enabling only a certain number of subarrays.From (a) to (d): only the 4 innermost microphones, only the 12 innermost microphones, the 28 innermost microphones, and all microphones.

Figure 15 :
Figure 15: Directivities when considering a variable number of orientations and active microphones.From (a) to (d)   with only 8 orientations up to 64 orientations on (d).

Figure 16 :
Figure 16: Measured MSL when considering a variable number of orientations and active microphones.From (a) to (d) the MSL with only 8 orientations up to 64 orientations on (d).

Table 1 :
Relevant parameters involved in proposed architecture.

Table 2 :
Relevant parameters involved in the performance calculation for the proposed architecture.
Figure 12: Clock regions for the time multiplexing of the computation of multiple   .

Table 3 :
Configuration of the architecture under analysis.

Table 4 :
Resource consumption after placement and routing when combining microphone subarrays.Each subarray combination details the resource consumption of the filter and the beamforming stage.

Table 5 :
Power consumption at   = 2 MHz expressed in mW when combining microphone subarrays.Values obtained from the Vivado 2016.4 power report.

Table 6 :
Timing analysis without any optimization of the design under evaluation.The values are expressed in s.from the filter stage is large enough.By combining the first two strategies,  P-SRP rounds to 2 ms and only the first steering loop needs 2.6 ms due to   II .In this case,  P-SRP is expressed as follows: P-SRP =   II +   ≈   .

Table 7 :
Timing analysis of the optimized designs when applying and combining the first two strategies.The values are expressed in ms.

Table 8 :
4aximum   when combining strategies.The maximum number of beamformers is obtained based on the available resources and the resource consumption of each beamformer (Table4).The maximum  op is reported by the Vivado 2016.4tool after placement and routing.MICs Inner 12 MICs Inner 28 MICs All 52 MICs Inner 4 MICs Inner 12 MICs Inner 28 MICs All 52 MICs