Mechanical Fault Sound Source Localization Estimation in a Multisource Strong Reverberation Environment

,


Introduction
Over the past decade, researchers have been actively exploring the use of acoustic features, classifcation, and clustering algorithms to predict or detect the state of machinery.Tey have also leveraged microphone arrays to capture spatial information for fault source localization [1,2].Many researchers have developed various sound source locators, such as Schober et al., who developed a functional sound source locator based on stochastic computing (SC) [3].However, due to various factors such as noise interference, reverberation, the nature of source signals, and the number of sources, the practical efcacy of source localization systems is infuenced.As a result, the application of source localization technology in mechanical fault detection still faces numerous challenges.
Traditional source localization algorithms are primarily built upon signal models and array signal processing techniques and are roughly divided into three categories: time diference of arrival-(TDOA-) based source localization algorithms [4,5], signal subspace-based source localization algorithms [6,7], and beamforming-based source localization algorithms [8,9].Te TDOA-based source localization algorithms ofer a low computational complexity and ease of implementation, rendering it highly practical for scenarios demanding real-time performance.However, it is sensitive to array hardware errors and exhibits reduced resilience to noise and reverberation.Te signal subspacebased source localization algorithms allow for achieving ultra-high-resolution multisource localization.Nevertheless, they also exhibit certain limitations.(1) Tey require prior knowledge of the number of sources, making them unsuitable for scenarios with an unknown number of sources.
(2) Tey impose several restrictions on sources and noise, demanding that sources be uncorrelated and noise adhere to Gaussian signal assumptions.Te beamforming-based source localization algorithms utilize synthesized beams to visualize the target region.Te direction with the highest response power corresponds to the direction of the source.Tey are more robust than the TDOA-based source localization algorithm, ofering increased resilience.DiBiase et al. [10] combined the characteristics of TDOA-based and beamforming-based source localization algorithms and proposed a method called steered response power-phase transform (SRP-PHAT).Tis approach involves applying the PHAT weighting to the sound signals received by each microphone, forming a directional beam.Subsequently, the beam scans various spatial search grids and computes response power spectra.Te location of the spectral peak is used to estimate the source's position.Compared to adaptive beamforming, SRP-PHAT does not require prior knowledge of source and noise, making it one of the mainstream source localization algorithms in recent years.Shi et al. [11] proposed a low-frequency noise sources localization method based on the virtual SMA extrapolation method and resolved the localization problem of a small-aperture SMA with lowfrequency noise sources.
In addition to the array signal techniques mentioned above for source localization, many researchers have introduced neural networks into sound source localization, such as CNN [12], CRNN [6], and AE [13].Tese approaches process the signals collected by microphone arrays through feature extraction modules to obtain input features.Tese are then fed into neural networks for estimating source positions or arrival directions.For example, Chakrabarty and Habets [14] proposed using the phase spectrum obtained from multichannel data after STFT transformation as the input to a CNN network and learning the phase relationship between neighboring channels through three consecutive convolutional layers to achieve sound source azimuth estimation, which shows excellent robustness in reverberant environments.Salvati et al. [15] proposed inputting weights of narrowband response power components from each beamforming into a neural network.Trough a CNN, they automatically learned the weighted vectors of these components, achieving higher-precision source localization.Adavanne et al. [16], on the other hand, utilized a CRNN network for multitask confguration.Tis method applies to various microphone array structures and demonstrates strong robustness in scenarios involving reverberation and low signal-to-noise ratios.Senocak et al. [17] proposed a cross-modal alignment task as a joint task with sound source localization to better learn the interaction between audio and visual modalities.Tis approach aims to achieve high localization performance through robust crossmodal semantic understanding.Park et al. [18] proposed to localize sound sources in visual scenes with a self-supervised approach.Using a less strict decision boundary in contrastive learning can alleviate the efect of noisy correspondences in sound source localization.
In summary, source localization methods incorporating neural networks can efectively address the performance degradation issues in source localization systems under complex scenarios involving reverberation, noise, and multiple sources.To address the challenge of spurious peaks in the response power spectra of the SRP-PHATalgorithm in multisource strong reverberation environments, it is worth noting that source localization is akin to locating the direction of sources within audio signals.Te U-net deep neural network has demonstrated remarkable performance in object localization within images [19].Tus, we propose to introduce the U-net network.Firstly, the SRP-PHAT algorithm is utilized to generate response power spectra.Subsequently, the U-net network is employed for multifault sound source localization.
Tis paper is organized as follows.Section 1 discusses the application of sound source localization technology in mechanical fault detection and reviews the research progress of various methods.Section 2 provides a detailed introduction to multisource localization method we proposed, which combines the U-net and SRP-PHAT algorithm.Section 3 focuses on the training and experimental analysis of the U-net network.Finally, Section 4 summarizes the key fndings and conclusions of the paper.

U-Net-Based Multifault Sound Source Localization Method
Te multifault sound source localization method proposed in this paper combines U-net with the SRP-PHAT algorithm.It treats sound source localization as a pixel-level classifcation task, where each pixel corresponds to a discrete spatial grid region, and the size of the pixel value represents the response power size corresponding to that grid, based on which the power size can be used to determine the actual sound source.In detail, the SRP-PHAT algorithm is used as a feature extractor of the spatial information of sound sources.Te response power spectrum with pseudopeaks is output as a feature map, which is converted into a "clean" sound source distribution map by a series of convolution, pooling, and upsampling operations through the U-net network, combined with an interpolation search method to realize multisource localization.Te overall framework is shown in Figure 1, which is divided into two stages: U-net network training and sound source localization.In the training stage, for each sample in the dataset, the multichannel audio data are processed through the SRP-PHAT algorithm to compute the response power spectrum, which serves as the input feature.Simultaneously, the source distribution map is calculated using the source positions and sound power levels, and this map is used as the ground truth label for the samples.Te U-net network is then trained.In the sound source localization stage, the array-collected signals undergo the SRP-PHAT algorithm to calculate the response power spectrum, which is then fed into the pretrained U-net model.Tis model predicts the source distribution map.By employing an interpolation search technique on this map, the specifc positions of multiple sources are determined, thus completing the localization process.

2
Shock and Vibration

Controlled Response Power Spectrum Calculation Based on SRP-PHAT.
Te process of calculating the response power spectrum using SRP-PHAT is depicted in Figure 2. Te search space is divided into discrete spatial grids based on the desired resolution, and all these grids are potential candidate positions for sources.Firstly, each microphone's received audio signal is weighted diferently to form a directional beam.Secondly, this beam is utilized to scan each spatial grid in the search space, and the controllable response power of that grid is computed.Tis process results in the response power spectrum of the source plane.
Assuming there are M microphones in the microphone array, let us denote R mn (τ) as the phase-transform weighted generalized inter-correlation function between the signals received by microphones m and n.Here, Δτ mn (l) represents the time delay diference, also known as the propagation time delay, of the sound signal from grid l to microphones m and n.Tis delay accounts for the microphone array's steering delay at grid l.Te response power at grid l can be represented as follows: Compared to the conventional controllable response power-based source localization algorithms, the SRP-PHAT algorithm introduces phase-transformed weighting to the generalized cross-correlation function to calculate controllable response power.Tis phase transformation weighting removes the amplitude information from the cross-power spectrum, retaining only the phase information.Tis approach can weaken the irrelevant peaks in the generalized cross-correlation function, leading to sharper spectral peaks.Consequently, it reduces the sensitivity of the SRP-PHAT localization algorithm to noise and reverberation components.
Let x m (t) and x n (t) represent the received signals from microphones m and n.Similarly, let X m (ω) and X n (ω) denote the Fourier transforms of x m (t) and x n (t), and (•) * indicates the complex conjugate transpose.With these defnitions in mind, the phase-transformed weighted generalized cross-correlation function can be represented as By substituting (2) into (1), the controlled response power spectrum of the SRP-PHAT algorithm can be obtained as In the resulting response power spectrum, each grid point's value corresponds to the microphone array's response power at that location.However, the response power spectrum often contains numerous false peaks due to reverberation and noise.When the number of sound sources is unknown, directly performing peak searching on the

Shock and Vibration
response power spectrum to locate sources might result in misjudgements of the source count due to the infuence of reverberation and noise, leading to inaccurate localization results.In addition, the presence of noise also causes the main fap of the source to widen, which may result in aliasing when the source positions are close together, making it difcult to distinguish the positions of two neighboring sources.

Construction of U-Net Network
Architecture.Te structure of the U-net network constructed in sound source localization is shown in Figure 3. Te U-net neural network is designed with an encoder part comprising four power spectrum feature extraction units, progressively extracting features from the response power spectrum reducing its spatial dimension.Te decoder part of the network consists of four source distribution map restoration units, systematically reconstructing the source distribution maps.To prevent potential overftting caused by excessive skip connections during training, which might lead to the network becoming overly sensitive to false peaks and noise in the response power spectrum, this paper does not connect all the corresponding layers of the encoder network and the decoder network as in the original U-net network.Still, it only connects the feature reduction units 1 and 2 of the decoder network with the corresponding layers of the encoder network.

Design of Response Power Spectrum Feature Extraction
Unit.In the encoder part of the U-net network, the input SRP-PHAT response power spectrum is initially subjected to a series of convolutional layers for feature extraction and pooling layers for downsampling.Hierarchical computations yield feature maps composed of multiple channels.Te structure of the response power spectrum feature extraction unit is shown in Figure 4.It commences with two dilated convolutional layers for feature extraction, followed by a batch normalization layer and a ReLU activation function layer after each convolutional layer.Finally, a max-pooling layer is employed for downsampling.
Each feature extraction unit consists of two dilated convolutional layers for feature extraction, followed by a max-pooling layer for downsampling.Each feature restoration unit includes a transpose convolutional layer for upsampling and two regular convolutional layers for feature fusion.Skip connections are incorporated in feature restoration units 1 and 2. Tese connections involve channelwise concatenation with the corresponding layers in the encoder, enhancing the ability to restore fne-grained details of the source distribution maps.
To efectively capture spatial structural information within the response power spectrum, the convolutional process in the encoder network of the U-net employs dilated convolution kernels of size 3 × 3 with a dilation rate K � 2. Tis choice allows the convolution operation to extend beyond neighboring elements, covering more considerable distances.Figure 5 illustrates convolution operations with diferent dilation rates.Te dark grid represents the distribution of the dilated convolution kernel.Convolution with a dilation rate of K � 1 is equivalent to standard convolution.However, using a dilation rate K > 1 creates holes in the input image, allowing the convolution kernel to capture a broader receptive feld.Tis process creates "holes" or "dilation" in the input image, enabling the convolutional kernel to encompass a broader receptive feld.
As shown in Figure 3, the max-pooling layer is used to reduce the resolution of the feature map.With a pooling size of 2 × 2, the input response power spectrum size is reduced by half with each passing feature extraction unit resolution.When the input response power spectrum is initially sized 100 × 100, after passing through four feature extraction units, its size decreases to 6 × 6, while the number of channels increases to 512.

Design of Source Distribution Map Feature Restoration
Unit.In the decoder part of the U-net network, the feature maps obtained after feature extraction and dimensionality reduction through the encoder network are gradually restored into source distribution maps via a series of upsampling layers and convolutional layers.In addition, Shock and Vibration skip connection operations are incorporated into the decoder network's frst and second feature restoration units, as shown in Figure 6.Te process starts with a transpose convolutional layer for upsampling.After upsampling, the feature map is connected to the corresponding layer in the encoder through skip connections, and then two consecutive convolutional layers are used for feature fusion.Skip connections can encourage the decoder network to reuse the high-level contextual information from the input response power spectrum and better restore the details in the source distribution maps but also alleviate the vanishing gradient problem commonly encountered in deep neural networks.Tis makes it easier for gradients to propagate through the network.Terefore, skip connections are used in the decoder part to stitch all channels of the shallow characteristics of the encoder and all channels of the decoder corresponding to the network layer to improve the ability of the U-net network to restore the sound source distribution maps.

Source Localization Based on Source Distribution Map
Interpolation Search.Te sound source distribution map refects the location of multiple sound sources and sound power.Te sound source distribution map of each grid sound power is inversely proportional to the distance from the grid center point to the sound source.In the presence of multiple sources on the plane, the power level in each grid results from the summation of the values contributed by each source in that grid.During the U-net network training, it is necessary to construct the corresponding source distribution maps as labels for training samples based on the source positions and power information.Te construction process of the source distribution map is shown in Figure 7. Firstly, according to the size of the response power spectrum, build a distribution map containing L grids, calculate the sound power of each grid in turn, and build a sound source distribution map according to the position of the sound source and sound power information.
Using r l grid � [x l grid , y l grid ] to represent the position of the center point of the lth grid, r m s � [x m s , y m s ] and q m s to represent the position and power of the mth source in M sources, R(l, m) to represent the distance between the lth grid center point and the mth source, and ζ(•) as a function to calculate the power attenuation coefcient, the power B(r l grid ) of each grid in the source distribution map can be expressed as where By changing the value of N, the speed of attenuation of sound power can be changed with the increase of R, and the speed of attenuation afects the width of the source distribution map's main lobe.Retaining a moderate main lobe width can improve the distribution map's ability to describe sound sources that are not at the center point of the grid and avoid the extreme sparsity of the sound source distribution map.
Te position and sound power of sound sources in all samples are constructed into a sound source distribution map by the above method, which can form a label set for training U-net networks.Te following example uses a response power spectrum and source distribution map for a sample with two sources to illustrate the principle and advantages of using source distribution maps as labels.Figure 8 shows the spatial representation and plane mapping of input features and labels for a sample with two sources, where the "+" mark points of the input features represent the actual locations of the two sources.
It can be seen from Figure 8(a) that the spectral peaks where the two sound sources are located in the response power spectrum are aliased with each other.Some false peaks are near the grid where the sound source is located.Te amplitude of these false peaks is close to the true sound source position spectral peak, and if the local maximum search is performed directly on the response power spectrum, the location of the real sound source cannot be located.From Figure 8(b), it can be seen that the amplitude distribution trend of the power spectrum is spread around the center point of the two sound sources, and the center point of the difusion corresponds to the true position of the sound source.Te constructed sound source distribution map is based on this feature, accurately delineating the position of each sound source by analyzing the overall spatial distribution trend of power.It employs (4) to simulate the diffusion of response power in the response power spectrum.From Figures 8(c) and 8(d), it can be seen that the sound source distribution map only retains the true main lobe of the sound source.Tis approach prevents false peaks from causing misjudgments of the number of sound sources and errors in the localization results during subsequent local maximum searches.
Let  B T represent the estimated sound source distribution map after the response power spectrum P is input to the Unet network, and the interpolation search of the estimated sound source distribution map can obtain the location of the  Shock and Vibration sound source, and the implementation process is as follows: frst, record the global maximum value of the estimated sound source distribution map as B � max(  B(r l grid )), and then record all other local maximums.B min is a predefned static threshold, and the number of local maxima that meet the above conditions is the estimated number of sound sources.Te center point coordinates of the mesh in which these local maximums are located are used as a rough estimate of the location of the sound source.After obtaining a rough estimate of the sound source location, interpolation searches are conducted around that grid to determine the precise position of each sound source.
Te specifc process of interpolation search is shown in Figure 9.A subregion is selected for each local maximum, with the local maximum as the center.Tis subregion consists of h grids.A partition ratio k is chosen to divide each grid of the subregion into k 2 subgrids proportionally.Te entire subregion is divided into I � h * k 2 grids, and then assume that each subdivision grid is the true position of the sound source, and calculate the sound power value B(r i grid ) of all subdivided grids when the sound source is in this grid, according to (4).By calculating the error between B(r i grid ) and the estimated distribution map's grid power value  B(r i grid ), an error distribution map is obtained, and the

U-Net Network Training and Simulation Experiment Analysis
To train the U-net network, this paper uses the mirrorsource model of the Pyroomacoustics library [20] to simulate sound propagation in a room and generate microphone array received signals.Using the CHZ02-S-112LA2 relay test object, a simulation dataset was constructed to train the U-net network and simulate the location of fault sound sources.

Simulation Dataset Construction and Model Training.
Taking multirelay mechanical fault detection as an example, the basic principle is to install multiple relays to be inspected on the vibrating plate at a specifc interval and drive the relay vibration on the vibrating plate through the electromagnetic exciter.Te fault relay vibrates under force, causing irregular movement of its armature due to loosening.As the armature moves, it collides with the relay shell, producing passive sound waves that propagate through the relay shell.Tese sound waves are then captured by the microphone array positioned directly above the relay and converted into multichannel audio signals.Analyzing the audio signal, multiple faulty sound sources are located, and the relay corresponding to the fault sound source location is judged as a fault relay.
To ensure the diversity of the simulation dataset, the relay fault acoustic signal without noise and reverberation is frst collected, and its diversity is ensured by setting diferent numbers of sound sources, diferent reverberation times, and diferent signal-to-noise ratios, and the maximum number of sound sources in the training sample is set to 4.
Te collection of fault acoustic signals is carried out in an anechoic chamber, the recording environment is shown in Figure 10, the relay fxture is fxed on the exciter, the inner wall of the anechoic chamber has a layer of sound-absorbing cotton to suppress reverberation, and the outside has a layer of soundproof cotton to isolate noise.All 500 real relays vibrate continuously at 25 Hz during the recording, and the audio sampling frequency is 48 kHz.Perform frame-by-frame processing on the audio data, dividing the audio into consecutive time segments.Apply a Fast Fourier Transform (FFT) to each audio data segment to convert it from the time domain to the frequency domain.Ten, concatenate the spectra of all segments along the time axis to obtain a two-dimensional matrix, which results in the spectrogram of the fault sound signal (Figure 12).Diferent colors represent the energy level of the signal segments at diferent frequencies.Darker areas in the spectrogram indicate higher energy levels at those frequency points.
Te audio data samples are generated in the room depicted in Figure 13.Te simulated room size is set to 3 m × 4 m × 3 m, the 8-element circular array with a radius of 0.2 m is placed on a plane with a vertical height of 1 m from the ground, the coordinates of the center point of the array are (1.5 m, 2 m, 1 m), and the microphone sampling rate is set to 48 kHz.
First, a set of 2000 random coordinates is generated on the sound source plane {x, y, z | 1.5 < x < 2.5, 2 < y < 4, z = 0.5} (unit m).Each time, 1∼4 copies are selected from 500 real relay fault audio recordings to serve as the sound source signal.Ten, the same number of coordinates in the coordinate set is randomly selected, and the sound source is placed, the signal-to-noise ratio is randomly set between 0 dB and 25 dB, and the reverberation time is randomly set between 0.2 s and 0.7 s.Te multichannel audio data collected by the microphone array are used as samples, and 2000 samples are generated when the number of sound sources is 1 to 4, and a total of 8000 samples are generated to form the training set.Before model training, the sample data are preprocessed, the audio frame with a sample point of 2048 is intercepted from the multichannel audio data, the response power spectrum is calculated by SRP-PHAT, the grid resolution is set to 0.02 m, the scanning range is the sound source plane range, and the output dimension is 100 × 100 response power spectrum; to maintain the consistency of the sample data, all the response power spectrum needs to be normalized.According to the position and sound power of all sound sources, an actual sound source distribution map B T with dimensions of 100 × 100 is constructed, and the normalized sound source distribution map is used as the training label for the samples.
In U-net network training, the predicted sound source distribution map can be regarded as a classifcation task for each pixel value, and the pixel mean squared error is used to represent the error between the actual source distribution map and the reconstructed sound source distribution map.
B T (r l grid ) and  B T (r l grid ) represent the pixel values of the lth grid in the true source distribution map B T and the predicted sound source distribution map  B T , and N represents the total number of pixel points in the sound source distribution image, and the loss function can be defned as (5)

Positioning Accuracy Experiment under Diferent Reverberation times.
To visually observe the efect of the U-net network in eliminating false peaks in the response power spectrum, localization simulation experiments were carried out under diferent conditions: SNR � 10 dB, RT 60 � {0.2 s, 0.5 s, 0.7 s}.

Shock and Vibration
Figures 14(a)-14(c) and Figures 14(d)-14(f ) compare the U-net input feature response power spectrum and the Unet network's output sound source distribution map.Te "+" in the response power spectrum indicates the actual positions of the sound sources.Te comparison between Unet input and output reveals that reverberation led to varying degrees of distortion in the response power spectrum.In the case of a reverberation time of 0.7 s, the false peak is more obvious, and the trained U-net network accurately distinguishes the false peak and the spectral peak where the real sound source is located and accurately fnds all the sound sources.From the localization results in Table 1, the localization error in all cases is less than 0.02 m, which proves that the sound source localization algorithm has high positioning accuracy and localization stability.

Analysis of Spatial Resolution under Diferent Signalto-Noise Ratios (SNRs)
. Spatial resolution refers to the minimum separation between two adjacent sound sources that a sound source localization method can distinguish.Two sound source signals are placed in the sound source plane, and the algorithm proposed in this paper is used for localization.Ten, the sound source spacing is continuously reduced with a step size of 0.01 m, and the localization algorithm can identify the minimum spacing between the two sound source positions as the experimental result.Table 2 presents the spatial resolution of the algorithm in an environment with SNRs of {0 dB, 5 dB, 10 dB, 15 dB, 20 dB, 25 dB} and an RT 60 of 0.7 s.It can be seen from the results that the algorithm's spatial resolution is afected by noise.As the signal-to-noise ratio increases, the algorithm's spatial resolution also improves.At an SNR of 25 dB, the proposed method in this paper can distinguish two sound sources spaced 0.08 m apart.When the SNR decreases to 0 dB, the method can distinguish sound sources spaced 0.2 m apart.

Multisource Localization Simulation Experiment.
To test the efectiveness of the proposed method for diferent numbers of sound sources, a simulation experiment for multisource localization is conducted.Te accuracy of localization and root mean square error (RMSE) are used as evaluation metrics.
Te simulation parameters are as follows: SNR � 10 dB, RT 60 � 0.5 s, the grid resolution is set to 0.02 m, and the distance error threshold v is set to 0.02 m.During calculating accuracy, the sound power threshold is set to 0.6 when interpolating the search.Five sets of tests were carried out under each environmental characteristic confguration.Each group contains 120 samples, and the average of the results from these fve groups was used as the experimental results.Based on the previously obtained algorithm spatial resolution, when the number of sound sources is greater than 1, the distance between any two sound sources is constrained to be greater than 0.2 m.
Table 3 shows the test results of the proposed algorithm's localization accuracy and root mean square error (RMSE) for diferent numbers of sound sources ranging from 1 to 5. It can be seen that when the number of sound sources is less than or equal to the maximum number of sound sources (4) in the training sample, the localization accuracy is greater than 98%, and the RMSE is less than 0.014 m.When the number of sound sources increases to 5, because the U-net training set does not contain samples of 5 sound sources, the positioning performance decreases slightly, but it still maintains high positioning accuracy and small positioning error, which proves that the proposed method can accurately and stably locate the position of multiple sound sources.

Experiment on Localization of Multiple Faulty Relay
Sound Sources

Experimental Environment and Parameter Settings.
Te experiment was carried out in a 4 m × 6 m × 4 m room, and the overall experimental environment is shown in Figure 15.Te exciter was placed on the ground, and a 0.22 m × 0.22 m vibrating disk was attached to the top pole of the exciter to fx the relays.An 8-element microphone array was chosen in a circular arrangement with a radius of r � 0.2 m, and the microphone array was placed 0.5 m directly above the vibrating plate and was parallel to the vibrating plate.Te plane where the vibrating disk was located was the sound source plane scanned by the beam during the sound source positioning process, and the sound source surface grid was established at the origin of the point vertically mapped to the sound source surface by the center of the array, and the four relays with mechanical faults were installed in the four positions of the vibration plate.Te exciter was activated to cause the relays to vibrate and emit faulty relay sounds, which were then captured as audio signals for the experiments.Te relay vibration sound device comprises an SA-SG signal generator, SA-PA power amplifer, and SA-JZ electromagnetic exciter.Te signal generator sends a sine wave electrical signal, which is amplifed by the power amplifer and input to the electromagnetic exciter; the exciter drives some fxed relays to vibrate.Te microphone array uses MSM261S4030H0R omnidirectional digital microphones.Te audio capture card operates at frequencies of up to 160 MHz, with sample rates supporting 8 K, 16 K, 22.05 K, 24 K, 32 K, 44.1 K, and 48 K.Additionally, it simultaneously acquires 8 microphone signals by combining the left and right channels.
According to the spatial resolution results of Subsection 3.2.2,when the signal-to-noise ratio is reduced to 0 dB, sound sources that are 0.2 m apart can be resolved, so the interval between adjacent relays is set to 0.2 m, and the installation location is shown in Figure 16.

Analysis of Experimental Results
. Fault relays were installed at all four locations shown in Figure 16, and the trained U-net network model was loaded for 50 sound source localizations.Te average of the 50 localization results was used as the experimental result.When performing sound source localization, the beam scanning plane grid resolution is set to 0.01 m, and the scanning range is x, y| − 0.5 < x < 0.5, −0.5 < y < 0.5   (in m).Te audio   Based on the results in Table 4, the average localization error is less than 0.02 m, which is signifcantly smaller than the spacing between the relays.In practical relay fault detection scenarios, if four relays simultaneously have a mechanical failure and abnormal sound, the algorithm presented in this paper would correctly identify all malfunctioning relays.Te experimental results prove that this algorithm can simultaneously localize sound sources from four malfunctioning relays in a real-world environment.

Conclusion
In this paper, the estimation method of mechanical fault sound source localization under strong reverberation of multiple sound sources is studied, the SRP-PHAT algorithm is used to calculate the response power spectrum, and a Unet network is utilized to transform the response power spectrum with spurious peaks into a "clean" estimated sound source distribution map.Te accurate location of fault sound sources is realized through interpolation search.Te research employs the SRP-PHAT algorithm to perform cross-correlation and phase transformation weighting on multichannel audio signals, creating directional beams.Tese beams scan the entire sound source plane and calculate the response power spectrum of the sound source plane.Ten, the U-net encoder network and decoder network for sound source distribution prediction are constructed, the power spectrum feature extraction unit and the sound source distribution map feature reduction unit are designed, and the interpolation search method based on the sound source distribution map is studied to estimate the precise location of each fault sound source.
Te experimental dataset was constructed from the mechanical fault data of the relay of electromechanical equipment to train the U-net network.Te experimental results show that the reverberation time increases from 0.2 s to 0.7 s, and the U-net network can still efectively eliminate the pseudopeak interference of the response power spectrum; when the signal-to-noise ratio is reduced from 25 dB to 0 dB, the spatial resolution increases from 0.08 m to 0.2 m; when the relay multifault sound source is located, the position of four fault sound sources can be located at the same time, and the average positioning error is less than 0.02 m.Compared with the traditional threshold processing and smoothing processing methods, the method proposed in this study can eliminate false peaks and accurately locate multifault sound sources without afecting the real sound source signal, which provides a new method for solving the problem of sound source localization in the scene of strong reverberation of multiple sound sources.

Shock and Vibration 13
the microphone array is diverse, and there are more fault sound sources, so we will focus on exploring diferent microphone arrays, increase the number of sound sources, and diferent signal types for our research.Additionally, we plan to conduct experiments on sound source localization in multisource strong reverberation scenarios using various algorithms to compare, such as MUSIC, CNN-DOA, and DAMAS, enhance localization accuracy, and increase their adaptability.

1 :
Multifault sound source localization using U-net and SRP-PHAT.

Figure 6 :
Figure 6: Te structure of the feature restoration unit.

Figure 7 :Figure 8 :
Figure 7: Te construction process of the sound source distribution map.

Figure 11 (
Figure 11(a)  shows the time-domain waveform of the relay fault sound signal.It can be observed that the fault sound signal exhibits distinct segmented characteristics, as it is not continuously emitting sound but rather has certain time intervals between each sound emission.Te spectrum obtained by performing a Fast Fourier Transform (FFT) on the fault signal is shown in Figure11(b).It can be seen that the fault sound signal possesses a continuous frequency spectrum, and based on the relationship between center frequency and signal bandwidth, the fault sound signal should be considered as a broadband signal.Perform frame-by-frame processing on the audio data, dividing the audio into consecutive time segments.Apply a Fast Fourier Transform (FFT) to each audio data segment to convert it from the time domain to the frequency domain.Ten, concatenate the spectra of all segments along the time axis to obtain a two-dimensional matrix, which results in the spectrogram of the fault sound signal (Figure12).Diferent colors represent the energy level of the signal segments at diferent frequencies.Darker areas in the spectrogram indicate higher energy levels at those frequency points.Te audio data samples are generated in the room depicted in Figure13.Te simulated room size is set to 3 m × 4 m × 3 m, the 8-element circular array with a radius of 0.2 m is placed on a plane with a vertical height of 1 m from the ground, the coordinates of the center point of the array are (1.5 m, 2 m, 1 m), and the microphone sampling rate is set to 48 kHz.First, a set of 2000 random coordinates is generated on the sound source plane {x, y, z | 1.5 < x < 2.5, 2 < y < 4, z = 0.5} (unit m).Each time, 1∼4 copies are selected from 500 real relay fault audio recordings to serve as the sound source signal.Ten, the same number of coordinates in the coordinate set is randomly selected, and the sound source is placed, the signal-to-noise ratio is randomly set between 0 dB and 25 dB, and the reverberation time is randomly set between 0.2 s and 0.7 s.Te multichannel audio data collected by the microphone array are used as samples, and 2000 samples are generated when the number of sound sources is 1 to 4, and a total of 8000 samples are generated to form the training set.

Figure 10 :
Figure 10: Collection of relay fault sound signals.

Figure 11 :
Figure 11: Te waveform and spectrum of fault sound signal: (a) time-domain waveform; (b) the spectrum.

12
Shock and Vibration sampling frequency is set to 48 kHz, 2048 sampling points are taken each time for positioning, and the shaker vibrates at a frequency of 25 Hz.Table4presents the localization results and localization errors.Figure17(a) is the response power spectrum of the four fault sound source positioning tests, and Figure 17(b) is the U-net output sound source distribution diagram, in which the " * " mark point is the location of the sound source positioned by the algorithm in this paper; it can be seen that the multifault sound source localization method proposed in this paper identifes the location of all fault relays.
Due to experimental limitations, this study only used an 8-element circular microphone array to collect audio signals and conducted localization experiments with only 4 fault sources.However, in the actual mechanical fault detection,

Table 4 :Figure 17 :
Figure 17: Response power spectrum and U-net outputs for four fault sound sources: (a) response power spectrum; (b) U-net output.

Table 2 :
Spatial resolution of localization algorithm under diferent SNRs.