Application of Generative Adversarial Nets (GANs) in Active Sound Production System of Electric Automobiles

To improve the diversity and quality of sound mimicry of electric automobile engines, a generative adversarial network (GAN) model was used to construct an active sound productionmodel for electric automobiles.+e structure of each layer in the network in this model and the size of its convolution kernel were designed. +e gradient descent in network training was optimized using the adaptive moment estimation (Adam) algorithm. To demonstrate the quality difference of the generated samples from different input signals, two GAN models with different inputs were constructed. +e experimental results indicate that the model can accurately learn the characteristic distributions of raw audio signals. Results from a human ear auditory test show that the generated audio samples mimicked the real samples well, and a leave-one-out (LOO) test show that the diversity of the samples generated from the raw audio signals was higher than that of samples generated from a two-dimensional spectrogram.


Introduction
Electric automobiles can create traffic safety risks as their engines emit low sound level during low-speed driving [1]. Standards for noise produced by electric automobiles during low-speed driving are being drafted in many countries [2]. Active sound production systems have been proposed to mimic the sound of an internal combustion engine. In one study [3], a multi-parameter-controlled mimicking algorithm based on digital audio signals was proposed for producing the sound effects of an engine based on multiple parameters including engine speed, driving speed, and acceleration. In another study [4], the superposition theory of speech synthesis technology was used to design a selfadaptive active sound production system based on engine speed. An engine sound-mimicking system based on a sine wave that can truly mimic the sound of a specific engine was proposed in [5]. e models in prior studies are all based on spectrum analyzer technology of sound signals from specific internal combustion engines, and a vector signal processing algorithm is used for adaptive sound mimicking, but there are three common problems: (1) e vectorization algorithm of sound signals divides the raw sound into audio frames and solves the mapping relationship between corresponding audio frames. e sound generated by these methods is not the same as real engine noise to human ears, leading to the problem of discontinuity and poor authenticity. (2) Without considering the overall dynamic characteristics of electric automobiles, the sound mimicking is poor when the engine parameters fluctuate, resulting in poor authenticity. (3) Consumers with different driving experience and of different ages, genders, and occupational backgrounds have diversified demands for interior sounds of electric vehicles. e mimicked sound is based on the characteristics of sound signals of specific internal combustion engines, and the diverse needs of different kinds of internal combustion engines in terms of sound mimicking cannot be satisfied.
Generative adversarial networks (GANs) have been used in image generation, semantic segmentation, and speech generation. Goodfellow and Pouget-Abadie [6] proposed an image generation algorithm based on GAN, and training datasets such as the Mixed National Institute of Standards and Technology (MNIST) database were used to generate images that could be recognized by humans. Jin et al. [7] used a GAN to remove rain stripes from images. Donahue et al. [8] proposed the WaveGAN approach to generate audio signals, where a GAN was used for unsupervised synthesis of raw audio waveforms to generate a drumbeat and sounds made by birds. e use of GANs for active sound production in electric automobiles has not been studied.
In this study, two GAN models were trained with raw sound signals and processed frequency domain signals from specific internal combustion engines in different conditions. An experiment showed that the reproduced sound of an internal combustion engine has a high similarity to the real sound.

GAN.
As a deep learning model, a GAN can be trained using a discriminator and generator [9]. e generator is responsible for generating samples and sending them with the real samples to the discriminator for training, aiming to select the optimal generated sample with the maximum probability.
rough training against real and generated samples, the discriminator identifies real samples and refuses generated samples as much as possible.
is principle is shown in Figure 1. e most direct model of a GAN is the multilayer perceptron, which learns mapping from a low-dimensional potential vector z ∈ Ζ (a priori variable of independent samples with the same distribution) to the midpoint in the real data χ. Goodfellow and Pouget-Abadie [6] proposed that the loss function of the discriminator D is actually the regular cross-entropy loss function related to the following binary classifier: e results of the loss function were different depending on the input sample types. When one or the other term in the loss function is approaching 0, the result will be the negative logarithm of the probability that the discriminator predicts the sample to be correctly classified. Note that y � 1 for real samples. e quantities p and 1 − p are the respective prediction probabilities of the real and false samples. If D(x) denotes p and (z) stands for the generated sample x, then the loss function of the discriminant model can be written as follows [6]: loss D � − (y log(D(x)) +(1 − y)log(1 − D(G(z)))). (2) e generator G is designed to maximize the loss function of the discriminator D [6]. Since y log(D(x)) has nothing to do with the generator, the loss function of G can be expressed as ere is a metacompetitive relationship between the generator G: Z↦X and D: X↦[0, 1], with the objective function [9]: where x is the real data with probability distribution p data (x), z is the noise data with probability distribution p(z), and E is the mathematical expectation of real data x and noise data z. Often represented by a fully connected or convolutional neural network, the generator G can obtain a distribution p g (x) of generated data through the noise distribution p(z), aiming to make p g (x) as close to p data (x) as possible, i.e., to ensure a minimum number of generated samples are judged as false. e discriminator D is trained to maximize the discriminant generation of false samples, i.e., to measure the gap between p g (x) and p data (x). Equation (1) tries to locate the minimum Jensen-Shannon divergence between p data (x) and P G , where P G is the generator's implicit distribution of z ∼ P z (z).
If the data generated by G(z) are x, then we obtain is defined as the generation distribution of z: Substituting equation (7) into (6) yields equation (8): e maximum value of D in the objective function [9] is which is the optimal solution of D(x). If the distribution of samples generated by G is consistent with the distribution of real samples, i.e., p data (x) � p g (x), then D * (x) � 1/2. By substituting the optimal solution of D in equation (8) to solve for the optimal value of G, we obtain dx.

(10)
We turn equation (10) into the KL divergence expression to obtain eorem 2.5 of [10]: In equation (11) [10], when the objective of the optimal solution of G, p g (x), is equal to the real distribution p data (x), KL � 0 and the minimum value of G is − log4. Hence, when the discriminator approaches the optimal solution, G also approaches the minimum value.

Design of the Active Sound Production System
We propose a GAN model to generate the sound from an internal combustion engine. e main process is as follows.
e preprocessed audio samples and tags are divided into training and test sets. e GAN model is trained with the training set data and the corresponding tags as the raw audio input. e test set is used to validate the iterated training model, so as to filter and save the optimal model. e generator model in the saved GAN model is used to generate new audio samples ( Figure 2).

Design of the GAN Model.
e generator and discriminator have a convolutional neural network structure, and the convolutional layer of the generator is referred to as transposed convolutional layer [11], i.e., the feature map is upsampled, which is similar to a reverse gradient calculation in an ordinary convolutional layer. e GAN structure to train with raw audio samples is as follows. e generator consists of an input layer, a fully connected layer, and five convolutional layers. In the GAN model, if the number of convolutional layers of the generator and the discriminator is larger, the generated sample is closer to the real sample, and at the same time, the training time of the model is also longer. e generator structure setting 5 convolutional layers is an optimized selection result, which takes into account the training time and the quality of the generated samples [8]. e ReLU activation function is used between two convolutional layers. Since the input raw audio samples are one-dimensional vectors, only the width of the convolution kernel is denoted (Table 1). e raw sample discriminator consists of five convolutional layers and one fully connected output layer. e ReLU function serves as the activation function between two layers, a phase conversion operation is added to each convolutional layer, and the discriminator contains a reconstruction layer and a fully connected layer ( Table 2).

Shock and Vibration 3
A GAN model using spectrograms of samples has the following network structure. e generator consists of one fully connected layer and five convolutional layers. e ReLU function serves as the activation function between two layers, and the last layer is activated with a tan h function. Since the input spectrograms are two-dimensional vectors, the size of the convolution kernel is represented by length and width (Table 3). e discriminator consists of five convolutional layers, one reconstruction layer, and one output layer. Activation layers alternate between convolutional layers. e ReLU function serves as the activation function. e output layer is fully connected. Table 4 shows the structure. erefore, long audio signal samples are decomposed into corresponding frequency bands during processing. Lee et al. [12] and Sainath et al. [13] proposed using audio signals in the time domain to complete training in semisupervised audio classification, and their results show that classification can be performed in the time or frequency domain with the same accuracy. e sound signal produced when the engine starts is nonstationary. It varies in time and may contain the audio characteristics of the motor and engine speed. erefore, in short-term processing of audio samples, the two-dimensional time-domain spectrogram signals and audio signals in the time domain are involved in a comparative experiment. e working audio samples of the internal combustion engine were recorded in an ordinary laboratory environment. A total of 1200 audio samples were collected in the morning and afternoon to form an experimental sample library. is contained 400 sets of steadystate sound signals produced by a Toyota HR16DE gasoline engine, Hyundai D4BH diesel engine, and Mitsubishi 4G6 MIVEC gasoline engine, whose running speed increased from 800 rpm to 4500 rpm. Each audio sample was processed to be monophonic with a sampling rate of 16 kHz and a duration of 1 s.

Sample Generation
When processing audio samples in the time domain, a real signal waveform was converted to a one-dimensional  vector. Extreme value normalization was used to adjust the data, i.e., so as to complete the GAN model training using the end-toend learning mode. e short-time Fourier transform (STFT) was the most commonly used method to process the spectrogram in the two-dimensional time-frequency domain. is idea has been   adapted by multiple researchers to allow for an objective measure of various generative systems [14][15][16]: where x(t) is a continuous signal, w * (t) is a window function that varies with time, and the superscript " * " denotes the complex conjugate. Equation (13) was used to transform the audio signal from the time domain into the frequency domain in discrete time bins [14]. is provides the amplitude and phase of each frequency contained in the signal at any time. e w(t) window function is When preprocessing the signals, the frame was divided by windowing at an interval of 16 ms with 8 ms. is process yields data in a 129 × 1999 matrix. In order to ensure that the obtained sound spectrum can reflect human auditory sensitivity, a mel-scale filter was used to process the two-dimensional time-frequency signals: mel(f) � 2595 * log 10 e resulting signals effectively reflect the sensitivity of the human ear, i.e., the mel spectrum changes rapidly at low frequencies and slowly at high frequencies. Figure 3 shows the processed mel spectrogram. e sample data were translated and normalized as follows: In other words, the mean value of the dataset was set to 0, the standard deviation was set to 1, and the values of the processed dataset ranged from − 1 to 1. In equation (16), µ is the mean value of all samples, and σ is the variance. A human ear auditory test on the processed data indicated no auditory difference between the samples.

Training and Optimization.
To guarantee the comparability of the test results from the two signal inputs, the inputs from raw audio samples and spectrograms both lasted 1 s, and the data input to the generator consisted of 100dimensional potential vectors. e generator trained against the raw samples produced the generator structure in Table 1.
e fully connected layer converted the 100-dimensional noise vectors into a 16 × 1024 feature map. A deconvolution operation similar to upsampling was conducted based on the length and number of convolution kernels in Table 1. After five deconvolution and activation instances, a 16384-dimensional vector was obtained and input to the discriminator in the next step. e discriminant structure shown in Table 2 was used to train the discriminator against the raw samples.
e 16384-dimensional vector generated by the generator and the 16384-dimensional data from reading real samples were input to the discriminator for convolution and corresponding activation operations. e LReLU function [15] served as the activation function, which reduced the sparsity of ordinary ReLU. We set α � 0.2 in this study. After five convolution operations, the fully connected layer and discriminator were connected to determine the authenticity of the samples.
While training the raw sample discriminator, due to common frequency overlap in the real data, tone noises are inevitably produced during upsampling, which is like the "chessboard" artifact [17] caused by deconvolution of twodimensional images. Since tone noises often occur at a specific stage, the discriminator will probably learn a rule to reject these noise samples, thereby suppressing the overall optimization and weakening the accuracy of the discriminator. To solve this problem, a phase disturbance operation was used during training, which randomly disturbed the phase of the data in each activation layer through n samples, so that the feature map could be unified before being input to the next layer, thereby decreasing the effect of noise on the discriminator. e phase disturbance operation produces a uniform sample in each layer of the discriminator by filling the left or  Figure 4 shows all the possible outputs from five feature maps when n � 2.
In the process of inputting and training against the spectrogram samples, the structure model shown in Table 3 was used for generator training, and the fully connected layer transformed the input 100-dimensional noise vector samples to a 4 × 4 × 1024 feature map. e 5 × 5 two-dimensional convolution kernel, with the number of convolution kernels decreasing layer-by-layer, was used for deconvolution. After five deconvolution and activation operations, the obtained 128 × 128 two-dimensional spectrogram was used as the generated sample input to the discriminator. e structure model shown in Table 4 was used for generator training, and the spectrograms generated by the generator and read from the real samples were unified as the input for convolution and corresponding activation operations. e activation function was the same as used for raw audio discriminator training, and the discriminator output was processed in the same way.
To ensure the comparability of the training of the two GAN models, 64 samples in each batch were selected to predict the gradient in two model experiments with a learning rate of 0.0002. To shorten the training time of the sparse gradient problem for convex functions during the training process, the adaptive moment estimation (Adam) algorithm [18] was used to optimize the gradient descent with β � 0.5.

Experimental Process.
e computing environment was an NVIDIA GeForce GTX 1070 GPU and CUDA 9.0 toolkit.
Eighty percent of the samples were selected as the training set, 10% as the validation set, and the remaining 10% as the test set. After 80 batches of samples were used for training, the data tended to converge in about nine hours. In Figure 5, G_W and G_S are the generator loss curves for training against the raw audio and spectrogram samples, respectively, and D_W and D_S are the corresponding discriminator loss curves for training. As shown in the figure, when 20 batches of samples were trained, the training loss function values of the two pairs of generators and discriminators tended to stabilize.

Evaluation and Analysis of Experimental Results.
To test the quality of the training model, the samples generated by training were evaluated qualitatively and quantitatively.
Humans were asked to perform qualitative evaluation. e sample set for the listening test was divided into two groups: one was composed of 10 audio samples generated by the raw audio GAN model and 10 real sounds, and the other was composed of 10 audio samples generated by the spectrogram GAN model and another 10 real sounds. 25 volunteers randomly recruited on campus participated in the listening test, including 18 boys and 7 girls. e listeners were told in advance that part of audio samples they heard were machine generated. e evaluation was conducted through the online voting system, and each listener voted on the authenticity of samples after listening to them. e degree of authenticity of the samples is divided into 11 levels, with a scale from 0 to 1, where 0 represents a completely generated sound sample, and 1 represents a completely real sound sample. Altogether, 25 people participated in the evaluation, whose results are shown in Figures 6 and 7. Samples with attribute values of 0 Shock and Vibration and 1 identify generated and real samples, respectively. According to the voting results, the authenticity rate of most samples reached more than 90%, which means ordinary people were unable to effectively distinguish the difference between these two types of samples, nor could they evaluate the disparity between the raw audio model and the spectrogram model. e generated samples were quantitatively evaluated by leave-one-out (LOO), which used a 1-NN classifier between generated and real samples [19]. If the samples generated by the model are qualified and their distribution perfectly matches that of the real samples, then the 1-NN classifier should exhibit an LOO of approximately 50%. No matter how the validation and training sets are allocated, there is only a probability of 50% that the 1-NN classifier could correctly predict the distribution of samples. All 1200 real samples were used as positive samples, and all 1200 generated samples as negative samples. e LOO method was used to conduct circuit training on the 1-NN classifier. As shown in Figure 8, the LOO values for all validation sets with the two models are on an upward curve, indicating great reliability of the model. Table 5 shows evaluation results for samples generated by the two models. All validation sets have LOO values greater than or close to 50%, which implies there is no overfitting when training the raw audio GAN and the    LOO value because the generated samples tend to gather in a small number of pattern centers, and these patterns are surrounded by generated samples in the same category. Hence, when they serve as validation sets, the discriminator will make the correct decision on the negative sample, resulting in a relatively high LOO value. As shown in Table 5, when the samples generated by the GAN model trained against spectrograms serve as the validation set, the LOO value is 0.945, which is 0.131 greater than that of the validation set for samples generated by the GAN model trained against the raw audio signals. is implies that the spectrogram GAN model may have caused mode collapse during  training, thereby failing to fully learn the true distribution of all the samples. e input samples used in the GAN model trained against the raw audio signals are only normalized, and their diversity is higher. erefore, the samples generated by the raw audio GAN model are also distributed in multiple mode centers, and its model collapse rate is lower than the spectrogram GAN model, which shows that the raw audio GAN model is better than the spectrogram GAN model in the diversity of generated samples. Although human ears are unable to distinguish generated sounds from real sounds, there are insufficient types of generated audio samples by two types of GAN mode which to some extent is correlated with the small number and limited types of training samples. Future studies should increase the number and types of training samples.

Conclusions
(1) An active GAN model with corresponding hierarchical structures for the generator and discriminator networks is proposed for producing internal combustion engine sounds in electric automobiles. In experiments, audio samples from internal combustion engines during startup were used as inputs to train a GAN model. Based on the evaluation of the 1-NN classifier, this model can be used to accurately learn the characteristic distribution of the raw audio signals. Human evaluation results show that the generated audio samples closely mimic the real sounds. (2) Results from LOO tests show that a GAN model trained against raw audio samples exhibited a lower collapse rate than the GAN model trained against spectrograms. Overall the samples generated with the GAN model trained against raw audio samples were of higher diversity than those generated with the GAN trained against spectrograms.

Data Availability
Some or all data, models, or code generated or used during the study are available from the corresponding author by request (list items).

Conflicts of Interest
e authors declare that they have no conflicts of interest.