Utterance Clustering Using Stereo Audio Channels

Utterance clustering is one of the actively researched topics in audio signal processing and machine learning. This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. Processed audio signals were generated by combining left- and right-channel audio signals in a few different ways and then by extracting the embedded features (also called d-vectors) from those processed audio signals. This study applied the Gaussian mixture model for supervised utterance clustering. In the training phase, a parameter-sharing Gaussian mixture model was obtained to train the model for each speaker. In the testing phase, the speaker with the maximum likelihood was selected as the detected speaker. Results of experiments with real audio recordings of multiperson discussion sessions showed that the proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono-audio signals in more complicated conditions.


Introduction
With artificial intelligence (AI) development, many techniques are applied in our daily life, such as automatic speech recognition (ASR) [1] and speaker recognition. Studies and products in speech processing are widely used in our daily life, such as Apple's Siri, Amazon's Alexa, Google Assistant, and Microsoft's Cortana. As more studies developed in speech processing, it will likely see further increases in popularity. Utterance clustering is a popular topic in speech processing that can be used for speaker diarization [2] and ASR. However, most studies are based on laboratory data sets, and those cannot process the real-world problem very well. Both formal and informal meetings have more segments with overlapping speaking than segments with only one speaker [3]. In the laboratory data sets, people speak one by one, but it is hard to ask people not to interrupt others' speech in the real world. e issue of overlapping speech segments has received considerable attention [4]. To expand the application of speech processing, it is necessary to have better performance in overlapping utterance clustering.
A key aspect of performance improvement in utterance clustering is audio feature embeddings. Feature embedding plays a vital role in ensuring the performance of utterance clustering.
ere are many studies that focus on the enhancement of audio feature embeddings, such as mel-frequency cepstral coefficient (MFCC) [5], i-vector [6], xvector [7], and d-vector [8]. However, a significant bottleneck towards the widespread adoption of speech processing applications in daily life is high-quality audio data requirements. Besides, audio signal processing could be a contributing factor to feature embeddings. ere are needs for a better speaker diarization method using low-quality audio recording data in many social science experiments.
is study initially tried several different published methods for our own experimental research, but their results were not as good as we had hoped.
To address this problem, here, a new method of audio signal processing was proposed for utterance clustering [9]. e challenge this study aims to address is how to handle low-quality audio data recorded in real-world discussion settings. e audio data set was recorded using an ordinary video camcorder in a noisy environment without a professional microphone. is study contributes to the advance of utterance clustering when the recording conditions are limited.
is study aims to improve clustering performance by processing multichannel (stereo) audio signals. Mono-audio signals are typically used in audio processing studies because they can be obtained easily by downmixing stereo audio signals. en the d-vector of each audio segment was obtained using pretrained neural networks as the audio feature representation.
Gaussian mixture model (GMM) was used as a supervised clustering method. e error rate (ER) of the clustering was compared, and the results showed that using the processed multichannel audio signal for utterance clustering was significantly better than using the original mono-audio signal. e structure of the paper is as follows. Section 2 of this paper introduces some related work. In Section 3, the method in feature processing and the Gaussian mixture model are described. Section 4 shows the details of the data set and the details of the experiments. e results are discussed in Section 5, and conclusion and plans for future work are described in Section 6.

Related Work
Numerous researchers have made significant advances in utterance clustering and related fields over the last few decades. Certain studies place a greater emphasis on feature embeddings; historically, the most common feature representation was the MFCC [5], which is a method based on the Fourier spectrum. en, as factor analysis developed, Dehak et al. [6] proposed a factor analysis called i-vector. eir factor analysis took into account the variability of speakers and channels without distinction. Lei and Kun [10] proposed wavelet packet entropy (WPE) to extract short vectors from utterances and then used the i-vector as a feature embedding. As with i-vectors, d-vectors [8] also have fixed sizes regardless of the length of the input utterance. Wan et al. [11] trained speakers' utterances using a deep neural network, and the lengths of these utterances varied, resulting in fixedlength embeddings, namely d-vector. e distinction between i-vector and d-vector is that the former is generated using GMM, while the latter is trained using deep neural networks. Similar to the d-vector, the x-vector [7] is also trained with deep neural networks. Ma et al. [12] proposed an E-vector, which was obtained by minimizing Euclidean metric to improve the performance of speaker identification. All of the feature embeddings aforementioned are commonly utilized, and the d-vector was used in this study.
ere are also some works that focus on the improvement of feature extraction. Lin et al. [13] introduced a novel feature extraction approach that combines multiresolution analysis with chaotic feature extraction to improve the performance of utterance features. Daqrouq et al. [14] proposed a feature extraction method based on wavelet packet transform (WPT). ey removed the silence parts from the audio data and decomposed the audio signal into wavelet packet tree nodes.
In some research, the clustering algorithms are given greater consideration. Delacourt and Wellekens [15] applied Bayesian information criterion (BIC) to measure the distances among utterances and conducted the agglomerative hierarchical clustering (AHC) based on the BIC metrics. Li et al. [16] conducted GMM on MFCC to classify the speakers' gender. Algabri et al. [17] applied Gaussian mixture model with the universal background model (GMM-UBM) to recognize speakers according to the MFCC of utterances. Shum et al. [18] used the resegmentation algorithm of Bayesian GMM clustering model based on i-vector to contribute to improving the speech clustering. Zajíc et al. [19] proposed a model for applying convolutional neural network (CNN) on i-vector to detect speaker changes. Wang et al. [20] developed the LSTM model on d-vector for the speaker diarization. Zhang et al. [21] constructed a supervised speaker diarization system on the extracted d-vector, called unbounded interleaved-state recurrent neural networks (UIS-RNNs).
In comparison to the previous efforts, this study used processed audio signals rather than mono-audio samples. e processed audio signals are derived from multichannel (stereo) audio signals, and the proposed method attempted to preserve more representative audio characteristics.

Methods
In this section, the proposed method of audio feature processing is discussed. e details of processing multichannel audio features are shown, and the tool which was employed to extract audio feature embeddings is described. Also, the clustering method is presented.

Feature Processing.
is study operated the left-channel audio signals and the right-channel audio signals to obtain speech-only audio features in the present work. e details of feature processing are visualized in Figure 1. is example shows that after removing the nonspeech part, the speaker's speaking time is 27 seconds. In this work, 27 seconds of stereo audio were divided into 54 stereo audio segments, each of which is 0.5 seconds in length. After that, monoaudio files were extracted, left-channel audio files and rightchannel audio files from the 0.5-second-long stereo audio files. e Python package librosa [22] was used to obtain the left and right audio signals in the time series.
Horizontal stacking of the original left and right audio signals (hstack) and horizontal stacking of the sum and the difference of the left-and right-channel signals (sumdif ) were performed. e computational complexity of the proposed method is still O(L), where L is the length of audio signal, which is the same in order as the traditional methods (although the actual computation takes about twice as much because our method processes two channels of audio signals).
For the training set, all speakers' utterances S � (s 1 , . . . , s i , . . . , s N ) were acquired, where N represents the number of speakers in the audio data set, and s i represents the sequence of all speaking segments of the ith speaker. Specifically, s i � (x i,1 , . . . , x i,t ), where x i,t represents the ith speaker's audio signal at the tth segment. en left-and right-channel audio signals from each segment in s i were extracted: x L i,t for the ith speaker's left-channel audio signal at the tth segment and x R i,t for the ith speaker's rightchannel audio signal at the tth segment were obtained. Using the left and right channels, the following two combined audio segments were created: For a fair comparison, a mono stack was created, which is called mstack. It is a stack result of repeated mono signals, represented as

Feature Embeddings.
After feature processing, the dvector [11] was extracted as the feature representation of the audio signals. e pretrained model called real-time voice cloning [23] was used to extract the d-vector. e pretrained model was trained using three data sets: one data set is LibriSpeech ASR corpus [24], which contains 292,000 utterances for more than 2,000 speakers in English, and others are VoxCeleb 1 and 2 [25,26], which contain more than 1 million utterances for more than 7,000 speakers in multiple languages.

Gaussian Mixture
Model. Gaussian mixture model (GMM) was used as the clustering method. GMM is one of the most frequently used tools for speakers clustering. In this study, separate GMM models for individual speakers were built, defined as where y represents the feature vector of audio signal, p(y) is the probability that the input audio signal belongs to specific cluster, α m represents the mixing proportions, μ m represents mean, and Σ m represents covariance matrix [27]. e expectation maximization (EM) algorithm [28] was used to estimate the model parameters in GMM. GMM has significant advantages in acoustic modeling [27].

Experiments
e details of the experiments are described in this section. e details of the data set used in this study are presented first.
en, the tool used in this study to perform audio processing is introduced. Also, the details of clustering experiments are shown. To ensure that the comparison between the proposed method and comparative methods is fair, the same audio data and the same audio processing method to extract multichannel audio signals and monoaudio signals were used in the proposed experiments for both the proposed method and comparative methods. Last c.
d. but not least, a parameter sharing GMM was conducted for both the proposed method and the comparative methods.

Data set.
A data set [29] containing 11 video files of discussions by multiple participants in a real-world physical environment was used in the proposed work. e number of speakers in 11 videos ranged from 4 to 10, the number of female speakers varies from 1 to 6, the number of male speakers varies from 1 to 6, and all speakers spoke English. Each speaker's speaking time ranged from 1 to 130.5 seconds. e total speaking time for all 11 videos is 31.6 minutes, and the average speaking time for each speaker is 26.7 seconds. e data set was manually annotated with the ground-truth speaker labels.
In the proposed experiments, two comparison groups were set. Audio files in one group contain overlapping speeches, and in the other one, audio files do not contain overlapping speeches. e audio files in these two groups are from the same audio. Speakers were in a real-world free discussion scenario, and an ordinary video camcorder was used to record all videos and audios with a built-in stereo microphone.

Audio Processing.
FFmpeg [30] was used to extract stereo audio files and mono-audio files from the video files. Based on the manually annotated speaking time data, audio segments for each of the different speakers were cut. en, each audio segment of different speakers was cut into shorter segments of a length of 0.5 seconds. e audio files that were shorter than 0.5 seconds were deleted. en, stereo signals were split into left-and right-channel signals, and d-vectors from processed signals were obtained. After the audio signal processing, the clustering experiments were conducted.

Clustering by the Gaussian Mixture Model.
e proposed work applied scikit-learn [31] for GMM training and testing. In the initial experiment, a small part of the data set and traditional methods (mono-audio signal) were used to adjust the parameters to obtain better accuracy. en, for fair comparison, the same parameters were set for all the proposed methods. e full covariance type and K-means were used to initialize the model. e input for the clustering model is the d-vector, and clustering experiments were conducted 50 times; for each time, a 10-fold cross-validation test was conducted. In

Results
is part will show the results of the proposed experiments. e visualization of feature vectors will be displayed to show the results of feature processing, then the results of GMM clustering and the results of significance tests will be shown. Figures 2 and 3 show the visualizations of processed feature vectors using t-SNE [32], and the proposed algorithms (hstack and sumdif ) show better clustering results. e data points show manifest clusters in the proposed methods. It can be seen from Figures 2 and 3 that in the proposed methods, the data points of Speakers 04, 05, 06, and 07 are clustered more closely. is implies that the d-vectors contain more information when the processed multichannel audio signal is used than the one extracted using the mono-audio signal.

Feature Processing.
is study improves the performance of feature embeddings by processing multichannel audio signals. e proposed method extracts more useful features from the audio signals. Although the improvement is apparent, there are still some differences between the audio with overlap group and the audio without overlap group. Compared with the audio without overlap group, the proposed method enhances the performance of feature embedding in the audio with overlap group. e audio with overlap group is more intricate than the audio without overlap group. Extracting more useful features helps more in the complicated scenario than in the simple scenario. Table 1 shows the comparison of the z-scores of GMM error rates in different algorithms. From Table 1, the sumdif algorithm works better than other algorithms in the audio with overlap group. In the audio without overlap group, the mstack algorithm works better. However, hstack and sumdif work better than the mono. e overall performance of hstack and sumdif is better than mono and mstack.

Clustering.
One-way ANOVA tests with Tukey HSD test were performed to determine whether there were differences among the error rates of algorithms compared. Results are shown in Table 2 for both the audio with overlap group and the audio without overlap group. Both groups had statistically very significant difference among the algorithms.
Results of Tukey HSD test are shown in Table 3. Results showed that the proposed algorithms (hstack and sumdif ) are significantly different from traditional algorithms (mono and mstack) when the audio signals contained overlaps between speeches. e difference was less clear when the audio signals had no overlaps.
Results of clustering signify that even if the traditional GMM is applied instead of the deep learning model, using the processed audio signals in utterance clustering can achieve a higher-accuracy score than mono-audio signals. e data set used in this study represents a real-world discussion setting. e proposed method shows significant improvements in a complicated discussion scenario, and the performance could be further improved by implementing deep learning models. e average of difference in means also shows that compared with simple condition (audio without overlap), the proposed method extracted more features from audio, which is more conducive to the utterance clustering in the complicated scenario (audio with overlap).

Conclusion
is study generated processed audio signals by combining left-and right-channel audio signals in two different ways. dvectors were extracted as embedded features from those processed audio signals. GMM was conducted for supervised utterance clustering. Based on the results obtained from the supervised clustering experiment, the proposed method works better in complicated conditions than traditional methods. Namely, the proposed method can achieve a higher accuracy score than using traditional algorithms in the speech that contains overlapping. is is because the stereo audio signals contain information about spatial location of the sound source (in a left-right direction space). In a typical real-world discussion setting, speakers tend to sit in a fixed location, so using spatial information can help speaker identification and utterance clustering. is study successfully demonstrated this idea.
One limitation of the proposed method is the computational cost. Even though the theoretical computational complexity of the proposed method is the same as the traditional methods, in the actual experiments, the run time of our proposed method is greater than that of the traditional methods. Moreover, stereo audio signals were used in this study, so another limitation is that the input data must be multichannel audio signals that involve spatial information.
In this study, GMM was applied as a clustering method. An innovative clustering model using deep learning will be proposed for future works. After applying different clustering methods, there are more comprehensive comparisons between the proposed algorithms and traditional algorithms.

Data Availability
e audio data used to support the findings of this study have not been made available because Institutional Review Board permissions do not accommodate their release.

Disclosure
A preprint version of this work is also available from https:// arxiv.org/abs/2009.05076. e views expressed in this presentation are those of the authors and do not reflect the official policy or position of the Department of the Army, DOD, or the U.S. Government.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.  Computational Intelligence and Neuroscience 7