An Objective Approach to Identify Spectral Distinctiveness for Hearing Impairment

To facilitate the process of developing speech perception, speech-language pathologists have to teach a subject with hearing loss the differences between two syllables by manually enhancing acoustic cues of speech. However, this process is time consuming and difficult. Thus, this study proposes an objective approach to automatically identify the regions of spectral distinctiveness between two syllables, which is used for speech-perception training. To accurately represent the characteristics of speech, mel-frequency cepstrum coefficients are selected as analytical parameters. The mismatch between two syllables in time domain is handled by dynamic time warping. Further, a filter bank is adopted to estimate the components in different frequency bands, which are also represented as mel-frequency cepstrum coefficients. The spectral distinctiveness in different frequency bands is then easily estimated by using Euclidean metrics. Finally, a morphological gradient operator is applied to automatically identify the regions of spectral distinctiveness. To evaluate the proposed approach, the identified regions are manipulated and then the manipulated syllables aremeasured by a close-set based speech-perception test.The experimental results demonstrated that the identified regions of spectral distinctiveness are very useful in speech perception, which indeed can help speech-language pathologists in speechperception training.


Introduction
Hearing loss would seriously degrade a subject's speech perception, thereby affecting the development of articulation ability.It then reduces speech intelligibility and affects speech and language development, learning, and communication.Recently, assistive listening devices such as hearing aids or cochlear implants could help subjects with hearing loss utilize their residual hearing and develop their speech perception [1][2][3][4][5][6][7].To facilitate this process, speech-language pathologists (SLPs) have to provide speech-perception training that could increase subject's ability to distinguish one syllable from another.In clinical practice, SLPs manually enhance the distinguishable acoustic cues and extensively use them in speech-perception training.However, it is a time-consuming and expensive process.Thus, it is beneficial for SLPs in speech-perception training if the regions of spectral distinctiveness between two syllables can be identified automatically.
When a speech wave propagates on the basilar membrane, it is characterized as time-spectral patterns.Then, unique perceptual cues, which are the basic units for speech perception, can be identified.Therefore, the relation between the acoustic cues and perceptual units is a key problem for speech perception [8][9][10].In the last decade, the relation had been examined [11][12][13][14][15][16][17][18] and the results show that the main factors of acoustic cues are duration, stress, and spectral distinctiveness.
SLPs generally increase the duration and stress of a syllable to teach a subject how to distinguish one syllable from another.Duration and stress can be simply manipulated by speech techniques [19].However, the spectral distinctiveness between two syllables is very difficult to be identified.In clinical practice, SLPs have to repeatedly pronounce a syllable by enhancing the volume of part segment of a syllable.However, it is a complicated task for SLPs to manually enhance spectral distinctiveness of a syllable.In order to Mathematical Problems in Engineering identify the regions of spectral distinctiveness, Li et al. proposed a psychoacoustic method in three dimensions: time, frequency, and intensity [20]; still it is a time-consuming process.Moreover, it is difficult to apply to other languages.Hence, to automatically identify the regions of spectral distinctiveness is very important for hearing impairment in speech-perception training.
In this study, an objective approach to identify the regions of spectral distinctiveness is proposed.The mel-frequency cepstrum coefficients (MFCCs) are selected as analytical parameters and used to represent the characteristics of acoustic signal.The mismatch between two syllables in time domain is handled by dynamic time warping; thereby, an optimal matching condition could be obtained.To accurately estimate the spectral similarity, filter bank is applied to find the spectral components of different frequency bands.For the speech signal in each frequency band, the MFCCs are also extracted to represent the acoustical characteristics.According to the optimal matching condition, the spectral distinctiveness of each frequency band between two syllables can be estimated easily by using Euclidean metrics.Finally, the morphological gradient operator is developed to automatically identify the regions of spectral distinctiveness.Moreover, in order to evaluate the accuracy of identified regions of spectral distinctiveness, an acoustic cue manipulation is proposed in this study.
The rest of this paper is organized as follows.Section 2 describes the objective approach to identify spectral distinctiveness including feature extraction, spectral distinctiveness estimation, and spectral distinctiveness identification.Besides, the acoustic cue manipulation is introduced.Section 3 then describes a series of experiments to examine the performance of our approach.Finally, conclusions are drawn in Section 4, along with recommendations for future research.

Materials and Methods
In this section, the proposed objective approach to identify the regions of spectral distinctiveness between two syllables (as shown in Figure 1) is introduced.Firstly, the MFCCs are extracted from the input speech signals and the filtered speech signals.Secondly, the distance between the MFCCs of two syllables is computed and used to find the consonantvowel boundary.This approach also adopts the dynamic time warping to find an optimal matching condition between two input syllables.Thirdly, according to the optimal matching condition, the spectral distinctiveness of each frequency band can be easily estimated by using Euclidean metric.Finally, a morphological gradient operator is applied to automatically identify the regions of spectral distinctiveness.To examine the proposed approach, an acoustic cues manipulation is also proposed to manipulate the regions of spectral distinctiveness.These procedures are illustrated serially in the following subsections.
2.1.Feature Extraction.Analytical parameters which can accurately represent a speech signal play an important role in objective measurement of spectral distinctiveness.Since, MFCCs had been widely used in speech processing, especially speech recognition [21], they are very suitable for accurately representing not only a speech signal but also speech signals in different frequency bands.The procedure to extract MFCCs from a speech signal is illustrated as follows: (1) taking the Fourier transform of frames windowed from input speech signal; (2) mapping the powers of the spectrum onto the mel scale which is defined as where ℎ is the frequency (Hz) in linear domain; (3) using triangular overlapping windows to get the power spectrum in mel scale; (4) taking the logs of the powers at each of the mel frequencies which is denoted as ; (5) taking the discrete cosine transform of the mel log powers which is defined as where  is the length of window size and () is defined as (6) finally, MFCCs are composed of amplitudes of the resulting spectrum, ().
In Mandarin, a syllable can be decomposed into an INITIAL and a FINAL.INITIALs consist of consonants or semivowels, and FINALs consist of vowels or vowels plus one of the two nasal sounds.Thus, two syllables represented as   =   V and   =   V are used to estimate the regions of spectral distinctiveness.Let   =  1  2  3 , . . .,    and   =  1  2  3 , . . .,    , respectively, represent the MFCCs of   and   , in this study.
In addition, a filter bank is adopted to separate the input signal into multiple components, which are the acoustical characteristics of frequency bands.For the th frequency band, the corresponding MFCCs of   and   are also extracted and denoted as

Spectral Distinctiveness Estimation.
In order to estimate the spectral distinctiveness between two syllables, the mismatch should be dealt; thereby, the difference in frequency bands can be easily estimated by using Euclidean metric.Therefore, the dynamic time warping algorithm is adopted to compare two sequences   , and   ; thereby, a plane spanned by   and   is considered as a distance matrix , which is written as where   is the Euclidean distance between   and   .The matching condition indicating the correspondence between the time axes of   and   can be represented a sequence of lattice points on the plane  and written as where ℎ   is the th matching pair in   .The  is the best matching condition and then the dynamic time warping algorithm is described as shown in Algorithm 1.
In the dynamic time warping algorithm, the variable is used to store a path, which reaches at lattice point (a, b) with minimum accumulative distance.The minimum accumulative distance is stored in variable DTW(, ).Since the durations are quite different for each INITIAL, the boundary condition should be ignored in this study.Hence, the path of the lattice point at first row and column goes through its previous lattice point at the step 1.In order to stop the backtrack for finding the optimal path, path(1, 1) is set to be (0, 0).In addition, monotony and continuity condition is applied to be concerned in the matching condition at step 2. It means that the search space of lattice point (, ) includes three lattice points: ( − 1, ), ( − 1,  − 1) and (,  − 1).At step 3, a simple method is implemented to find an optimal matching condition  by backtracking from lattice point (  ,   ).
According to the optimal matching condition  and a distinguishable matrix for syllable   , (  ), can be estimated as where  is the number of filter bands and    is the distance for frame  at th frequency band.   can be defined as where # is the number of frames in   which is matched with frame  in   .

Mathematical Problems in Engineering
Step End For End For Step 3. Backtracking and Termination The optimal (minimum) distance is DTW(  ,   ).
The optimal matching path H is found by simple backtracking from Path(  ,   ).
Algorithm 1: The dynamic time warping algorithm.

Spectral Distinctiveness Identification.
The spectral distinctiveness between two syllables is estimated; thereby, the regions of spectral distinctiveness should be identified from the distinguishable matrix .Recently, the grayscale morphological gradient operator is a powerful and fast technique for both contour detection and region based segmentation [22].Thus, it can be successfully used to detect the regions of spectral distinctiveness.To obtain the regions of spectral distinctiveness from , the morphological gradient operator is used and defined as where  is the scale of morphological gradient operator and   denotes the group of square structuring elements.
In (8), the symbols ⊕ and Θ are the grayscale dilation and grayscale erosion, which are defined as follows: where  is a flat structuring element.  and   are the domains of  and , respectively.According to (9), the grayscale opening and closing then can be derived as 2.4.Acoustic Cues Manipulation.The accuracy of identify regions of spectral distinctiveness was examined in this subsection.Therefore, the power of the spectrogram of these regions should be manipulated to examine that a syllable is converted to another.Thus, a speech modification procedure based on short-time Fourier transform (STFT) is proposed to analyze a speech sound and then synthesize an enhanced speech sound [23].
where  is the step size and is defined as /4.Therefore, the resulting STFT coefficients [, ] can be derived as To improve the accuracy of modification, the windowed speech is zero-padded before performing the Fourier transform.
The region of spectral distinctiveness is then modified by multiplying a specific gain [, ].Specifically, [, ] = 0 indicates feature removal, 0 < [, ] < 1 corresponds to feature attenuation, and [, ] > 1 represents feature enhancement.Thus, the modified speech spectrum can be written as Generally, the gain is expressed in dB as According to X[, ], the single frame signal is recovered by applying an inverse Fourier transform, which is defined as follow: Finally, an overlap-add synthesis is used to generate the modified speech X[] in time domain, which can be written as

Results and Discussions
To evaluate the proposed approach, a close-set based speechperception test on stop consonants was performed in this study.The speech stimuli, including the syllables /da, ga, ka, ba, pa, ta/, were chosen from the University of Pennsylvania's Linguistic Data Consortium (LDC) LDC2005S22.The detailed experimental results are shown as follows.
3.1.Results of Manipulating /ta/ and /ka/.In this subsection, the syllables /ta/ and /ka/ were used to illustrate the results of the proposed approach.First, the regions of spectral distinctiveness were manually identified to check the results of our approach.The spectrograms of /ta/ and /ka/ were shown in Figure 2. It is obvious that /ta/ has high-frequency burst above 4 k Hz (marked as black rectangle) and /ka/ has a low-frequency burst about 1 k Hz (marked as black rectangle).These two regions should be very important to distinguish /ta/ from /ka/.Second, the results of distance matrix and dynamic time warping algorithm are examined here.The distance matrix  estimated from /ta/ and /ka/ is shown in Figure 3(a).In this figure, the FINALs of /ta/ and /ka/ are the same; then the distances between the speech segments of FINALs are very small.The distances between the speech segments of FINALs Obviously, there exists an optimal path in 45-degree.In Figure 3(b), the optimal matching path can be successfully detected by dynamic time warping algorithm.It demonstrates that the mismatch between two syllables in time domain can be handled in our approach.Besides, the first frame and the third frame of /ka/ match with three frames of /ta/; thus, the first frame and the third frame of /ka/ should be duplicated three times when /ka/ is manipulated as /ta/.It also shows that our approach can be applied to correctly increase the duration of a syllable.Third, the spectral distinctiveness between /ta/ and /ka/ measured by our approach is validated with manually identified results.Therefore, with the optimal matching path (shown in Figure 3(b)), the MFCCs of /ta/ and /ka/ in different frequency bands were adopted to estimate the spectral distinctiveness.The results were normalized in time domain and shown in Figure 4. Comparing Figures 4 and 2, it is obvious that the differences in spectrogram are precisely estimated.By selecting a suitable threshold in morphological gradient operator, the regions of spectral distinctiveness for /ta/ and /ka/ can be identified (shown in Figure 5), which are very similar to the expected regions (shown in Figure 2).
Finally, the regions of spectral distinctiveness should be manipulated to examine the accuracy by a subject.According to the identified regions in Figure 5, the acoustic cues manipulation was applied to modify /ta/ and /ka/./ta/ and /ka/ were converted to /ka/ (denoted as /ta → ka/) and /ta/ (denoted as /ka → ta/), respectively.Then, the spectrograms of /ta → ka/ and /ka → ta/ were shown in Figure 6.Comparing Figure 6 2(a), the spectral energy above 4 K increases.Therefore, /ta/ and /ka/ are heard as /ka/ and /ta/, respectively.So, the identified regions really play an important role in distinguishing one syllable from another.

Experimental Results of Subject Evaluation.
In this subsection, the results of manipulated syllables are used to examine the identified regions of spectral distinctiveness.Seven males and three females (college students, age about 30 years) were asked to participate in this study.Each token with and without manipulation was randomly presented to each subject 5 times.The speech stimuli were played at the most comfortable level (around 70 dB SPL) for the listeners.The parameters of gain  in ( 14) were set to be 3 dB, 6 dB, 9 dB, and 12 dB.After each presentation, subjects responded to the stimulus by clicking on one of two buttons labeled with syllables.The detailed results in recognition rate (%) are shown in Table 1.The experimental results show that the recognition rate is over 86%.Moreover, the average recognition rates of manipulated syllables are 89.53%,91.27%, 92.87%, and 92.80% for  which is 3 dB, 6 dB, 9 dB, and 12 dB, respectively.When the gain  is set to be 9 dB, it can achieve the best recognition rate.Then, speech intelligibility is distorted for the larger gain.
To objectively compare these results, the syllables without manipulation were also used and the results are shown in Table 2.The average of recognition rate is 94.93% which is very similar to that of syllables with manipulation ( = 9 dB).It means that a syllable can be heard as another syllable by manipulating these regions of spectral distinctiveness.Hence, the identified regions of spectral distinctiveness really play an important role in speech perception.SLPs then can apply the identified regions of spectral distinctiveness to help a subject with hearing loss increase his/her ability to distinguish one syllable from another; thereby, the process of speech-perception training then can be facilitated.

Conclusions
In this study, an objective approach is proposed to identify the regions of spectral distinctiveness between two syllables.The MFCCs are appropriate to represent not only the speech signal but also the speech components in different frequency bands.In addition, the use of the dynamic time warping overcomes the mismatch between two speech signals in time domain.According to the optimal matching condition, the spectral distinctiveness of each frequency band between two syllables is easily estimated by using Euclidean metrics.The regions of spectral distinctiveness are precisely identified by morphological gradient operator.The experimental results demonstrate that the identify regions play an important role in distinguishing one syllable from another.In the future, the regions of spectral distinctiveness should be automatically

Figure 1 :
Figure 1: The flowchart of the objective approach to identify the regions spectral distinctiveness.
(a) with Figure 2(b), the spectral energy about 1 k Hz had been relatively decreased.Comparing Figure 6(b) with Figure

Table 2 :
Recognition rates (%) of syllables without manipulation.and extensively used in speech-perception training; then it can efficiently reduce the loading of SLPs and facilitate the process of developing speech perception. enhanced