Using Innovative Acoustic Analysis to Predict the Postoperative Outcomes of Unilateral Vocal Fold Paralysis

Objective. Autologous fat injection laryngoplasty is ineffective for some patients with iatrogenic vocal fold paralysis, and additional laryngeal framework surgery is often required. An acoustically measurable outcome predictor for lipoinjection laryngoplasty would assist phonosurgeons in formulating treatment strategies. Methods. Seventeen thyroid surgery patients with unilateral vocal fold paralysis participated in this study. All subjects underwent lipoinjection laryngoplasty to treat postsurgery vocal hoarseness. After treatment, patients were assigned to success and failure groups on the basis of voice improvement. Linear prediction analysis was used to construct a new voice quality indicator, the number of irregular peaks (NIrrP). It compared with the measures used in the Multi-Dimensional Voice Program (MDVP), such as jitter (frequency perturbation) and shimmer (perturbation of amplitude). Results. By comparing the [i] vowel produced by patients before the lipoinjection laryngoplasty (AUC = 0.98, 95% CI = 0.78–0.99), NIrrP was shown to be a more accurate predictor of long-term surgical outcomes than jitter (AUC = 0.73, 95% CI = 0.47–0.91) and shimmer (AUC = 0.63, 95% CI = 0.37–0.85), as identified by the receiver operating characteristic curve. Conclusions. NIrrP measured using the LP model could be a more accurate outcome predictor than the parameters used in the MDVP.


Introduction
Speech problems affect human communication. Degradation in voice quality can have a negative impact on a patient's daily life and in extreme cases can even lead to sociophobia [1]. This paper focuses on unilateral vocal fold paralysis (UVFP), which is a possible cause of dysphonia [2]. Iatrogenic UVFP caused by thyroid surgery can persist for 6 to 9 months after surgery. If the natural recovery process fails, patients may be required to undergo various types of phonosurgery, such as thyroplasty or injection laryngoplasty, to correct their voice impairment. However, few studies have reported on the outcome of lipoinjection laryngoplasty for iatrogenic UVFP after thyroid surgery.
Lipoinjection laryngoplasty is a conservative method for treating UVFP because autologous fat is a self-derived tissue that presents almost no tissue rejection concerns. It can also improve voice quality even in patients whose UVFP recovers naturally [3]. Moreover, the efficacy of lipoinjection laryngoplasty lasts for 12 months on average, reducing symptoms such as choking and glottal incompetence [4][5][6]. However, long-term outcomes are unpredictable because of reabsorption of the fat, with treatment failure rates of 30% after 2 years and 45% by 4 years [7]. Because of the unpredictable surgical outcome, repeated injections or laryngeal framework surgery such as thyroplasty are required. Preoperative prediction of a voice is therefore desirable both to improve patient selection for lipoinjection laryngoplasty and to ensure early intervention after recurrence of hoarseness [8,9].
Existing tools and quality of life questionnaire for evaluation of vocal hoarseness include perceptional voice analysis such as the Voice Handicap Index (VHI), Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), acoustic analysis using the Multi-Dimensional Voice Program (MDVP) in a computer speech lab, and stroboscopic analysis [10]. The predictive power of jitter, shimmer, and the VHI for surgical outcomes of injection laryngoplasty has been reported in the literature, but no consensus has yet been reached on its effectiveness [10,11]. Although some reports have claimed that lipoinjection laryngoplasty surgery reduces jitter and shimmer as measured using the MDVP for patients with UVFP, correlation with surgical outcomes has been inconsistent [10,11]. Although autologous fat absorption can be detected early by using stroboscopy, it is difficult to evaluate in patients with a strong gag reflex [12].
MDVP has been found to be reliable and objective assessment software for voice quality in the patients with UVFP after injection laryngoplasty [13]. However, it has often failed in the biomedical signal analysis of highly degraded signals [14]. In this study, an acoustic inverse scattering technique called linear prediction (LP) was used to evaluate each patient's voice before and after lipoinjection. Huang et al. [15] found that LP could enhance the periodicity in noisy speech signals. LP, when used for voice synthesis purposes, could improve the perceptual voice quality and restore harmonic structure of the speech [15]. We therefore investigated an alternative method of clinically analyzing hoarse voice by identifying new acoustic parameters that can predict the outcome of lipoinjection laryngoplasty surgery in UVFP patients. The voice analysis method adopted in this study was LP, a mathematical technique that simultaneously estimates the GSW forms and the filtering effects provided by the vocal tract [16]. Because of its simultaneous sourcefilter estimation capability, LP has been widely used for digital speech communication purposes, such as speech synthesis and speech data compression [15]. In the current research, LP enabled us to focus on the GSW forms while ignoring the filtering effects of the pharynx and oral cavity. LP therefore is suitable for investigating the functions of the vocal fold without being influenced by confounding factors caused by vocal tract filtering. The aim of the study was to find a new acoustic predictor to improve patient selection and evaluation of postoperative outcomes. All patients provided informed consent before lipoinjection laryngoplasty and the study was approved by the Institutional Review Board of the hospital. The diagnosis of UVFP was based on two criteria: laryngoscope assurance and the lack of laryngeal electromyography responses in the unilateral thyroarytenoid muscle. Following an observation period of 1 year, all patients received lipoinjection laryngoplasty for their hoarseness and choking problems. As noted, UVFP patients were divided into a success group and a failure group; the patient was assigned to the failure group if both the following criteria were met: the patient had recurrent nonrecoverable hoarseness, and the patient therefore underwent revision lipoinjection laryngoplasty or thyroplasty 6 months after the initial injection.

Autologous Fat Injection
Laryngoplasty. Autologous fat for injection laryngoplasty was obtained from the periumbilical subcutaneous area. A 2 cm incision was made 0.5 cm beneath the umbilical area after local infiltration. Lidocaine hydrochloride (20 mL), dexamethasone (1 mL), 7% sodium bicarbonate (20 mL), and epinephrine (5 mg) were added to 500 mL of sodium chloride and mixed. Between 30 and 50 mL of the mixed solution was injected into the periumbilical subcutaneous area to elute fat for 5 minutes. Fat globules were then harvested using a 10 mL Storz injection syringe (Karl Storz, Tuttlingen, Germany). A total of 30-40 mL of subcutaneous adipose soft tissue was obtained and rinsed in 10 mL of regular insulin for 5 minutes after being washed in normal saline solution to remove blood clots. After soaking, the adipose tissue was loaded into a Storz Brüningstype laryngeal injector (Karl Storz, Tuttlingen, Germany) in preparation for injection. Patients were sedated using general anesthesia and a 5.5 or 6 mm oral endotracheal tube. A rigid suspension laryngoscope was used to expose the patient's vocal fold, and 1.5-2.0 mL of autologous fat was injected into the paralyzed side using an 18-or 19-gauge syringe. The injection point was at the posterior third of the membranous vocal fold, at the lateral aspect of the vocal process in the thyroarytenoid muscle. Injection in this point causes medicalization of the paralyzed vocal fold. In practice, 20-30% bulging of the paralyzed vocal fold across the midline was achieved after the injection (Figures 1(a) and 1(b)).

Voice Laboratory Measures.
All of the patients underwent preoperative and postoperative acoustic recording and phonation studies. The perceptual evaluation of grade, roughness, breathiness, asthenia, and strain (using the GRBAS) and measurement of maximum phonation time (MPT) were performed by an otolaryngologist, Y-A T, before surgery and 6 months after it. Videostroboscopic examinations were also made, and they confirmed UVFP. The patients were asked to produce the vowels [a] and [i] at a stable pitch and loudness; the voice was then recorded and analyzed using the MDVP in a computerized speech laboratory system (CSL4500, Kay Elemetrics Corp, Lincoln Park, NJ, USA). The maximum phonation time was measured by the same phonosurgeon while patients produced a sustained [a] vowel. The midportion of the [a] and [i] vowel voice samples, which is considered a stable voice segment, was used for acoustic analysis. Fundamental frequency, jitter (frequency perturbation), shimmer (perturbation of amplitude), and harmonics-to-noise ratio (HNR) values were obtained using the MDVP [17].
All patients received laryngoscope and stroboscope examinations to survey the postoperation laryngeal gap, and the voices of the patients were recorded and analyzed using the MDVP both before the operation and at weekly intervals for 6 months after operation.
Based on their improvement in voice quality after the surgery, the patients were assigned to two groups. The lipoinjection treatment was considered a failure if the patient's voice was poorer within 6 months as determined using the MDVP or if an increased vocal slit was observed in stroboscopic analysis. Current clinical practice requires such patients to receive additional injections or permanent laryngeal framework surgery, such as medialization thyroplasty (silicon or Goretex). The lipoinjection treatment was considered to be a success if the patient's voice quality improved and stroboscopy showed that there was no slit during the maximal closure phase in the cycle of phonation.

Digital Signal Processing
Methods. The digitally stored signals were analyzed offline by using computer programs written in MATLAB (Mathworks, Natick, MA, USA). The signal processing flow included preemphasis, windowing, LP, and feature extraction. The eventual goal was to examine the GSW form indicated by LP and to construct a useful predictor that might indicate the pathological status of the vocal fold. The signal processing methods are explained as follows.
The high-frequency components of the human voice have a roll-off tendency at a rate of 6 dB/octave when the sound waves radiate from the oral cavity. To compensate for this high-frequency attenuation, a filter was applied to the recorded signal to whiten the spectrum [18]: where [ ] denotes the raw signal at time and [ ] is the result after preemphasis. Spectrum whitening is a technique used to compensate for high-frequency attenuation of a signal within its own bandwidth in order to improve the resolution and appearance of voice data. This technique can prohibit excessive boosting of background noise that is not produced by a patient with UVFP.
In the field of speech processing, preemphasis is an essential step preceding LP analysis, because the LP error term (to be defined later) can be regarded as an effective representation of glottal waves only if the raw spectrum has no tendency to roll off. Equation (1) provides a 6 dB/octave boost that counteracts roll-off caused by radiative loss.
Because of the nonstationary property of the human voice, features of the voice signal were repeatedly extracted for each predetermined short period. The signal recorded within this period is referred to as a frame, and the rate at which features were extracted is called the frame rate. In this study, the length of a frame was set at 64 ms and the frame rate was the inverse of the length of the frame (15.625 frames/sec); in other words, frames did not overlap. In certain applications of spectral estimation, it is beneficial to multiply the frame with a windowing function to trade off resolution in the time domain and in the frequency domain. In this study, however, the frame was not further windowed by such a function; every sample maintained its original value after preemphasis.
The LP technique is employed to approximate every sample in a signal as a linear combination of previous samples. The approximation can be written as follows: where { 1 , 2 , . . . , } are called the LP coefficients, is the order of LP, and [ ] denotes the approximation error signal. When the LP coefficients are chosen to minimize the mean square of [ ], the spectrum of [ ] is maximally flat [19]. In practice, [ ] can be regarded as an estimation of the glottal source signal if is sufficiently large and an optimal set of coefficients { 1 , 2 , . . . , } is found [20]. A vocal tract filter is simultaneously estimated, characterized by the acoustic transfer function ( ): where = 2 / is the digital frequency in rad/sample (where denotes the sampling rate). Equations (3) and (2) can be rewritten as follows: This arrangement is interpreted as follows: [ ] traverses the inverse filter of ( ) so that any information concerning vocal tract filtering is removed. Therefore, the result [ ] can be regarded as a representation of the glottal source wave (GSW).
In the present study, was fixed at 20, and the LP coefficients and the error (or excitation) signal [ ] were obtained iteratively through the Levinson-Durbin algorithm [21]. LP analysis was performed for every frame of the recorded signals. Figure 2 shows typical results for GSWs, estimated using (4).

Statistical
Analysis. The data were analyzed using the statistical analysis software SPSS 15. We used descriptive statistics to present the patients' demographic characteristics. Independent t-test was used to compare the success and failure groups, and paired t-test was used to determine the statistical difference between preoperative and postoperative GRBAS, MPT, and voice parameters for [a] and [i] vowels. The statistical significance of differences in gender distribution between the two groups was analyzed by Fisher's exact test. We plotted sensitivity against 1 − specificity between different voice parameters and new parameters for [a] and [i], to create receiver operating characteristic curves. The area under the curve (AUC) at a 95% confidence interval for each of the parameters was used to determine accuracy [22,23].

Results
The patients ranged in age from 31 to 71 years (average = 54 y). The sex, age, and vocal problems of the patients are presented in Tables 1 and 2. The patients' fundamental frequency distribution was 140-230 Hz for the females and 70-190 Hz for the males. The failure group comprised 8 patients and the success group comprised 9 patients. There were 5 females and 4 males in the failure group, and 6 females and 2 males in success group. The two groups did not differ significantly from each other in terms of gender distribution ( = 0.62). After grouping, we analyzed voice parameters as follows.

A New Voice Quality Indicator.
In the search for a new voice quality indicator, we first carefully inspected the GSWs, which are the voice signals originated from the vocal folds. As shown in Figure 2, when the lipoinjection surgery was successful, the GSWs indicated by LP became less noisy and prominent spikes emerged in the waveforms (as shown in Figure 2(f), compared with Figures 2(b) and 2(d)). This generated the following postulates: first, in a normal GSW, the prominent spikes should be of approximately equal height within a short frame; second, the prominent spikes should rise far above the noise floor, whereas the heights of other local maxima (or peaks) in the GSW are mostly near or below the noise floor.
Following these postulates, we constructed a new voice quality indicator, namely, the number of irregular peaks (NIrrP), by peak counting in each GSW frame. It represents the number of voice cycles whose peak amplitude was out of  the preset range (below 25% and above 75%) in LP system. First, the root-mean-square (RMS) value of the GSW was treated as the noise floor (shown as the lowest dashed line in Figure 3). The maximum amplitude was then identified and the amplitude range of the GSW was partitioned into four equal regions between the RMS value and the maximum (shown as the highest dashed line in Figure 3). Finally, peaks whose height fell within the 25-75% range between RMS  Figure 3: Peak counting in GSWs. A typical example from a patient in the failed group (a) and an example from a patient in the successful group (b) were shown. The lowest dashed line is the RMS value in the frame, and the highest line shows the maximum amplitude in the GSW of the frame. Peaks that fall within the 25% to 75% range between RMS and the maximum are marked with a red asterisk ( * ). and maximum were counted (marked with red asterisks in Figure 3). The NIrrP is defined as the average number of these peaks in each 64 ms frame in the GSW. In a failed case, the GSWs should appear noisy and glottal spikes should be obscured. Therefore, we expected the NIrrP to be higher in the failure group (see Figure 3(a)) than in the success group. Comparing with failure group, significant differences in NIrrP and MDVP of [a] and [i] vowels in postoperation were found in success group ( < 0.05; Table 3).

Comparing NIrrP against MDVP Parameters.
The accuracy of the NIrrP in predicting surgical outcomes was compared with that of the parameters calculated by the MDVP, such as jitter, shimmer, and HNR. The aim was to ascertain whether NIrrP and other parameters can be used to predict the long-term outcome of lipoinjection surgery. We first made a preoperative comparison between the success and failure groups across each of the parameters. The MDVP parameters and NIrrP for the failure and success groups are shown in Table 3. Comparing the voice samples recorded before and after surgical operation, no other statistically significant differences were observed, except for jitter of vowels [a] and [i] and NIrrP of vowels [i]. The failure and success groups did not differ from each other for the jitter and shimmer values measured by MDVP before surgery. However, after surgery the groups differed significantly for GRBAS, We further quantified the predictive power of the parameters by AUC under receiver operating characteristic curve (ROC). The results showed that the new voice parameter [i] of NIrrP is the most favorable among all parameters for discriminating between successful and unsuccessful outcomes before autologous fat injection laryngoplasty (Figures 4(a)  and 4(b)). The ROC for the new voice parameter  (Table 4).

Discussion
Permanent recurrent laryngeal nerve injury may occur after surgery for thyroid cancer or thyroid nodular goiter and may require laryngeal surgery [24]. Thyroplasty is typically the first choice to treat permanent UVFP and stable results have been presented in the literature [25]. However, foreign body reaction and migration of implants may appear in longterm follow-up. In addition, the neck will have an additional wound and fibrotic scar which may also cause implant migration. Furthermore, patients undergoing thyroplasty still have residual glottis insufficiency and salvage injection laryngoplasty is still required [26,27]. Therefore, autologous fat injection laryngoplasty offers an alternative treatment strategy. Controversy remains over the choice of injection laryngoplasty or thyroplasty for UVFP patients following thyroid surgery. However, no superior methods for the prognosis of injection laryngoplasty have been reported in the literature. GRBAS, CAPE-V, jitter, shimmer, and HNR have been used to compare patients' voice quality before and after various types of laryngeal surgery [28]. In particular, voice grade has been reported to be a predictor for the surgical outcome of thyroplasty [29]. However, none of these parameters are able to predict the surgical outcome of injection laryngoplasty. Therefore, a reliable outcome predictor is needed to allow phonosurgeons to improve surgical decision making. MPT and GRBAS are commonly used for evaluation of patient's voice in clinic and are subjective voice evaluation for vocal fold paralysis [30]. In the results of previous study, a significantly shorter MPT was found in the patients with vocal fold paralysis to compare with normal subjects and was an appropriate predictor of outcome after thyroplasty [31]. The results reported by Morsomme et al. [32] also indicated that reduced GRBAS reflected the success of UVFP treatment. Increased MPT and a decrease in GRBAS would be expected after a successful lipoinjection laryngoplasty. However, in the present study only an increase in MPT was found in both the success and failure groups after lipoinjection laryngoplasty, while no changes were found in GRBAS. The reason may be that lipoinjection laryngoplasty improves the glottic closure efficiency, and MPT was useful to assess the improvements [33]. GRBAS is emphasized in the perceptual assessment of voice quality [32]. Acoustic parameters such as fundamental frequency, jitter, shimmer, and HNR provide possibility for an objective evaluation of voice quality, which complements perceptual voice evaluation. According to Zhang et al. [34], the pathological voice of UVFP patients had higher jitter and shimmer values compared to normal voices. In the present study, the receiver operating characteristic analysis was used to assess the jitter and shimmer characteristics in the patients' [a] and [i] vowels. The acceptable discrimination of sensitivity and specificity of these two voice parameters in distinguishing the success group of patients with UVFP from failure group was found before autologous fat injection laryngoplasty. These findings were the same as previous studies. In our study, NIrrP of the patients' [i] vowel excellently differentiated the success and failure groups after lipoinjection laryngoplasty from each other. NIrrP in our model behaved more precisely than jitter and shimmer in detection of cycle-to-cycle variations in period length and amplitude of the voice signal. Thus, it could be a predictor of voice quality, indicating stability of vocal fold vibration, to be used before and after phonosurgery.
Like GRBAS, CAPE-V is measured subjectively; the jitter, shimmer, and HNR are obtained objectively by using the MDVP. However, these factors are affected by the vocal tract. This might explain their poor performance as surgical outcome predictors. By contrast, LP measures the source of the voice produced by the vocal folds and removes the filtering effects of the vocal tract [31]. In this research, the NIrrP test on the [i] vowel showed a significant ability to differentiate between the success and failure groups both before operation and after operation. The AUC of the NIrrP test (>0.9) was higher than that of jitter and shimmer [23]. The NIrrP had greater predictive power than jitter and shimmer even when tested after 6 months in [i] production, but not in [a] production. These results could be explained by the higher vocal tension of the thyroarytenoid muscle and cricothyroid muscle in production of the [i] vowel [35][36][37]. This increased vocal tension allows more effective vocal fold closure for the [i] vowel than the [a] vowel. Testing on [i] also showed less jitter and shimmer than that on [a] and reduced interference of noise from irregular movements during vocal fold vibration. Therefore, voice analysis on the [i] vowel provided more significant changes in predicting surgical outcomes. A recent study conducted tests on different vowels and attributed dysphonia to the different supraglottal changes during [a] and [i] phonation, which, in turn, affect vocal tract resonances [32]. A dynamic magnetic resonance imaging study demonstrated the different movement of the vocal folds for [a] and [i] vowels and revealed a lower larynx position in [a] vowel production than in [i] vowel production [33]. However, differences in testing [a] and [i] persist even in the LP model, in which voicing comes directly from the vocal fold and is not affected by the actions of the vocal tract, such as the pharynx, oral cavity, and tongue.
Our focus was on identifying outcome predictors for lipoinjection laryngoplasty using the LP model. We found that jitter and shimmer were weaker predictive factors than the NIrrP in [i] before the operation and 6 months after the operation. Further research is needed to explore these results in more detail. We also suggest extending the research to the production of different vowels, such as /u/, /e/, and /o/. A limitation of the study is the lack of any measure of thyroarytenoid muscle and cricothyroid muscle tension. We postulated that the [i] vowel requires higher thyroarytenoid and cricothyroid muscle tension and that this might directly produce different NIrrP results in the LP model. Our participant group was also relatively small; to establish more robust results, replication with larger groups of participants is recommended.
The NIrrP can be used as a reliable preoperative predictor of surgical outcomes. The NIrrP in [i] vowel testing in the LP model for lipoinjection laryngoplasty produced accurate outcome predictions both before surgery and 6 months afterward. This approach can assist phonosurgeons in making more precise diagnoses and improve patient selection for this type of surgery. The predictive power of the NIrrP in [i] vowel testing for lipoinjection laryngoplasty has not been previously reported, and this technique should be applied more widely in clinical practice.

Disclosure
Yung-An Tsou and Yi-Wen Liu contributed equally to this work as first authors.