Invariant Measures Based on the U-Correlation Integral : An Application to the Study of Human Voice

Nonlinear measures such as the correlation dimension, the correlation entropy, and the noise level were used in this article to characterize normal and pathological voices.These invariantswere estimated through an automated algorithmbased on the recently proposedU-correlation integral. Our results show that the voice dynamics have a lowdimension.The value of correlation dimension is greater for pathological voices than for normal ones. Furthermore, its value also increases along with the type of the voice. The low correlation entropy values obtained for normal and pathological type 1 and type 2 voices suggest that their dynamics are nearly periodic. Regarding the noise level, in the context of voice signals, it can be interpreted as the power of an additive stochastic perturbation intrinsic to the voice production system. Our estimations suggest that the noise level is greater for pathological voices than for normal ones. Moreover, it increases along with the type of voice, being the highest for type 4 voices. From these results, we can conclude that the voice production dynamical system is more complex in the presence of a pathology. In addition, the presence of the inherent stochastic perturbation strengthens along with the voice type. Finally, based on our results, we propose that the noise level can be used to quantitatively differentiate between type 3 and type 4 voices.


Introduction
The human voice is the most important means of communication among individuals.Thanks to vocal communication, activities like asking for help are apparently trivial in our daily routine.Thus, a voice disorder can limit our ability to cover our most basic needs, producing a negative impact on our quality of life.For this reason, it is very important not only to increase our knowledge about the mechanism of voice production but also to characterize its dynamics in normal and pathological conditions.
In the literature, several methodologies to assess human voices can be found.However, their reliability depends on the nature of the studied voice.Consequently, Titze proposed a qualitative classification for voice signals [1] Figure 1 shows the time series, the state space reconstruction, and the spectrogram of each type of voice.A normal voice (first column) is characterized for a quasiperiodic time representation and a smooth attractor in the reconstructed state space.Moreover, from its spectrogram, one can easily observe the fundamental frequency and its harmonics.A pathological type 1 voice (second column) displays a more irregular time series than the normal voice.Notice that although their attractors are similar, pathological type 1 attractor is not so smooth.Furthermore, both spectrograms are also similar in the sense that one can still distinguish a fundamental frequency and its harmonics.Nevertheless, the noise content blurs the harmonics at high frequencies.

Complexity
In the case of a pathological type 2 voice (third column), it is possible to observe a less regular time series, compared to the pathological type 1 voice.The volume occupied by its attractor has been reduced and it is more difficult to distinguish a smooth shape.In its spectrogram, one can find subharmonic frequencies.The pathological type 3 voice (fourth column) is characterized by an irregular time series.The state space reconstruction is similar to the one of white Gaussian noise.Its spectrogram shows the fundamental frequency but its harmonics are rapidly blurred.Finally, a pathological type 4 voice (fifth column) has an irregular time series representation and a regular state space reconstruction, like the type 3 voice.In its spectrogram, one can find a fundamental frequency but not the harmonics.
These representations are useful to differentiate between some kinds of voices, for example, between normal and pathological type 3 voices.However, it is very difficult to differentiate between a normal and a pathological type 1 voice or between a type 3 and a type 4 voice.The importance of this classification is based on the fact that traditional perturbation measures like jitter and shimmer give reliable results if they are applied to type 1 and type 2 signals.In contrast, nonlinear dynamics concepts should be used to characterize type 3 voices, whereas they are unreliable when applied to type 4 signals.Moreover, until now, the classification of type 4 voices has been done subjectively by visual inspection [2].
Over the last three decades, strong evidence about the nonlinear behavior of the voice production mechanism has been collected [3][4][5][6], leading the scientific community to develop concepts and methods based on nonlinear dynamics and chaos theories.There exist an extensive set of publications in which those concepts have been used to characterize healthy and pathological voices.The correlation dimension () and the correlation entropy ( 2 ) are two quantities that are used to characterize the complexity of a dynamical system.The former can be thought of as an estimation of the number of variables involved in the dynamics (degrees of freedom) [7].The latter measures the rate at which information about the dynamics is lost over time.More complex systems are commonly characterized by having higher dimensions and entropy values [8].
The correlation integral is the quantity used to estimate  and  2 [9].It has been widely used in the biomedical field since it was proposed by Grassberger and Procaccia in the early 1980s [9].
According to what we know so far, this correlation integral has been used in all studies that have estimated  and  2 from voice signals.However, it is well known that the Grassberger and Procaccia approach is not robust in the presence of noise, and it needs special conditions to converge [10].More robust variants of the correlation integrals have been proposed, like the Gaussian kernel correlation integral (GCI) [11] and the U-correlation integral [12].
The objective of this article is to characterize the dynamics of normal and pathological voices through the correlation dimension, the correlation entropy, and the noise level ().Furthermore, we seek to statistically analyze these invariants for the four types of voices.As a novelty, we obtain these invariants using U-correlation integral through an automated algorithm [12] which allowed us to avoid subjective judgements.Finally, we propose a new quantitative methodology to differentiate between type 3 and type 4 voices.

Materials and Methods
The correlation dimension and the correlation entropy are invariants that characterize the natural measure of a dynamical system [13].These invariants can be easily computed from the correlation integral which is obtained from indirect temporal measures of one of the variables of the system [9].For an observed stationary time series of length , {  }  =1 , the reconstructed -dimensional delay vectors x i = (  ,  + , . . .,  −(−1) ),  = 1, 2, . . .,  =  − ( − 1), must be formed.The correlation integral   (ℎ), in its general form, is defined as the probability that the distance z = ‖x i − x j ‖ between two randomly selected -dimensional delay vectors x i and x j is smaller than a value ℎ [13]: where (ℎ, z) is a kernel function and   z (z) is the probability density function (pdf) of the distance between pairs of delay vectors.In this article, we used the Euclidean distance.
The Grassberger and Procaccia correlation integral   (ℎ) uses as a kernel function (ℎ, z) = (1 − z/ℎ), where (⋅) is the Heaviside step function [9].On the other hand, the GCI proposed by Diks et al.,   (ℎ), adopts (ℎ, z) = exp(−z 2 /4ℎ 2 ) [11].The main advantage of the GCI over the Grassberger and Procaccia correlation integral is that the former allows us to model the influence of additive noise on the scaling law.
For a zero-mean time series of variance  2  with additive white Gaussian noise of variance  2  , the scaling rule is [11,14,15] where It is important to say that all time series used in this work were rescaled to have unitary standard deviation.In this sense,  is the noise level after rescaling; that is,  → 0 corresponds to clean time series and  → 1 implies a pure stochastic process.The signal-to-noise ratio (SNR) can be calculated as In practical applications, it is very important to be able to quantify , the noise level in the time series.This is because it allows us to correct the estimations of  and  2 .Moreover, the conclusions driven by the interpretation of the invariants calculated in the presence of high levels of noise should be taken carefully.For this reason, an estimate of the noise level must be also reported, allowing the readers to be aware of the limitations of the estimates.On the other hand, when the data is taken under controlled conditions, like the voices in the database analyzed herein, a high noise level can be seen as an indicator of a dominant underlying additive stochastic process.
The main disadvantage of the GCI is that it requires high values of  to obtain a convergent estimate of  2 [10,15].This is because the pdf of the interpoint distance changes with the embedding dimension, whereas the GCI's kernel does not (see (1)).As a result, the convergence of the GCI to  2 is slowed down [12].The last statement also applies to the Grassberger and Procaccia correlation integral.As a solution, we have recently proposed the U-correlation integral (UCI) [12]: where Γ(, ) is the upper incomplete Gamma function, Γ() is the Gamma function, and f (; ) is the pdf of the squared interpoint distance.There are two important aspects about this correlation integral that deserve to be mentioned.First, note that the UCI's kernel function, given by has a parameter  that is used to incorporate information about the embedding dimension.In other words, this kernel function is able to change according to .Second, we are now working with the square of the interpoint distance, that is,  = z2 , to reduce the computational cost.
Once the correlation sum is calculated, we estimated ,  2 , and  using coarse-grained estimators.These are explicit expressions for ,  2 , and  as functions of  and ℎ [13].Such functions are useful because they allow us to estimate the invariants and to visually confirm a scaling regime.In [16], we have presented three coarse-grained functions:    (ℎ),    (ℎ), and    (ℎ) which are based on the UCI.As an advantage, they can be calculated in an automated manner from two U-correlation integrals.Moreover, we have proposed an automated algorithm to estimate ,  2 , and  from these coarse-grained estimators [16].

Coarse-Grained Estimators Based on UCI.
Here, we present the coarse-grained estimators used in this study.A more detailed description can be found in [12,16].
We define the noise level functional as [12] where This functional is the difference between the logarithmic derivatives of two U-correlation integrals:  =  (ℎ), which is the UCI calculated with squared distances   coming from -dimensional delay vectors and a kernel with the parameter  = , and  = +2 (ℎ).The last one is the correlation integral obtained from distances between pairs of (+2)-dimensional delay vectors and a kernel with the parameter  = .

2.2.
Database.This study was conducted using the Massachusetts Eye & Ear Infirmary (MEEI) Voice Disorders Database, distributed by Kay Elemetrics [17].We have chosen a subset of 219 records (53 normal and 166 pathological) of sustained phonation of vowel /a/ [18].Each record was downsampled to 10 kHz and the subjects' descriptive statistics are presented in Table 1.From this set of voices, we excluded 3 records because they contained less than 8000 sample points, ending with a total of 216 records to be analyzed.
For each record, a centered window of  = 8000 data points was selected and normalized to have zero mean and unitary standard deviation.Then, the U-correlation sums were computed for  = {4, 6, . . ., 20},  = 10, and ℎ ∈ [ −5 ,  2 ].Moreover, the nearest 10 temporal neighbors of each delay vector were discarded [7].Finally, Algorithm 1 was applied.We must clarify that the range of values for the embedding dimension was selected taking into account the studies conducted in [6,[19][20][21], where a low-dimensional vocal system is suggested ( < 5 for pathological voices).On the other hand, the embedding lag was selected as the average lag (over all voices) where the first minimum of the mutual information function occurs [7,10].
In a preliminary study, the pathological records were classified into Titze's scheme based on the visual observation of the time series and its spectrograms.The classification resulted in 74 type 1 records and 81 type 2 records.The remainder 11 voices where taken apart since these techniques cannot differentiate between type 3 and type 4 voices.
Following the definition of type 4 voices given by Sprecher et al. in [2], we calculated the coarse-grained estimators    (ℎ) and    (ℎ) for these 11 voices.Then, we grouped together the voices that did show a scaling regime (finite correlation dimension and entropy) as type 3 and those that did not as type 4.

Results
The first result of the simulations can be observed in Figure 2. It is important to observe that all coarse-grained estimators present a scaling region, except the ones for correlation dimension and  2 entropy of type 4 voices.This highlights the suitability of these estimators to analyze normal and pathological (type 1, type 2, and type 3) voices.
As it can be observed in the first row of Figure 2, the noise level is greater for the pathological voices than for the normal voice and, in the pathological case, it increases along with the type.This means that, for both normal and pathological voices, there is an underlying stochastic component and its level increases in the presence of pathology.
The behavior of the estimator    (ℎ) is presented in the second row of Figure 2. As it can be seen, this estimator suggests a value of  ≈ 1.25 for normal voices and slightly greater values for pathological type 1 and type 2 voices.However, it is difficult to say whether there is any difference between the analyzed normal and pathological voice.Nevertheless, these results suggest that the voice production system has a relatively low dimension.
Regarding the estimator    (ℎ), it can be observed from the third row of Figure 2 that, for the voices analyzed herein, the estimator converges to values close to zero.This suggests the presence of a strong harmonic component.Furthermore, there exists a small increase of  2 entropy from normal voice to pathological type 1 voice and from the latter to pathological type 2 voice.This reflects an increasing degree of irregularity which is often associated with an increase in complexity.
For the type 3 voice, it can be observed in the fourth column of Figure 2 that the noise level coarse-grained estimator converges to a higher value than the one corresponding to the type 2 voice.However, it is still possible to observe a scaling range in both    (ℎ) and    (ℎ).Moreover, these estimators converge to higher values than the ones for type 2 voice.On the other hand, for the type 4 voice (fifth column of Figure 2), it is not possible to find a scaling range either in dimension or in the entropy estimator.This behavior is expected since, by definition, these types of voices have an infinite value of dimension and entropy.
As we mentioned before, this kind of nonlinear techniques should not be used with type 4 voices.Note that, from the curves of    (ℎ) and    (ℎ) (Figures 2(j) and 2(o), resp.), it is not easy to decide whether there is a scaling range.In this sense, an untrained person could erroneously determine its existence, resulting in misleading estimations.
On the other hand, a scaling range can be found for    (ℎ) (Figure 2(e)), and it converges to a very high noise level value  ≈ 0.5 (SNR ≈ 4.77 dB).This suggests that the dynamics are mostly ruled by the underlying stochastic component.It is important to mention that this behavior is consistent in all voices that did not have a scaling region on    (ℎ) and    (ℎ) (type 4 voices).Based on this result, we suggest that a high value of  could be a quantitative indicator of type 4 voices.
In order to find a noise level value that allows us to differentiate between type 3 and type 4 voices, we select by visual inspection all voices that did not show a scaling region on the estimators    (ℎ) and    (ℎ).Then, for each signal, we estimate the noise level value from    (ℎ) and select the minimum of these values ( thr = 0.4) as a threshold.

Automated Estimation of Invariants.
In order to obtain reliable estimations of ,  2 , and , it is essential to verify the existence of a scaling behavior over the coarse-grained estimators.Once it is found, one has to choose a range of ℎ values from which to estimate the invariants.This choice is critical since, in practical application, the coarse-grained estimators strongly vary as a function of the scale.To avoid subjective judgements, we have proposed an algorithm (see Algorithm 1) for the automated estimation of ,  2 , and  based on the coarse-grained estimators    (ℎ),    (ℎ), and    (ℎ), respectively [16].This algorithm selects the scaling range where those invariants should be estimated based on the next criteria: (i) the coarse-grained estimator must be constant for a range of scale values and (ii) the value of the invariant should converge as the embedding dimension increases.
The algorithm begins by approximating the U-correlation integrals  =  (ℎ) and  =−2  (ℎ) for different  values ( > 2) using (8).Note that  =−2  (ℎ) must be calculated for  > 2 since the shape parameter of the incomplete Gamma and Gamma functions must be greater than zero.Next, the logarithmic derivatives Ḋ =  (ℎ) and Ḋ = +2 (ℎ) must be computed in order to obtain Δ   (ℎ) (see (9)).Observe that the correlation integral  = +2 (ℎ) is equal to the correlation integral  = m−2 m (ℎ) evaluated at m =  + 2. In this article, the logarithmic derivatives were obtained using a wavelet transform approach [22].This produces a smooth version of the derivatives allowing us to better estimate the invariants.
The noise level must be calculated from a range of ℎ where the coarse-grained estimator    (ℎ) is nearly constant and its variation across the  values is the smallest.In this sense, for the noise level, we define the functions [16] where  ∈ { 1 ,  2 , . . .,   , . . .,   } and σ (ℎ) is the average of    (ℎ) across .  (ℎ) is the average over  of the derivative of    (ℎ) with respect to ln ℎ,   (ℎ) gives the variation of    (ℎ) across , and   (ℎ) is the product of the two aforementioned functions.We propose to estimate  within a range of ℎ centered at the ℎ value at which   (ℎ) is minimum.This way,  is estimated in a range of ℎ centered in a plateaued region (scaling region) of    (ℎ), and its value is consistent through the parameter .
The correlation dimension and the correlation entropy can be determined using the coarse-grained estimators    (ℎ) (see (12)) and    (ℎ) (see (13)), respectively.To find a range of ℎ values to estimate  and  2 , we use a similar approach, but from functions   (ℎ) and   2 (ℎ), respectively.The functions   (ℎ) and   2 (ℎ) can be computed similarly to   (ℎ) but using the coarse-grained estimators    (ℎ) and    (ℎ), respectively.The estimation of each invariant for the whole group of signals is presented in Figure 3. Figure 3(a) shows a box plot of the noise level estimation for normal (N), pathological type 1 (PT1), pathological type 2 (PT2), pathological type 3 (PT3), and pathological type 4 (PT4) voices.As it can be observed,  is greater for pathological voices than for normal ones.However, it is not possible to establish a statistical difference between normal and pathological type 1 and type 2 voices.On the other hand, we can differentiate normal from pathological type 3 and type 4 voices.It is very important to say that the group labeled as TP4 was selected because they have a noise level value  > 0.4.Note that, using this threshold, we can separate type 3 from type 4 voices.
The box plot of the correlation dimension is presented in Figure 3(b).There is an increase of  from normal to pathological voices.For normal voices, the median is  = 1.4; for pathological type 1 voices,  = 1.68; for type 2 voices,  = 2.23; and for type 3 voices,  = 2.99.From these results, we can establish that the vocal system has a low dimension, even in the presence of a pathology.The group TP4 is not shown in this plot since, by definition, its dimension is infinite.Regarding the correlation entropy, it is shown in Figure 3(c) that there is a small increase of  2 from normal to pathological voices.However, it is not possible to find a statistical difference between normal, pathological type 1, and pathological type 2 voices.The median correlation entropy estimated for normal voices is  2 = 0.0038, for pathological type 1 voices is  2 = 0.0053, for pathological type 2 voices is  2 = 0.0068, and for pathological type 3 voices is  2 = 0.012.

Discussion
Bearing in mind that the used records were taken in a controlled environment, we could associate the noise level with the strength of an additive stochastic component that coexists with the dynamic producing the voice.There is a tendency of  to increase its value from normal voices to pathological ones.This can be seen as an increase of the power of the stochastic component caused by the presence of a pathology.Moreover, in pathological voices,  increases its value along with the voice type.
The definition of type 4 voices given by Sprecher et al. [2] states that these signals are characterized by pure stochastic oscillations; therefore, their dimension is infinite [10].In practice, it is not possible to measure an infinite correlation dimension value since it is bounded by the embedding dimension used to calculate the correlation integral.Instead, an infinite correlation dimension is inferred if there is not a scaling regime in the coarse-grained estimator of .In previous studies, this was done through visual inspection, which is always a subjective judgement [2,19].
One interesting aspect of the coarse-grained estimator    (ℎ) is that, in the type 4 signals analyzed here, it always presents a scaling range from which to estimate , although    (ℎ) and    (ℎ) have no scaling regions.These results led us to think that the noise level can be used as an objective measure to discriminate between type 3 and type 4 voices.
The threshold proposed here ( thr = 0.4) was set using the voice signals that did not present a scaling regime in    (ℎ) and    (ℎ).With this threshold, we were able to well separate type 3 and type 4 voices.Nevertheless, we are aware that this threshold was selected based on observations from this database.This finding must be validated with a more extensive study involving a larger number of records and voice care professionals.
Our estimations of correlation dimension are in concordance with other studies [6,[19][20][21].In [23], Choi et al. conducted a very similar simulation over the Kay Elemetrics database.They reported a mean value of  = 1.57for normal voices and an increased dimension value for pathological ones.Moreover, they obtained a decreasing SNR (it was calculated according to [24]) with the type of the voice, being the lowest for type 3 voices.However, they did not analyze type 4 voices.Another study by Zhang and Jiang conducted over the Kay Elemetrics database reported a mean correlation dimension of  = 1.51 for normal voices and  = 3.17 for a group with vocal tremor [25].As far as we know, all researches conducted on normal and pathological voices have used the Grassberger and Procaccia correlation integral to estimate  and  2 .This methodology has the disadvantage that its estimations are sensitive to noise presence, requires large  values to converge, and does not give an estimation of the noise level [10,13].However, other variants like the Gaussian correlation integral have not been used.
Regarding the correlation entropy, our results suggest that pathological type 1 and type 2 voices have a slightly greater value than normal ones.Furthermore, these three types of voices have a  2 value close to zero, suggesting nearly periodic dynamics.For type 3 voices, the values of  2 are the greatest, meaning more irregular and unpredictable dynamics.These values are comparable with those presented by Yan et al. [26].They reported an estimation of  2 for normal (mean value  2 = 0.014) and esophageal phonation (mean value  2 = 0.023) subjects.In [27], Calawerts et al. calculated the largest Lyapunov exponent for type 1, type 2, and type 3 voices using the Kay Elemetrics database.Their results suggest an increase of the value of the largest Lyapunov exponent along with the type of the voice.Our results are in concordance with this study since we obtain an estimation of correlation entropy that not only is greater for pathological Complexity voices than for normal ones but also increases with the type of the voice.

Conclusions
In this article, we have studied normal and pathological voices through the correlation dimension, the correlation entropy, and the noise level.These invariants were estimated using an automated algorithm based on coarse-grained estimators derived from the U-correlation integral.The results suggest that the voice production dynamical system has a low dimension.The value of  is greater for pathological voices than for normal ones.Moreover, its value also increases along with the type of the voice.Regarding the correlation entropy, its value is very low for normal and for type 1 and type 2 pathological voices.Although a more extensive study is still needed, this finding suggests that the system dynamics have a harmonic oscillatory behavior.On the other hand, pathological type 3 voices present higher values of  2 , implying a more complex behavior which is reflected in a more irregular dynamic.The noise level can be interpreted as the power of a stochastic perturbation intrinsic to the voice production system.Our results show that  is greater for pathological voices than for normal ones.Furthermore, it increases along with the type of voice, being the highest for type 4 voices.This means that the presence of the stochastic component is stronger in pathological voices.Based on these results, in this work, we have proposed a quantitative criterion that can be used to differentiate between type 3 and type 4 voices.We are aware of the limitation of the reduced samples of type 3 and type 4 voices.In this sense, these preliminary results will be validated with a more extensive study involving a larger number of records and voice care professionals.
. The scheme proposed by Titze divides the signals into three types: type 1 signals are nearly periodic voice signals, type 2 signals have strong modulation or subharmonics, and type 3 signals are characterized by a very irregular or even chaotic behavior.Sprecher et al. proposed a modification to this scheme.They redefined type 3 voices as being deterministic chaotic signals, adding a fourth type that is characterized by a dominant random-like behavior [2].

Figure 1 :
Figure 1: Time series, state space reconstruction, and spectrogram of normal and pathological voices.First column: normal voice; second column: pathological type 1 voice; third column: pathological type 2 voice; fourth column: pathological type 3 voice; fifth column: pathological type 4 voice.

Table 1 :
Analyzed subjects: number and age.