Automatic Prosodic Analysis to Identify Mild Dementia

This paper describes an exploratory technique to identify mild dementia by assessing the degree of speech deficits. A total of twenty participants were used for this experiment, ten patients with a diagnosis of mild dementia and ten participants like healthy control. The audio session for each subject was recorded following a methodology developed for the present study. Prosodic features in patients with mild dementia and healthy elderly controls were measured using automatic prosodic analysis on a reading task. A novel method was carried out to gather twelve prosodic features over speech samples. The best classification rate achieved was of 85% accuracy using four prosodic features. The results attained show that the proposed computational speech analysis offers a viable alternative for automatic identification of dementia features in elderly adults.


Introduction
Dementia is a disorder characterized by an impairment of intellectual and communicative functioning, with high impact among elderly people. Usually this disorder leads to dependency of the patient on their families or caregivers due to the impossibility to carry out their daily tasks. A general agreement of the experts in the field revealed that the number of patients with dementia is increasing around the world due to a progressive aging society [1].
With this fact in mind the early detection of the dementia syndrome becomes an important goal to slow down the development of cognitive deterioration, allowing either the use of alternative nonpharmacological therapies or short periods of pharmacological treatments. Current methods of dementia screening in older adults involve structured interviews. Questionnaire tests such as the Mini-Mental State Examination (MMSE) [2], Clinical Dementia Rating (CDR) [3], or Memory Impairment Screen (MIS) [4] are commonly used. These methods typically rely on prolonged interviews with the patient and a family member. Therefore, an automated method for screening of dementia is highly desirable.
Due to the fact that one of the most significant areas affected by dementia is language, many researches have been oriented towards speech analysis, showing that language impairment is strongly related to cognitive impairment. Even more, the first clues start to appear some years before patient is clinically diagnosed [5,6].
In this paper we propose a framework that applies speech signal analysis to identify mild dementia (MD). In contrast to previous works in automating the evaluation of cognitive impairment through speech analysis that relied on manual transcripts of audio recording, our system uses a novel method for automatically detecting syllable nuclei in order to measure prosodic features without the need for a transcription. In the next sections we present an overview of our data, followed by the description of the feature extraction procedure we propose and the classification technique used to determine whether the subject has mild dementia or not.

Experimental Subjects.
Within the framework of this exploratory work, speech recordings were conducted at the Center of Elderly Adult #2 in Santa Clara, Cuba. A total of twenty subjects were selected for this pilot experiment from a group of candidates. Our sample comprises participant older than sixty years old with a diagnosis of mild dementia and healthy subjects. Other inclusion criteria were basic reading skills and no significant visual impairments. All the work was performed strictly following the ethical consideration of the center and the participants were notified about the research and their agreement obtained. Table 1 shows demographic data, with no significant differences between groups in terms of gender, age, or years of education.

Recording Tools and Procedures.
Also trying to make the process of speech recording the less annoying as possible for a daily clinical practice, a specific tool was developed. It consisted of the use of a standard laptop equipped with two headworn condenser C520L AKG microphones for capturing both clinician and patient voices. Each microphone was connected to a different channel (left or right) of an M-audio ADC device connected to the laptop through a USB port. This configuration provides some acoustic separation, despite no complete isolation, of patient and clinician voices, thus making it easier for their processing.
A specific software DCGrab v3.0 was created by the authors to allow an easy recording of the audio signals during each one of the parts defined in our recording protocol. The speech sound was recorded in stereo format with 16 bits of resolution and 44.1 KHz of sampling rate. The DCGrab v3.0 software also allows storing clinical data and demographic information for each patient (see Figure 1).

Protocol and Speaking
Styles. Two major conflicting criteria were considered for the design of the recording protocol in our database. On one side, it should represent the minimum possible burden on the busy schedules of daily clinical practices, while on the other side it should collect the richest variety of speaking styles which can result in a notable increase in testing time.
Consequently we decided to design a protocol consisting of two sequential parts. During the first part we recorded the speech productions from both clinician and patient during structured interviews commonly used in clinical assessment procedures. More specifically we considered the Mini Examen Cognoscitivo (MEC) which is the Spanish version of the Mini-Mental State Examination (MMSE) [2]. The MEC evaluation is our gold standard to classify each participant and evaluate the automatic method proposed in this work.
The second part of the interview consists of asking each enrolled subject to read a Spanish version of the paragraph "The Grandfather Passage" [7]: ¿Tú quieres saber todo acerca de mi abuelo? Pues bien,él tiene casi 93 años de edad y aún conserva su mente tan brillante como siempre.Él usa un viejo saco negro al que generalmente le faltan algunos botones. Su barba larga y colgante produce en quienes lo observan un

Prosodic Analysis.
To obtain the prosodic features of speech recordings by means of automatic prosodic transcription, a novel algorithm to automatically detect syllable nuclei was used. The proposed algorithm is mainly based on the method described in [8] for speech rate detection. The overall procedure is illustrated in Figure 2. The input speech signal is processed in parallel to obtain an automatic estimation of both syllable nuclei and fundamental frequency (F0 detection). In order to increase the temporal resolution of the energy envelope, the downsampling process is removed; also smaller windows size (10 ms) and overlap (5 ms) are used in the temporal weighting stage. Then, the syllabic nuclei are detected using the same threshold mechanism described in the peak counting stage in [8].
The vowel nucleus is the place where the pitch estimation reaches a local maximum and this phenomenon is relative to the syllable boundaries because simultaneous changes of intensity, spectral energy distribution, and voicing partially hide the perception of the pitch changes [9]. This feature is more evident for stop and fricative consonants and less significant for liquids and nasals [10]. Consequently the edges of the syllabic nucleus are more suitable for the detection of  noticeable changes in the fundamental frequency than the syllable border. The boundaries of the syllable nucleus are estimated by the nearest minimum related to the detected peak on the energy envelope or by vocal activity limits, provided by the robust algorithm for pitch tracking (RAPT) [11].
For each syllable nucleus obtained, a number of features related to measures of intensity, duration, and fundamental frequency are estimated (see central bottom block in Figure 2). Duration and fundamental frequency features are given in milliseconds (ms) and semitones (ST), respectively. Expressing fundamental frequency in semitones diminishes gender differences as suggested in [12,13].
For our research twelve prosodic features were calculated based on the syllable nucleus position obtained by the novel prosodic method as follows: (1) speech time (SPT): total speech time from first syllable to last syllable produced, (2) number of pauses (NPU): total number of silences; a gap between two consecutive syllables over 0.3 s that was considered silence,

Automatic Classification. Many classification techniques
have been developed with remarkable performance in the last decades [14]. Since the main goal of this research is to find a set of prosodic parameter with discriminative potential for identifying mild dementia a well-known classifier was selected. In [15] we explored the use of Random Forest for a similar task, but now we have found that better results can be achieved using the Support Vector Machine (SVM) classification technique. Therefore our automatic classification of reading speech is based on SVM technique to evaluate how well the proposed features predicted participant's group membership. The results are evaluated in terms of accuracy (Accu), sensitivity (Sens), and specificity (Spec) measurements [16]. A cross validation technique was used to avoid overfitting, that is, a discriminant function to be created with the same data used later for testing. Specifically the leaveone-out method of cross validation was applied. It involves generating the discriminant function on all but one of the participants ( −1) and then testing for group membership on that sample. The process is repeated for each sample ( times) and the percentage of correct classifications is generated through averaging for the trials. In our case is equal to the total number of participants ( = 20), and, in each iteration of the cross validation method, one fold is used for testing and the other nineteen folds are used for training the SVM classifier.

Results
The goal of our experiments was to evaluate the potential of selected features for automatic measurement of the impairment cognition through prosodic analysis. Table 2 contains descriptive statistics for these measures showing the mean, standard deviation and range for every prosodic measure on both groups (MD and non-MD). Visual inspection reflects that probable differences between subjects with mild dementia and healthy subjects  Hence, we performed a Kolmogorov-Smirnov test (KStest) to determine which set of features can significantly discriminate between the two groups (MD and non-MD) [17]. The KS-test has the advantage of making no assumption about the distribution of data. This nonparametric test for the equality of continuous, one-dimensional probability distributions can be used to compare a sample with a reference probability distribution (one-sample KS-test) or to compare two samples (two-sample KS-test). The null distribution of this statistic is calculated under the null hypothesis that the samples are drawn from the same distribution (in the two-sample case). The two-sample test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples [18]. Nevertheless with a KS-test, we cannot guarantee finding the best set of features to reach    [20] 72 (35) 75.4%-81.5% Thomas et al. [21] 85 (50) 58.8%-75.3% Bucks et al. [22] 24 (8) 87.5% the maximum performance of the SVM classifier but we believe it can provide an acceptable first approximation to it. Statistical analyses were conducted using IBM SPSS v21 [19], and the desired significance level of 0.05 was used. Summary of results of the hypothesis (ℎ), the values for KS-test, and a ranking of level of significance for each feature based on the lower values are shown in Table 3. Despite the small sample used, the ranking attempt illustrates the relative importance of these variables for discriminating between healthy and mild dementia.
Using this information three sets of features (presented in Table 4) were defined according to three different levels of discrimination. The first set included those features with significant differences ( < 0.05) between both groups (SDF and MFF). The second set contained features with slight differences ( < 0.5) between MD and non-MD participants, for example, PTH, SPR, NSY, or MVF. Features with no significant differences ( > 0.5) between groups: SPT, NPU, PPU, PPH, ARR, and MSD were included in the third set. We think that the small sample size may have resulted in this lack of significance and that these temporal measures may yet offer additional explanatory power.  SVM was carried out to determine how well the proposed groups of level of significance predicted participants' group membership. Classification results were obtained using different combinations of the prosodic features included in the three significance groups summarized in Table 4.
As it can be seen in Table 5, using this classification strategy, the best accuracy of 75.0% was achieved using only the set of features in the significant group. While interpreting the results of Table 5, we should note that the classification algorithm did not take advantage from increasing the number of features. Even so these results could be considered relatively good based on the size of the data and published results elsewhere [20][21][22] as depicted in Table 6.
The former method of feature selection is one of the statistical methods frequently used in similar studies to evaluate the level of significance for measurement under analysis [23,24]. Due to the nature of this method (KS-test over individual features) it cannot guarantee maximum accuracy for a classifier able to model complementary information between features. The set of features that could represent in a better way both classes (MD and non-MD) cannot be determined by the level of significance of individual features. Even the combination of different features can either improve or worsen the final classification. Consequently one critical question arises and still needs to be answered: how to choose the set of features to able to reach the maximum classification rate for a given classification technique. Trying to answer this question, in the next step, we proceeded with the evaluation of the SVM using all the possible combination of the features obtained from the prosodic analysis. In our work a total of 4094 combinations from 12 features without repetition were tested to find the best feature set. The best classification rates, using leave-one-out cross validation, for each combination of a features amount are shown in Table 7.
Results in Table 6 indicate that the relation between features and the way each affects others in the pattern classification is not well known. The highest accuracy value was obtained for the combination of four features, including the best (SDF) and worse (MSD) ranked in Table 3. We also should note that increasing the number of features does not guarantee an increment in the classification rate.

Conclusions
The main goal of this pilot study was to investigate the potential use of automatically extracted prosodic features in the diagnosis of mild dementia in elderly adults. The results demonstrate the existence of significant measureable prosodic differences between the performance of healthy participants and patients with mild dementia in reading speech. Features like ARM, MSD, SDF, and MFF were identified as having a higher discriminative power. Furthermore, due to the relative simple and low cost methodology, the technique for the screening of moderate cognitive impairment is easy to spread out. This study lays the foundation for future research using a larger number of participants and other speech features either in time or in spectral and cepstral domain. In this way, definitive conclusions of prosodic analysis to identify mild dementia could be drawn.