A Triplet Multimodel Transfer Learning Network for Speech Disorder Screening of Parkinson’s Disease

,


Introduction
Parkinson's disease (PD) is a degenerative disease commonly seen in the elderly, mainly manifested as motor retardation, static tremor, and muscle rigidity.Terefore, the examination of Parkinson's disease mainly depends on the medical history and physical examination, which should be completed by neurologists in hospitals.If the patient has obvious movement slowing, static tremor, 4-6 Hz per second, tremor is weakened or disappeared during movement, and there is a mask face, walking forward, small gait, muscle stifness, and increased muscle tension, the patient is more likely to have Parkinson's disease.However, the motor symptoms of Parkinson's disease often occur late.In contrast, nonmotor symptoms, such as language and cognitive disorders, are manifested decades before the onset of motor symptoms, which is of great signifcance for the early diagnosis of the potential disease possibility of Parkinson's disease.
Tere is a lot of literature proving that early Parkinson's disease also has a small amount of speech impairment [1][2][3][4][5].An assessment of vocal impairment was presented for separating healthy people from persons with early untreated Parkinson's disease (PD) [1].Te purpose of the study [2] is to determine if subjects in the early stages of untreated Parkinson's disease (PD) or PD treated with deprenyl alone sufer from motor speech abnormalities.Speech defects are common in advanced PD, including disturbances of respiration, phonation, and articulation.We studied 12 subjects with early PD (Hoehn and Yahr stage ≤ 2; mean duration disease 3.2 years) who were not taking symptomatic therapy and tested them under two conditions: on and of deprenyl.Te study of [3] provides an evaluation of speech disorders in early Parkinson's disease.Moreover, evidence shows that speech difculties were associated with greater autonomic dysfunction, sleep disturbances, and striatal dopaminergic defcit and can serve as a predictor of faster cognitive decline in early Parkinson's disease [4].Detecting speech disorders in early Parkinson's disease by acoustic analysis is another study in 2018 [5].
Language dysfunction can show the cognitive ability of the brain and the speed of response to external stimuli.It is mainly manifested as throat voice and tongue movement disorders of diferent degrees, and the frst manifestation is the weakening of voice.In addition, there are also situations of single pitch, slow speech speed, abnormal language pauses, continuous dysphonia, abnormal stress, vague and hoarse voice, decreased fuency of oral expression, and simplifed syntactic expression.Furthermore, PD can also hinder voice production, making the voice of patients with Parkinson's disease soft and monotonous.Research shows that these symptoms often appear in the early stages of disease development, sometimes decades earlier than exercise-related symptoms.Tis year, a new study by neuroscientists at the University of Arizona showed that a specifc gene usually associated with Parkinson's disease may be the reason behind these problems related to phonation.Tis discovery may help to early diagnosis and treatment of Parkinson's disease [6].Tese representations are obtained from the patient's speech data.Terefore, using computer methods to analyze and process these speech data is the primary task.
Te speech recognition system can be roughly divided into four parts (Figure 1): (1) PD pronunciation test, including the reading vowel test and continuous conversation test (2) Speech data collection: collect the test content using mobile phones or recording pens and other equipment equipped with microphones (3) Speech recognition and PD detection: use deep learning or machine learning algorithm for feature extraction and recognition of speech data (4) Te prediction results of the model are fed back to doctors to help them make treatment plans In order to detect Parkinson's disease, computer scientists try to capture the unique disease symptoms of PD patients and build models to compare with healthy people.Specifcally, a large proportion of these methods are artifcial intelligence technologies.For example, supervised traditional machine learning (ML) algorithms, such as random forest [7][8][9], decision trees [10,11], and K-nearest neighbor (KNN) [12], have been highly efective in motor symptoms analysis of Parkinson's disease; support vector machines [13,14] and XGBoost [15] have competitive performance in PD speech analysis and recognition.Te deep network has achieved far more accuracy than ML methods, including speech, natural language processing, vision, and many other felds.In the analysis of speech, a classical ML algorithm usually requires complex feature engineering, while deep networks can usually achieve good performance by simply passing data directly to the network.Moreover, deep models can more easily adapt to diferent felds and applications.For instance, transfer learning makes it efective to apply the pretrained deep network to diferent applications in the same feld.Benefting from these strengths, deep models have also shown incomparable advantages in speech recognition [16] of Parkinson's disease, such as time series models (LSTM [17,18] and GRU [19]), convolution-based neural networks (CNN [20][21][22] and ResNet [18,23]), and hybrid or complex networks (transformer [24,25], ensemble learning [26,27], and few-shot learning [28]).
Despite that these methods have a place in a number of felds, they are also limited to concentrating on a single perspective, that is, using a single perspective to view speech data.For example, CNN is for spatial feature extraction, LSTM is for temporal feature extraction [29], and MFCC is for spectral feature extraction.However, the expression forms of each feature are diverse, and the degree of aggregation of the same feature of multiple samples to diferent spaces or diferent perspectives is greatly diferent.Terefore, we choose the method of multimodel fusion to express diferent levels of features in diferent dimensions and spaces and fuse them until the best efect is achieved.Te multimodel fusion method makes up for the defect of one-sided feature representation of a single model and makes the fusion model more suitable for the input data.
In addition, we conduct the transfer learning framework to process speech data.Due to the long training time and large amount of data of the deep learning model, it is difcult for the complex model we designed to achieve excellent classifcation efects in a short time, and it is hard to change the details once the model is trained.Te transfer learning framework is divided into an upstream task and a downstream task, which can perfectly optimize the system performance and improve the training efciency, so that the reconstructed model of the downstream task can cover the shortcomings of the upstream task, pay more attention to the detailed feature description, and achieve faster modeling speed by fne-tuning the training weight.
Te inner unit of the triplet network proposed integrates the attention mechanism, convolution, feature splicing, scalable structure, and other technologies.New elements have also been added to the block to improve the recognition ability of the model.Tis improved strategy successfully presents the advantages of each model and obtains a more robust hybrid model.
Te main contributions of our work are summarized as follows: (i) A triplet multimodel transfer learning network (TmmNet) is proposed for speech disorders screening of Parkinson's disease, which can not only extract the multidimensional spatiotemporal features of the input speech but also selectively enhance the signifcance of the features.Te two-layer task framework adopted solves the problem of a large data volume and a long training process.(ii) Te proposed triplet network integrates a variety of improved new expansion units, adds multiscale convolution, multihead, spatial, and channel attention mechanisms, uses parallel mode for training and serial mode for feature splicing, and performs hierarchical feature representation and fusion.Te rest of this paper is organized as follows: Section 2 illustrates the related work in recent years.Section 3 describes the framework and computing process of the proposed model.Section 4 provides the dataset introduction, experimental results, and settings.Te conclusion discusses the strengthens and limitations in Section 5.

Related Work
As far as the algorithms mentioned above are concerned, we will introduce computer methods for speech recognition in the following three categories: manual feature extraction methods, machine learning methods, and deep learning methods.As illustrated in Table 1, we mark whether the investigated research involves these algorithm categories as "✓" and "−." "✓" means involved, and "−" means not involved.
Machine learning methods, such as SVM and XGBoost, have been widely applied in PD speech assessment [13,15].Te study of [13] proposed to introduce the L1 regularization SVM for speech signal feature extraction and then trained an improved genetic algorithm and an SVM classifer for speech recognition.Wang et al. [15] compared XGBoost with support vector machines, random forests, and neural networks for the detection of the speech signal collected from Parkinson's patients, by identifying vocal fundamental frequency of speech.Although machine learning algorithms have made achievements in the feld of voiceprint recognition, most of the machine learning algorithms used are still limited to feature classifcation, without more consideration on feature representation and description.
As popular deep learning methods, a large number of classical models [18,19,24,34,35] have achieved good performance in PD speech recognition.As the audio feature, MFCC was input into LSTM, GRU, CNN, ResNet, and other deep models for automatic speech recognition (ASR) in the study of [18].GRU [19] was employed to assess speech impairments by computing static features from complete utterance.Hernandez et al. [24] explored the usefulness of using Wav2Vec self-supervised speech representation as the speech feature of dysarthria in training ASR systems and used a transformer-based context network for feature representation and classifcation.In addition, several hybrid fusion models [25,26,28,36] have gradually emerged in the application of PD speech recognition.An audio spectrogram [25] transformer was proposed to analyze the multimodal PD speech and handwriting data.An ensemble model [26] was designed for the classifcation of PD speech data, which combined a deep sample learning algorithm with a deep network, realizing deep dual-side learning.A deep model based on iterative mean clustering [28] was established to obtain new high-level deep samples, which solved the problem of few-shot learning.
For MFCC feature extraction, some algorithms analyze and classify the MFCC features in speech data [30][31][32][33].Qing et al. [32] designed a transfer learning network after extracting MFCC features from the raw speech data.In the Step 1: PD Pronunciation Test Step 2: Speech Data Collection Step 3: Speech Recognition and PD Detection Step 4: Feedback for Diagnosis (Treatment) study of [33], the MFCC parameters with the best performance in 12 dimensions were extracted to represent the acoustic characteristics of articulation disorders, which were utilized for automatic speech recognition based on the artifcial neural network (ANN).Nivash et al. [31] carried out research in 2021 and used a series of machine learning algorithms to classify the MFCC features of speech, such as RF and naive Bayes, and naive Bayes was verifed to be the best algorithm.MFCC was also utilized to detect patients with PD from healthy people.Literature [30] adopted an SVM classifer to distinguish the extracted voice and cepstrum features, and the results showed that MFCC is the best by comparison.Tese algorithms all involve MFCC feature extraction, which is sufcient to verify its availability.
Inspired by these approaches, we frst extract the MFCC features of speech fles in the preprocessing part and then develop a model of transfer learning structure that includes traditional machine learning and deep learning.

Triplet Multimodel Transfer Learning Network
For achieving successful speech analysis and recognition of PD patients and healthy controls in the real environment, we here propose a triplet multimodel transfer learning network for MFCC features, multilevel and scale feature extraction, and classifcation.First, we introduce the preprocessor for MFCC feature computation.Ten, we describe the architecture of the pretrained model for multilevel and scale feature extraction, followed by detailed discussion on the individual components.Finally, we describe the reconstructed model for fne-tuning the upstream parameters and scoring the fused features before supplying the fnal diagnostic result.

Data Preprocessor.
As shown in Figure 2, frst of all, we use a pre-emphasis method to compensate the highfrequency part of the voice.For the sampled value x[n] of speech at time n, the output after pre-emphasis processing is Te pre-emphasis coefcient a is generally between 0.9 and 1. Ten, the voice is divided into segments by windowing.Te windowing function is nonzero only in some regions but zero in other regions.Te next step of windowing and framing is a discrete Fourier transform (DFT).Te function of the Fourier transform is to map the signal from the time domain to the frequency domain.Assuming that the number of sampling points after windowing is N, DFT for these N points includes Ten, the amplitude of each frequency component is obtained, and the frequency is mapped to mel frequency.Te expression relationship between mel frequency and frequency (f) is as follows: Inverse discrete Fourier transform (IDFT): we take the logarithm of the mel feature in the previous step, which can be used as an acoustic feature, take the logarithmic frequency spectrum as a time-domain signal, and do a Fourier transform again, because the content of our voice is often determined by the path that the sound passes through from the sound location (similar to a series of flters) and is independent of the vibration frequency (fundamental frequency) of the sound location itself.Te function of cepstrum is to separate the flter from the sound source to help identify the content of the sound.Te calculation process of cepstrum is shown in the following formula, which only represents the calculation process of cepstrum, excluding the process of mel fltering.We can see that cepstrum is to take the frequency spectrum after the Fourier transform as the time-domain signal and do another Fourier transform on this frequency spectrum: Te process of delta is as follows: for each frame, the frst 12 dimension cepstrum coefcients passing through IDFT are selected, and then, the energy is used as the 13 th dimension feature.Te time-domain signal after adding a window can obtain energy characteristics through calculation.Assuming that the window length starts from t 1 and ends at t 2 , then the energy of the frame is 4 International Journal of Intelligent Systems Te change of feature in time can represent the acoustic characteristics.Terefore, the change of feature in time is added to the original 13 dimensional features to obtain the delta feature, which represents the change of cepstrum coefcient and energy between frames.
First, we formulate the problem for speech recognition.Te MFCC features are defned as X � x i ∈ R 40 * 10 ,  i � 1, 2, 3, . . ., N} with corresponding 2-class label sequences L, N being the sample numbers of the input data, and 10 and 40 being the width and height of each training sample after preprocessing.

Triplet Network.
Te triplet network is a model that integrates three functional blocks.Te transformer block integrates a multihead attention mechanism and a multiple feed-forward neural network (Multi-FFN).Te multiscale convolution module contains the feature output of diferent convolution kernels of the depth-wise, spatial, and channel attention components, a normalization module, two one-dimensional convolutions, and a fully connected layer.Finally, DenseNet with three dense blocks is adopted to reduce the possibility of information loss of the frst two blocks.

Transformer Block.
We redesigned the internal structure of the transformer block, which is composed of a multihead attention mechanism and a multiple feedforward neural network (Multi-FFN).Since the input data are speech sequence data, a multihead attention can receive three sequences: query, key, and value.Te output sequence length of the multihead attention is consistent with the input query sequence length.Te length of the query is L q , and the length of the key and value is L k .
Multihead attention is composed of one or more parallel cell structures.We call each such cell structure a head.For convenience, we name this cell structure one head attention.Multihead attention consists of multiple one head attention.Remember that a multihead attention has n heads, and the weights of the i th head are W Q , respectively, W Q i , W K i , and Te input q, k, and v matrices are input into each one head attention, respectively.Te output matrices of each head are spliced according to the characteristic dimensions to obtain a new matrix and then multiplied with the W O matrix to obtain the output.Te multihead attention process is illustrated in Figure 3. Te multihead attention mechanism divides each attention operation into a single head and can extract feature information from multiple dimensions.Te three transformation tensors perform linear transformation on Q, K, and V, respectively.Each head starts to segment the output tensor from the semantic level; each head wants to obtain a set of Q, K, and V for the calculation of the attention mechanism.
For the multiple feed-forward neural network (Multi-FFN), we embed three diferent feed-forward neural networks and fuse the three outputs obtained from these blocks.It includes an RBF block, an FC block, and a Conv block; the structure of the three blocks is demonstrated in Figure 4. Tis redesigned transformer block not only includes the multihead attention mechanism but also transforms a single feed-forward MLP network into a combination of three Te input X is an m-dimensional vector, and the sample size is P, P > m.Te input data point X p is the center of the radial basis function ϕ.Te function of the hidden layer is to map the vector from the low dimension m to the high dimension P. When the low dimension is linearly indivisible, it can be linearly separable from the high dimension.We select refected sigmoidal function as a radial basis function ϕ.Te Conv block contains three layers of convolution operations with a kernel of 3 * 3. Te down sampling layer is removed to avoid information loss.

Multiscale Convolution Block.
Te multiscale convolution block follows the internal structure of the transformer.It uses group convolution to divide all channels into several groups, and convolution is performed in groups.Te inverse bottleneck layer is used to perform the convolution operation in the order of dimension increasing (depth-wise convolution) to dimension reducing, and the order of depthwise convolution is raised to the top.Tis is to facilitate the comparison of features after the 1 * 1 convolution and prevent the parameter amount from rising.Te structure of the multiscale conv block is shown in Figure 5. Te speech data are frst convolved through three diferent scales of depth-wise convolution kernels.Te joint features of channel attention and spatial attention are extracted from the output of each layer, and then, the fnal multiconvolution    International Journal of Intelligent Systems fusion feature is obtained through two one-dimensional convolutions.
It is worth noting that this module uses multiscale convolution kernels, 7 * 7, 5 * 5, and 3 * 3, in depth-wise convolution and processes the input data to obtain three features for fusion.After obtaining the feature map of these three blocks, the serial channel attention and spatial attention are added to highlight the landmark information and target location in the speech signal.Convolutional block attention (CBA) can improve the feature extraction ability of the network model without signifcantly increasing the amount of computation and parameters [37], which is shown in Figure 6.Tis module can serially generate attention feature map information in two dimensions of channel and space and then multiply the two feature map information with the original input feature map to generate the fnal feature map through adaptive feature correction.It includes two modules: channel attention and spatial attention, channel attention uses the relationship between feature channels to generate channel attention mapping.In order to efectively calculate channel attention, we compress the spatial dimension of input feature mapping.Average pooling is usually used to aggregate spatial information.Spatial attention uses the spatial relationship between features to generate spatial attention mapping.In order to calculate spatial attention, we frst apply average pooling and max pooling operations along the channel axis and concatenate them to generate efective feature descriptors.
(a) Channel Attention.When compressing the spatial dimension of the input feature map, average pooling and max pooling methods are adopted to obtain a total of two one-dimensional vectors.Global average pooling has feedback for every pixel on the feature map, while global max pooling has gradient feedback only where the response is the largest in the feature map during gradient back propagation calculation.Meanwhile, average pooling and max pooling are employed to aggregate spatial dimension features to generate two spatial dimension descriptors: F c max and F c avg , and then, the weight is generated for each channel through an MLP network.Finally, the weight is multiplied by the original channel attention.Te formula is as follows: where F represents the input feature map, F c avg and F c max are the features calculated by global average pooling and global max pooling, respectively, W 0 and W 1 denote two-layer parameters in a multilayer perceptron model, and the features between W 0 and W 1 in the multilayer perceptron model need to be processed with ReLU as the activation function.(b) Spatial Attention.With the exception of generating the attention model on the channel, the author said that at the spatial level, the network also needs to understand which parts of the feature map should have higher response.First, average pooling and max pooling are utilized to compress the input feature map at channel levels, and the input features are subject to mean and max operations on the channel dimension, respectively.Finally, two 2D features are obtained and stitched together according to the channel dimension to obtain a feature map with two channels.It is then convolved with a hidden layer containing a single convolution kernel.It must be ensured that the fnal features are consistent with the input feature map in the spatial dimension: Max and average pooling operations are also used, but they are executed in the channel dimension.In order to reduce the number of channels in the C dimension of the original feature to 1 dimension, so as to learn spatial attention.Te formula is where the feature map after max pooling and average pooling is defned as F s avg ∈ R 1 * H * W and F s max ∈ R 1 * H * W and σ represents sigmoid activation function.Te convolution layer shown in this part uses 7 × 7 of the convolution kernel.
Channel attention and spatial attention can be expressed by the following formula: International Journal of Intelligent Systems where F is the feature map, M s (F) and M c (F) represent channel-based and space-based attention, ⊗ represents the element by element multiplication, and F ′ and F ″ represent the output feature map after channel attention and spatial attention, respectively.Because the input and output sizes of the convolutional block attention module are the same, it can be inserted anywhere in the existing model.Subsequently, after two layers of 1 * 1 ordinary convolution layer, a layer of GeLU activation function is inserted in the middle to preserve the probability and dependence on input and to avoid the gradient disappearing.

DenseNet.
DenseNet includes three dense blocks and uses a more aggressive dense connection mechanism.Each layer will accept all the layers in front of it as its additional input.Te size of the feature map in each dense block is unifed to facilitate the concatenation operation.Te dense block-+ transition structure is used in the DenseNet network.A dense block is a module that contains many layers.Te feature map size of each layer is the same.Dense connection is adopted between layers.Te transition module connects two adjacent dense blocks and reduces the size of the feature map through pooling.As shown in Figure 7, the network structure of DenseNet is mainly composed of dense block and transition (convolution + pooling).Te feature transfer method is to directly concatenate the features of all the previous layers to the next layer, instead of pointing to all the layers behind.Te details can be illustrated in the literature of [38].

Reconstructed Model.
As the downstream task of the transfer learning model, we reconstruct the network into a triplet network, a temporal network and a two-layer fully connected layer.
We keep the triplet network unchanged in the pretraining process.Since the input speech data are in a sequential state, we implant a temporal network composed of a 1-D convolution layer and a bidirectional LSTM (BiLSTM) with attention to conduct retraining and fne-tuning of the original network weights, followed by two fully connected layers.BiLSTM employs a two-layer internal extensible unit as its structure and adds an attention mechanism as the temporal feature extraction module of the fne-tuning downstream task.
We integrate the output features of the triplet network and the temporal network, preserve the bidirectional information transmission between the speech sequence data frames, make up for the shortcomings of the triplet network, and strengthen the attention to the value in the unique position of the output matrix through attention, and the working mechanism is illustrated in Figure 8.

Experiments
Tis section presents our experimental settings and the performance of the proposed model, compared against several state-of-the-art methods on two challenging speech datasets.
First, we provide a brief introduction to the dataset.Ten, we briefy describe the experimental settings.Finally, we give the global evaluation of the experimental results on the two speech datasets.

Dataset Specifcations.
In this section, we give a brief description of two speech datasets, i.e., MDVR-KCL dataset [39] and IPVS dataset [40], including data collected from microphones, and the format is in ".wav."Te details are introduced as follows.
MDVR-KCL dataset: Te MDVR-KCL dataset is a voice fle of early and late Parkinson's disease patients and healthy controls recorded with mobile devices.It was collected at A typical examination room with about ten square meters area and a typical reverberation tome of approx were utilized to perform the voice recording with 500 ms.Te recording was carried out in the real situation of the call (that is, the participant puts the phone on the preferred ear and the microphone is directly close to the mouth).It can be assumed that all recordings are made within the reverberation radius, so it can be considered as "clean" [39].A Motorola Moto G4 smart phone was used as a recording device.Trough the developed application, highquality recording with a sampling rate of 44.1 kHz and a bit depth of 16 bits (audio CD quality) was fnally achieved.Te format was ".wav," and the collecting process was as follows: (i) Ask participants to relax a little and then call the test executor (ii) Please read the article aloud (iii) According to the constitution of the participants, they are required to read the text (iv) Start a spontaneous conversation with the participants, and the test executor starts to randomly ask questions about the scenic spots, local transportation, or personal interests (if acceptable) (v) Te test executor ends the call by saying goodbye Te dataset included data from 16 PD patients and 21 healthy controls.For each HC and PD participant, the data regarding scores were labeled on the Hoehn and Yahr (H and Y) scale, as well as the UPDRS II part 5 and UPDRS III part 18 scales.
IPVS dataset: Te IPVS dataset included voice recordings of 28 PD patients and 20 healthy controls, all of which were collected at a 16 kHz sampling rate in a quiet, echoless, warm room.Te microphone was located 15 to 25 cm from the people.Te participants performed the following tasks: two phonation of the vowels /a/, /e/, /i/, /o/, and /u/ and syllable execution of "ka" and "pa" for 5 s.In this study, the reading of phonetically balanced phrases and vowel recording were utilized, and a phonetically balanced text was read twice [40].Te reading rules are as follows: (i) (a) 2 readings of a phonemically balanced text spaced by a pause (30 sec) (ii) (b) execution of the syllable "pa" (5 sec), pause (20 sec), and execution of the syllable "ta" (5 sec) (iii) (c) 2 phonation of the vocal "a" (iv) (d) 2 phonation of the vocal "e" (v) e) 2 phonation of the vocal "i" (vi) (f ) 2 phonation of the vocal "o" (vii) (g) 2 phonation of the vocal "u" (viii) (h) reading of some phonemically balanced words, pause (1 min), and reading of some phonemically balanced phrases It should be emphasized that there is a one-minute break between the execution of (a) and (b) and between (g) and (h).Before the implementation mentioned in points (c), (d), (e), (f ), and (g), it is necessary to inhale as much air as possible and continue to make sound until the lungs are empty.A 30-second pause is required between the execution of (c), (d), (e), (f ), (g), and (h).

Experimental Settings.
Te experiment was implemented on two speech datasets, and appropriate settings were arranged according to the features of each dataset.Te device had a graphics card of GeForce RTX 3080, the memory of an RAM of 32.0 GB, and a CPU of Intel(R) Core(TM) i7-11700.Te settings were described in accordance with the dataset.
For the two speech datasets, we shufed and randomly selected 80% of the data for training and 20% for testing, with a data capacity of 10000+ for the MDVR-KCL dataset and 20000+ for the IPVS dataset, respectively.Te fnal testing time on each dataset was approximately 15 ms (MDVR-KCL dataset) and 27 ms (IPVS dataset).We utilized the spontaneous dialogue fle in the MDVR-KCL dataset, as well as the monophonic pronunciation (/a/, /e/, /i/, /o/, and /u/) fles collected by the microphone in the IPVS dataset, corresponding to points (c), (d), (e), (f ), and (g) in the collection process.
For the MDVR-KCL dataset, we had ".wav" fles of 16 PD patients and 21 healthy controls, each containing about two minutes of recording.First, we extracted MFCC features (40 dimensions) through a data preprocessing module: 13 dimensional static coefcients + 13 dimensional frst-order diference coefcients + 13 dimensional second-order difference coefcients + 1 dimensional frame energy.Te sampling rate was set to 8000, which meant taking 8000 points per second.Tis way, a segment of audio can output N × 40 dimensional vectors, as these audios were continuous.We took 10 × 40-dimensional sequences as one training sample.Ten, these 400 dimensional MFCC features were fed into the triplet network for pretraining and save the model parameters.Te processed data were then input into the reconstructed model's triplet network and temporal network for retraining.Te triplet network used pretrained parameters, the temporal network used initialization parameters, the time step was set to 100, and the batch size of the entire network was set to 128.For the IPVS dataset, we had ".wav" fles of 28 PD patients and 20 healthy controls, the process parameters for MFCC feature extraction were consistent, and the diferences were mainly refected in the amount of data.

Speech Recognition.
Te experiment is implemented in four parts, ablation experiment, comparison experiment of machine learning models and deep models, and global evaluation.

Ablation Experiment.
We split TmmNet into four constituting components, i.e., TmmNet without a fnetuning process (TmmNet NoFT), TmmNet without an MS Conv block (TmmNet NoConv), TmmNet without an ST-Attn block (TmmNet NoAttn), and TmmNet without a temporal network and an ST-Attn block (TmmNet NoTN) for the ablation experiment.Due to the small diference in the efect of various models on the IPVS dataset, we only use the MDVR-KCL dataset to carry out the ablation experiment.
We used four evaluation indicators, precision, recall, F1 score, and accuracy, to evaluate the performance of the four constituting components and the overall model.As shown in Table 2, the precision represents the proportion of positive cases in the samples with positive predicted results.Te performance of several split modules here varies greatly.It can be seen that TmmNet, TmmNet (NoConv), and TmmNet (NoAttn) perform best, while TmmNet (NoFT) and TmmNet (NoTN) perform poorly, because the temporal network in the fne-tuning process and downstream tasks has a greater impact on precision.For recall, the performance of the overall model and the split modules was not satisfactory, but the overall model reaches 75%, ranking frst.For F1 score, TmmNet performs the best, followed by TmmNet (NoConv).TmmNet (NoTN) is the worst, which proves that the TN module of the fne-tuning part has the greatest impact on the F1 score value of the overall model.By comparing the accuracy of these components, TmmNet (NoFT) and TmmNet (NoTN) perform worse than other models, indicating the importance of fne-tuning and temporal network in the overall model.
Te confusion matrix of the components is shown in Figure 9.We can see that TmmNet achieves 100% of the detection rate of PD, but the misclassifcation rate of HC is still high and also better than that of other component modules.Te worst detection rate for PD is TmmNet (NoTN), and the highest error rate for HC is TmmNet (NoFT).It can be seen that the fne-tuning part, attention, and temporal information play a signifcant role in the TmmNet framework.24.74% of healthy subjects are classifed as PD patients, because the pronunciation in the training data of some mild PD patients is similar to that of healthy people.
We also utilize ROC (receiver operating characteristic) curves and AUC (area under curve) values to compare the performance of these constituting components (Figure 10).Te closer the ROC curve is to the upper left corner, the better the performance is.AUC is defned as the area under the ROC curve enclosed by the coordinate axis.It is a machine learning performance metric used to evaluate the binary model.Te degree of AUC greater than 0.5 measures the extent to which the algorithm is superior to the randomly selected algorithm.It can be seen that TmmNet, TmmNet (NoAttn), and TmmNet (NoConv) are more than 80% and that TmmNet (NoFT) and TmmNet (NoTN) are more than 70%.

Results of Machine Learning Models.
In this section, we compare eight machine learning methods, i.e., DT, GBDT, LDA, KNN, LightGBM, LR, RF, and XGBoost, for the classifcation of speech signals in PD patients and healthy people.In addition, the ROC curve is demonstrated in Figure 11.Te AUC value of TmmNet is the highest, followed by LightGBM, which is a gradient boosting framework and utilizes a decision tree-based learning algorithm.It is distributed and suitable for samples with large datasets.DT and KNN have the lowest efect.When KNN treats the sample imbalance, the predicted accuracy of rare categories is low.DT is prone to overftting, and it is easy to ignore the correlation of attributes in the dataset.For the input speech data, each frame is interrelated, and the sample data volume is large, which is also the reason why machine learning methods can be applied.

MDVR-KCL dataset: As illustrated in
IPVS dataset: We also used the IPVS dataset to distinguish the characteristics of healthy subjects and patients with Parkinson's disease by following the pronunciation of fve vowels /a/, /e/, /i/, /o/, and /u/.
Te classifcation performance of machine learning methods is discussed in Table 4. Te evidence shows that the results of our proposed TmmNet in fve syllable pronunciation classifcation are signifcantly superior to the traditional machine learning algorithm, with an average accuracy of more than 99%.It also means that our model can be directly used for speech disorders prediagnosis.By comparing the pronunciation of fve vowels, it is found that   International Journal of Intelligent Systems  Te bold values in Table 3 represents the highest results compared with the machine learning classifers.International Journal of Intelligent Systems International Journal of Intelligent Systems the subjects are difcult to distinguish between /e/ and /i/ pronunciation patterns, and the average efect of various machine learning methods is the worst, but the efect of TmmNet is still more than 99%.It is worth mentioning that the efect of RF stands out among many methods and can be comparable with the proposed TmmNet.
In addition, the ROC curve can also clearly show which machine learning method is more suitable for speech datasets.Te comparison of AUC values is shown in Figure 12.We only list the evaluation results of resolving vowel /a/.Te AUC values of LightGBM and RF rank in Top 1 and Top 2, respectively, which shows their advantages in traditional classifcation.Te performance gap of other classifers is relatively small, which indicates that the data are highly separable and fully suitable for machine learning methods.
MDVR-KCL dataset: In Table 5, by comparing the CNNbased model with the temporal model, we can see that the average classifcation performance of the temporal model is better, which is related to the sequential form of speech data.On the one hand, among the CNN-based models, Den-seCNN obtains the highest accuracy.Because it proposes a more radical intensive connection mechanism, which connects all the layers and directly concatenates the feature maps from diferent layers, it can achieve feature reuse and improve efciency, so that it can obtain superior performance.Te results of ResNet50 and DenseCNN are relatively poor.Due to the sparsity of the data in the training process, it leads to the overftting phenomenon, which is inferior to other models.On the other hand, the bidirectional memory model with an attention mechanism (BiGRU (Attn), BiLSTM (Attn)) in the temporal network performs satisfactorily due to its special gating mechanism and the construction of the expandable unit.Inspired by the advantages of these models, our proposed TmmNet has the attribute of integrating spatiotemporal features and has a transfer learning infrastructure.It goes beyond these mainstream models and becomes an efective tool that best fts the speech data being trained.
Furthermore, the ROC curve of 11 diferent deep models on the MDVR-KCL dataset is demonstrated in Figure 13.Te results in Table 5 are consistent with the performance ranking of the ROC curve.Te performance based on ResNet and HMM is relatively poor.Te corresponding AUC can also see that the infection point of TmmNet is closer to the upper left, while the results of ResNet50, HMM, ResCNN, and LDA have a large gap compared with other deep models, which is clearly refected.
IPVS dataset: We also compared fve deep learning methods in Table 6, and the performance is signifcantly better than that of machine learning methods, because the accuracy of pronunciation resolution for each syllable is more than 99%, of which the efect of HMM is obviously at a disadvantage and also shows its limitations.We will give priority to other networks as speech recognition algorithms.
Here, we can see that all the machine learning methods and deep models compared in this paper have generally excellent classifcation efects on this dataset, indicating that the monosyllabic pronunciation of subjects is easier to distinguish than reading a long passage or fnishing a dialogue.Te research in this paper can serve as a reference for the prediagnosis and severity assessment of Parkinson's disease.
ROC and AUC are utilized to evaluate the classifcation performance of the above deep learning model in Figure 14.Similarly, the performance of deep models is not far from that of traditional machine learning algorithms.Due to the high sensitivity of the temporal network to speech sequences, the average performance is slightly higher than that of machine learning and other deep learning models.

Global Evaluation.
In this section, we evaluate the overall efect of the TmmNet model and use the confusion matrix of TmmNet on two datasets to describe the classifcation accuracy.In addition, a perceptual experiment is conducted to evaluate the classifcation results of speech disorders.
MDVR-KCL dataset: We use the confusion matrix to show the classifcation results of TmmNet on the MDVR-KCL dataset in Figure 15.Te accuracy of screening for PD reached 100%, although some healthy subjects were wrongly classifed as PD patients.At the lower left corner of the confusion matrix, 24.74% of the samples of healthy subjects were wrongly divided into PD samples.We extracted a wrongly divided sample and found that its feature distribution was IPVS dataset: We use the confusion matrix to show the classifcation results of TmmNet for pronunciation /a/ in Figure 16.It can be seen that the probability of correct classifcation of samples on the diagonal is more than 99%.Compared with the MDVR-KCL dataset, the accuracy of monosyllabic follow-up classifcation in the IPVS dataset is signifcantly higher, which is inevitably related to the high complexity of long texts.Terefore, when we conduct speech tests on subjects, we can follow the vowels frst and then the long text, which can efectively detect Parkinson's disease and evaluate its severity.
Perceptual Experiment.Tere are a total of 20 nonmedical subjects conducting hearing experiments in a quiet room of 20 square meters.We randomly select an audio from a dataset of PD patients, and each person is limited to 10 seconds to listen to a recording before giving a judgment on whether it is an audio from a PD patient.After conducting a hearing test on 20 people for a segment of audio, it was found that 12 people correctly recognized the audio for PD patients, with a comprehensive accuracy rate of 60.00%.After inquiry, it is not ruled out that there is a possibility of speculation.Tis recognition rate is much lower than the model results proposed in this article.We also invited a PD expert to conduct hearing tests on samples from 50 datasets, including 25 PD samples.Te test results showed that the audio recognition accuracy of PD patients was 84%, HC audio recognition accuracy was 92%, and the overall accuracy was 88.00%.Terefore, it can be seen that the probability of using the proposed scheme for accurate diagnosis of Parkinson's disease using audio exceeds 90%, providing a reference for automated diagnosis research.International Journal of Intelligent Systems  Trough consistency testing, the Kappa values of TmmNet on two datasets are 0.9980 (/a/), 0.9952 (/e/), 0.9926 (/i/), 0.9974 (/o/), 0.9981 (/u/), and 0.7863 (MDVR-KCL dataset), respectively.It can be clearly seen that the TmmNet model performs much better on IPVS than on MDVR-KCL, achieving almost perfect consistency, while achieving high consistency on MDVR-KCL.Te confusion matrix on the MDVR-KCL dataset is relatively imbalanced, as the detection rate of PD in the test set is 100%, there is a problem of data imbalance.Other models also have the same problem, and the data should be fltered or added later.
Severity Assessment.Furthermore, we also adopt a speech dataset from Parkinson's disease to validate the proposed model, and the results showed that TmmNet also has good performance in classifying the severity of PD speech.Tis study can frst detect patients with potential Parkinson's disease based on speech data and then evaluate their severity.Te experimental results are shown in Table 7.According to the Hoehn and Yahr scale, speech data in the MDVR-KCL dataset are classifed into four categories: healthy individuals, PD1 level, PD2 level, and PD3 level, which is completely marked by expert evaluation scores.We compared fve deep learning methods [48][49][50][51], including models based on convolutional neural networks, transformers, and transfer learning for speech emotion recognition, and achieved good results, which is sufcient to prove that these deep models can evaluate the severity of Parkinson's disease speech, and the efectiveness of the proposed TmmNet is also remarkable.
Furthermore, we also utilize the t-SNE visualization method to show the classifcation ability of our proposed method.Te experimental results are shown in Figure 17.

Conclusion
We have presented techniques for screening out PD patients or samples with potential PD from the speech data of subjects, including MFCC feature extraction, and a pretrained triplet hybrid model and a reconstructed temporal model achieve transfer learning for high-level expression of the MFCC feature.Experiments have shown that our method can not only be applied to the detection of monosyllabic vowels in patients with Parkinson's disease but also have the function of analysis and recognition for a period of time of the spontaneous dialogue.Although the efect is not as good as the former, it can be used as a reference for the detection and classifcation of PD speech.By the abovementioned strong results, we hope to stimulate more research in this direction so that we can eventually improve the ability of transfer learning models to process speech sequence data and promote speech modeling.2022M711741, Natural Science Foundation of Shandong Province under Grant no.ZR2021QF084, and Shandong Province Higher Education Institutions Youth Innovation and Technology Support Program (2023KJ365).

2
International Journal of Intelligent Systems (iii) After multifeature fusion of the upstream pretraining model of the transfer learning structure is completed, the downstream model adds a bidirectional temporal recurrent memory network and two fully connected modules after the pretraining model for fne-tuning training.Te signifcance of fusion features in the time series dimension is highlighted.

Figure 1 :
Figure 1: Speech recognition process.Te speech recognition system can be divided into four parts: PD pronunciation test, speech data collection, speech recognition and PD detection, and feedback for diagnosis.

Figure 2 :
Figure 2: Te structure of TmmNet.Tere are three modules: data preprocessor for MFCC feature extraction, pretrained model for multilevel and scale feature extraction, and reconstructed model for temporal information acquisition and feature classifcation.

Figure 4 :
Figure 4: Te structure of the transformer block.

Figure 3 :
Figure 3: Te process of the multihead attention mechanism.

Figure 5 :
Figure 5: Te structure of the multiscale Conv block.

Figure 6 :
Figure 6: Te internal structure of the convolutional block attention.

Figure 10 :
Figure 10: ROC curve of four constituting components and the overall model.

Figure 11 :
Figure 11: ROC curve of eight machine learning methods and TmmNet.

Figure 12 :
Figure 12: ROC curve of machine learning and TmmNet.

Figure 13 :
Figure 13: ROC curve of deep models and TmmNet on the MDVR-KCL dataset.

16
International Journal of Intelligent SystemsCohen's Kappa.Cohen's kappa is an indicator used for consistency testing and can also be used to measure the efectiveness of classifcation.For classifcation problems, consistency refers to whether the predicted results of the model are consistent with the actual classifcation results.Te calculation of Cohen's Kappa is based on the confusion matrix, with values ranging from −1 to 1, usually greater than 0.Te formula for calculating the kappa coefcient based on the confusion matrix is as follows:kappa � p o − p e 1 − p e , (11)where p o is the sum of the number of correctly classifed samples for each class divided by the total number of samples, which is the overall classifcation accuracy, and p e represents the sum of the product of actual and predicted quantities for all classes, divided by the square of the total number of samples.It can be divided into fve groups to represent diferent levels of consistency: low consistency is [0.0, 0.20], fair consistency is [0.21, 0.40], moderate consistency is [0.41, 0.60], substantial consistency is [0.61, 0.80], and almost perfect consistency is [0.81, 1].

Figure 14 :Figure 15 :Figure 16 :
Figure 14: ROC curve of deep models and TmmNet on the IPVS dataset.

Figure 17 :
Figure 17: Comparison of t-SNE feature dimensionality reduction on the MDVR-KCL dataset.(a, b) Comparison between the original data and the feature extracted from our model after dimensionality reduction on the MDVR-KCL dataset.

Table 2 :
Results of the four constituting components on the MDVR-KCL dataset.Te bold values in Table2represents the highest results among the split models.

Table 3 :
Results of the eight machine learning methods on the MDVR-KCL dataset.

Table 4 :
Performance of machine learning methods on the IPVS dataset.Te bold values in Table4represents the highest results among all the compared methods.

Table 5 :
Performance of deep models on the MDVR-KCL dataset.Te bold values in Table5represents the highest results compared with the deep learning models.

Table 6 :
Performance of deep models on the IPVS dataset.Te bold values in Table6represents the highest results among all the compared methods.

Table 7 :
Performance of deep models on the MDVR-KCL dataset for severity rating.