FPT-Former: A Flexible Parallel Transformer of Recognizing Depression by Using Audiovisual Expert-Knowledge-Based Multimodal Measures

,


Introduction
Te recognition that mental disorders are signifcant contributors to the burden of disease is growing [1].Currently, depression stands as the most prevalent mental illness, characterized predominantly by persistent and long-term feelings of low mood [2], making it a signifcant form of mental illness in modern times.As per the World Health Organization (WHO), by 2030, depression is projected to become the most prevalent mental disorder [3].In extreme cases, depression can result in suicide [4].At present, there is no distinct and efective clinical defnition for depression, resulting in a diagnosis process that can be both subjective and lengthy.Te integration of artifcial intelligence and mathematical modeling methods is increasingly being employed in mental health research in attempt to address this issue.Tese techniques can be benefcial to the feld of depression detection, given their ability to appreciate the signifcance of acquiring detailed data to distinguish various depression disorders [5].
A multitude of automatic depression estimation (ADE) systems have been developed [6].Many audiovisual features can also be used to diagnose depression [7,8].Zhou et al. [9] put forward a unique deep architecture named Depress Net, designed to learn representations from images for depression recognition.He et al. [10] proposed a network that integrates 2D-CNN networks and an attention mechanism for depression recognition.While the majority of earlier research concentrated on single-modal data, recent studies have demonstrated that multimodal data can provide superior predictive performance compared to single-modal data [11].Yang et al. [12] introduced a multimodal fusion framework that integrates deep convolutional neural network (DCNN) and deep neural network (DNN) models.Tis model uses audio, video, and text streams as inputs and is aimed at detecting depression.However, how to better mine serialized information, how to better utilize multimodal information, and which features can improve the accuracy of diagnosis more efectively are still topics that need further research.
Using knowledge-based descriptors as inputs can be an alternative strategy while using original audio and video as inputs faces the challenge of personal privacy disclosure, and the sheer volume of raw video can also slow down predictive efciency.Facial expressions and the acoustic characteristics of speech are the two main categories of knowledge-based measures.Facial expressions serve as a powerful means of conveying emotions to others.Psychologists have meticulously modeled these expressions, culminating in a reference guide known as the facial action coding system (FACS) [13] which will be used in this experiment.Tis system catalogs the combinations of facial muscles involved in each expression and can be utilized as a tool to discern an individual's emotional state through their facial expression.Te acoustic characteristics of speech have also been recognized as potential indicators of depression [14].In this study, a novel set of acoustic features known as the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) has been used.Tis set was recently devised for use in various areas of speech analysis [15].Te Mel-Frequency Cepstral Coefcient (MFCC) method is also one of the foremost techniques used for the extraction of speech features [16], and this feature will also be used in the model proposed in this article.
Numerous open-source datasets about depression are available, ofering a valuable resource for researchers delving into the realm of mental health.Shen et al. [17] presented the Emotional Audio-Textual Depression Corpus (EATD-Corpus), an assemblage encompassing audio recordings and transcribed responses gathered from both depressed and nondepressed individuals.However, it is important to note that this corpus exclusively contains audio and textual data.Cai et al. [18] introduce a multimodal open dataset designed for the analysis of mental disorders.Tis dataset incorporates EEG and audio data sourced from clinically depressed patients, as well as corresponding data from unafected control subjects but its sample size is slightly above 50.Te dataset used in this paper is the Extended Distress Analysis Interview Corpus (E-DAIC) [19], an enhanced version of WOZ-DAIC, which comprises semiclinical interviews intended for assisting in the diagnosis of psychological distress conditions.Tese resources stem from the Audio/Visual Emotion Challenge and Workshop (AVEC 2019), an event dedicated to the comparison of multimedia and machine learning techniques in the realm of health and emotion analysis [19].
Drawing inspiration from the potent learning capacity of transformers, a sequence transduction model wholly based on attention mechanisms [20], we proposed this FPT-Former framework.Tis model is composed of multiple parallel encoders for each modality, which create lowdimensional global feature vectors encapsulating compact information.By combining with expert knowledge, this model enhances the prediction accuracy of the PHQ-8 scores [21].FPT-Former is specifcally tailored to process diverse data types, facial expressions, audio-MFCC, and audio-eGeMAPs, for accurate depression severity estimation.Our model surpasses existing methodologies by incorporating a confuence of components, including parallel transformer encoders for each modality and a fusion layer for efective information convergence.Our FPT-Former achieved comparable performance to the state-of-the-art works, with a root mean squared error (RMSE) of 4.80 and a mean absolute error (MAE) of 4.58.
Te key contributions of this paper can be encapsulated as follows: (1) a fexible parallel transformer model has been proposed for depression recognition; (2) the fusion of audiovisual expert-knowledge-based multimodal metrics increases prediction accuracy; (3) the paralleled structure adapts diferent numbers of measures in diverse modalities; and (4) the utilization of low-dimensional video features in the proposed transformer model increases prediction efciency and avoids personal privacy leakage.
Te structure of the paper is organized as follows: Section 2 introduces the related work of automatic depression diagnosis.Section 3 introduces the expert-knowledge-based features that serve as inputs to the FPT-Former.Section 4 illustrates the framework of the proposed FPT-Former.Section 5 presents and analyzes the experimental results.Section 6 concludes the paper and discusses potential future work.

Related Work
In the area of depression recognition, many researchers have made progress.Du et al. [22] introduced the machine speech chain model for depression recognition (MSCDR), highlighting the signifcance of vocal tract changes as important markers for depression.Yang et al. [23] addressed the challenge of speech depression detection by proposing the International Journal of Intelligent Systems DALF framework, which employs attention-guided learnable time-domain flterbanks.By learning acoustic features and spectral attention, DALF outperformed state-of-the-art methods from audio signals.In a similar vein, Niu et al's "Depressioner" model [24] turned its attention to facial dynamics, identifying facial changes as potential biomarkers for depression levels.In 2022, Kakuba et al. [25] introduced the concurrent spatial-temporal and grammatical (CoSTGA) model, which is a deep learning-based approach.Tis model is designed to simultaneously acquire spatial, temporal, and semantic representations within the local feature learning block (LFLB).Tese representations are then combined into a latent vector, which serves as the input for the global feature learning block (GFLB), and they also presented an attention-based multilearning model (ABMD) [26] that leverages residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers featuring multihead attention.Te ABMD model delivers comparable performance while efciently capturing global contextualized long-term dependencies among features in a parallel manner which can be used in speech emotion recognition.
Moreover, researchers have been proactive in embracing multimodal approaches to elevate the accuracy of depression recognition.Li et al's multimodal hierarchical attention (MHA) model [27], designed for social media settings, improves recognition performance by integrating various data types.Tis model employs attention mechanisms and combines multiple data sources which highlight the signifcance of taking a holistic approach.In tandem, privacy concerns were addressed by Pan et al's AVA-DepressNet [28].Tis model pays attention to facial privacy preservation while concurrently boosting audiovisual feature enhancement, addressing the ethical issues of technology application in sensitive domains such as mental health.Zou et al. [29] developed a Chinese Multimodal Depression Corpus (CMDC) by conducting semistructural interviews with depression patients.Trough feature analysis and benchmark evaluations, they established the efectiveness of their multimodal fusion approach, showcasing its potential for automatic depression screening.Zhang et al. [30] introduced a two-stream deep network within a depression detection framework which achieved state-of-the-art performance on AVEC2014 datasets [31].
Extending the scope, Zhao et al's work [32] pushed the boundaries of depression detection through the analysis of image-based data.Tey introduced frequency attention, tapping into the distinctive traits of depression images to uncover signifcant patterns of depression patients.
Besides, the temporal dimension emerges as a recurring motif.He et al's DepNet [33] used deep learning to extract spatial-temporal patterns from video-based facial sequences.By scrutinizing patterns over time, this approach ofers a deeper understanding of the dynamic nature of depression manifestations.
Lastly, the treatment of mental illnesses is also a feld of interest for deep learning.A. Singh et al. proposed a costefective, socially designed robot named "Tinku" for teaching and assisting children with autism spectrum disorder [34].
In addition, the continuous introduction of new deep learning models, as well as improvements and adjustments to these models, has played a promotive role in the research of this feld [35][36][37].

Feature Descriptors
Te data utilized in this study are derived from the Extended Distress Analysis Interview Corpus (E-DAIC) [19], a broader version of WOZ-DAIC, which comprises semiclinical interviews intended for assisting in the diagnosis of psychological distress conditions.Te research for this paper involves 275 subjects, with each participant contributing three unique sets of data, specifcally FACS, eGeMAPS, and MFCC.Every subject is assigned a Patient Health Questionnaire (PHQ-8) score.All the initial descriptors used in this paper are grounded in expert-based knowledge.

Video Descriptors.
Te video descriptors in this experiment were extracted by a facial behavior analysis toolkit called OpenFace which can accurately detect head pose, recognize facial action units, and estimate eye gaze [38].Each subject's interview video undergoes processing with OpenFace.After this processing, every frame of the video contains 49 distinct feature values.
Te frst to sixth eigenvalues constitute the subject's head pose.Te orientation of the head in relation to the camera can be represented through rotation and shift.Figure 1 illustrates the description of a head rotation transformation.Te spatial coordinates corresponding to the head's position will also be provided.
Features from the 7th to the 14th represent the subjects' eye gaze estimation.Te gaze direction vector of each eye is represented by three numbers and the average of the horizontal and vertical radians of the gazing directions of the two eyes will also provide two features.
Te Facial Action Coding System (FACS) [13] is a taxonomy of human facial movements defned by their manifestation on the face.It encodes the movements of individual facial muscles, discerning subtle instantaneous changes in facial appearance.With FACS, nearly any anatomically possible facial expression can be coded, breaking it down into specifc action units (AUs).It serves as a widely accepted standard for objectively describing facial expressions [39].Eighteen action units were considered in this study, all of which are more typically associated with the expression of negative emotions, and these AUs are described in Table 1.
Each AU is denoted by two values: its presence and its intensity (excluding AU28, for which only its presence is determined.).Te presence refers to whether the AU is visibly apparent on the face and the intensity refers to the strength or force of the AU, rated on a 5-point scale ranging from minimal to maximal.

Audio Descriptors: MFCC. Te frst set of audio expertknowledge-based measures is the Mel-Frequency Cepstral
Coefcients (MFCC), and it encompasses 39 features.

International Journal of Intelligent Systems
Research in psychophysics has revealed that the human perception of frequency in speech signals does not adhere to a linear scale.Terefore, for each tone with a factual frequency (f ) measured in hertz (Hz), a subjective pitch is assessed on a scale known as the "Mel" scale which is calculated as follows [40]: In this context, f mel corresponds to the subjective pitch in the "Mel" scale that is associated with a specifc frequency measured in hertz (Hz).Tis understanding forms the foundation for the defnition of the Mel-Frequency Cepstral Coefcients (MFCCs), a fundamental set of acoustic features used in speech and speaker recognition applications [41].
In the data employed for this experiment, 13 coefcients are preserved following the Discrete Cosine Transform (DCT), and the frst and second derivatives of these 13 coefcients are also computed.In summary, the Mel-Frequency Cepstral Coefcient (MFCC) expert-knowledge-based dataset comprises a total of 39 features.

Audio Descriptors: eGeMAPS.
Valuable information encapsulating emotional indicators can be extracted from audio signals, which can contribute signifcantly to the diagnosis of depression.Researchers are investigating various dimensions such as the identifcation of emotional states, the conveyance of emotional signals through voice, the impact of emotions in language, and the automatic detection of speaker emotions for enhancing depression prediction efcacy.
eGeMAPS is an expanded version of GeMAPS [15], augmenting its set of 18 low-level descriptors (LLDs) with several new features.Tese additions include fve spectral features, such as the frst four Mel-Frequency Cepstral Coefcients (MFCC1-4), and the spectral diference between consecutive frames, alongside two frequency-dependent attributes: the bandwidth of the second and third formants.eGeMAPS contains 88 features in total which extract acoustic parameters from speech to understand vocal emotional expressions.
Tis paper uses 23 of these 88 features that are most relevant to the diagnosis of depression, and every feature undergoes a smoothing process utilizing a window size of three frames.Detailed descriptions of these 23 features can be seen in Table 2.

Framework of Flexible Parallel Transformer
Te network introduced in this study is called the FPT-Former, which is a fexible, parallel transformer network specifcally built to process multimodal data, facial expressions, audio-MFCC, and audio-eGeMAPs characteristics, for depression identifcation.Te FPT-Former architecture is constituted by a confuence of components.Tese components encapsulate an input layer, a cohort of parallel encoders for each modality, and a fusion layer to converge the learned information across the modalities.Te total number of training parameters for the entire model is 234,463, and a ReduceLROnPlateau scheduler is used to adjust the learning rate based on the validation loss, enhancing the model's ability to converge to the optimal solution.Tis section provides a detailed exposition of the network architecture, and the data structure in each phase undergoes various transformations throughout the pipeline of our framework, FPT-Former, as illustrated in Figure 2.

Data Preprocessing and Input.
Our dataset contains multimodal data including facial expression, audio-MFCC, and audio-eGeMAPs measures.Te facial expression features, derived from the FACS, consist of 49 dimensions, and audio-eGeMAPS measures comprise 39 dimensions, while audio-MFCC measures are characterized by 23 dimensions.To preserve the temporal information across the sequence of video or audio frames, an additional feature value indicating the frame serial number is appended to each modality, resulting in the dimensions of 50, 40, and 24, respectively (the feature number of facial expression, audio-MFCC, and audio-eGeMAPs measures).Each of the three modalities takes a frame every 0.1s, and each subject takes 4,146 frames.

Input Layer.
Te journey of data through the network commences at the input layer.Herein, the raw multimodal data are introduced into the system frame-by-frame.Tis modality-specifc data include facial measures of dimension  and 24, respectively, corresponding to the diferent modalities of the dataset.Tis approach is predicated on the understanding that capturing the temporal dynamics  International Journal of Intelligent Systems inherent in the frame sequence is paramount to the efectiveness of the model.

Encoder Stage.
Following that the input layer is the encoder stage, this stage is characterized by a trident of parallel encoders, each designed to cater to the specifc modalities: facial expression, audio-MFCC, and audio-eGeMAPs measures.
4.3.1.Transformer and Self-Attention.Te Transformer architecture, proposed by Vaswani et al. [20], has emerged as a groundbreaking paradigm in sequential data modeling, thanks to its innovative self-attention mechanism.Tis mechanism allows the model to weigh the signifcance of diferent positions within a sequence while processing each element, making it well-suited for capturing long-range dependencies and relationships.Te essence of the self-attention mechanism lies in the QKV (Query-Key-Value) mechanism, which can be mathematically expressed as Figure 3.
Given an input sequence of vectors X � (x 1 , x 2 , . .., x i ), the self-attention mechanism calculates the weighted sum of value vectors based on their relevance to a query vector: where Q represents the query matrix, K represents the key matrix, and V represents the value matrix.d k is the dimension of the key vectors.Te scaled dot-product operation inside the softmax function computes the compatibility between each query and key pair.Te result is a set of attention scores that determine how much each value contributes to the fnal output.
In the context of our model, the self-attention mechanism enables the encoder to focus on relevant features within a sequence.Tis is particularly benefcial when processing multimodal data, as it helps capture meaningful interactions between diferent elements.

Encoder Components.
In this architecture, data from three diferent modalities are individually channeled into three separate encoder pathways.Each pathway follows a multistep process.
Initially, the data are subjected to a multihead selfattention mechanism, where the model utilizes fve attention heads to capture intricate relationships and dependencies within each modality.Ten, a layer normalization step is applied after the multihead selfattention process.Te Transformers are tailored to the specifc requirements of each data modality, with diferent numbers of heads and layers.Te model uses 5, 4, and 4 selfattention heads in its three transformer modules, respectively.Following layer normalization, the data passes through position-wise feed-forward networks.To mitigate the risk of gradient vanishing, residual connections are employed and these connections allow the output of each layer to be combined with its input.Tree identical encoder layers are stacked one upon the other.Te number of encoder layers and self-attention heads is determined after multiple attempts under our computational environments.Each encoder layer encompasses all the aforementioned components, and this stacking increases the model's representational capacity.At the end of this process, each pathway yields an output vector with dimensions [1,32].

Fusion Layer.
After the three encoders produce their respective outputs, three [1,32] feature vectors are obtained.Tese vectors are then combined in a fusion layer, resulting in a single comprehensive feature vector of dimensions (1,96).Tis aggregated vector encompasses all the essential information from the three modalities.Subsequently, the feature vector is passed through a fully connected layer, culminating in the fnal PHQ-8 score which can be used in the prediction of depression severity.
In summary, the FPT-Former utilizes the richness of information intrinsic to the diferent modalities, enabling their synergistic utilization to augment the prediction accuracy of depression severity.Te next section will shed light on the efectiveness of our model, substantiated by empirical results from our experiments.

Experiments and Analysis
In this section, we will analyze the experimental results, assess our proposed FPT-Former model, and conduct a comparison with existing state-of-the-art techniques.Furthermore, through ablation studies, we will substantiate the efectiveness of the FPT-Former model in estimating depression severity by expert-knowledge-based multimodal measures.

Dataset Split and Model
Training.Tis study makes use of the E-DAIC dataset which comprises data from 275 participants.Each participant's data represents a sample, which includes visual features from OpenFace 2.1.0,eGe-MAPS features, and MFCC features extracted using OpenSMILE [42].For each sample, the maximum frames considered are 12,438 for visual data and 41,460 frames for both MFCC data and eGeMAPS features.
To ensure the consistency of the input across all samples, we restrict the data for each modality.For visual features, we select every third frame, resulting in 4146 frames per sample.For audio features, every tenth frame is chosen, resulting in 4146 frames per sample for each of these modalities as well.
We employ a ten-fold cross-validation scheme for our model training and evaluation, thus splitting our dataset into ten partitions.For each fold, nine partitions are used for training, and one partition is left out for testing.
Te FPT-Former is trained using the Adam optimizer with an initial learning rate of 0.01 and a cosine annealing schedule for learning rate decay.Te model is evaluated using two metrics: RMSE and MAE, and both of them have been calculated on the validation set for each fold.
RMSE and MAE are defned as follows: where N is the total number of observations, r i is the prediction from the model, and r i ′ is the actual observed value.Each fold is repeated and the reported results are averaged over all folds.In this manner, we ensure a robust estimation of our model's performance.

Depression Recognition Results.
To establish the efectiveness of our proposed FPT-Former model in multimodal depression severity estimation, we compared it with several existing state-of-the-art methods.Te comparative evaluation focused on the primary performance metrics: RMSE and MAE.
Table 3 outlines the performance of our method against others.Al Hanai et al. [43] employed audio and text features in an LSTM neural network, achieving an RMSE of 6.50 and MAE of 5.13, and Zhang et al. [44] introduced an autoencoder model with BiGRU for speech-based depression severity prediction, resulting in an RMSE of 5.68 and MAE of 4.64.Yang et al. [45] integrated speech, text, and face data with DCGAN for feature augmentation, yielding an RMSE of 5.52 and MAE of 4.63, and Han et al. [46] proposed a spatial-temporal feature network for speech-based depression detection, achieving an RMSE of 6.29 and MAE of 5.38 while Fang et al. [47] presented a multimodal fusion model with multilevel attention mechanism for depression detection, with an RMSE of 5.17.Our FPT-Former presented herein achieved comparable performance to the state-ofthe-art works, with an RMSE of 4.80 and an MAE of 4.58.To ensure comparability, all the methods listed in Table 3 used E-DAIC as a dataset.
To provide a better understanding of the agreement between our FPT-Former model's predictions and the actual depression severity scores, we conducted a Bland-Altman analysis.Te Bland-Altman plot (the left part of Figure 4) depicts the diference between the predicted depression severity scores and the actual scores on the y-axis, against the average of the two scores on the x-axis.Te regression analysis (the right part of Figure 4) also allows us to observe whether the model exhibits consistent deviations across diferent levels of depression severity.

Ablation Experiment.
To better understand the contribution of each modality to our model's performance, we conducted ablation experiments.Tese experiments systematically removed one or two modalities from the multimodal model and observed the efect on performance.

Single Modality Ablation.
In the single modality ablation experiments, we individually removed each modality such as FACS, MFCC, and eGeMAPS from our FPT-Former model and observed the change in model performance.Table 4 presents the results of the ablation experiments, showing the RMSE and MAE values when each modality was removed.
From the results presented in Table 4, it can be observed that each modality plays a vital role in the performance of the FPT-Former model.Removing any one of the modalities leads to an increase in RMSE and MAE, indicating a decline in prediction accuracy.Tis underlines the importance of multimodal data and the synergy between these modalities in making accurate predictions.Te extent of performance degradation varies with the removal of diferent modalities, suggesting that each modality contributes diferently to the overall model's performance.

Double Modality Ablation.
Next, we examined the interplay between diferent modalities by conducting double modality ablation experiments.Here, we removed two modalities at a time and evaluated the performance of the model with only the one remaining modality.Te results are shown in Table 5.
Tese fndings reinforce the notion that each expertbased-knowledge carries unique and valuable information for the task of depression severity estimation.Relying on a single modality can cause the loss of essential information.Tis underlines the signifcance of an expert-basedknowledge in developing robust predictive models for depression recognition.To visually represent these fndings, a bar plot was generated (Figure 5) to compare the RMSE and MAE values for diferent ablation scenarios.Te plot illustrates the impact of removing each modality on the model's predictive accuracy.
Te dataset we utilized originates from AVEC 2017: Real-life Depression and Afect Recognition Workshop and Challenge [19], and this challenge provided a baseline for comparison.We compared the results of our double International Journal of Intelligent Systems Audiovisual and text data 5.17  6) showed that our model demonstrated better performance in predictions.
Te results reveal that the removal of FACS has the most signifcant impact, leading to the highest increase in both RMSE and MAE values.On the other hand, removing MFCC causes a comparatively smaller increase in RMSE and MAE, indicating its relatively lower contribution to the model's performance.It is noteworthy that each modality plays a distinct role, and their removal afects the model's predictive capabilities diferently.

Depression Classifcation.
In addition to estimating depression severity scores, we further conducted a classifcation task to distinguish between normal subjects and individuals with depression.Tis binary classifcation task allows us to evaluate the model's capability to diferentiate between the two categories based on the threshold of 10 points on the Patient Health Questionnaire (PHQ-8) score [21].To demonstrate the model's generalization, we also conducted the same testing on the AVEC-2014 [48] dataset which contains 150 subjects.Te AVEC-2014 dataset utilizes BDI-II scores as labels, with a threshold of 21 to distinguish between individuals with depression and those without depression [49].
To provide a visual representation of our classifcation model's performance, we constructed a confusion matrix.Te confusion matrix displays the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.Tis matrix provides insights into the model's strengths and weaknesses in terms of correctly and incorrectly classifed instances which have been shown in Figure 6.61 of 86 (70.93%) subjects with depression and 157 of 189 (83.07%) normal subjects were correctly predicted in E-DAIC dataset (Figure 6(a)), and 32 of 45 (71.11%) subjects with depression and 81 of 105 (77.14%) normal subjects were correctly predicted in AVEC-2014 dataset (Figure 6(b)).
For assessing the performance of our classifcation model, we employed the following evaluation metrics which can be seen from Figure 7. Accuracy (ACC) is defned as the proportion of correctly classifed instances among all instances.Sensitivity (SEN) denotes the ratio of true positive predictions to the actual positive instances (depressed individuals).Specifcity (SPE) is the ratio of true negative predictions to actual negative instances (normal individuals).Positive predictive value (PPV) is the proportion of true positive predictions among the instances that the model classifed as positive.Negative predictive value (NPV) indicates the proportion of true negative predictions among the instances that the model classifed as negative.Here, ACC, SEN, SPE, PPV, and NPV achieve 0.79, 0.71, 0.83, 0.66, and 0.86, respectively, in the E-DAIC dataset (Figure 7(a)) and the same index achieves 0.75, 0.58, 0.85, 0.71, and 0.78, respectively, in AVEC-2014 dataset (Figure 7 [50].Furthermore, it is essential to note that the current implementation of the FPT-Former relies on a concatenation method for multimodal fusion.In future research, we plan to explore more fusion techniques, such as attention mechanisms, tensor fusion, or hierarchical fusion, to enhance the model's accuracy and better capture the interdependencies among diferent modalities.
Despite these limitations, the FPT-Former represents a signifcant step towards a more comprehensive, accurate, and nuanced approach to depression severity estimation.Future work guided by these identifed areas of improvement holds the potential to enhance the predictive capability of the model and broaden its applicability in real-world scenarios.

Conclusions
In this study, we introduced a novel fexible parallel transformer model, the FPT-Former, designed to harness the power of multimodal data in recognizing depression.Trough its unique architecture, this model circumvents the challenge of quantitative diferences across various modality features and provides a robust solution to reduce prediction errors.Besides, this model's ability to adapt to diferent numbers of measures across diverse modalities underlines its fexibility and applicability.Our FPT-Former model incorporates expert-knowledge-based audiovisual measures, facilitating the extraction of meaningful patterns from data, while maintaining the low dimensionality of input features.By employing lowdimensional measures as inputs, our model not only increases predictive efciency but also addresses concerns related to personal privacy leakage, which is paramount in mental health applications.Experimental results on the E-DAIC dataset demonstrate the superiority of our model over existing techniques in terms of RMSE and MAE.Te ablation studies further reveal the integral role each modality plays in achieving superior performance.Integrating multiple modalities and capturing long-term temporal dependencies from videos has the potential to detect depression accurately.After the comprehensive evaluation, the FPT-Former may become a useful diagnostic tool for mental health and contribute to global eforts in addressing this critical mental health issue.
In conclusion, this research contributes signifcantly to the understanding and technology of mental health diagnostics.Te FPT-Former model, with its emphasis on expert-knowledge integration and privacy protection, not only advances the feld of depression detection but also promotes the development of intelligent systems in mental healthcare.Its fexible structure and high predictive efciency make it a potential tool for clinicians and researchers.

Figure 1 :
Figure 1: Relationship between head posture angle change and head motion.
Te bold font indicates the lowest value among the compared studies.8InternationalJournal of Intelligent Systems modality ablation experiment with the baseline (the challenge provided the RMSE only) from the challenge, and the results (Table (b)).

Figure 4 :
Figure 4: Te Bland-Altman plot and regression analysis for FPT-Former depression severity predictions.

Figure 5 :
Figure 5: Te efect of modality ablation on model performance.

Table 1 :
List of AUs in OpenFace.

Table 2 :
Selected eGeMAPS features for depression diagnosis.

Table 3 :
Te performance of depression recognition on E-DAIC databases.

Table 4 :
Te performance of FPT-Former when each modality is removed.

Table 5 :
Te performance of FPT-Former when only one modality is used.Limitations and Future Works.Despite the promising results achieved by our proposed FPT-Former model, there exist several limitations that highlight avenues for future research.First, the model is trained and evaluated using the E-DAIC dataset and AVEC-2014 dataset.Although these datasets are widely accepted, the generalizability of the model can be further validated using other multimodal datasets that are more diverse in terms of demographic characteristics and cultural contexts.Future work can involve conducting experiments on more datasets to improve the robustness and universality of the model.Second, our study focused on three modalities: facial expressions, audio-MFCC, and audio-eGeMAPs.While these are undoubtedly important, depression manifests in

Table 6 :
Comparison between FPT-Former (when only one modality is used) and baseline of AVEC-2019.Te bold font indicates that the RMSE of our model is lower than the previous three models.
10 International Journal of Intelligent Systems various other ways.Future studies could consider incorporating additional modalities, such as text from patient interviews, physiological signals like heart rate variability, or even social interaction patterns