Kernel Probabilistic Dependent-Independent Canonical Correlation Analysis

. Tere is growing interest in developing linear/nonlinear feature fusion methods that fuse the elicited features from two diferent sources of information for achieving a higher recognition rate. In this regard, canonical correlation analysis (CCA), cross-modal factor analysis, and probabilistic CCA (PCCA) have been introduced to better deal with data variability and uncertainty. In our previous research, we formerly developed the kernel version of PCCA (KPCCA) to capture both nonlinear and probabilistic relation between the features of two diferent source signals. However, KPCCA is only able to estimate latent variables, which are statistically correlated between the features of two independent modalities. To overcome this drawback, we propose a kernel version of the probabilistic dependent-independent CCA (PDICCA) method to capture the nonlinear relation between both dependent and independent latent variables. We have compared the proposed method to PDICCA, CCA, KCCA, cross-modal factor analysis (CFA), and kernel CFA methods over the eNTERFACE and RML datasets for audio-visual emotion recognition and the M2VTS dataset for audio-visual speech recognition. Empirical results on the three datasets indicate the superiority of both the PDICCA and Kernel PDICCA methods to their counterparts.


Introduction
It is evident that data collection from a single sensor (modality) does not capture all discriminative information from an observation.For instance, when we take a flm of a person during his/her speech, voice signals or video frames solely cannot capture all the information in the input states.Terefore multisource data collection has attracted the researchers' attention because various data from an observation can cover the uncertainty, variability, and partial observation of each other.For instance, when we listen to somebody and watch him/her simultaneously, even if we cannot hear a word well, we can guess the corresponding word via observing his/her lip motions.Tus, in this research, we want to fuse audio and visual information to achieve a higher speech recognition rate.Although a few fusion techniques have been developed to fuse the elicited correlated features of diferent modalities, this research proposes a novel approach for fusing both dependent and independent probabilistic information of feature vectors extracted from two different modalities.
1.1.Background.It is empirically shown that the raw audio and video data passes through several neural processing stages [1], which are nonlinear and eventually integrated into each other via a complex neural process.Bimodal data collection is repeatedly used in several applications such as medical image fusion [2], multimodal interaction [3], multimodal emotion detection [4], and audio-visual speech recognition [5,6].
Information fusion can be carried out at diferent levels of data processing including raw data fusion, feature fusion, model fusion, and decision fusion.Among these approaches, feature fusion [7,8] is of concern in this study because this approach can consider the linear/nonlinear relation among elicited features.It is noteworthy to say that feature fusion is diferent from feature concatenation [9].In other words, extracting feature vectors from diferent sources of data and arranging them into a long feature vector is not feature fusion.Feature concatenation methods highly afect the number of training parameters in several classifers such as Bayes classifer [10] or deep learning schemes [11,12].In the case of a small sample size problem, the covariance of such a small dataset is underestimated.From another perspective, conventional classifers are unable to either capture the interaction of all features or tolerate the uncertainty and variability of features [6,[13][14][15].
Terefore, an efcient feature fusion should not produce very high-dimensional feature vectors.To mimic the human's fusion system, representative features from each modality are elicited and then projected into a new space (processing spikes in a higher level) in a way that the projected features of these modalities have a maximum correlation or have a minimum distance, while having a proper size.Terefore, by fusing the projected features (e.g., audio and video features) in the correlation space, better recognition performance can be obtained [4].
Canonical correlation analysis (CCA) [16] is a known technique for future fusion by identifying the shared (dependent) information between two diferent sources of data.CCA is determined by optimizing two diferent linear projections of features belonging to two diferent modalities.Tese features are projected into a new space (called correlation space), in which the cross-correlation of the two projected features is maximized.Tese projected features are called latent variables/features.Nonetheless, CCA has some drawbacks such as a lack of understanding of the stochastic nature of the features.Moreover, CCA is unable to capture nonlinear relations between features.Cross-modal factor analysis (CFA) [17] is another feature fusion method that is similar to CCA, projects the input features of two diferent modalities into a new space in such a way that the distance (Frobenius norm) between the projected features is minimized.CCA and CFA have been adopted in practical multimodal recognition systems such as face recognition [18], signal processing [19], monitoring and fault detection [20], audio-visual speaker detection [21], fusion of multimodal medical imaging [22,23], and audio-visual synchronization [24].To enable both CCA and CFA to capture the input nonlinearities, their kernel versions, called KCCA [25] and KCFA [26] were developed.Tey have been applied to various data fusion applications like specifc radar emitter identifcation [27],audio-visual emotion recognition [26,28], and feature selection [20,29].Nevertheless, during the recording of audio-video signals, a few undesired disturbing factors occur such as the slight movement of the recording camera or getting close and far from the recording microphone.For instance, in a bimodal recognition system [26], the KCFA scheme is deployed to elicit the latent variables of audio and video data but the achieved results are not convincing.Tis is because ignoring the variability factors during the recording afects the quality of the recorded data and declines the performance of the recognizer.
In practice, each set of recorded data has a degree of randomness due to several reasons, such as the movement of sources during data acquisition, power line noise, and the induction noise of other equipment.To capture the stochastic nature as well as the variability of recorded data from two modalities, Bach and Jorden [30] have proposed the probabilistic CCA (PCCA) model.Although PCCA is linear, we propose its kernel version in our previous study to capture the nonlinearity of elicited features from two modalities [6].
Although correlated features can cover the lack of each other, independent features do not sufer from redundancy and reveal a new perspective from an input observation.It is interesting that some fusion methods just estimate the dependent (shared) features between two modalities of data while a few of them use both dependent and independent features.Tus, employing both dependent and independent features in the PCCA framework, which is termed as PDICCA in the literature [31], allows us to move from partial description toward a wider observability of inputs.
Since PDICCA is a linear method and cannot capture nonlinear relations between the elicited features, the main contribution of this study is devoted to kernelizing the PDICCA method, which we call KPDICCA hereafter.Te proposed method is able to capture diferent aspects of inputs such as dependencies, independencies, uncertainty, variability, and nonlinearities.As we see, in the case of encountering a limited number of samples, kernel methods are able to provide convincing results because the size of the kernel depends on the number of feature vectors.In contrast, to get a convincing result from a classifer, the number of training samples should be high.Hence, there is a tradeof between applying a kernel to input features and well training a classifer.
Te rest of this paper is structured as follows: In Section 2, the PDICCA method and the proposed method are introduced.Section 3 introduces the deployed datasets and their feature extraction techniques.Section 4 presents the experimental results obtained from the proposed method along with state-of-the-art methods, and their achievements are compared and discussed.Finally, the paper is concluded in Section 5.

Methodology
In this section, frst PDICCA is briefy explained, and then the proposed method (KPDICCA) is introduced.

PDICCA.
To overcome the lack of capturing uncertainty in both CCA and KCCA models optimized by the maximum likelihood (ML) method [30], one solution is to consider a linear projection between observations of sources and dependencies among latent variables.In addition to the Gaussian distribution assumption, the probabilistic CCA (PCCA) method is able to model the variability of data and outlines a solution for the CCA problem.PCCA has been 2 International Journal of Intelligent Systems extended [31] by incorporating a dependent variable Z similar to CCA and two other independent latent variables Z x and Z y , which are not dependent and exclusively belonged to the two modalities of x and y, as described in Figure 1.Tis method is termed probabilistic dependentindependent CCA (PDICCA) categorizing as a generative data model.PDICCA captures both dependence and independence of latent variables as follows: where f(.) and g(.) are two deterministic functions that transform dependent latent variables (z) and independent latent variables (z x , z y ) to the observation space.However, ϵ x and ϵ y denote the additive noise on x and y observations, respectively.
For the simplifcation, they use two linear functions, f(z|W i ) � zW T i and g(z i |B i ) � z i B T i , and they consider an independent Gaussian distribution with equal variance for noise parameters ϵ i �N(0, σ 2 i I).Furthermore, they assume a Gaussian distribution with zero mean and unit covariance for latent variables (z, z x , z y ).
Terefore, we can write To solve the above equation, at frst, the parameters z (shared latent variable), W x , and W y (projection matrices), must be estimated while the set-specifc parameters z x and z y should be marginally out.Afterward, the probabilistic model is marginalized over the shared latent variable z and then B i and σ i (i = x or y) can be optimized, accordingly.To summarize this learning scheme, its pseudo code is illustrated as follows (see Algorithm 1).
CCA, CFA, and PDICCA methods are all linear approaches, which are capable of fnding linear relationship between two synchronous recording modalities.It should be noted that these models cannot digest the nonlinear correlations between two sets of features elicited from two diferent modalities.Herein, a nonlinear kernel is inserted into PDICCA to overcome this drawback.

Te Proposed Method.
Our approach is similar to KCCA [25] and KCFA [26].To the best of the authors' knowledge, deriving the kernel version of PDICCA has not been proposed yet.Tis paper aims to propose the kernel version of PDICCA (KPDICCA) for considering the nonlinear relations among the observations (from audio and visual modalities).To equip the PDICCA method for capturing nonlinear relations between the observations and their elicited latent variables, here the kernel derivation of PDICCA is developed by implicitly mapping data from the original space to a higher dimension and then apply the Klami and Kaski method [31] to fnd dependent and independent latent variables, as shown in Figure 2. To derive the formula, frst, we consider that all latent variables have normal distribution.Similar to the derivation of KCCA, we can write z x , z y , z ∼ N 0, To simplify the above relations, similar to KCCA, we assume that W i and B i (i � x or y), which are the transformation matrices that project ϕ(x) and ψ(y) into α and β subspaces, can be written as follows: Considering W i and B i (i � x or y) parameters, we can derive a learning method using the expectation maximization (EM) algorithm [31] to obtain parameters θ � α i , β i , δ 2 i   where i ∈ x, y   and δ 2 i is the noise variance of the primary domain.Assuming β x and β y are fxed and marginalizing over independent factors of z x and z y , constructing a probabilistic model that is only dependent to W x and W y with the following covariances: φ y � ψ(y) T β y β T y + δ 2 y I  ψ(y). (5) T α and covariance matrix as If K x and K y are invertible, we will have Terefore, we can obtain the update formula for α as follows: In the second step, we marginalize over z parameter and use similar method to provide an updating function for β i as follows: where Actually, the variance is recovered by the following equation: Te posterior expectation of shared and set-specifc latent variables given observation x, latent variables can be obtained by ML estimation.Applying ML to the probabilistic model, we achieve Similarly, for the observation y, at the above equation, replacing all subscript x by y and ϕ by ψ.If K x and K y are not invertible, the above equations will not provide a solution for KPCCA.Tis problem can be solved using a similar (1) Assume that B x and B y are fxed and marginalized over z x and z y to get and φ is a block-diagonal matrix that consist of φ x and φ y .Te Σ is the joint sample covariance matrix.
(2) Marginalize over z to get where d x is the dimensionality of x, and B x is the new value just updated.Repeat the above two substeps for parameters related to y, replacing all subscripts x with y.ALGORITHM 1: PDICCA algorithm.4 International Journal of Intelligent Systems regularization approach that has been presented for KCCA.
In this study, we use Golub et al. [32] and Koskinen et al. [33] methods that both consider a priori knowledge on where r is a regularization parameter and apply the EM algorithm to achieve the following relations for α and β i : where λ φ � trace(φ) and in the KPDICCA method, we used expectation maximization (EM) algorithm presented in [34].In order to infer z, we need to marginalize out the latent independent factors of z x and z y .Terefore, we assume that β x is fxed (then B x is fxed) and marginalizing over an independent factor.
Similarly, the study [34] involves an integral over p( x|z x )p(z x ) where  x � ϕ(x) − zα T x ϕ(x).As the prior p(z x ) is Gaussian and it is multiplied with a linear term, we can integrate z x out analytically, obtaining  x � N( , φ x ) where φ x �ϕ(x) T (β x β T x + δ 2 x I)ϕ(x).Doing the same marginalization for z y leads to the generative model.
If we consider α � [α x α y ] T and ϖ(O) � ϕ(x) 0 0 ψ(y)  , we can write above model as where Tis is exactly the model proposed in [30] for interpreting CCA probabilistically.For obtaining the above solution, we implicitly assume that the dimensionalities of the z x and z y are sufciently high to produce a nonconstrained covariance matrix (φ).However, Klami and Kaski [31] propose a new algorithm that does not require this assumption and propose a more general EM algorithm for linear projections.Te algorithm includes an additional step that marginalizes the z out to enable estimation of the dependent matrixes (B x and B y ).
Te EM algorithm for optimizing the extended probabilistic CCA is described in Figure 2 and repeats the two steps until convergence.Tis method is a linear approach and is capable of fnding linear relationship between two modalities.For extending this approach to nonlinear relation, similar to the KCCA method, by considering W and B as (10) and (11) and substituting into the Klami's method, we can obtain the EM algorithm for updating the parameters θ � α, β, δ 2   .
By substituting equation (10) into EM algorithm (Figure 2) part (1), we have where and considering K x and K y are invertible, we will have ϖ(O)ϖ(O) − 1 � I. Now by the use of EM for updating the formula for W, we can obtain the following equation for updating the α parameter: In the second step, we marginalize over the parameter z and use similar a method to provide an updating function for β i as follows: where i ∈ x, y   and Actually, the variance is recovered using the model described in equation ( 9) by

Application
In this paper, we assess the proposed method and its competitors on the speech recognition and emotion recognition datasets.Here, M2VTS [35] as an audio-visual database is used for speech recognition.Te M2VTS dataset includes 185 recordings from 37 subjects (12 females and 25 males).Each speaker utters fve shots.Te subjects utter the digits from "0" to "9" within each shot, and their audio and video signals are recorded.Te sampling rate of audio International Journal of Intelligent Systems signals is 48 KHz, and the frame rate of video is 25 Hz.Several features for characterizing speech signals have been proposed such as cepstral coefcients [13], discrete cosine transform (DCT), mel-frequency cepstral coefcients (MFCCs) [9], and the perceptual linear predictor (PLP) [36].First, the background speech signal is removed, and then the cleaned speech signal is segmented into successive hamming windows with 50% overlap, where the length of each window is 512 samples.From each windowed signal, 12 MFCC coefcients are extracted.Since our processing is simultaneous, we have to characterize the lip motion in parallel.In this regard, the lip contour should be frst elicited to trace its key points in successive frames.We here use the Rohani et al. [13] method, in which we frst divide a colored face image into lip and nonlip clusters.Tis segmentation is done by simulating a simple geometric lip model and applying spatial fuzzy Cmean clustering in order to extract the lip contour.Te geometric lip model described by equation ( 25) is presented in Figure 3.
After matching the lip model to each image, a lip contour is extracted.Te lip model contains six features (two key points in the upper lip and four points in the lower lips) that need to be traced in successive frames.
Te employed emotion recognition datasets are eNTERFACE and RML, both of which include six states of emotions such as anger, disgust, fear, happy, sad, and surprise [37,38].In eNTERFACE, 44 subjects participated whose video is recoded at 25 frame per second, and their acoustic signals are recorded at a sampling rate of 48 KHz.On the other hand, in the Ryerson database, eight subjects speak six diferent languages, generating three believable reactions to all the situations.Teir acoustic sampling rate is 22050 Hz, with the video frame rate of 30.
For the emotion recognition dataset, similar processing stages are applied.In order to remove the speech noise, we take the wavelet transform from this signal and by applying thresholding to the energy of wavelet coefcients on different scales, we remove those scales whose energy values are less than an empirical threshold [39].After reconstructing the signals, the frst energy of the signal in the time domain within each window is determined [40] and then the frst 12 MFCC features [26,34] are added to the feature vector.In the emotion recognition system, facial expression features play a very important role.Te challenging issue in the video processing is to precisely extracting the face margin.In this research, the Haar cascade technique [41] is employed to detect the face part.Te image in each frame is resized to 64 × 64 pixels.Afterward, a Gabor wavelet flter [42] in fve scales and eight orientations is applied to each image to elicit key facial features [6,26].Nonetheless, Gabor feature vectors are high-dimensional.To reduce the feature size, principle component analysis (PCA) is deployed.

Experimental Results
In this section, the results of applying the proposed method along with PDICCA, KPCCA, KCCA, CFA, and KCFA to the described speech processing and emotional recognition datasets are presented.As described before, the M2VTS database (https://www.ee.surrey.ac.uk/CVSSP/xm2vtsdb/) [35] for audio-visual speech recognition, eNTERFACE (https://www.enterface.net/enterface05/docs/results/databases/project2_database.zip)[43], and Ryerson (RML) databases (https://www.kaggle.com/datasets/ryersonmultimedialab/ryerson-emotion-database) [44] for audio-visual emotion recognition are employed in this research.Te model parameters are estimated during the cross-validation phase.Te number of hidden states and the number of Gaussian per state in the hidden Markov model (HMM), over the speech dataset, are selected at three and one, respectively.Te number of HMM hidden states and the number of Gaussian per state in the emotion recognition dataset are six and three, respectively.Te variance parameter of the Gaussian kernel is set to 14 for both applications.
Te computational complexity of the kernel-based methods depends on the number of samples.Here, the audio-visual features' dimension is 200.In addition, among the subjects, 10 persons are selected randomly to overcome the memory shortage in each experiment.70% of subjects are selected for the training phase, and the rest are chosen as the test set.We repeat this dividing 10 times, and the fnal results are determined by taking an average over the experiments.Te fnal results are demonstrated in Figures 4-15.
Figures 4 and 5 report the audio-visual emotion recognition accuracy and F1-score of the eNTERFACE and Ryerson (RML) datasets for the CCA, CFA, and PDICCA methods, and the results are calculated for diferent dimension sizes.
Figure 6 reports audio-visual emotion recognition ROC curves for the best accuracy result of the eNTERFACE and Ryerson (RML) datasets for the CCA, CFA, and PDICCA methods and the ROC curves are shown for each emotion classes.

6
International Journal of Intelligent Systems Te experimental results for audio-visual speech recognition accuracy and F1-score using MFCC and PLP acoustic features for feature fusion and decision fusion are shown in Figures 7 and 8, respectively.In this proposed algorithm, the dimension sizes between one and fve are of concern for independent space parts.Te results obtained from the tests indicate that at the feature level and decision level for three and fve dimensions, the best results are achieved.Te ROC curves in each class for best accuracy results are reported in Figure 9. Tese results exhibit that the fusion of dependent and independent parts of some bimodal data could improve the recognition accuracy.Nevertheless, the accuracy of the linear methods is still not acceptable for a real application; therefore, we applied the kernel version of these methods to the same features in order to increase the accuracy.
Figures 10 and 11 depict the experimental results for the proposed KPDICCA and state-of-the-art KCCA and KCFA on the eNTERFACE and Ryerson (RML) datasets based on accuracy and F1-score metrics.In the proposed KPCDICA method, we set diferent dimension sizes for the independence latent variable dimension and the validation results.Tese results are reported for diferent regularization parameter values, and it can be demonstrated that this parameter afects the recognition performance; however, it is difcult to identify an optimum interval for this parameter.

International Journal of Intelligent Systems
For instance, in the decision fusion case, for the eNTER-FACE and RML datasets, the best results are obtained at r � 1.0 and r � 0.4, respectively.
Figure 12 reports audio-visual emotion recognition ROC curves for the best accuracy results of the eNTERFACE and Ryerson (RML) datasets for the KPDICCA, KCCA, and KCFA methods, and the ROC curves are show for each emotion class.
Te audio-visual speech recognition accuracy and F1score of the conventional methods, together with the proposed method for the M2VTS database using MFCC and PLP acoustic features are presented in Figures 13 and 14.In KPDICCA, the diferent dimension sizes are considered between one and six for the independence space, and the results indicate that for independence size four, the best result is achieved.
Figure 15 depicts audio-visual speech recognition ROC curves for the best accuracy result of the M2VTS database using MFCC and PLP for the KPDICCA, KCCA, and KCFA methods.Te ROC curves are show for each emotion classes.
By comparing the recognition accuracy and F1-score in Figures 4 and 5, 7-8, 10-11, and 13-14 on real datasets, we can fnd that the relation between audio and video data are nonlinear and using kernel can handle this problem and fnd a suitable accuracy for the emotion and speech recognition system.Tis supremacy is emerged from incorporating  International Journal of Intelligent Systems dependent and independent variables containing nonlinear information.It can be also interpreted that the extra information that implies the superiority of KPDICCA to KCCA is the incorporation of independent latent variable information as an independent feature for each modality in the HMMs.In other words, when we consider just the common information between two modalities, some discriminative features belonging to each modality carrying unique and independent information are removed.Tis elimination causes a decrease in the performance of the recognition system.As we can see from the results, the performance of KPDICCA declines when the dimension of     International Journal of Intelligent Systems the elicited variables is increased.On the other hand, it should be pointed out that in the RML dataset, due to the low number of samples, by increasing the number of elicited features, the performance of both the KCCA and KPDICCA methods decreases, which is caused by the curse of dimensionality.International Journal of Intelligent Systems

Conclusion
In this paper, a novel approach for audio-visual information fusion based on probabilistic dependent and independent canonical correlation analysis (PDICCA) is proposed.Empirical results reveal that the fusing dependent and independent latent variables of bimodal inputs can increase recognition accuracy.Although a combination of nonlinear dependent latent and set-specifc (independent) features provides more discriminative information than just using the dependent latent features, these dependent latent variables have high share in the fnal results, and nonlinear independent features can be considered auxiliary features that can slightly improve the performance of a recognition system.However, this superiority rises from the fact that KPDICCA captures the data variation in its covariance metrics while KCCA and KCFA do not consider any input tolerance in their formulas.Our experimental results confrmed the feasibility and efciency of KPDICCA for the multimodal data fusion application.Tis method provides good results on low-dimensional inputs but for highdimensional, selecting a suitable regularization factor is capable of better handling high-dimensional inputs when the covariance matrix is sparse and its results on the emotion datasets confrm this claim.
In future work, temporal information can be added to the proposed model to increase its performance.To extend this study, other types of kernels can be assessed for diferent applications, and also other types of regulation models can be employed.On the other hand, to achieve the best kernel map and dependent and independent latent features, the deep learning approach such as deep CCA (DCCA) and deep canonically correlated autoencoders (DCCAEs) can be used.

Figure 1 :
Figure 1: Graphical representation of the generative model structure used to detect dependencies and independencies(Klami  and Kaski [31]).

Figure 2 :
Figure 2: Graphical representation of the generative model structure used to detect dependencies and independence in the kernel domain.

Figure 7 :Figure 8 :
Figure 7: Experimental results of the linear CCA, CFA, and PDICCA methods based on overall accuracy measure.(a) Feature level on the M2VTS database with MFCC features.(b) Decision level on the M2VTS database with MFCC features.(c) Feature level on the M2VTS database with PLP features.(d) Decision level on the M2VTS database with PLP features.

Figure 9 :
Figure 9: ROC curves for best accuracy results of linear CCA, CFA, and PDICCA methods.(a) Feature level on M2VTS database with MFCC features.(b) Decision level on M2VTS database with MFCC features.(c) Feature level on M2VTS database with PLP features.(d) Decision level on M2VTS database with PLP features.

Figure 14 :
Figure 14: Experimental results of nonlinear KCCA, KCFA, and KPDICCA methods based on overall F1-score measure.(a) Feature level on the M2VTS database with MFCC features (b) Decision level on M2vts database with MFCC features (c) Feature level on M2VTS database with PLP features.(d) Decision level on M2VTS database with PLP features.

Figure 15 :
Figure 15: ROC curves for best accuracy results of nonlinear KCCA, KCFA, and KPDICCA methods.(a) Feature level on the M2VTS database with MFCC features.(b) Decision level on M2vts database with MFCC features.(c) Feature level on the M2VTS database with PLP features.(d) Decision level on the M2VTS database with PLP features.