Adaptive-Weighted Multiview Deep Basis Matrix Factorization for Multimedia Data Analysis

Feature representation learning is a key issue in artificial intelligence research. Multiview multimedia data can provide rich information, which makes feature representation become one of the current research hotspots in data analysis. Recently, a large number of multiview data feature representation methods have been proposed, among which matrix factorization shows the excellent performance. Therefore, we propose an adaptive-weighted multiview deep basis matrix factorization (AMDBMF) method that integrates matrix factorization, deep learning, and view fusion together. Specifically, we first perform deep basis matrix factorization on data of each view. Then, all views are integrated to complete the procedure of multiview feature learning. Finally, we propose an adaptive weighting strategy to fuse the low-dimensional features of each view so that a unified feature representation can be obtained for multiview multimedia data. We also design an iterative update algorithm to optimize the objective function and justify the convergence of the optimization algorithm through numerical experiments. We conducted clustering experiments on five multiview multimedia datasets and compare the proposed method with several excellent current methods. The experimental results demonstrate that the clustering performance of the proposed method is better than those of the other comparison methods.


Introduction
With the rapid development of computer technology, the collected multimedia data from many research fields, such as computer vision, image processing, and natural language processing, always have features with high dimension and complex structures. These high-dimensional data can not only provide abundant information but also bring some problems such as the "curse of dimensionality" [1,2]. Therefore, how to effectively deal with high-dimensional data has become a widespread concern [3]. Dimensionality reduction is an efficient way to solve this issue, which can map the original data to a low-dimensional space and obtain a lowdimensional representation derived from the hidden information in the original data [4].
In recent years, many dimensionality reduction methods have been proposed for multimedia data [5]. The matrix factorization method has become one of the research hotspots owing to its simple theoretical basis and easy implementation. Principal component analysis (PCA) [6], independent components analysis (ICA) [7], vector quantization (VQ) [8], etc. are well-known matrix factorization methods that can obtain a low-rank approximation matrix by decomposing a high-dimensional data matrix, and they can effectively extract a low-dimensional representation from highdimensional data. However, these methods do not utilize any constraints on the matrix elements during the process of matrix decomposition. It means that the results allow negative elements, which give rise to the loss of physical meaning in low-dimensional representations. To solve this problem, Lee et al. added nonnegative constraints into matrix decomposition and proposed a nonnegative matrix factorization (NMF) [9] method. The low-dimensional feature representations obtained by NMF method are part-based so that they have strong interpretability. Consequently, NMF has attracted the wide attention of researchers. There are a large number of improved algorithms based on NMF have been emerged, which have achieved great success in computer vision, natural language processing, speech recognition, DNA sequence analysis, and other areas [10][11][12][13].
NMF decomposes the original nonnegative data matrix into the product of a nonnegative basis matrix and a nonnegative coefficient matrix (also called low-dimensional feature matrix). The original data can be expressed as a linear combination of basis matrices, and the combination coefficients can form the coefficient matrix. Since NMF uses nonnegative constraints, it reflects the intuitive notion of combining parts to form a whole and has better interpretability than other methods. The obtained experimental results indicate that NMF has achieved good performance on image and document clustering tasks. Nevertheless, the traditional NMF method only considers the nonnegativity constraints of the elements, which may result in the obtained basis matrix having poor sparseness and independence. To solve the above problems, researchers have imposed additional constraints on the basis matrix or the coefficient matrix and proposed a series of improved methods. For instance, Hoyer [14] designed a sparsity measurement criterion and proposed an NMF variant with sparsity constraints (NMF-SC). Moreover, to enhance the independence of the obtained basis matrices and low-dimensional representation, Choi [15] proposed orthogonal nonnegative matrix factorization (ONMF), which imposed orthogonal constraints on the basis matrix and the coefficient matrix. However, the above methods have nonnegative limitations on the original data, thereby limiting the applicability of these NMF-based algorithms. Therefore, Ding et al. [16] proposed a semi-nonnegative matrix factorization (SNMF). Different from traditional NMF, SNMF relaxed the limitations on the original data and coefficient matrix and only imposed a nonnegative constraint on the basis matrix. The methods mentioned above have better capabilities than their predecessors for feature extraction and achieved better results in real-world tasks, but they only extracted shallow features [17].
In recent years, deep learning has exhibited outstanding performance in feature representation tasks [18][19][20]. Therefore, many researchers have introduced deep learning into matrix factorization and proposed a large number of deep feature representation methods [21][22][23][24][25][26][27]. Ahn et al. [21] proposed multilayer nonnegative matrix factorization (MNMF). Different from traditional NMF-based approaches, MNMF decomposed the coefficient matrix several times to obtain an underlying part-based representation that can extract deep hierarchical features from the original data. In addition, to expand the application scope, Trigeorgis et al. [22] integrated deep factorization and semi-NMF to propose a deep semi-nonnegative matrix factorization (deep semi-NMF) method. However, both MNMF and deep semi-NMF only considered the deep decomposition of the coefficient matrix for the training data. For the new test data, the basis matrix was used to obtain the deep low-dimensional representation. Therefore, the basis matrix directly affected the results of the deep low-dimensional representation. To obtain a more accurate deep low-dimensional representation of the original data matrix, Zhao et al. [23] applied deep factorization to the basis matrix and proposed a deep NMF method based on basis image learning.
With the rapid development of the Internet and data collection technology, a large amount of multiview multimedia data can be easily acquired [28][29][30]. For example, an object can be shot from different views. An image can be described with different types of features such as color, texture, and shape. These multiview multimedia data can provide different information for each view, but they also contain potential correlations among these different views. Furthermore, they contain more information than single-view data. It is possible to simply integrate multiview data into single-view data, which ignores the differences and potential correlations between the various views of the data [28][29][30].
Consequently, extensive multiview data dimensionality reduction methods have been proposed [31][32][33]. Liu et al. [34] proposed a multiview NMF (multi-NMF) method which established the relationship between different perspectives by learning the common coefficient matrix among different views. Subsequently, Chang et al. [35] introduced a new regularization term into the multi-NMF and used it for clothing image clustering. Inspired by ONMF, Liang et al. [36] proposed NMF with coorthogonal constraints (NMFCC) for multiview multimedia data clustering. Additionally, to consider the correlations between multiple views, Zhan et al. [37] jointly optimized the graph matrix and concept factorization process and proposed an adaptive structure concept factorization (ASCF) method for multiview clustering. Although the above methods can handle multiview multimedia data well, they still belong to the class of feature representation method based on shallow factorization [38,39]. The underlying deep features in the multiview data are still not available. Therefore, Zhao et al. [40] maximized the mutual information between various views, which forced the nonnegative representation of the last layer in each view to be as similar as possible. Then, the deep semi-NMF method was applied to multiview multimedia data clustering. Different from the existing studies, to adaptively provide feature weights for different perspectives in the multiperspective deep feature representation procedure, Huang et al. introduced an adaptive-weighted framework into the multiview deep semi-NMF and proposed an adaptive-weighted multiview clustering method based on deep matrix factorization [41]. Unlike the literature [40], it can adaptively assign weights to different views in a multiview deep feature representation. However, these methods still consider only the deep decomposition of the coefficient matrix. Therefore, an adaptive-weighted multiview deep basis matrix factorization (AMDBMF) is proposed for multimedia data clustering in this paper. Different from the above methods, AMDBMF first decomposes the basis matrix using a deep way on the data of each view simultaneously and then integrates the low-dimensional features of all view through the adaptive weighting mechanism to extract more accurate multiperspective deep low-dimensional representations. The flowchart of the proposed AMDBMF approach is shown in Figure 1. At last, we perform extensive experiments on five publicly 2 Wireless Communications and Mobile Computing available multiview multimedia datasets. These experimental results show that the proposed AMDBMF approach outperforms the existing related approaches. The remainder of this paper is organized as follows. "Related Works" describes the related algorithms including NMF and deep semi-NMF briefly. "Adaptive-Weighted Multiview Deep Basis Matrix Factorization" introduces the adaptive-weighted multiview deep basis matrix factorization (AMDBMF) algorithm in detail. The experimental results and analysis are discussed in "Experiments and Analysis." Finally, the conclusions are given in "Conclusions and Future Work."

Nonnegative Matrix Factorization.
Suppose that the given multimedia data can be represented asX = ½x 1 , ⋯, where D is the dimensionality of the data and N is the number of samples. Each sample can be represented as a D-dimensional feature vector x j ð1 ≤ j ≤ NÞ. NMF is aimed at finding two low-ranking nonnegative matrices W = ½w 1 , ⋯, w d ∈ R D×d + and H = ½h 1 , ⋯, h N ∈ R d×N + ðk<<Nandd<<DÞ that fulfill X ≈ WH. After obtaining W and H, the original data can be expressed as that is, each sample can be expressed as a linear combination of the basis matrix W = ½w 1 , ⋯, w d , and the coefficient vector is h j . Therefore, the matrices W and H are called the basis matrix and coefficient matrix, respectively. The objective function of NMF is defined as follows: where kk F is the Frobenius norm operation.
According to the Karush-Kuhn-Tucker (KKT) condition, the update formulas for variables W and H are as follows: 2.2. Deep Nonnegative Matrix Factorization. The traditional NMF method can remove redundant information and reveal the hidden semantic features of multimedia data, but it cannot learn an effective feature representation for the data. For example, a facial image contains various changes such as posture, lighting, and expression changes. Therefore, Trigeorgis et al. [22] pointed out that the coefficient matrix, as a low-dimensional representation of high-dimensional data, should be able to continue to be decomposed so that more abstract low-dimensional features can be obtained. Thus, these processes of deep factorization are defined as where W i and H i represent the factorization results of the i -th layer. It can be seen from Eq. (3) that deep NMF performs a procedure of matrix factorization at each layer and uses the decomposed coefficient matrix as the input data of the next layer to continue decomposing. Consequently, the process of deep matrix factorization performed on the data is expressed as The objective function of deep NMF is defined as follows: Similar to that of NMF, the update formula can be defined as follows: where Ψ = W 1 ⋯ W i−1 ,H i denotes the reconstruction of the i-th layer's feature matrix, and the symbol ⊙ represents the dot product of matrices. ½A pos = jAj + A/2 represents a matrix operation that restrains all the negative elements to zeros and keeps the positive elements unchanged. On the contrary, ½A neg = jAj − A/2 turns the positive elements to be zeros while the negative elements are to be nonnegative.

Adaptive-Weighted Multiview Deep Basis Matrix Factorization
First, an adaptive-weighted multiview deep basis matrix factorization (AMDBMF) method is proposed, which incorporates the nonnegative matrix factorization and deep learning into a unified framework. Next, an optimization algorithm with an iterative updating rule is designed to solve the objective function of AMDBMF. Then, an adaptiveweighted fusion mechanism is provided. Finally, we provide the complexity analysis of the proposed algorithm.
Suppose that X = ½X 1 , ⋯, X N denotes a multimedia data set which contains N samples. Each sample x i ði = 1, ⋯, NÞ is described by M views. Thus, the m-th view's features for this sample can be represented as x m i ðm = 1, ⋯, MÞ. The features of all samples in this view can be represented as First, matrix factorization is performed on the features in each view of the multimedia data, and the objective function can be defined as where W m and H m denote the basis matrix and the coefficient matrix of the m-th view's features, respectively. Then, the deep factorization is performed on W m . The process is defined as follows: The Lagrangian function of Eq. (12) where the symbol ⊙ represents the dot product of matrices. Finally, the algorithmic steps of the proposed method are given in Algorithm 1. To make it easier to understand, 4 Wireless Communications and Mobile Computing Figure 2 depicts the block diagram of the proposed optimization algorithm.

Feature Confusion.
After obtaining the basis matrix and coefficient matrix of each layer for each view through the optimization algorithm, an adaptive-weighted fusion mechanism is adopted to obtain a low-dimensional representation of the multiview data, and the weight calculation is where ε is a small constant. Then, α m is normalized by Eq. (17) Finally, since the low-dimensional representation of each view is expressed as H m = H m l Λ m l−1 , the fusion of the lowdimensional features derived from the multiview data can be expressed as

Wireless Communications and Mobile Computing
3.4. Complexity Analysis. Clearly, the proposed algorithm can be divided into two stages: pretraining and fine-tuning. For convenience, suppose that the number of iterations is T, M is the number of data views, and L is the number of layers. The number of features for all views is D, and the number of low-dimensional representations for each layer is K. In the pretraining process, the complexity of a single view is O ðTDNKÞ. Therefore, the complexity of the whole pretraining process is OðTMLDNKÞ. For the fine-tuning part, the main computational complexity is derived from updating Λ m l−1 , W m l , and H m l , which requires OðTMðL-1ÞNK 2 Þ, OðTMLDNKÞ, and OðTMLDNKÞ complexity, respectively.
Since D > >K, the total computational complexity of the proposed algorithm is OðTMLDNKÞ.         [42]. This dataset includes 737 news articles from the BBC Sport network from 2004 to 2005. These news articles cover six fields, such as track, field, cricket, football, rugby, and tennis (http://mlg.ucd.ie/datasets/segment.html). [43]. This is a dataset that includes 1200 English articles from six types of samples, and each article has been translated into French, German, Italian, and Spanish (http://lig-membres.imag.fr/grimal/data.html). [43]. This dataset consists of specific Wikipedia material with 2669 articles in 29 categories (http:// www.svcl.ucsd.edu/projects/crossmodal/). In the experiments, we select a subset of the 10 most popular categories containing a total of 693 samples. The detailed statistical information about the different datasets is given in Table 1.

Metrics.
In the experiments, we select three commonly used clustering evaluation indicators [44]: accuracy (ACC), normalized mutual information (NMI), and purity to evaluate the performance of the proposed method.
Assuming that the clustering result of x i is l i and that the corresponding true label is t i , then the clustering accuracy (ACC) [45] is defined as where the function δð·Þ is defined as follows: The function mapð·Þ maps the clustering result to the corresponding true label. The Kuhn-Munkres algorithm [46] is employed to find the best mapping result.
Assume that L and T are the clustering result and the true label set, respectively. The mutual information (MI) between them is defined as where pðl i Þ and pðt i Þ represent the probabilities that a sample is randomly selected from the dataset belonging to l i and t i , respectively. pðl i , t i Þ represents the joint probability of a sample randomly being selected from the dataset belonging to l i and t i . Let HðLÞ and HðTÞ represent the entropies of L and T, respectively. Since the value range of mutual information is between 0 and max ðHðLÞ, HðTÞÞ, the normalized mutual information (NMI) is defined as

Wireless Communications and Mobile Computing
Purity is a straightforward and transparent evaluation method that is defined as follows: where k represents the number of clusters, jC d i j is the number of elements in the most numerous category in cluster C i , and jC i j is the number of elements in cluster C i .

Experimental Results and Analysis.
In the first experiment, to test the influences of the parameters on the proposed method, we set the number of factorization layers L and the feature dimension D of each layer to f1, 2, 3g and f10, 30, 50, 70, 90, 110, 130g, respectively. Furthermore, we adopt a grid search to find the optimal parameter value. In the experiment, the low-dimensional features obtained by the proposed algorithm are clustered by the K-means algorithm. Since the initialization of the K-means algorithm has an impact on the clustering results, we repeat the random initialization process with 10 times and report the mean value. First, the optimal feature dimension of each layer is fixed, and the numbers of layers are changed. As shown in Figure 3, in most cases, when the number of layers is set to 1, the result of each measure is poorer than the rest. However, as the number of layers increases, the performance also increases. It shows that the deep factorization helps to improve the performance of the proposed method.
Then, the numbers of layers are fixed, and the dimension of the feature is changed. The result is shown in Figure 4. It can be seen that as the dimensionality increases, the clustering performance also improves in most cases. However, this trend is not always maintained, and the clustering perfor-mance decreases or remains stable as the dimensionality increases once the performance reaches the optimal level. The details of the optimal parameter groups in our proposed algorithm are listed in Table 2.
The second experiment is conducted to verify that the fusion of multiview information is beneficial for improving the clustering performance of the proposed method. First, we perform traditional NMF and deep basis matrix factorization (DBMF) for the data of each view. Then, we obtain the low-dimensional features of the multiview data by fusing the features of different views with equal weight. Finally, the proposed AMDBMF method is compared with the above two methods. The comparison results are listed in Tables 3-5. According to the tables, the performance of the DBMF method is better than that of the traditional NMF method, which indicates that more abstract features can be obtained through the deep factorization. The performance of the proposed AMDBMF method is better than that of the DNBMF method, which verifies that the adaptive fusion of different views is beneficial for extracting more robust lowdimensional features from multiview data.
The third experiment compares the performance of the proposed AMDBMF method with those of some currently popular multiview algorithms, including MVCF [37], DeepMVC [41], GMC [47], and NMFCC [36]. MVCF utilized the correlation information between the views obtained by jointly optimizing the graph matrix of the data of each view. DeepMVC used a nonparameterized adaptive learning method to obtain the weights between views. NMFCC introduces orthogonal constraints into the basis matrix and coefficient matrix. The best results yielded by the different multiview learning methods on different datasets are shown in Tables 6-8. It can be seen that the performances of the proposed method are significantly better than those of the other comparison methods in most cases. Since these

10
Wireless Communications and Mobile Computing methods use different mechanisms to fuse multiview data information, all the methods present different performances on different databases. Therefore, how to effectively integrate fusion mechanisms is still an open problem. The final experiment verifies the convergence of the proposed optimization algorithm. The convergence curves of the proposed method on different datasets are given in Figure 5. As seen from the figures, the iterative update rules in Algorithm 1 decrease the objective function value obtained by our proposed method. Moreover, we can also see that our proposed method converges very quickly on these datasets.

Conclusions and Future Work
To efficiently learn the feature representations of multiview multimedia data, this paper proposes a new deep nonnegative matrix factorization method with multiview learning. Unlike traditional methods, the proposed method deeply decomposes the basis matrix, so it not only can learn the component representation of the original data but also can learn more abstract deep features. Furthermore, to effectively fuse the available multiview data information, this paper introduces an adaptive feature fusion mechanism.
To solve the shortcoming of information fusion for multiview data, a large number of fusion mechanisms have been proposed, and they achieve different performances on different datasets. Therefore, how to effectively integrate different mechanisms to improve the feature representation ability of a given approach is one of the key research tasks to be addressed in the future. Moreover, we will apply our method to other fields such as medical image procession and medical text analysis [48].

Data Availability
The data are derived from public domain resources.

Conflicts of Interest
The authors declare that they have no conflicts of interest.