Analyzing complex system with multimodal data, such as image and text, has recently received tremendous attention. Modeling the relationship between different modalities is the key to address this problem. Motivated by recent successful applications of deep neural learning in unimodal data, in this paper, we propose a computational deep neural architecture, bimodal deep architecture (BDA) for measuring the similarity between different modalities. Our proposed BDA architecture has three closely related consecutive components. For image and text modalities, the first component can be constructed using some popular feature extraction methods in their individual modalities. The second component has two types of stacked restricted Boltzmann machines (RBMs). Specifically, for image modality a binary-binary RBM is stacked over a Gaussian-binary RBM; for text modality a binary-binary RBM is stacked over a replicated softmax RBM. In the third component, we come up with a variant autoencoder with a predefined loss function for discriminatively learning the regularity between different modalities. We show experimentally the effectiveness of our approach to the task of classifying image tags on public available datasets.
Recently, there is a growing demand for analyzing the complex systems with great number of variables [
During the past few years, motivated by the biological propagation phenomena in distributed structure of human brain, deep neural learning has received considerable attention from the year of 2006. These deep neural learning methods are proposed to learn hierarchical and effective representations to facilitate various tasks with respect to recognizing and analyzing in complex artificial system. Even with only a very short development, deep neural learning has achieved great success in some tasks of modeling the single modal data, such as speech recognition systems [
Motivated by the progress in deep neural learning, in this paper, we endeavor to construct a computational deep architecture for measuring the similarity between modalities in complex multimodal system with a large number of variables. Our proposed framework, bimodal deep architecture (BDA), has three closely consecutive components. For images and text modalities, the first component can be constructed by some popular feature extraction methods in each individual modality. The second component has two types of stacked restricted Boltzmann machines (RBMs). Specifically, for image modality a Bernoulli-Bernoulli RBM (BB-RBM) is stacked over an RBM; for text modality a BB-RBM is stacked over a replicated softmax RBM (RS-RBM). In the third component, we come up with a variant autoencoder with a predefined loss function for discriminatively learning the regularity hidden within modalities.
It is worthwhile to highlight several aspects of the BDA proposed in this paper. In the first component of the BDA, for image modality, three methods are utilized in our setting. However, we could explore more feature extraction methods. In the second component of the BDA, we stack two RBMs for each modality. In theory, we could stack more RBMs to make a more effective representation. In the third component of the BDA, motivated by the deep neural architecture, we come up with a loss function to keep small distance for semantically similar bimodal data and to generate large distance for semantically dissimilar ones. The work in this paper primarily focuses on the image and text bimodal data. However, the BDA presented here can be naturally extended to other different bimodal data.
The remainder of this paper is organized as follows. Section
There have been several approaches to learning from cross modal data with many variables. In particular, Blei and Jordan [
Recently, motivated by deep neural learning, Chopra et al. [
Another line of research focuses on bimodal semantic hashing, which tries to represent data as binary codes. Subsequently, Hamming metric is applied for the learned codes as the measure of similarity. McFee and Lanckriet [
The main idea of our deep framework is to construct hierarchical representations of bimodal data. This framework, as shown in Figure
A deep framework is used for measuring the similarity of cross modal data such as images and text. From left to right, first, the classical methods for each modality could be used to extract basic modality-specific features. For example, we use MPEG-7, Gist, and some known features descriptors for images; we use bag-of-words model for tags. Second, for each modality two RBMs are stacked for extracting intermediate modality-specific features. For images, we stack a binary RBM over a Gaussian RBM; for text, we stack a binary RBM over a replicated softmax. Third, an autoencoder with similar constraint is used for extracting similar presentations. The number in each box is the neurons adopted in this layer.
In the second component, the low-level representations usually with different dimensions for image and tag words are distilled to form the mid-level representations using two stacked restricted Boltzmann machines (RBMs), respectively. The first layer RBMs, a Gaussian RBM for low-level representation of the images and a replicated softmax for those of text, are adopted mainly for normalizing these bimodal data with the same output units. The second layer RBMs, two binary RBMs, are used for expecting more abstract representations.
In the third component, we propose a variant of autoencoder for learning the high-level semantic similar/dissimilar representations of these bimodal data. The details of this network are described in Section
In the training stage, a collection of pairs of image and text is presented to the system. As a result, by our learning algorithm the system could learn the neural connection weights. In the test stage, a new pair of bimodal data is shown to the learned system such that we can obtain the similarity/dissimilarity between the pair of unseen data.
Different unimodal data, such as images or text, usually have different methods to extract the representative features. We use these extracted features as our basic representations. For image modality some popular methods, such as MPEG-7 and gist descriptors, can be used. Gist represents the dominant spatial structure of a scene by a set of perceptual dimensions, including naturalness, openness, roughness, expansion, and ruggedness. These perceptual dimensions can be estimated using spectral and coarsely localized information.
One part of MPEG-7 is a standard for visual descriptors. We use four different visual descriptors defined in MPEG-7 for image representations: color layout (CL), color structure (CS), edge histogram (EH), and scalable color (SC). CL is based on spatial distribution of colors and is obtained by applying the DCT transformation. CS is based on color distribution and local spatial structure of the color. EH is based on spatial distribution of edges. SC is based on the color histogram in HSV color space encoded by a Haar transformation.
For text modality, we use the classical bag-of-words model for its basic representations. A dictionary of
For binary data, as in the second layer of the second component in our framework, we use RBMs for modeling them. An RBM [
We model the image features with real-valued data, as in the first layer of the second component in our framework, using Gaussian RBM. It is an extension of the binary RBM replacing the Bernoulli distribution with Gaussian distribution for the visible data [
For the text features with count data, as in the first layer of the second component in our framework, we use replicated softmax model [
In the second component of our framework, for each modality we stack two RBMs to learn the intermediate representations. These two-layer stacked RBMs can be trained by the greedy layer-wise training method [
In the third component of our framework, we propose a special type of autoencoder for bimodal representations to learn the similarity. As shown in the rightmost of Figure
Formally, we denote the mapping from the inputs of two subnets to the code layers as
To learn the similar representations of these two modalities for one object, we come up with a loss function given input
By the standard backpropagation algorithm [
To summarize, by the above three consecutive components we can learn the similarity metric for bimodal data.
We evaluate our proposed method for the task of image annotation selection compared with multilayer perceptron (MLP) with two hidden layers and canonical correlation analysis (CCA) with RBMs as a benchmark method on two publicly available datasets.
In the following sections, we will describe the two datasets used, our experimental settings, and the evaluation criteria. Moreover, we report and discuss our experimental results.
The two datasets used in our experiments are the Small ESP Game dataset [
Examples from the ESP dataset. The top is images and the bottom is their corresponding group of tag words. See text for more details.
The MLC-2013 dataset, briefly known as MLC, was created by Ian J. Goodfellow for the workshop on representation learning at International Conference on Machine Learning, ICML 2013. Specifically, the MLC consists of 1,000 manually labeled images, which were obtained by Google image search queries for some of the most commonly used words in the ESP. And for each image, two labels are given; one fits better than the other. The labels were intended to resemble those in the EPS. For example, they include incorrect spellings that were common in the EPS. Some examples from the MLC are shown in Figure
Two examples from the MLC dataset. Each one shows the image and its two tags. The words in sky blue box are from its original tag, and the words in carnation box are from the generated one.
In all our experiments, we use the ESP as the training set and the MLC as the test set. In addition, we note that each image in the ESP involves a unique group of tags; we must firstly construct the counterpart group of tags for each image in the ESP. Therefore, in order to train our system, we automatically generate an incorrect counterpart for word tags of each image in the ESP. This can be achieved by randomly choosing one from all the correct tag words of the rest of the images, while ensuring that each of the tag words occurs only one time. That is our preprocessing procedure.
In this section, we describe the settings in our experiments, including the details of feature extraction methods in the first component and the neurons configurations in the second and third components.
In the first component, for image modality, three popular methods are adopted. The first group of features is obtained by the following steps: (1) preprocess images using local contrast normalization [
The second group of features is obtained by MPEG-7 visual descriptors. We use 192 coefficients of DCT transformation for Cl, 256 coefficients form for CS, fixed 80 coefficients for EH, and 256 coefficients form for CS. The software module based on the MPEG-7 Reference Software (available at
The third group of features is obtained by gist descriptor. The package used in our paper is available at
For text representation we use bag-of-words model. In our experiments, a dictionary of 4000 high-frequency words is built from all the tag words of ESP. Thus, each tag is represented as a vector with 4000 one/zero elements.
In the second component, we use the neurons configuration of 1704-1024-1024 for image modality and of 4000-1024-1024 for text modality. That is, Gaussian-Bernoulli RBM has the 1704 visible neurons and 1024 hidden neurons. The replicated softmax has 4000 visible neurons and 1024 hidden neurons. And the two Bernoulli RBMs in the above layers in the intermediate representations have 1024 visible neurons and 1024 hidden neurons.
In the third component, we use the neurons configuration of 1024-512 both for image modality and text modality. That is, the autoencoders in the advanced representations for both modalities have 1024 visible neurons and 512 hidden neurons.
As for the parameters, we illustrate the method for setting the parameter
For comparison of two benchmark methods, MLP with two hidden layers and CCA with RBMs are used in our experiments. Next, we describe the details of these two methods.
The MLP is a popular supervised learning model in artificial neural networks system [
The MLP system used in our experiments. In the lower right corner, an image is represented using MPEG-7 and gist descriptors forming a vector with 1,704 neurons. And in the upper right corner, the corresponding tag words are first represented using the BOW model. Furthermore, a replicated softmax RBM with 4,000 visible neurons and hidden 1,024 neurons is adopted to learn a text representation. Finally, in the dashed box, from the bottom to the top, an MLP with two hidden layers is designed to learn the mapping from image modality to text modality.
The CCA [
In the CCA system, the two groups of variables are modal representations. One group of input of the CCA system is the basic representation for the image modality. The other is the replicated softmax RBM representation based on the BOW model. Then, the CCA is applied for these two quantities. To be specific, we set the image representation to have 1,704 neurons. And a replicated softmax RBM is used with an input of 4,000 neurons and output of 1,024 neurons. The canonical components
The CCA system used in our experiments. From the lower left corner, the image modality is represented using MPEG-7 and gist descriptors forming a vector with size 1,704. Then, a Gaussian RBM with 1,704 visible neurons and 1,024 hidden neurons is used to learn the image representation. From the lower right corner, the text modality is represented using BoW model forming a vector with size 4000. Then, a replicated softmax RBM with 4,000 visible neurons and hidden 1,024 neurons is adopted to learn a text representation. Finally, a CCA model with 1,024 twin inputs and 1,024 twin outputs is built using these bimodal representations.
The performance of predictions of a classifier is evaluated based on the accuracy. Here, the accuracy is, for simplicity, defined as the area under the receiving operating curve (ROC). The adoption of this accuracy is motivated by two facts. One is its successful application on evaluating binary classifiers [
Often used for evaluating binary classifiers, the ROC is a plot of false positive rate
In order to fulfill the computation, the three models in our experiments should be able to produce a continuous valued output that can be used for ranking its predictions. Therefore, we define the continuous valued output
We compare our deep neural architecture with an MLP-based system with two hidden layers and a CCA-based system with two RBMs used. The experimental results of different methods are shown in Table
Accuracies achieved by BDA and benchmark methods.
Method | Accuracy (%) |
---|---|
|
88.96 |
CCA [ |
85.54 |
MLP [ |
81.54 |
Besides, we investigate the effect of hyperparameter
The effect of hyperparameter
In this section we will make a discussion on the three methods and the experimental results. The models used in our experiments have something in common. Firstly, for image modality, some popular descriptors are adopted. And for the text modality, we adopt the classical BOW model. These low-level modal specific representations are substantially built for further more abstract representations. Secondly, all the three models obtain a cross modal metric enabling these heterogeneous data from different sources to be comparable.
Note that the experimental results are largely affected by the differences among these models. In the MLP-based system, the nonlinear mapping function is obtained directly from the low-level representation of image modality to that of text modality. The assumption behind the MLP-based system is that the text representation is abstract whilst the image representation is concrete, relatively. In contrast, the other two systems, the CCA-based and our BDA-based system, treat the bimodal data as the same starting point. The assumption behind these systems is that there exists a common representation space for bimodal data of one object. Recent neurobiological research [
The experiments on the hyperparameter
To conclude, we propose a computational deep neural architecture for measuring the similarity between different modalities in complex system with image and text. Our proposed framework closely combines feature extraction methods for individual modality and deep neural networks using stacked RBMs and a variant of neural autoencoder architecture. We show experimentally the effectiveness of our approach to the task of classifying image tags on public available datasets.
Our computational framework is flexible and could be extended in several ways. For example, we could explore more feature extraction methods in the first component. As another example, complex neural based representation could be exploited. Moreover, this architecture presented here can be naturally extended to other different modalities. In our future work we will investigate this flexibility in other complex systems.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was partially supported by National Natural Science Foundation of China (nos. 61273365, 61100120, 61202247, and 61202248), National High Technology Research and Development Program of China (no. 2012AA011103), Discipline Building Plan in 111 Base (no. B08004), and Fundamental Research Funds for the Central Universities (no. 2013RC0304). The authors would also like to thank the editor and the anonymous reviewers for their useful comments and suggestions that allowed them to improve the final version of this paper.