Topologically Ordered Feature Extraction Based on Sparse Group Restricted Boltzmann Machines

1School of Computer Science and Technology, Wuhan University of Technology, 122 Luoshi Road, Wuhan 430070, China 2State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, 129 Luoyu Road, Wuhan 430079, China 3Engineering Research Center for Spatio-Temporal Data Smart Acquisition and Application, Ministry of Education of China, Wuhan University, 129 Luoyu Road, Wuhan 430079, China 4Institute of Information Technology, Luoyang Normal University, 71 Luolong Road, Luoyang 471022, China


Introduction
Restricted Boltzmann Machines (RBMs) [1] are a type of product of experts model [2] based on Boltzmann Machines [3] but with a complete bipartite interaction graph.In general, RBMs, which are used as generative models to simulate input distributions of binary data [4], are viewed as an effective feature-representation approach for extracting structured information from input data.They have received much attention recently and have been successfully applied in various application domains, such as dimensionality reduction [5], object recognition [6], topic modeling [7], and feature learning [8].In addition, RBMs have attracted much attention as building blocks for the multilayer learning systems (e.g., Deep Belief Networks (DBNs), Deep Boltzmann Machines (DBMs)), and variants and extensions of RBMs have a great many applications in a wide range of feature learning and pattern recognition tasks.
Due to the arbitrary connectivity of Boltzmann machines, they are too slow to be practical, and in order to obtain efficient and exact results, RBMs have the restrictions that there are no visible-visible or hidden-hidden connections, which leads to the obvious advantage that inferences in the RBMs are much easier than in Boltzmann Machines [9].Therefore, the hidden units are conditionally independent and we may generate a more powerful learning model [10].Lee et al. [11] proposed sparse RBMs (SRBMs) by pointing out that RBMs tend to learn distributed and nonsparse representations as the number of hidden units is increased; accordingly, they added a regularization term that penalized a deviation of the expected activation with a low level to ensure that the hidden units would be sparsely activated.Moreover, in order to group similar activations of the hidden units and capture their local dependencies, Luo et al. [12] proposed sparse group RBMs (SGRBMs) using a novel regularization of the activation probabilities of the hidden units in RBMs.What SRBMs and SGRBMs have in common is that they have adopted sparsity to promote regularization, making them powerful enough to represent complicated distributions.
By introducing the  1,2 regularizer into the activation probabilities of the hidden units, the SGRBMs have the following two properties: first, this model encourages few groups to be active when given observed data (this property yields sparsity at group level), and second, it results in only a few hidden units being active in a group (this property yields sparsity within the group).However, they did not consider overfitting problems, which lack corresponding strategies for controlling the reconstruction complexity of the weight matrix.In addition, they did not take into account the fact that all the extracted features in the hidden units are not topologically ordered (i.e., similar features are grouped together while they do not simultaneously discard group sparsity), and it is essential for a learning machine to obtain structured information from the input data.In 2002, Welling et al. [13] proposed a novel learning sparse topographic representation with products of Student -distributions and found that if the Student -distribution is used to model the combined outputs of sets of neutrally adjacent filters, then the orientation, spatial frequency, and location of the filters change smoothly across the topographic map.Later, Goh et al. [14] proposed a method for regularizing RBMs during training to obtain features that are sparse and topographically organized.The features learned are then Gabor-like and demonstrate a coding for orientation, spatial position, frequency, and color that vary smoothly with the topography of the feature map.For the purpose of efficiently extracting invariant features with group sparsity from highdimensional data, in this paper, firstly we adopted a weightdecay strategy [15,16] at group level based on SGRBMs, and secondly, by adding an extra term to penalize the topologically ordered factors in the log-likelihood function, the topologically ordered features at group level can be obtained.
The remaining sections of this paper are organized as follows.In Section 2, RBMs and Contrastive Divergence algorithms for RBM training are described in brief.In Section 3, a nontopologically ordered feature extraction approach is proposed to obtain sparse but not topologically ordered features between groups from the input data.In Section 4, a topologically ordered feature extraction approach is proposed to obtain structured information (i.e., sparse and topologically ordered features between the overlapping groups) from the input data.In Section 5, experimental results with two different datasets (namely, natural images and Flying Apsara images in the Dunhuang Grotto Murals) are shown to validate the proposed approach.Finally, the conclusions are in Section 6.

Restricted Boltzmann Machines and Contrastive Divergence
RBMs are a particular form of the Markov Random Field (MRF) model and are regarded as an undirected generative model which uses a layer of binary hidden units to model a distribution over binary visible units [17].Suppose an RBM consists of  visible units k = (V 1 , V 2 , . . ., V  ) ∈ {0, 1}  representing the input data and  hidden units h = (ℎ 1 , ℎ 2 , . . ., ℎ  ) ∈ {0, 1}  to capture the features of the input data.The joint probability distribution   (k, h) is given by the Gibbs distribution with the energy function [17,18]: where W ∈  × is the matrix of weights and b ∈   and c ∈   are vectors which represent the visible and hidden biases, respectively.All these are referred to as the RBM parameters  = {W, b, c}, and   (k, h) is the energy function and () = ∑ k, h exp(−  (k, h)) is a corresponding normalized constant.Therefore, the marginal distribution of visible variables becomes As the hidden units are independent given the states of the visible units and vice versa, when given the observed data, the conditional probabilities and conditional distributions of the hidden units are where W ⋅ is the th column of W, which is a vector that represents the connection weights between the th hidden unit and all visible units, and sig() = 1/(1 +  − ) is the sigmoid activation function.Thus, the marginal distribution   (k) of the visible variables actually is a model of product of experts [2,12]: Equation ( 4) deduces that all these hidden units for the individual components of the given data vector k are combined multiplicatively and will contribute probabilities according to the activation probabilities.If given a data sample, one specific hidden unit will be activated with a high probability, and the hidden unit is responsible for representing the data sample.If more data in the training data set activates a hidden unit with a higher probability, the hidden unit's feature will be less discriminative.Thus it is sometimes necessary to introduce sparsity at the hidden layer of an RBM [11,19,20].For a training example k 0 , training an RBM is the same as modeling the marginal distribution   (k) of the visible units.A common practice is to adopt the log-likelihood gradient approach [16,18] to maximizing the marginal distribution   (k), which aims to generate k 0 with the largest probability.Using gradient descent approach can solve this problem: The second term of ( 5) is intractable because we cannot obtain any information about the marginal distribution   (k).In order to solve this problem, Hinton et al. [21] proposed Contrastive Divergence (CD) learning, which has become a standard way to train RBMs.The -step contrastive divergence learning (CD-) is in two steps: first, the Gibbs chain is initialized with the training example k 0 of the training set.Second, the sample k () is yielded after  steps, and each step  ( = 1, 2, . . ., ) consists of sampling h (−1) from   (h | k (−1) ) and subsequently sampling k () from   (k | h (−1) ).According to the general Markov Chain Monte Carlo (MCMC) theory, we know that when  → ∞, the -step contrastive divergence learning algorithm converges to the second term of (5) and becomes only visible in the proof of Bengio and Delalleau [22].However, Hinton [16] pointed out that when initializing with the training example k 0 of the training set, running one-step ( = 1) Gibbs sampling approximates this term in the log-likelihood gradient relatively well.Therefore, ( 5) can be approximated as Thus, the iterative update process of the th column W ⋅ from the weight matrix W for the training example k 0 is represented by where  is the learning rate.The first term of ( 7) decreases the energy of k 0 [23]; at the same time, this term also guarantees that unit  is more likely to be activated when the hidden unit observes k 0 again; this means that the hidden units are learning to represent k 0 [12].In the next section, we use a weight-decay strategy at group level for SGRBMs to capture features with group sparsity from the input data.

Nontopologically Ordered Feature Extraction Based on Sparse Group RBMs
In the unsupervised learning process, some of the hidden units may extract similar features if there is little difference between their corresponding weight vectors {W ⋅ }  =1 .This homogenization problem can be obvious and serious if the number of the hidden units is increased.To alleviate this problem, Lee et al. [11] introduced SRBMs and Luo et al. [12] introduced SGRBMs to remit statistical dependencies between all of the hidden units when adding a penalty term.SRBMs have been popular due to the fact that an RBM with a low average hidden activation probability is better at extracting discriminative features than nonregularized RBMs [8,24].This is especially the case in Luo et al. [12], who divided the hidden units equally into nonoverlapping groups to restrain the dependencies within these groups and penalized the overall activation level of a group.More discriminative features are learned when SGRBMs are applied to deep learning systems for classification tasks.However, Luo et al. [12] did not consider overfitting problems and did not propose any strategies for controlling the reconstruction complexity of the weight matrix.Thus, to equilibrate the reconstruction error (i.e., the learning accuracy of specific training samples) and reconstruction complexity (generalization ability), we have used a weight-decay strategy at group level based on SGRBMs to capture features with sparse grouping of the input data.
For an RBM with  hidden units, let Η = {1, 2, . . ., } denote the set of all indices of the hidden units.The th group is denoted by G  , where G  ⊂ Η,  = 1, 2, . . ., .Suppose all groups are nonoverlapping and of equal size [12] (see Figure 1).Given a grouping G and a sample k () , the th group norm  G  (k () ) is given by where  G  (k () ) is the  2 (Euclidean) norm of the vector comprising the activation probabilities {  (ℎ ()  = 1 | k () )} ∈G  , which are considered as the overall activation level of the th group.Given all the group norms { G  (k () )} =1,2,..., , the mixed  1,2 norm is Figure 1: Grouping with nonoverlapping for nontopologically ordered SGRBMs.
The mixed  1,2 norm of the grouping weight vectors {W ()  ⋅ } ∈G  is shown by In fact, the mixed  1,2 norm is considered as the overall weight strength level of the grouping.We add these two  1,2 regularizers (( 9) and ( 11)) to the log-likelihood of the training examples.Thus, given the training set {k () }  =1 comprising  examples, we need to solve the following optimization problem: where  and  are two regularization constants.The second term increases the sparsity of the hidden units at group level, and the third term decreases the reconstruction complexity of this model.We apply the contrastive divergence update rule (see (7)) to solve (12), and it is followed by one step of gradient ascent by using the gradient of the regularization terms.
By introducing these two regularizers, the iterative process to solve the optimal parameters (7) is updated as follows: where k ()− is the CD-1 sampling from k () , and we assume the th hidden unit belongs to the   th group G   .The last step in ( 13) is derived from Since the second and third terms in (12) control the sparsity of activation probabilities of the hidden units, then using ( 13), we can regard the hidden layer of RBMs as the nontopological feature extractor to capture features with group sparsity from the training data, abbreviated to NTOSGRBMs.In the next section, we obtain topologically ordered features from the input data at group level by adding an extra term to the mixed  1,2 norm of the group weights based on SGRBMs and by adding an extra term to penalize the topologically ordered factors in the log-likelihood function.

Topologically Ordered Feature Extraction Based on Sparse Group RBMs
The approach of the sparsity-based feature extraction approaches is to employ regularizers to induce sparsity during discriminative feature representation [25,26].According to Luo et al. [12], the mixed  1,2 norm encourages sparsity at group level; however, it does not contain any prior information about possible groups of covariates that we may want to select jointly [27,28].From the SGRBMs, we can learn a set of features with group sparsity that is useful for representing the input data; however, drawing inspiration from the human brain, we would like to learn a set of features that have both with group sparsity and topologically ordered in some manner.Here, topologically ordered method means that similar features are grouped together while retaining group sparsity.The aim of this constraint for hidden units is to group adjacent features together in the smoothed  1 penalty.Instead of keeping all groups being nonoverlapping, therefore, we retain all of the overlapping groups.Then, suppose the kth group is denoted by Gk , where Gk ⊂ Η, k = 1, 2, . . ., K, and of these overlapping groups are of the same size; then all overlapping parts are of the same size, and each hidden unit belongs to two neighboring groups (see Figure 2).
)) 2 represents the overall activation level of hidden units for all groups, then (12) becomes max In fact, since we actually minimize the overall weight strength level of the grouping, the third term in (14) ensures that only a few hidden units can be activated in a group.From the perspective of information theory, the entropy of the distribution of the conditional probabilities for all hidden units in a group is relatively low.For the kth group Gk in the hidden layer, the entropy of the conditional probabilities' distribution is defined as For the neighboring two groups, the kth group Gk and ( k + 1)th group Gk +1 at the hidden layer, we can define the topologically ordered factor (TOF) between these neighboring two groups as where (⋅) is the count function to obtain the number of elements in a set.The structured features in the hidden units are topologically well-ordered if  Gk , Gk +1 (k () ) is close to zero.Since one important research direction in using the stochastic method (i.e., contrastive divergence [2] and approximate maximum-likelihood [29]) for RBMs is to design a regularization term [30], it is common to use weight-decay that regularizes the growth of parameters to avoid overfitting and stabilize learning.Therefore, in this paper, we have added an extra term to penalize the topologically ordered factors, so that the topologically ordered feature extraction based on SGRBMs (TOSGRBMs) can be extended by these two extra regularizers.Thus, given training data {k (1) , k (2) , . . ., k () }, we need to solve the following optimization problem: This type of regularization can be seen as the combination of the group weight-decay and topologically ordered factors regularization.To address this problem, it is possible to use the alternating direction method of gradient ascent.The partial derivative of the first and third terms is shown in (13), and this problem may be turned into solving the partial derivative term (/W () ⋅ ) ∑ K−1 k=1  2 Gk , Gk +1 (k () ).Suppose  ∈ Gk ∩ Gk +1 ; thus  ∈ Gk and  ∈ Gk +1 ; then The objective function in ( 17) is optimized when using the iterated method described.In the above discussion, the form of ( 13) is not changed, but one significant difference is that we assume the th hidden unit belongs to the k th group G  and its neighboring group: the ( k + 1)th group G  +1 .Thus, the iterative formula for the optimal solution of parameters of RBM is where MG k (W () ) = √∑ t∈ Gk ∑  =1 ( ()  t ) 2 ( k = k , k + 1).The first term of (17) represents the true distribution of the input data, and maximizing it in fact minimizes the reconstruction error of RBMs.The third term of ( 17) represents the group sparsity of the hidden layer, and minimizing this penalty term is equivalent to minimizing the reconstruction complexity of RBMs.In addition, in order to obtain topologically ordered features from these features with group sparsity, we add the second term of (17) to penalize the topologically ordered factors.
The objective function in (17) is actually an optimization problem; however, it is not convex.In principle, we can apply gradient descent to solve this problem; however, computing the gradient of the log-likelihood term is expensive.Fortunately, the contrastive divergence learning algorithm gives an efficient approximation to the gradient of the log-likelihood [2].Building upon this, in each iteration we can apply the contrastive divergence update rule, followed by one step of gradient descent using the gradient of the regularization terms [11].

Experimental Results
In this section, we compared the results of the proposed nontopologically ordered and topologically ordered feature extraction approaches based on SGRBMs.First, we applied the two approaches to model patches of natural images.Then, we applied them to analyze the structured features of Flying Apsara images from the Dunhuang Grotto Murals at four different historical periods.

Modeling Patches of Natural Images.
The training data consists of 20,000 patches 8 × 8, randomly extracted from a standard set of ten 512×512 whitened images.All the patches were divided into minibatches, each containing 256 patches, and updated the weights of each minibatch (total batches = 2,000).
We trained a nontopologically ordered SGRBM with 20,000 real-valued visible units and 256 hidden units which were divided into 64 uniform nonoverlapping groups containing four hidden units each.The learning rate  was set to 0.1 [16], and the regularization constants  and  were empirically set to 0.1 and 0.5, respectively.The learned features are shown in Figure 3(a).For comparison, we also trained a topologically ordered SGRBM with 256 hidden units.The learned features are shown in Figure 3(b).Some features are extracted and localized Gabor-like edge detectors in different positions and orientations; these results are like those in [13,14].Since the hidden units within a group compete with each other in the modeling patches, each hidden unit in the nontopologically ordered SGRBMs focused on modeling more subtle patterns in the training data.As a result, the features learned with the topologically ordered SGRBMs are more aggregative at group level than those learned with the nontopologically ordered SGRBMs.Moreover, the learned features by TOSGRBMs shown in Figure 3 have an enforced a topological order, where the location, orientation, and frequency of the Gabor-like filters all change smoothly.In conclusion, from the perspective of the invariant feature learning, the topologically ordered feature extraction approach facilitate the training of the whole network to extract more discriminative features at the hidden layer.
We also compared our results with the standard SRBMs and SGRBMs.With the same parameter settings and the same number of iterations (=10 4 ), Figure 4 shows that the SRBMs, SGRBMs, NTOSGRBMs, and TOSGRBMs extracted Gabor-like filter features; however, SRBMs and SGRBMs have many more redundant feature patches than NTOSGRBMs and TOSGRBMs.Moreover, since the grouped features are significant in SGRBMs, NTOSGRBMs, and TOSGRBMs, the features learned by the SGRBMs, NTOSGRBMs, and TOSGRBMs are more localized than those learned by the SRBMs.
In addition, we use Hoyer's sparseness measure [31] to determine the sparse representations learned by the SRBMs, SGRBMs, NTOSGRBMs, and TOSGRBMs.This measure is in the interval [0, 1] and on a normalized scale.Figure 5 shows the activation probabilities of the hidden units that were computed using the regular SRBMs, SGRBMs,    NTOSGRBMs, and TOSGRBMs.It can be seen that the representations learned by the SGRBMs were much more sparse than for the other three models, although NTOSGRBMs and TOSGRBMs, with similar activation probabilities of hidden units, learned similar representations (close sparseness values) and learned much more sparse representations than the SRBMs, but much less sparse representations than the SGRBMs.

Modeling Patches of Flying Apsaras Images in the Dunhuang Grotto Murals. The image dataset of the Flying
Apsaras in the Dunhuang Grotto Murals published in Zheng and Tai's book [32] contains 300 images.These images cover four historical periods: the early stage, the developing stage, the flourishing stage, and the terminal stage [33].In the present study, as an example, the training data consisted of 20,000 8 × 8 randomly selected image patches of the Flying Apsara images.These patches were randomly extracted from a standard set of ten 512 × 512 fine-art paintings of the Flying Apsaras (Figure 6).Features from these images covering the four historical periods were exacted using both nontopologically ordered and topologically ordered SGRBMs.In addition, the parameters settings were the same as in the previous subsection.In Figure 7, we see that both approaches show pronounced advantages in extracting discriminative features at group level, although the features learned by the topologically ordered SGRBMs were more aggregative than the nontopologically ordered SGRBMs.Moreover, since there is structural similarity between the features of the Flying Apsara images, their representations varied smoothly compared to the transformations and invariance achieved using TOS-GRBMs.The features learned by TOSGRBMs (Figure 7) had enforced topological order, where the location, orientation, and frequency of the Gabor-like filters all change smoothly.It is concluded that, when using topologically ordered feature extraction based on SGRBMs, feature selection performed well because of the sparse and aggregative features at group level.Taking the early stage of the Flying Apsara images as an example, we also compared our learned features with those of the standard SRBMs and SGRBMs.With the same parameter settings and the same number of iterations (=10 4 ), Figure 8 shows that the grouped features are significant in SGRBMs, NTOSGRBMs, and TOSGRBMs.The features learned using the SGRBMs, NTOSGRBMs, and TOSGRBMs are more localized than those for the SRBMs.Moreover, the sparse features learned with SRBMs and NTOSGRBMs did not group similar feature patches as an aggregative representation.Both SGRBMs and TOSGRBMs learned to group similar features without discarding group sparsity; however, the sparse features learned by TOSGRBMs not only grouped similar feature patches as an aggregative representation but also retained similar feature patches in topological order at group level.
To  (see Figures 7, 8, and 10) but also generate representative features well.Since the average sparsity of weight matrix is large (see Figure 11), TOSGRBMs give the hidden units low correlation when representing structured features of the input data, and according to the tendency of the average reconstruction errors (Figure 9), it can avoid over-fitting learning simultaneously.The topologically ordered SGRBMs approach achieved the best discriminative feature extraction and produced the best trade-off between reconstruction error and complexity.The topologically ordered SGRBMs have lower average topologically ordered factors, which indicate that the proposed topologically ordered SGRBMs decrease the similarities between extracted features and order them well topologically, because of the penalty on the topologically ordered factors of all groups.

Conclusions
For the purpose of extracting topologically ordered features efficiently from high-dimensional data, firstly we used a weight-decay strategy at group level based on SGRBMs to capture features with group sparsity from the input data.Secondly, by adding an extra term to penalize the topologically ordered factors in the log-likelihood function, we obtain topologically ordered features at group level.Experimental results on the image datasets of both natural images and the Flying Apsara images from the Dunhuang Grotto Murals at four different historical periods demonstrate that the combination of these two extra terms in the log-likelihood function helps to extract better discriminative features with much sparser and more aggregative hidden activation probabilities.In conclusion, in our experiments the topologically ordered SGRBMs showed markedly prominent sparsity of weight matrix, discriminative features at the hidden layer, and sparse feature representations.Topologically ordered SGRBMs were therefore found to be superior to nontopologically ordered SGRBMs for those reasons.

Figure 3 :
Figure 3: Learned features by the nontopologically ordered (a) and topologically ordered (b) feature extraction approach with 256 elements, learned on a dataset of 8 × 8 whitened natural image patches with 6 × 6 cyclic overlapping groups.

Figure 5 :
Figure 5: (a) Activation probabilities computed under the SRBMs, the sparseness of the vector is 0.68; (b) activation probabilities computed under the SGRBMs; the sparseness is 0.89; (c) activation probabilities computed under the NTOSGRBMs; the sparseness is 0.81; (d) activation probabilities computed under the TOSGRBMs; the sparseness is 0.78.

Figure 7 :
Figure 7: Learned features using the nontopologically ordered (a) and topologically ordered (b) feature extraction approaches (taking the early stage as an example).

Figure 9 :Figure 10 :
Figure 9: The different reconstruction errors of topologically ordered SGRBMs and nontopologically ordered SGRBMs when applied to image dataset of Flying Apsara at four different historical periods.

Figure 11 :
Figure 11: The different weight matrix sparseness of topologically ordered SGRBMs and nontopologically ordered SGRBMs when applied to image dataset of Flying Apsara at four historical periods.