Query-Specific Deep Embedding of Content-Rich Network

In this paper, we propose to embed a content-rich network for the purpose of similarity searching for a query node. In this network, besides the information of the nodes and edges, we also have the content of each node. We use the convolutional neural network (CNN) to represent the content of each node and then use the graph convolutional network (GCN) to further represent the node by merging the representations of its neighboring nodes. The GCN output is further fed to a deep encoder-decoder model to convert each node to a Gaussian distribution and then convert back to its node identity. The dissimilarity between the two nodes is measured by the Wasserstein distance between their Gaussian distributions. We define the nodes of the network to be positives if they are relevant to the query node and negative if they are irrelevant. The labeling of the positives/negatives is based on an upper bound and a lower bound of the Wasserstein distances between the candidate nodes and the query nodes. We learn the parameters of CNN, GCN, encoder-decoder model, Gaussian distributions, and the upper bound and lower bounds jointly. The learning problem is modeled as a minimization problem to minimize the losses of node identification, network structure preservation, positive/negative query-specific relevance-guild distance, and model complexity. An iterative algorithm is developed to solve the minimization problem. We conducted experiments over benchmark networks, especially innovation networks, to verify the effectiveness of the proposed method and showed its advantage over the state-of-the-art methods.


Background.
Recently, content-rich network analysis has attracted much attention. Different from the traditional network, whose each node is only identified by its ID, the content-rich network has content for each node [1,2]. For example, in a scientific article citation network, each node is a research paper, and each linkage is a citation between two papers, while each node is enriched by the content of the research paper. In this case, each node is represented by not only the paper ID, but also its content, such as its title, abstract, and text. However, in the past researches, the content of each node is ignored and only the network structure is considered to represent the nodes. For example, a popular network analysis tool is network embedding, where each node is mapped to a low-dimensional vector space, where the network structure is preserved [3][4][5][6]. e traditional network embedding methods only consider the network structure by learning from the edges of the network, while the content of the nodes is not encoded to the embedding process [7][8][9][10]. However, in many cases, the contents of two nodes have a strong clue of the linkage of two nodes, even though there is no direct edge between such two nodes in the network structure alone. As an example, in the innovation network analysis problem, two research papers recently published may have similar ideas, but they have not cited each other. However, from the content of these two papers, we can conclude that they should be sharing the same idea.
us the content is a good complementary component for the network embedding besides the network structure itself.
Meanwhile, information retrieval is a major application of network analysis. Given a query node in the graph, the search task is to rank the other nodes according to the similarity between the query and the nodes and return the top-ranked nodes [11][12][13][14][15]. Using both the network structure and the query information to rank the nodes in the network has been a popular way for information retrieval, while network embedding is another important direction of network analysis. It is natural to combine these two technologies to improve the performance of the retrieval. However, up to now, all the network embedding works have not considered the query information to boost the embedding for the retrieval results. In this paper, we fill this gap of learning network embeddings for a given specific query and the content of the nodes.

Related Works.
In this section, we summarize the related works of network embedding and network-based retrieval. Our work is a query-specific network embedding method that embeds a network for the purpose of searching similar nodes of a given query node. us, our work is related to both the network embedding and network-based retrieval works. e related network embedding works are summarized as follows.
He et al. [2] developed the Network-to-Network Network Embedding model to combine the network structure and content of nodes into one embedding vector. To this end, two neural networks are employed, one for the content based on the convolutional neural network (CNN) model [16][17][18] and another for the network structure based on the graph convolutional network (GCN) model [19][20][21][22]. e CNN model embeds the content of each node to a convolutional representation vector and then feed it to the GCN model, where the representation vectors of neighbors of each node are taken as input and converted to a vector of node identity. e learning of the parameters is performed by minimizing the loss of the node identity prediction task. Zhu et al. [6] proposed embedding a node in a network as a Gaussian distribution and using a Wasserstein distance to measure the dissimilarity between two nodes' Gaussian embedding. Moreover, they developed an encoder-decoder method to map the neighborhood coding vector to the Gaussian embedding parameters and then map it back to the neighborhood coding vector. e embedding parameters are optimized to minimize the decoding error and keep the network structures. Tu et al. [23] proposed a deep recurrent neural network-based network (RNN) embedding method. It uses a node's neighbors in the network as the input of an RNN model and uses the output of the RNN model to approximate the embedding of the node. e inputs of the neighboring nodes are also their embedding vectors accordingly. Moreover, the RNN outputs are further fed to a multilayer perception (MLP) model to approximate the degree of the node. e learning processes are conducted by minimizing the approximation errors of both the embedding vectors and the degrees.
Wang et al. [24] innovated a novel graph embedding method for a group of networks. is method tries to learn a group of base vectors, and each vector can extend to a base affinity matrix of a base network by self-product. en each network affinity matrix is approximated by a learning combination of the base matrices.
e learning of the base vectors and the combination coefficients are learned jointly by minimizing the approximation error. e network-based retrieval works are summarized as follows.
Li et al. [25] proposed adjusting the affinity matrix of a network according to a given query and a set of positive/negative nodes from the network.
is method firstly calculates a ranking vector from the affinity matrix and query node indicator vector and then impose the positive nodes ranking score is larger than the negative nodes. Please note that the positive nodes are the known nodes that are similar to the query, while the negative nodes are the nodes known to be dissimilar to the query node. e learning of the new affinity matrix is conducted by minimizing the loss of the positive/negative nodes constraint and meanwhile keeping the adjusted affinity matrix as similar to the original affinity matrix as possible. Yang et al. [26] proposed learning an improved affinity matrix of a network by firstly calculating the tensor product of the matrix and then conducting confusion over the tensor product of the affinity matrix. e tensor product of the matrix is an extended network where each node is a pair of nodes of the original network and each edge weight is the product of the weights of the edges between the corresponding two nodes. e confusion of a matrix is calculated as the summation of the different orders of the matrix and the number of orders varies from zero to infinite. e learned affinity matrix is obtained by recovering from the confusion of the tensor product. ey proved that the recovered matrix can be obtained by an iterative algorithm, as Q ⟵ AQA ⊤ + I, where A is the original affinity matrix and Q is the expected recovered matrix. Bai et al. [27] proposed to learn the ranking scores of nodes in a network to a query node by an iterative label prorogation algorithm. In an iterative algorithm, the ranking score of a node is updated as the weighted average of the neighboring nodes, while, at the beginning of each iteration, the ranking score of the query node itself is updated as one.

Our Contributions.
Our contribution in this paper is of the following folds: (1) We come up with a novel problem for network analysisthe query-specific content-rich network embedding problem. e setting of this problem is that each node of the network is attached to some 2 Computational Intelligence and Neuroscience content, such as text and image. Moreover, a node or more nodes are known as the query node(s). e task is to learn effective embedding vectors of the nodes so that, from the embeddings, we can calculate a similarity measure to rank the nodes of the network for the purpose of information retrieval. (2) We develop a novel solution to this problem. We use a CNN model to extract the content-level features of each node and then use a GCN model to encode the features of the neighboring nodes to represent the node. e new representations of the nodes by GCN are further converted to the Gaussian distribution parameters, including the mean and the covariance for each node, by an encoder. Finally, the Gaussian distribution parameters are decoded to a node identity probability vector. To learn the parameters, we model the learning problem by minimizing the loss of node identity decoding and network structure-preserving. Meanwhile, to utilize the query node, we try to define a set of positives that are supposed to be relevant to the query and returned by the retrieval system and a set of negative nodes, which is supposed to be ignored by the retrieval system. e labels of the positives/negatives are used to learn the distance between nodes. (3) We design an optimization algorithm to solve the problem of the minimization problem modeled as above. Firstly, the labels of positive and negatives are based on the distance of Gaussian distributions of each pair of nodes measured by the Wasserstein distance, and an upper bound/lower bound of the distance. Secondly, the learning of the parameters of CNN, GCN, Gaussian distribution parameters, and upper bound/lower bound of the labels are learned jointly. irdly, learning and labeling are conducted iteratively in an algorithm.

Remark 1.
Our work is based on the idea of learning a probabilistic model relying on an autoencoder architecture, which is well known in the literature as the variational autoencoder (VAE) proposed by Kingma and Welling [28]. Our cost function is different from the standard loss function used in VAEs.

Origination.
Our paper is organized as follows. In Section 2, we introduce the novel method for the queryspecific network embedding for the content-rich network. In Section 3, we evaluate the proposed method experimentally and compare it against the state-of-the-art. In Section 4, we give the conclusion of this paper.

Proposed Method
In this section, we will introduce our query-specific embedding method of a content-rich network. e embedding of nodes of the network is conducted at two layers. e first layer is the representation of the content of each node by using the convolutional representation method. e second layer is the representation of the node neighborhood by using graph convolutional representation of the content of the neighboring nodes, where the embedding of the nodes is in the Wasserstein space. To learn the parameters, of the model, we consider the problem of query-specific search, while keeping the network structure.

Content-Rich Graph Embedding
In this subsection, we will discuss the presentation of the node contents. Given a node of the network, v, we assume its content is a text, which can be denoted as a sequence of words, and each word is represented as an embedding vector: where x i ∈ R d is the embedding vector of the i-th word, d is the dimension of the word embedding space, and N is the number of words in the text of node v. To represent the text, we employ a CNN model with one convolutional layer and a max-pooling layer.
In the convolutional layer, we have a filter bank, which filters the word embeddings of a window size of s words. e response of the k-th filter is calculated as follows: where x i: i+s−1 ∈ R (d×s) is the concatenation word embedding vectors of x i , . . . , x i+s−1 , and b k is the bias parameter of the k-th parameter. ReLU(x) � max(x, 0) is the activation function of rectified linear unit (ReLU) [29][30][31][32].
In the max-pooling layer, the maximum response of each filter is selected as the output of the layer: e convolutional representation of the content of the node v is the vector of the max-pooling outputs of the |F| filter outputs:

Neighborhood Convolutional Encoder-Decoder.
In this subsection, we will introduce the neighborhood representation of a node from the content of its neighboring nodes.
To this end, we apply an encoding-decoding methodology to code the neighborhood of each node to the Wasserstein space.
(1) Graph Convolutional Encoder. We assume the network is denoted as Computational Intelligence and Neuroscience is the set of nodes, v i is the i-th node, and n is the number of us, we can denote the set of neighbors of a node v i as N i � v j | e ij � 1 or e ji � 1, j � 1, . . . , n . To represent the neighborhood of a node, we normalize the edge weights of its neighbors as To utilize the neighborhood to represent a node, we employ the deep graph convolutional network (GCN). e input layer of the GCN is the convolutional representations of nodes, For the l-th layer of GCN, the output is calculated as where v l j is the input of the l-th layer of the j-th node and the neighboring nodes' content representations are linearly combined with normalized edge weights and then pass through a full-connection layer with a tanh activation layer, and W l and b l are weight and bias parameters. e number of GCN layers is L, and the output of the last layer of GCN for the i-th node is denoted as v L j .
(2) Gaussian-Based Encoder. We further assume that each node is generated from a lower-dimensional Gaussian distribution in the Wasserstein space. e Gaussian distribution is characterized as where µ i ∈ R g is the mean of the distribution, Σ i ∈ R g×g is the covariance matrix of the distribution, and g is the dimension of the embedding of the lower-dimensional Gaussian distribution. In our work, we assume that the covariance matrix is a diagonal matrix: To bridge the network structure and the distribution of the node, we assume that the mean and covariance can be reconstructed from the GCN network output, by two fullconnection layers: where Θ, Ψ are the weight matrix, while θ and ψ are the bias vectors.
is the Elu (exponential linear unit) activation function [33][34][35][36] and Elu(x) + 1 is used to guarantee that σ i is a positive vector. In this way, each node is encoded to a Gaussian distribution in Wasserstein space.
To measure the dissimilarity between the two nodes, vi and vj, from their Gaussian distributions, we apply the 2nd Wasserstein distance, as follows: (3) Node-Identity Decoder. After we have the Gaussian-based encoding result for each node, we want to decode it to its original identity in the graph. us, we design a decoder to convert its Gaussian distribution to the probabilities of being the nodes of G. In the decoder, we first sample the data from the Gaussian distribution to obtain a representation of the node, as follows: where ψ is the sampled weight parameter. en we calculate the reconstructed node probability function by a full-connection layer and a sigmoid activation layer.
where the j-th dimension of p i , p ij is the probability of the node being the j-th node of the network.

Problem Modeling and Solving.
To learn the parameters of the CNN, GCN, and Gaussian-based encoder-decoder, we consider the following problems.

Decoder Loss of Node Identification.
Since the Gaussian-based encoder-decoder is designed to identify the node from the graph, we propose to minimize the loss of the node identification measured the by cross-entropy loss as follows: where π ij � 1 if v i is the j-th node, and 0 otherwise.

Neighborhood Structure Preservation.
With the new coding of each node as a Gaussian distribution in the Wasserstein space, we hope the neighborhood structure can be preserved. To this end, we firstly define a set of triplets of where v i and v j are connected, and v i and v k are disconnected in graph G. e energy between two nodes v i and v j in the graph is also defined as the Wasserstein distance: To keep the structure of the network, we propose minimizing the squared energy between the connected nodes and maximizing the exponential of negative entity between the disconnected nodes: With minimizing this objective, we hope the learned Gaussian distributions of the connected nodes are close in the Wasserstein space, while that of the disconnected nodes are far from each other in the Wasserstein space. us, the network structure is preserved in the Wasserstein space.

Query-Specific Distance Supervision.
In our problem setting, we already have a known query node, and the task is to find similar nodes in the network. We assume the q-th node is the query node, v q . To use the query to guild the learning process, we define a label for each node to indicate if it is similar to the query. By default, y q is similar to itself; thus, y q � 1. For the other nodes, it is difficult to define the label accurately.
us, we develop a heuristic method to learn the labels from the Wasserstein distance between the query node and a given candidate node. For this purpose, we split the distance range to three intervals, divided by an upper bound, u, and a lower bound, l, where l < u. e labeling process selects the nodes which have a Wasserstein distance to query larger than u as positives and selects the nodes with a Wasserstein distance to query smaller than l as negative. e nodes whose distance to query is between u and l are left to be ambiguous. us the label of a node vi is defined as None, otherwise.
We further define an indicator to indicate if v i is labeled as β i � 1 and 0 otherwise. e range of distance between u and l is the range of ambiguous nodes, and we define u − l as the ambiguous range. For the labeled nodes, we minimize a linear loss of L(v i , v j ; y i ) � y i × dist(v i , v q ). Meanwhile, we also hope the ambiguous range can be as small as possible so that more nodes can be labeled. us, we minimize u − l.
e overall minimization problem is the combination of both the minimization of the linear losses of the labeled and the ambiguous range: where c is a regularization parameter. In this way, for the positive nodes which are labeled to be similar to the query, their distance to the query should be minimized, while the distance to the query for the negatives will be maximized. e overall optimization problem is the combination of the three subproblems: where Φ represents the set of parameters of CNN, GCN, and Gaussian-based encoder-decoder. Solving this problem directly is difficult, because the label definition, ambiguous range parameters, and the Gaussian-based encoder parameter are coupled. To be specific, the label is defined over the ambiguous range and the distance of the Gaussian-based distributions of the nodes, while the parameters of the Gaussian-based distributions are learned from the labels of the nodes. To solve this problem, we use the fixed point iteration method [37][38][39][40] in an iteration algorithm. We firstly fix the parameters of Φ, and the ambiguous range parameters, u and l, to update the labels according to (20). en we fix the labels and ambiguous range parameters to update the parameters of Φ by solving the following problem: Computational Intelligence and Neuroscience is problem is solved by the back-propagation algorithm with the ADAM optimizer [41].
Finally, we fix Φ and the labels to update the ambiguous range parameters as follows: We use the gradient descent algorithm to solve this problem: where ρ is the descent step size. Since (zo 2 (u, l)/zu) � 1 and (zo 2 (u, l)/zl) � −1, in each descent step, u is increased by ρ, while l is decreased by ρ, until u � l.

Experimental Results
In this section, we conduct experiments over benchmark data sets of networks.

Datasets.
In the experiments, we use the following benchmark datasets of the innovation networks.
e first dataset is Cora dataset [42]. is dataset is a network of research articles of machine learning topics. e research articles are treated as nodes, and the edges are the citations of papers. e abstract of each article is treated as the content of each node. is network has only 2,211 nodes and 5,214 edges. e content of articles has around 170 words on average; the total number of unique works of nodes in this network is 12,619. e second dataset is Citeseer dataset [43]. is dataset is a network of scientific articles of ten different multidisciplinary topics. Each node is also a research article, and each edge is the same citation relation connecting two articles. But in this network, the content of each node is its title, not the abstract. e number of nodes of this network is 4,610, and the number of edges is 5,923. e number of words of content is 10 on average, and the number of unique words overall is 5,523. e third dataset is the DBLP dataset [44]. is network has the bibliography data of 13,404 articles of computer science. Each node is an article, and each edge is a citation relation. e articles are labeled by four different research topics, including artificial intelligence and computer vision. e content of each node is also the title. e number of edges is 39,861. e average length of the content is 10, and the size of the unique word set is 8,501.

Experimental Settings.
To conduct the experiments, we set up the following protocols by using the leave-one-out validation. For each network, we leave one node out as a query node, and the remaining nodes as the candidate nodes to be retrieved. Since our data is research articles, we define the relevance of two research articles according to their small areas. If an article is in the same small areas as the query article, then it is defined as a positive node. e task is to retrieve as many positives as possible while keeping the negatives out of search results as much as possible.
is process is repeated for each node of the network by turns; that is, each node is treated as a query node one by one. en we apply our algorithm to learn the embeddings of the nodes and use the Wasserstein distances to measure the dissimilarity between the query and a candidate node and rank them according to returning the nodes with the smallest Wasserstein distances. To measure the performance of the retrieval results, we use the mAP (mean average precision) [45,46]. Remark 2. mAP is an effective measure of database retrieval performance. Given a set of queries, Q, the retrieval systems return a list of ranked database objects for each query. For each query, q ∈ Q, we can calculate a precision at each rank k: where TP(q)@k is the number of return objects relevant to q at top k ranks. e average precision (AP) of q is calculated as the average overall ranks: while mAP is the mean of AP over the queries at Q.
where |Q| is the size of Q.

Experimental Results.
We compare the proposed method, named as Query-Specific Deep Embedding of Content-Rich Network (QDECN), against the networkbased ranking methods first and then against the other network embedding methods. For the comparison to the network embedding methods, we first embed the nodes of the network and then calculate the dot-product scores of their embedding vectors and the query as the ranking scores for the purpose of retrieval.

Comparison with Network-Based Ranking Methods.
We compared the following methods of network-based ranking: graph transduction (GT) [27], tensor product graph diffusion (TPGD) [26], and Query-Specific Optimal Networks (QUINT) [25]. e comparison results are shown in Table 1. From this table, we can see that the proposed method QDECN outperforms the other methods in all cases. is is not surprising at all due to the following reasons.
(1) QDECN is the only method that explores the content of the nodes of a network, while, for all the other methods, they only utilize the network structure information, such as edge data. However, in these datasets of innovation networks, two articles may not have the citation relation, but according to their content similarity, they should belong to the same small area. QDECN not only codes the content of the nodes to its representation but also leverage the content features of its neighboring nodes. So it has the capability to learn from both the node content and the edges of the network, while the other network-based ranking methods only learn from the network structure itself. (2) QDECN is the only method that employs network embedding technology to improve retrieval performance. e remaining methods, such as QUINT and TPGD, aim to learn a better network affinity matrix to guild the learning of the ranking score, but they are still failing to the same schema as GT. e network-based ranking methods use the network affinity matrix to regularize the learning of ranking score, but the network embedding methods map the nodes to a low-dimensional continues vectors, which contains the richer information about the network and instinct information about the relevance of nodes; thus, it is a better choice for the node relevance search tasks.
(3) Only QDECN and QUINT adjust the learning of the network parameters according to the query node. is method can learn a better network representation which is optimal for finding the relevant nodes to the node. is setting does not guarantee the learned network representation is optimal for other tasks, but, for the given query node, it gives better results than other methods. Please also note that the supervision information of QUINT is richer than QDECN, since the positive/negative nodes of QUINT are given as ground truth, but for QDECN the positives and negatives are both learned by the algorithm. However, QDECN still outperforms QUINT in all cases.

Comparison to Network Embedding Methods.
We compare QDECN to the following network embedding methods: Network to-Network Network Embedding (Net2Net-NE) [2], Deep Variational Network Embedding in Wasserstein Space (DVNE) [6], and Deep Recursive Network Embedding (DRNE) [23]. e results are shown in Table 2. We have the following observations from this table.
(1) Again, QDECN obtains better results than the other methods. Compared to DRNE and DVNE, the proposed method and Net2Net-NE can use the content of the nodes to enrich the embedding results. Compared to the Net2Net-NE itself, our method can further sense which node is the query node and take this advantage to guild the embedding process, which Net2Net-NE cannot. Due to the above reasons, the overall results of QDECN are better than the others. (2) DRNE and DVNE are common network embedding methods, which have no supervisor of node content and query node. Meanwhile, Net2Net-NE has the supervision from the content of the node but cannot access the query. us, this is not a fair competition. However, our method is the very first algorithm that can use both the node content and query node of the networks. e fact that QDECN gives the best results is a piece of strong evidence that it is necessary to develop an effective method to take both node content and query node into account during the process of network embedding, especially for the purpose of information retrieval.
ere are two dimensionalities of the latent spaces, d and g, and we set their values to 300 and 500 for three datasets. e learning rate of the optimizer is set to 0.01 for all three datasets.

Parameter
Analysis. In our model, there are three tradeoff parameters, C 1 , C 2 , and C 3 . We conduct experiments to analyse them one by one.
3.4.1. Analysis of C 1 . C 1 is the weight of the network structure preservation loss term. We vary the value of C 1 and measure the changes of mAP over three different datasets and plot the curves in Figure 1. We can see that the performance of our algorithm keeps improving when the value of C 1 is increased. Since this parameter is the weight of the network neighborhood structure preservation term, this phenomenon indicates that the neighborhood structure plays an important role in a good quality network embedding process and also is critical for the node-level information relevance search problem. Moreover, we also observe that performance improvement becomes minor after a certain value. For example, for the DBLP network, this value is 1, while, for the Citeseer network, this value is 10.  Figure 2. From this figure, we can also conclude that overall a larger value of C 2 can give a better performance in terms of mAP. But the improvement is limited, and the performance is not sensitive to the change of the values of C 2 . A possible reason is that the positive and negative labeling is not at the level of ground truth but is estimated by the upper/lower bound. However, the upper/ lower bound itself is learned as variable. us, in nature, the learning process is an unsupervised learning process. So the improvement is not comparable to supervised learning. Even though it is unsupervised, we still can see the improvements with increasing C 2 , which is the benefit from the sensing of the query node.

Computational Intensity.
In this section, we study how computational intensive the proposed method is. e average running time (in seconds) of the learning process for a query over the datasets of Cora, Citeseer, and DBLP is 43.66, 93.34, and 437.04. e running time is rather long because the model needs to be retrained for each query node, and the neighborhood preserving term in (19) scale as n 3 as we are considering all possible triples within the network. is problem is even more serious with the network being large. us, we proposed to reduce the size of training triplet set size. To this end, for each node, instead of using all the disconnected nodes to construct the training triplets, we only sample a few disconnected node for this purpose. After this change, the running time is reduced to 21.12, 55.07, and 94.34, respectively.

Contribution of Different Objective Terms during the
Training Phase. In this section, we analyse the contributions of the different terms of the objective for the database of Cora. To measure the contribution of a term, we firstly remove this term from the objective and learn the model to retrieve the nodes for queries and calculate the mAP of the retrieval results. en we add the term back to the objective and measure the retrieval results by mAP again. e contribution of this term is measured by the improvement of the mAP after the term is added to the objective. e term-wise mAP improvement is reported in Figure 3. From this figure, we can see that the query-specific distance supervision term gives the largest contribution, while the regularization term has the least contribution. e second and third significant contributions are from the node identification term and the neighborhood structure preservation term. Objective term mAP improvement Decoder loss of node Identification Neighborhood structure preservation Query specific distance supervision

Conclusions
In this paper, we develop a novel method of network embedding, for the content-rich network, for the purpose of node-level information retrieval. We firstly use a CNN to extract features from the content and then use a GCN to code the features of the neighboring nodes, and finally use a deep encoder-decoder to map these features to a Gaussian distribution and convert it to the node's identity. e learning of the parameters is performed by minimizing a loss function. In the loss function, except for the node identification loss, the neighborhood preservation loss, and the complexity of models, we also consider the query node regularization problem. For this purpose, we define positive/ negative nodes according to the Wasserstein distance between the query and the candidate nodes. Experimental results show the advantages of the proposed method which embeds the content-rich network guided by the query node. Computational Intelligence and Neuroscience 9