Fusing Node Embeddings and Incomplete Attributes by Complement-Based Concatenation

Network embedding that learns representations of network nodes plays a critical role in network analysis, since it enables many downstream learning tasks. Although various network embedding methods have been proposed, they are mainly designed for a single network scenario. This paper considers a “multiple network” scenario by studying the problem of fusing the node embeddings and incomplete attributes from two different networks. To address this problem, we propose to complement the incomplete attributes, so as to conduct data fusion via concatenation. Specifically, we first propose a simple inductive method, in which attributes are defined as a parametric function of the given node embedding vectors. We then propose its transductive variant by adaptively learning an adjacency graph to approximate the original network structure. Additionally, we also provide a light version of this transductive variant. Experimental results on four datasets demonstrate the superiority of our methods.


Introduction
Social network sites (SNSs, also commonly referred as social networking services) are online platforms which provide users with various features to facilitate digital social interaction and information sharing [1,2]. Over three billion users are currently active on various SNSs (like Facebook, Twitter, and QQ), spending on average two hours daily. These wide and active SNSs naturally form an important part of the digital economy, making social network analysis [3,4] become a hot research topic over the years.
Recently, network embedding [5], as a fundamental problem in network analysis, has aroused considerable research interest. Network embedding learns lowdimensional vector representations for network nodes. The learned vectorized representations, which preserve certain structural and content information of networks, can be easily combined with off-the-shelf learning algorithms for many social network analysis tasks such as node classification [6], link prediction [7], and diffusion prediction [8].
1.1. Problem. Although various network embedding methods have been proposed, they mainly focus on a single network scenario. In the era of big data, the related information from different networks should be fused together to facilitate applications. In this paper, we consider a "multiple network" scenario by studying the problem of fusing the node embeddings and incompleted attributes provided by two different networks.
As illustrated in Figure 1, this problem has practical importance. Imagine that you use Yelp (see Figure 1(a)), a popular review app, and try to get in your account there. Yelp allows you to sign in using your Facebook account. In addition, as the node (user) embeddings not only preserve certain characteristics of networks but also protect users' privacy [9], Facebook may provide these embeddings to Yelp to facilitate its applications, e.g., cold-start recommendation. More importantly, as some Yelp users begin to write reviews, a very practical problem would arise: is it possible to fuse the original node embeddings provided by Facebook and the reviews provided by Yelp to get new user embeddings (illustrated in Figure 1(b))?
1.2. Challenge and Solution. Certainly, one fundamental challenge is the incompleteness of attributes, i.e., only a small part of nodes are further provided with attributes. This challenge is very common. As reported [10], the distribution of user activity tends to be long-tailed, suggesting most social media contents (like the reviews on Yelp) are actually written by a few active users. To address this, we propose to complement the incomplete attributes by defining attributes as a parametric function of the given node embedding vectors. This complement enables us to conduct data fusion via concatenation (illustrated in the bottom right corner of Figure 1(b)).
To obtain high-quality fusion results, we further propose a transductive method by adaptively learning an adjacency graph to approximate the original network structure. In particular, the adjacency graph is learned by jointly considering the given node embeddings and attribute knowledge. Additionally, we also provide a light version of the proposed transductive method. Specifically, for each node, this light version reduces its neighbor candidate set for efficient adjacency graph learning. We then conduct extensive experiments to verify the effectiveness of our methods.
In summary, our main contributions are as follows: (i) We study the problem of fusing node embeddings and incomplete attributes from two different networks. To our best knowledge, little work has addressed this problem (ii) We propose a very simple and effective inductive method based on the idea of attribute complement (iii) We further propose a transductive method POINTS and its light version POINTS * , both of which could obtain superior performance (iv) The remainder of the paper is organized as follows. We review related work in Sect. 2 and formalize the problem in Sect. 3. We present our method in Sect. 4 and then provide some discussion in Sect. 6. We conduct experiments in Sect. 7. Finally, we end with a conclusion in Sect. 8

Network Embedding.
Over the past few years, there has been a lot of interest in learning useful node embeddings (i.e., features) from large-scale networks automatically [5]. A representative work is DeepWalk [6] which performs random walk on a network to generate node sequences and then perform the skip-gram algorithm [11] on those sequences to achieve the embedding. Another well-known work is LINE which preserves both first-order proximity (i.e., the similarity between linked nodes) and second-order proximity (i.e., the similarity between the nodes with shared neighbors) of a network. In addition, researchers have also proposed some deep learning-based embedding models, such as SDNE [12] and

Wireless Communications and Mobile Computing
GraphGAN [13]. Recently, lots of studies consider the network embedding with side information, such as node attributes. For example, by proving DeepWalk is equivalent to matrix factorization, the work in [14] presents textassociated DeepWalk (TADW). GraphSAGE [15] employs graph convolutional networks [16] to aggregate features among node neighborhood for network embedding. RSDNE [17] and RECT [18] further consider the problem of zeroshot graph embedding, i.e., the completely imbalanced label setting.

Data Fusion.
Data fusion is the study of efficient methods for automatically transforming information from different sources and different points in time into a representation that provides effective support for human or intelligent systems. Data fusion has proved useful in many disciplines, as discussed in [19,20]. For example, in bioinformatics, jointly analyzing multiple datasets describing different organisms improves the understanding of biological processes [21]. In information retrieval, fusing the retrieval results from multiple search engines would significantly improve the retrieval performance [22]. In biometric recognition systems, feature fusion could greatly improve the recognition performance [23]. We refer to [24,25] for a comprehensive survey. However, little previous work considers the fusion of incomplete data or network embedding data. Our work fills this gap.

Problem Statement
The studied problem is defined as follows. We are given the node embeddings of a network U ∈ ℝ n×d , where n is the node number and the i-th row of U (denoted as u i ) is a d-dimensional embedding vector of node i. On the other hand, another network further provides the attributes of l (l < n) nodes: L A = fðu 1 , a 1 Þ, ⋯, ðu l , a l Þg, where a i ∈ ℝ 1×m is the attribute vector, and m is the attribute feature number. Our goal is to fuse the given node embeddings and those incomplete attributes, so as to get the updated embeddings for all nodes. Note that different from existing network embedding methods, the original network structure is unknown in our problem.

Fusion via Attribute Complement.
Since only a small part of nodes are further provided with attributes, we cannot directly fuse node embeddings and attributes. To address this problem, we adopt a very simple complement strategy: predicting the nonexist attributes. In particular, for each node i which is further provided with attributes, we assume that its node embedding u i should have the ability to generate its attribute vector a i . The optimal generation function f can be obtained by solving the following minimization problem: where ℓ is a loss function that measures the reconstruction error, such as squared loss or hinge loss. By solving the problem in Eq. (1), we can obtain the generation function f . Then, for a node i with no attributes, we can predict its attributes by applying f ðu i Þ. This complement enables us to conduct data fusion via concatenation. More details and discussion about the concatenation strategy can be found in Sect. 6.2.

Transductive Attribute Prediction.
The method formulated in Eq. (1) is inductive. In this section, we present a transductive method. Generally, transductive methods, which leverage the test data for model training, perform better than inductive methods [26]. For network embedding, classical transductive methods exploit all network nodes by preserving the inherent network structure in the embedding space, i.e., connected nodes tend to have similar embeddings [27,28]. Although the original network structure is unknown, one can simply build a sparse adjacency graph (We use the term "graph" to describe the recovered network structure, as to avoid ambiguity with the original network structure.) S to approximate it, i.e., S ij = 1 when node j is the k-nearest neighbors of node i in the given node embedding space, otherwise S ij = 0. This approximation can capture the intuition of transductive learning by the following cost term: where Distð⋅ , ⋅Þ is a distance function, and y i ∈ ℝ 1×m (the i-th row of matrix Y ∈ ℝ n×m ) is the predicted attribute vector of node i. The imposed constraint ensures the predicted attributes to be consistent with the known attributes. The adjacency graph plays a crucial role in this kind of graph-based transductive learning methods [29,30]. However, the matrix S in Eq. (2) might not be the optimal adjacency graph. On the one hand, the original network information is only approximately described by the given node embeddings (i.e., u i=1,⋯,n ) which S is built from. On the other hand, the construction of S ignores the attribute information, i.e., similar (dissimilar) attributes indicate similarity (dissimilarity) between different nodes. In this paper, we solve this problem in an adaptive way. Specifically, we propose to learn S by jointly considering the given node embeddings and attribute knowledge. This yields the following cost term: where s i ∈ ℝ n×1 is a vector with the j-th element as S ij (i.e., s i ′ is the row vector of matrix S), 1 denotes a column vector with 3 Wireless Communications and Mobile Computing all entries equal to one, and α and β are two adjustable parameters. Intuitively, the first and second term of Eq. (3) measure how well the adjacency graph fits the attributes and the given node embeddings, respectively.
The unified model: POINTS with learning the attribute generation function (Eq. (1)) and adjacency graph (Eq. (3)), the proposed method is to solve the following optimization problem: Since the key idea of this method is to learn the adjacency graph adaptively, we term our method as adaPtively netwOrk embeddIng aNd aTtribute fuSion (POINTS).
A Light Version of POINTS: for each node i, to learn its optimal neighbors, POINTS needs to consider all nodes. This is very inefficient, as the network may be extremely large (some theoretical analysis can be found in Sect. 6.3). Therefore, we give a light version of POINTS (denoted as POINTS * ). In particular, we propose to build a candidate neighbor set (denoted as N k * ðiÞ) for each node, where k * (k < k * ≪ n) is the candidate neighbor number. Based on this idea, the light version POINTS * is to solve the following optimization problem:

Optimization
The objective functions of POINTS (i.e., Eq. (4)) and POINTS * (i.e., Eq. (5)) both contain 0/1 constraints, which might be difficult to solve by the conventional optimization tools. In this section, we develop efficient solutions for these two problems.
Therefore, we can update W as W ← W − ηð∂J /∂UÞ, where η is the learning rate.
When the other variables are fixed, we can obtain the partial derivative of J w.r.t. Y as where Δ = D − ðS + S ′ Þ/2, and D is a diagonal matrix whose i-th diagonal element is ∑ j ðS ij + S ji Þ/2. Then, we can update Y as Y ← Y − ηð∂J /∂YÞ. After that, for each node i with given attributes a i , we adjust its predicted attributes as y i = a i , so as to satisfy the constraint in Eq. (4).
Update rule of S: when the other variables are fixed, the original optimization problem reduces to As problem (8) is independent between different i, we can instead to solve n decoupled subproblems: The optimal solution of problem (9) is (proved in Sect. 6.1) where set N UA k ðiÞ contains the top-k nearest nodes to i in the network "embedding-attribute" space, where the distance between node i and j is defined as αky i − y j k 2 2 + βku i − u j k 2 2 :. We can iteratively update these three variables until convergence to obtain the final solution. After that, as discussed in Sect. 6.2, we can get the final fusion results by concatenation. For clarity, we summarize the complete fusion procedure in Alg. 1.

5.2.
Optimization for POINTS * . The optimization approach of POINTS * is very similar to that of POINTS in Sect. 5.1. The only difference is that when updating S as other variables are fixed, we only need to sort the nodes in (its neighbor candidate set) N k * ðiÞ to get the top-k nearest neighbors in the 4 Wireless Communications and Mobile Computing network embedding attribute space, so as to get the optimal solution of S.
Proof. By contradiction, suppose node i has gotten its optimal neighbor set N UA k which contains a node p not in i's top-k nearest nodes in the "node-attribute" space. For convenience, we use Ψði, jÞ to denote the distance between nodes i and j in this space, i.e., Ψði, jÞ = αky i − y j k 2 2 + βku i − u j k 2 2 . As such, there must exist a node q ∉ N UA k which is one of i's top-k nearest nodes in this space. Then, we get Ψði, pÞ > Ψði, qÞ. Considering our minimization problem (i.e., Eq. (9)), this inequation leads This indicates that fN UA k + qg \ p is a better optimal solution than N UA k , a contradiction. Actually, we can generalize the above proof to a more general case.
Proof. This conclusion can be proved by replacing the squared Euclid distance function in the proof of Theorem 1 by Distð⋅ , ⋅Þ.

Fusion Strategy.
In this part, we will discuss how to conduct data fusion, based on the proposed attribute complement methods. The inductive method (described in Sect. 4.1) would learn a generation function f . Then, for each node i, we can predict its attribute vector as y i = f ðu i Þ. For those two transductive methods (described in Sect. 4.2), we will directly obtain the predicted attribute vectors y i=1,⋯,n . As such, the attributes are completed for fusion. Specifically, we adopt a "trick" concatenation strategy: (1) if node i has no attributes, we obtain its final fusion vector by concatenating u i and the predicted attribute vector y i ; (2) if node i has attributes, we obtain its final fusion vector by concatenating u i and a i . The principle of this trick is that the given attributes are always more stable and accurate than the predicted attributes for node description.
6.3. Time Complexity. The time complexity of Alg. 1 is as below. The complexity for updating W is Oðd 2 n + d 2 m + d mnÞ. The complexity for updating Y is OðnnzðΔÞm + dmnÞ, where nnzð⋅Þ is the number of nonzeros of a matrix. The complexity for updating S is Oðn 2 log nÞ, because for each node, we have to calculate its top-k nearest neighbors. As d , m ≪ n is linear with n and nnzðΔÞ is linear with n, the overall complexity of POINTS is Oðτðn 2 log nÞÞ, where τ is the number of iterations to converge.
For the light version, i.e., POINTS * , the complexity of updating S becomes Oðnk * log k * Þ, and all others remain the same. Hence, since k * ≪ n, the overall complexity becomes Oðτðd 2 n + dmn + nk * log k * ÞÞ. As our method usually converges fast (τ ≤ 20 in our experiments) and d, m ≪ n, the complexity of POINTS * is linear to the number of nodes.

Experiments
Datasets: we conduct our experiments on four widely used citation network datasets: Citeseer [31], Cora [31], Wiki [32], and Pubmed [32]. In these networks, nodes are documents, and edges denote the citation relationship between them. Node attributes (i.e., features) are the bag-of-word representations of documents. The statistic of these networks is shown in Table 1.  Experimental setting: as illustrated in Figure 1, for each dataset, we first obtain the original node embeddings and then provide some nodes with attributes for data fusion, so as to simulate fusing data from two different networks. Specifically, we first obtain the original node embeddings by the famous network embedding method LINE. We adopt its first-order proximity version LINE (1st). Besides, we also try other network embedding methods in Sect. 7.3. After obtaining the original node embeddings, we randomly select some nodes and provide them with attributes. At last, we employ different fusion methods to obtain the final fusion results.
Baseline methods: since this incomplete data fusion problem has not been previously studied, there is no natural baseline to compare with. We thus compare our methods with those methods which directly fuse the original given node embeddings and attributes. We list them as follows: (1) LINE(1st). We adopt LINE(1st) to obtain the original node embeddings. This method neglects the incomplete node attributes. Note: in Sect. 7.3, we also try more network embedding methods (2) Attributes. We use the zero-padded attributes as fusion results. This method neglects the given node embeddings (3) NaiveCombine. We simply concatenate the vectors from the given node embeddings and the zeropadded attributes For our method, we test its three different versions: POINTS ind (the inductive version formulated in Eq. (1)), POINTS (the full transductive version formulated in Eq. (4)), and POINTS * (the light version formulated in Eq. (5)).
Parameters: we follow the suggestion of LINE to set the embedding dimension to 128. In addition, following [14],  Wireless Communications and Mobile Computing we reduce the dimension of attributes by applying SVD decomposition on the original text features. For simplicity, we also reduce this dimension to 128. In the proposed methods POINTS and POINTS * , we fix parameters α = 1 and β = 1 throughout our experiments, although adjusting them would yield better results. Besides, we simply set the neighbor number k = 5 like most graph-based transductive methods [33] and set the candidate number m = 20k for POINTS * .
7.1. Node Classification. Following [6], we train one-vs-rest logistic regression classifiers to evaluated the fusion (i.e., the updated embeddings) quality. Specifically, for Citeseer, Cora, and Wiki, we fix the label rate in the classifiers to 10%. Since Pubmed is a much larger dataset with fewer classes, we follow [34] to set the percentage of labeled data to 1%. In addition, we increase the rate of nodes with attributes from 10% to 90% on all datasets. Following [28], before evaluation, we normalize all representation vectors to unit length for a fair comparison. Figures 2 and 3 show the classification performance measured by micro-F1 and macro-F1 [35], respectively. We can draw the following three conclusions from these results. Firstly, all our methods (including POINTS ind , POINTS, and POINTS * ) outperform baseline methods significantly. For example, on Citeseer with 50% attributes, POINTS ind , which performs worst in the proposed three methods, still outperforms LINE(1st) by 13%, attributes by 8%, and Naive-Combine by 3%. Additionally, the improvements of our two transductive methods POINTS and POINTS * are more remarkable. These results clearly demonstrate the effectiveness of our complement strategy.
Secondly, the proposed two transductive methods (i.e., POINTS and POINTS * ) consistently outperform our inductive method POINTS ind . Especially on Citeseer, Cora, and Pubmed, these two transductive methods generally 7 Wireless Communications and Mobile Computing outperform POINTS ind by 5-12%. On the other hand, we also find that the improvement becomes less significant on Wiki. We conjecture that it may be hard to recover its original network structure from the given node embeddings and attributes. More specifically, this might be because Wiki (whose edge num is eight times greater than node number) is much denser than the other three datasets.
Thirdly, the light version POINTS * is comparable to POINTS on all datasets. This indicates that we can reduce the neighbor candidate set size for efficient transductive learning.
7.2. Visualization. Following [28], we use t-SNE package [36] to visualize the final node representations obtained by different fusion methods. Without loss of generality, we choose the first dataset Citeseer and test the case with 50% attributes. Similar to [28], for a clear comparison, we visualize the nodes from three different research fields: IR, DB, and HCI. Figure 4 shows the visualization results.
As shown in Figures 4(a)-4(c), the visualization results of the compared three baselines are not very meaningful, in which the points belonging to different categories are heavily mixed with each other. This is due to the fact that all these baselines cannot sufficiently utilize the incomplete attributes. In contrast, as shown in Figures 4(d)-4(f), the results of our three methods are much better (nodes with same colors are distributed closer). In addition, compared to our inductive method POINTS ind , our two transductive methods POINTS and POINTS * show more meaningful layout. Specifically, the blue points in POINTS ind are partly separated by the red points, while these two types of points in POINTS and POINTS * are less mixed with each other. To clarify the reason, we further visualize the predicted attributes of POINTS ind and  7.3. More Network Embedding Baselines. We evaluate the performance of our methods based on more network embedding methods. In particular, we further test another five network embedding methods as follows: Without loss of generality, we fix the label rate to 10% and choose 50% nodes with attributes. For convenience, we use "OrigEmb" to denote the original node embeddings obtained by various network embedding methods. Figure 5 shows the performance on Citeseer. We can clearly find that our methods (including POINTS ind , POINTS, and POINTS * ) consistently outperform baselines by a large margin. On the other hand, the light version POINTS * could always achieve similar accuracy as its full version POINTS. Taken together, all these observations clearly indicate the effectiveness of our methods.

Conclusion
This paper investigates the problem of fusing node embeddings and incompleted attributes provided by two different networks. We develop both inductive and transductive variants of our method. Additionally, we also provide an efficient light version of our transductive variant. Extensive experiments have demonstrated the effectiveness of our methods. In the further, we would extend our method to fuse more types of related information from more different networks and resources.

Data Availability
The datasets used in this paper can be found at https://linqs .soe.ucsc.edu/data.

Conflicts of Interest
The author(s) declare(s) that they have no conflicts of interest.