Sign Prediction on Social Networks Based Nodal Features

. The sentiments among social individuals are complexity and diversity, and the relationships between them include being friendly and hostile. The positive (“friendly” ,“like” or “trust”) or negative (“hostile”, “dislike” or “distrust”) sentiments in the relations can be modeled as signed connections or links. The missing relations or sentiments between individuals are always worthy of speculation. The sign predication on links has been significant applications in a variety of online settings, such as online recommendation system and abnormal user detections. A novel sign prediction method called the 𝑆𝑃𝑅 model is measured by the values of the two indexes, one is similarity; the other is preference-reputation (PR). The similarity of a pair nodes is defined by the statistical properties of local structures. The definition of similarity agrees with the theory of social balance because existing connections reflect the tendency of the new links emergence between individuals. And PR value is to measure the positive or negative tendency of edges without sign. The experiments on real big social data proved the feasibility and efficiency of the 𝑆𝑃𝑅 model: Comparing with some popular predication methods, the 𝑆𝑃𝑅 model in this issue shows lower complexity and higher accuracy. Experimental results also prove that the 𝑆𝑃𝑅 model provide insight and foresight of the mechanism driving the sign formation of links.


Introduction
In social networks, relations among members not only exhibit friendship and cooperation, but also hostility and competition.Positive and negative links were used to describe cooperative (friendly/trustful) and competitive (hostile/distrustful) relationships respectively.Assigning signs to links were a signicant way of including additional information to networks than traditional binary or weighted approaches [1][2][3].One of the challenges in signed networks is inferring the signs of unknown relations that is o en referred to as sign prediction [4], which reveals the underlying relationships between social members.erefore, it can be widely used in many applications such as recommendation systems and abnormal user detections etc. [5].
Sign prediction is the problem of inferring those hidden signs using the information provided by the rest of the network.It is similar to link prediction, which is a well-studied problem in traditional unsigned social network analysis [6].However, compared with link prediction, sign prediction is still in its beginning stage due to the following di culties.One the hand, the e ects of negative and positive signs are unbalanced or unwieldy in signed social networks [7,8].Positive signs can be propagated between members of social networks while negative signs cannot.For example, A trusts B and B trusts C, A will trust C to some extent, while A distrusts B and B distrusts C, it is hard to judge the relationships between A and C directly [9].ereby, in the propagation model of reference [10], the distrust relationship only propagates once among the trust relationships.On the other hand, the formation mechanism of the negative links is di erent from the positive links.In the eld of signed network research, less negative signs datasets are available for study [11] because members of social networks rarely express their antipathy to others for fear of being retaliated [12].So the negative sign prediction became a di cult problem in the eld of sign prediction.erefore, in-depth study and mining of the formation mechanism of social network is the key to improve the accuracy of prediction.
Sign prediction was rst introduced and investigated by Guha et al. [10], and later developed in matrix calculation, machine learning, and collaborative ltering.Guha et al. [10] used power matrix to calculate the propagation of trust and distrust.By the matrix, a variety of technical on predications were discussed.e leading eigenvectors with tness functions to ne-tune clusters were presented [13].e random walk according to the similarity between nodal pairs realized in researching the inconsistency of distrust in propagation [14].Minimizing the rank of the adjacent matrix could approximately make the balanced structure to the greatest extent [15].To quickly obtain the maximal balanced matrix, Cai et al. [16] propose a singular value projection algorithm, in which the product of the top-k singular vectors and singular values is taken to approximately replace the original matrix.Agrawal et al. [17] and Hsieh et al. [18] approximate the original matrix by a matrix decomposition method, in which the original × matrix is decomposed into the product of two × matrices, and the element values of the product matrix are used as the predicted values.To date, the methods used in machine learning include logistic regression [4,9,19,20], support vector machine [21], decision tree [22], naive Bayes [23] etc.; the features used for learning include nodal degrees [4,9], types [23], similarity [9,20], trustworthiness [24], preference [25,26], triangle structures [4], quadrilateral structures [19], user reviews [22,27] etc. Collaborative ltering focuses on similarity, similar individuals are more likely to make similar behaviors, which is the basic idea of sign prediction by collaborative ltering.Javari and Jalili [28] believe that computing the similarity between nodes is a ected by the sparsity of the social networks.erefore, they cluster the network and calculate the similarity between clusters to replace the similarity between individuals.Individual behaviors in signed network was believed hidden in "group intelligence" which is embodied by the community structure [5].e community structure embedded in the social network is untractable even in complete networks [29].
Enlightened by the references and their methods, a new sign prediction method is presented by two indexes in this paper, one is similarity; the other is the preference-reputation (PR) value, called model for short.e statistics of local structures are analyzed to explore the constitution mechanism of signed social networks by which the similarity of a pair nodes are de ned.e meaning of similarity agrees with the theory of social balance, because the existing connections re ect the tendency of new links emerging between individuals.And the PR value, coinciding with the preferential attachment mechanism [2], is to measure the positive or negative tendency of edges without sign.e experiments on real data proved the feasibility and e ciency of the model.Compared with the popular predication methods, the model in this issue shows lower complexity and higher accuracy.Experimental results also prove that the model provide insight and foresight of the mechanism driving the sign formation of links.e arrangement of this paper is follows.e introduction and motivation is illustrated in Section 1; In Section 2, the similarity, and the PR value are de ned.erea er, the predictive method, namely the model, is presented based on the indexes.In Section 3, the experimental results and comparisons on three real social signed networks, Epinions, Slashdot, and Wikipedia, are shown.Finally, the discussion and conclusion of this work are presented in Section 4.

The Method and Model
A signed graph is denoted by = ( , , ), where and are the node set and the link set of respectively, and = {+1, −1, 0} is a weight set on such that the link , is set = 1, −1 or 0 if the node shows positive, negative, or none attitude to the node .Irrespective of positive or negative, the sentiments are clear and distinct.While, for the none attitude, it is ambiguous and unsetting, people wonder to determine the precise attitude.en a natural question is to predict the sign of link , based on the information of and their signs [4].e sign prediction problem is also interpreted to "what extent the evolution of a network can be predicted using its structural information" [26].
In this section, indexes such as similarity, dissimilarity, preference and reputation are presented, and the sign of link predication model is constructed.

Similarity and Dissimilarity.
In order to predict the edge sign from node to node , , it is necessary to make targeted analysis on the prediction task.Consider the following local structure, as shown in Figure 1: in panel (a), since is the node into and ̸ = 0, then the higher common attribute between and , the more probability of = ; in panel (b), since ℎ is the node out of , then the higher common attribute between ℎ and , the more probability of = ℎ .ere by predicting can via the common attributes between and and the common attributes between ℎ and .Analyzing Figure 1, since , are the source nodes and ℎ, are the target nodes in the quadrilateral structure, the common attributes between and are equal to the common attributes between ℎ and .us, it can yield twice the results with half the e ort.Generally, the more common neighbors (polarity is also consistent) two nodes have, the higher their common attributes will be.en the similarity between and can be de ned as where + ( ) and − ( ) are the neighborhoods getting out the node with positive and negative links, respectively, is the neighborhoods getting in the node irrespective of the signs of links.Further, , is re ned by the signs of the node and its neighbors.en where + , and − , are the cases of ∈ + and ∈ − for Equation (1) respectively, and + and − are the neighborhoods getting in the node with positive and negative links respectively.
+ , and − , are called the positive similarity and negative similarity, respectively.
Figure 2 shows all the cases of ∑ ∈ ( ) | ( ) ∩ ( )|: where panels (a)-(d) are the case of ∈ + and panels (e)- (h) are the negative similarity ∈ − ; Hence, panels (a)- (d)  show positive similarity + , , whereas panels (e)-(h) describe the negative similarity − , .By Equation (1), panels (a) and (b) con rm with + , , while panels (c) and (d) against it; Panels (e) and (f) con rm to − , while (g), (h) are against it respectively.For the opposite property of the similarity, the dissimilarity is also introduced. (1) In Figure 2, the more structures of (a) and (b), the larger the value of + , , and the more structures of (c) and (d), the smaller the value of + , .e more structures of (e) and (f), the larger value of − , , and the more structures of (g) and (h), the smaller value of − , .
As the de nition of similarity of nodes and , the dissimilarity between nodes and is de ned where + , and − , are the cases of ∈ + and ∈ − for Equation (3) respectively, + , and − , are positive dissimilarity and negative dissimilar- ity, respectively.
By Equations ( 1)-( 4), it is found that the following two facts hold if ( ) ∩ ( ) ̸ = , otherwise, when ( ) ∩ ( ) ̸ = , the other two facts hold, Normally, , represents the degree of consistency between nodes and , while , is the degree of incon- sistency between nodes and .In real social networks, positive similar nodes tend to have positive relationships, while nodes with large di erences between them may have negative relationships.

Preference and Reputation.
In social networks, the preference and reputation of individuals are in uential in decision-making to form a connection [25].e preference, known as optimism or bias in previous studies [26], is for edge generating nodes.Some nodes might be more optimistic than others, meaning their attitude are more likely to be positive.e preference of node is de ned as Complexity 4 competent for the prediction.erefore, the sign of the link is assigned as , the sign tendency of is obvious so that the feature − is competent for the prediction task.Yet, is case means that the sentiment's tendency is ambiguous.Hence, the feature of − loses its e cacy for predictions.In this case, the values of , is considered for prediction.Denote the proportion of positive links in the network by + .en the sign of the link is assigned as In fact, , ≥ + means a probability of the preference and the reputation is greater than the proportion of positive tendency, so = 1 is easy to admit.Otherwise, = −1.When , = 1 means the links generated by nodes and are all positive; otherwise, the links generated by nodes and received are all negative when , = 0.In the experimental analysis, the input real social networked data is the adjacent matrix with | | rows times 3 col- umns.Each row is an edge, the rst and the second columns are the source and the target nodes, respectively, the third column is the observed sign from a source to a target node.When we calculate the -model, a 21 × | | dimensions matrix is de ned.As described above, the rst three columns are still network link data.e 4th column to the 11th are the number of eight special quadrangles of each edge contained in respectively.e 12th to 15th column store the values of

e Pseudo-Code for
+ and − of the edge respectively.e 16th to 18th columns are the values of , and of each edge respectively.e 19th and 20th columns are the values of + and − of each edge respectively.e 21st column is the predicted value for each edge.Hence, the spatial complexity is (| | × 21).In addition, the spatial complexity of calculating the neighbor set of each node is ( ) measures the general attitude of node toward other nodes in Equation (7), and also means the probability of positive edges among all edges generated by the node .e greater ( ) is, the higher the probability of node regener- ating another positive edge is.
Reputation, also known as prestige or deserve in previous studies [26], is for edge receiving nodes.Reputation re ects the popularity of a node in the network.A node with a high reputation tends to receive more positive edges.e reputation of node is de ned as In Equation ( 8), ( ) measures the general attitude of other nodes toward node , and it is also the probability of positive edges among all edges received by node .e greater ( ) is, the higher the probability of node receiving another positive edge is.
Combing both ( ) and would enhance the prediction e ect on the pair of nodes and .erefore, we calculate the weighted sum of ( ) and as e sum of the coe cients of ( ) and in Equation ( 9) is 1, which means the equation not only takes into full consideration the preference of node and the reputation of node , but also the priority connection mechanism [2].

e Prediction: SPR-Model.
is section predicts signs using similarity-dissimilarity (denotes as − ) and value.− is a local environmental feature which re ects the interaction structure the target edge actually participated, while value is the nodal own feature which re ects the empirical estimates according to the past performances.Here, the prediction method takes both − as the decisive factor and value as the auxiliary factor.e model is taken as follows: Denote + = + , + − , as the positive index and − = − , + + , as the negative index.Let be any given positive real number to measure the di erence between + and − , ∈ [0, 1] a threshold measuring the dif- ference between + and − , and .+ re ects the positive tendency between nodes, while − is the negative tendency between nodes.When the gap between + and − is large enough, the tendency is looked as obvious.erefore, two cases of + − − > and + − − < are assumed as the pos- itive and negative signs, respectively.Hence, the sign of the link of nodes and is assigned by the two cases: In this case, the sign tendency on is easy to understand, so the values of -is

Evaluating Metrics.
Experimental results are presented by three metrics: accuracy, average accuracy and 1 -score.e accuracy (acc) is de ned as: where TP, TN, FP and FN are de ned as shown in Table 3.
TPR is the true positive rate, TNR is the true negative rate, P is the number of positive edges, and N is the number of negative edges.Equation (12) shows that the role of negative edge prediction is almost ignored and the result is completely determined by positive edge when → 0 ( = / ).erefore, the average accuracy ( ) is de ned as: us, predictors with higher can predict higher rates of either sign in even skewed datasets disregarding bias [30].In addition, since sign prediction is a binary classi cation task, 1 -score is used to measure the predictive precision and recall rate and it is calculated as: where = / + and = / + .Obviously, the 1 -score is the harmonic mean of and and can be a trade-o between them.(

Generalization across
size of the node set of the network.Summarizing the above analysis, the total spatial complexity is (2| | + ⟨ ⟩| |).
For each edge , , do the following 5 steps: Step 1. Compute the neighbor set of each node.
Step 2. Compute the number of special quadrilaterals.
Step 3. Compute the similarity and dissimilarity. .
Step 5. e sign of each edge , is predicted.
where + is the proportion of positive links in the network.Output: e sign of each edge , .explain the mechanism of the formation of signed social networks, although MOI-10 measures the balance of cycles with lengths ≤ 10, its prediction results are still inferior to those of other algorithms.In addition, the low of CF also illustrates that the prediction of edge signs should take full account of other features of the network, rather than relying solely on structural balance.(2) Local structure is more signed than macro structure.In other words, nodes generate the signed edges usually based on their local connections, i.e., HOC-5 learns the features of cycles with length of 3 : 5, its predictive results are still inferior to those of other machine learning algorithms.(3) Machine learning can not e ectively capture the key signed structural features when there are too many features to learn, i.e., for the nine scalars of the three datasets Epinions, Slashdot and Wikipedia.In Table 2, 91.6% of Epinions, 91.1% of Slashdot and 97.7% of Wikipedia are extracted for testing.Table 4 shows the three sub-datasets whose edges are contained in at least one panel of Figure 2.
e performances of the predictive model is displayed in Figures 3(a), 3(c) and 3(e) which demonstrates that: (1) when predicting only based on value, accuracies on three datasets are 85.51%, 78.66% and 75.34%, respectively, while when predicting only based on − , results are 97.57%,95.31% and 90.20%, improved by 12.06%, 16.65%, and 14.86% respectively.(2) when using − as decisive and PR value as auxiliary to predict, accuracies on the three datasets are all improved, which demonstrate the scienti c of the predictive model.
Since − is computed by the number of quadrilaterals as Figure 2 displayed, each dataset is classi ed into four sub-datasets according to the number of quadrilaterals to test the performance of − , As shown in Figures 3(b), 3(d) and 3(f).For Epinions, the predictive e ect does not di er signi cantly over the four sub-datasets, moreover, the predictive accuracy always be high.
is proves that − has high robustness.For Slashdot and Wikipedia, when the number of quadrilateral is 0 : 100, the predictive accuracy is obviously lower than that when the number of quadrilateral exceeds 100. is demonstrates that these two networks have less data to extract features, which is the main reason why the accuracy under these two datasets is not as well as the data of Epinions.erefore, the conclusions are threefold.First, the network of Epinions is more mature than that of Slashdot and Wikipedia.Second, that the predictive accuracy of Slashdot and Wikipedia increasing with the increased available network data; And the third is scienti c to predict with − .

Comparison of Results.
To further test the performance of prediction of model, it is compared with the existing approaches, such as the logistic regression (LR) proposed by Leskovec et al. [4], the logistic regression based on three attributes (LR-3A) proposed by Yuan et al. [9], the supervised learning based on higher order cycles (HOC) proposed by Chiang et al. [19], the logistic regression based on Bayesian node properties (LR-BNP) proposed by Song et al. [23], the troll-trust model based on ranking proposed by Wu et al. [24], the logistic regression based on reputation and optimism (LR-RO) proposed by Shahriari et al. [26], the measures of imbalance (MOI) and the matrix factorization (MF) studied by Chiang et al. [15], the collaborative ltering (CF) introduced by Javari and Jalili [28] and the closed triple micro structure (CTMS) proposed by Khodadadi and Jalili [30].e comparison results are shown in Table 5.In order to compare the approaches fairly, the experimental data of Table 5 are quoted from the previous studies.Note that in the predictive model = 0.15.
Table 5 shows that the values of SPR-model on Epinions, Slashdot and Wikipedia are all larger than that of other 10 approaches. is proves the feasibility and validity of 's predicting mechanism for calculating the nodal features.By comparing the of the 10 approaches, the following conclusions can be drawn: (1) Social balance theory cannot fully 5: e results of on three networks.
"≈" is the approximation from the reference.

Epinions Slashdot Wikipedia
LR [4] 0.9342 0.9351 0.8021 LR-3A [9] 0.9592 0.8892 0.8786 HOC-5 [19] 0.9080 0.8469 0.8605 LR-BNP [23] 0.9313 0.8565 0.8737 Toll-Trust [24] ≈ 0.96 ≈ 0.90 ≈ 0.89 LR-RO [26] 0.9582 0.9010 0.8880 MOI-10 [15] 0.8497 0.7850 0.8220 MF [15] 0.9448 0.8835 0.8839 CF [30] 0.9282 0.8258 0.8137 CTMS [30] 0 As for the skewness feature of actual datasets, is basically determined by the positive edges.erefore, the of the model is compared with the exiting algorithms, shown in Table 6.In order to compare the approaches fairly, the experimental data of Table 6 are quoted from previous studies.Since some previous studies did not show the results of these experiments, the kinds of comparison algorithms in Table 6 are less than that in Table 5, and the -model signi cantly outperforms than others showing the scienti c and validity of 's predictive mechanism.Compared with the ve algorithms in Table 6, LR-RO is still the most competitive, which is consistent with the conclusion in Table 5.However, the of other algorithms has been greatly reduced. is shows that algorithms (LR, HOC-5 and LR-BNP) there are eight scalars inferior to that of LR-RO.e main reason is that LR-RO only learns two features (reputation and optimism) while the other three algorithms have learnt many features.(4) e main factor a ecting the sign of an edge is the features of its two endpoints, followed by its local features, and nally its global features.For these 11 algorithms, there are only Troll-Trust and LR-RO can be comparable to in terms of accuracy and robustness.What these three algorithms have in common is that they are based on the features of two endpoints to predict the sign of edge.e above comparative analysis demonstrates that successfully avoids the shortcomings of other algorithms and captures the key signed structural features.of and 1 -score is basically synchronized, which also shows that the two evaluation metrics are mainly determined by the positive edges, moreover, when is very small (0 ≤ ≤ 0.1), they can reach the optimum.However, the trend of is quite di erent.With the change of , shows a clear trend of increasing rst and then decreasing, and the optimal value is obviously lagging behind that of or 1 -score.is is because: when is very small, the edge signs are mainly determined by the − feature; with the increase of , a considerable part of the edges are determined by the value, by this token, value is superior to − in predicting most of the algorithms have defects in predicting negative edges.In addition, 's 1 -score is also compared with LR-3A and Troll-Trust algorithms, as shown in Table 7, of which the experimental data prove that the predictive model has high predictive precision and recall rate.By comparing with the state of the art methods, it is fully demonstrated that outperforms others in predicting both positive and negative edges.Data Availability e three .txtles, Epinions.txt,Slashdot.txt,and Wikipedia.txt are datasets used to support the ndings of this study have been deposited in the Stanford web site repository at https://snap.stanford.edu/data/#signnets. e datasets are in the form of adjacency list, include three arrays: the rst is the source node, the second is the target node, and the third is the edge weights or the signs.e data of Epinions is the consumers' review site, includes 131828 nodes and 841372 links.Users can read or comment on a variety of goods and services, and also be allowed to evaluate the comments made by others users, that is, to evaluate other users as trustworthy or distrusted objects.e data of Slashdot is a blog site that allows users to say what did they like or dislike other users' comments, and it contains 82144 nodes and 549202 links.e data of Wikipedia is an online voting network where users vote or against a candidate administrator, and there is 7118 nodes and 104359 links.negative edges.Yet, due to the overwhelming advantage of the positive edges in quantity, all the three evaluation metrics will be reduced when is too large.

Discussion and Conclusion
In this paper, the model is proposed to predict the edge signs in large online social networks where interactions can be both positive and negative.e model is easy to understand because of the only two indexes to measure the interactions between nodes and their local environments.
− , shows similarity and dissimilarity between nodes and , which can be re ned into positive and negative similarity-dissimilarity.Experimental results on Epinions, Slashdot, and Wikipedia proved the scienti c and validity of − , in predicting edge signs.e main advantages of the index of − , precisely predicting edge's signs are as follows.e rst advantage is the index of − measuring the common attributes of nodal pairs.Hence − is calculated from a highly symmetrical quadrilateral.Since the signs of the bi-directed edges are basically coincident which powerful supported by the evidences in Table 8. e natural conjecture of the directions of the links in the network should be symmetrical.In fact, the proportions of bidirectional links in Epinions, Slashdot and Wikipedia are 30.55%,17.39%, and 2.04%, respec- tively.e reason why Wikipedia has a worse prediction e ect might be the bi-direction of links.e second advantage is that the values of − keep both social balance and status theory hold on, or at least it skillfully avoids the con icts between them.For example, in Figure 2(a), the quadrilateral is structurally balanced when is 1, and should be the same as when nodes and have similar status.
e third might no the last advantage is that the prediction model makes the best possible of the existing data to predict the missing signs of links.Previous methods are mostly based on triangle structure, and there are fewer triangle data in actual data.As shown in Table 2, in Epinions, Slashdot, and Wikipedia there are 11.5%, 39%, and 5.8% fewer triangle data compared with the data the model based.
, displays the tendency of , and is a weighted sum of the preference of and the reputation of .Nodal preference and reputation are derived from the preferential attachment mechanism, which can be described in signed social networks as: nodes with larger positive/negative outdegree (or indegree) generate a positive/negative edge with larger probability; nodes with smaller positive/negative outdegree (or indegree) generate a positive/negative edge with smaller probability, shown in Figure 5. Experimental results demonstrate that negative edges have obvious features when they are generated.erefore, it may be more e ective to predict edge signs by distinguishing the features of nodal pairs.In this paper, the underlying mechanism that determine the signs of links in large social networks is explored and a conclusion is obtained that edge signs are mainly determined by their own or local features, not the global one.rough experimental analysis, the scienti city and validity of the predictive model are veri ed.In addition, because the features

F 1 :
Similarity diagram.(a) e out node pair.(b) e in node pair.
Datasets.To test the performance of the predictive model, experiments are made on di erent

T 1 :
Algorithm of pseudo-code of -model.Input: Network adjacent matrix.

F 4 :
Experimental results of three real data sets under di erent .(a) Epinions.(b) Slashdot.(c) Wikipedia.

F 5 :
Figure 4 shows the experimental results, plotted as a function of .With the change in , the trend Degree distributions of the three real data sets.Epinions.(b) Slashdot.(c) Wikipedia.Complexity 10 measured by the model are extracted from the nodal own or local structures, the model is very advantageous for large-scale datasets.
And nally in Step 5, it also takes (| |) for predicting the sign of each edge.erefore, the total computational time complexity of predicting the signs of edge in is | |⟨ ⟩ 2 .
Computing the SPR-Model.(| |), where | | is the size of edge set ; In Step 2, for each edge , , match the neighbors ℎ and of and respec- tively, the time complexity of Step 2 is | |⟨ ⟩ 2 , where ⟨ ⟩ is the average degree of nodes.In Step 3, computing the similarity and dissimilarity of each pair of nodes takes (| |).In Step 4, it takes (| |) for computing value of each pair of nodes.
order to verify the e ciency and reasonability of the sign of link predication model, experiments on real data are taken.
allows users to say they like or dislike other users' comments.Slashdot data consists of 82144 nodes and 549202 edges, 77.4% of which are positive edges.Wikipedia is an online voting network where users can vote for or against a candidate administrator.Wikipedia dataset consists of 7118 nodes and 104359 edges, 78.4% of which are positive edges.edetails of these three networks are shown in Table2.
2: ree real social signed networks.
+Edges (-Edges) is the number of positive (negative) edges in networks.% +Edge is the percentage of +Edge.