Analysis on Influential Functions in the Weighted Software Network

1College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China 2The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province, Qinhuangdao, Hebei 066004, China 3The Key Laboratory for Software Engineering of Hebei Province, Qinhuangdao, Hebei 066004, China 4Beijing Key Laboratory of Software Security Engineering Technique, Beijing Institute of Technology, 5 South Zhongguancun Street, Haidian District, Beijing 100081, China


Introduction
Measuring accurately the importance of the node in the software networks is the premise to improve the security and robustness of software [1,2].Moreover, with the development of the software, measuring the importance of nodes in the network has practical significance for defending and protecting the influential nodes in the software network [3], if these nodes are suffered by deliberate attacks, maybe cascading failure occurs [4,5].Accordingly, how to mine the potential characteristics of software to control the evolution process of the software structure has become a hot spot for researching [6][7][8][9].
Many researchers introduced the idea of complex networks to the field of software structure and abstracted software to a network from different granularity point [10].With the network structure, many potential characteristics can be discovered directly.Ma et al. [11] abstracted interaction relationship between packages into a software network, and they defined functions in package as nodes and dependencies among functions as edges.Wang et al. [12] proposed an approach to study the evolution of special software kernel components, which adopted the theory of complex networks.They also proposed a generic method to find major structural changes that happened during the evolution of software systems.Li et al. [13] proposed a modular attachment mechanism of software network evolution.Their approach treated object-oriented software system as a modular network, which was more realistic.A new definition of asymmetric probabilities was given to acquire links in directed networks when new nodes attached to the existing network.With the directed network, both of the "scale-free" and "small-world" properties were verified to be present in the software network.In [14], David proposed a method to simplify the complexity of the software network.With the method, some valuable characteristics in the network could be obtained easily.From the researches above, the complex network was proved to be applicative in the software engineering and it brought 2 Security and Communication Networks us a new perspective to research the software structure.However, these methods of modeling the software mentioned above were based on the static structure of the source code.The execution characteristic during the software running process was neglected in these methods.For the software, most of the characteristics are exhibited during the execution process.
The characteristics of the software execution can help us to understand the software better.It is obvious that the node is an important part of the network and it has enormous influence on the stability, reliability, and robustness of the network [15].For a software network, the software function plays a critical role in the stability and robustness of the software during the execution process.In the structure of the software, the functions carry most of the feature characteristics and topology information and they can affect each other.In most cases, the fault of a function is not only caused by itself but also infected by the other functions.Recently, the importance of the node in the network was defined from different aspects.Bhattacharya et al. [2] defined a measure to evaluate relative importance of the nodes in software by assigning a numerical weight to each node of software graph.By the value of the betweenness and the clustering coefficient, Zhang et al. [16] measured the importance of each node to analyze the influence of each node to the entire network.According to the propagation field of the classes, Li et al. [17] put forward an indicator to measure the importance classes in the software network at class level.Based on the value of the indegree and the outdegree of each node, Wang and Lü [18] proposed a method to mine the influential nodes.With the method, they proved that the fault appeared with a large probability in those nodes with large degree value.In the researches above, the node was proved to play a key role to analyze the network.However, the node was regarded as an individual unit, as well as the relationship between the node and the entire network was ignored.In practical application, the network should be considered as a whole, in which the nodes can interact with each other.
Considering the above-mentioned shortcomings, the dependency relationship between the function nodes, and the absence of efficient analysis methods, we construct the WSN to show the software structure according to the information of multiple execution.Based on the dependency relationship between the function nodes, we present a targeted method FunctionRank to evaluate the importance of the software nodes.With the analysis result of each node, we rank the influence of each node to mine the top- nodes.These function nodes have played an important part in ensuring software reliability and stability.So they should be paid more attention in the process of software updating and software maintenance.
The primary contributions of this paper can be summarized as follows: (i) A novel method is proposed to construct weighted software network (WSN).So we make the understanding and recognition of software structure more accurate.(ii) A measurement node importance (NI) is put forward to evaluate the importance of each node in the network.(iii) The IC (independent cascade) model as an attack model is used to evaluate the influential functions for software system.(iv) The proposed algorithm is an effective method for security measurements of cybernetwork and provides basis for software security and reliability improvement.
The rest of this paper is organized as follows.The construction process of the weighted software network (WSN) is described in Section 2. The node importance of each function node is given definition in Section 3.Then, in Section 4, the method FunctionRank is given to mine the most influential nodes.In Section 5, the performances of the proposed algorithm are showed by experiments.Finally, conclusions and future works of the paper are presented in Section 6.

Definitions of Weighted Software Network
Complex networks are suitable to show the invoking relationships between the software functions.Based on the information of the multiple execution processes, we define the software execution dependency structure with a directedweighted network.
2.1.Software Network.In this section, according to the multiple execution information, we define a software network to demonstrate the software execution dependency structure.
Figure 1 shows a real example of software network.
Where each node represents a software function and each edge is the invoking relationship between the functions.In the software network, most of the characteristics can be exhibited during the software execution process.

Weighted Software Network.
Next, in order to guarantee the completeness of the experimental data and make the understanding and recognition of software structure more accurate, we define a weighted software network.Compared with the software network, we consider invoking times between the software functions in multiple execution processes as the weight of each edge.The weighted software network is suitable to demonstrate the complex invoking relationships between the software functions.The definition of weighted software network (WSN) is given as follows: Figure 2 shows a weighted software network, where Node is a software functions set and Edge is an invoking relationship set between the software functions.  that stands for the weight of edge  is calculated by the following formula: where  is the times of the trials with different experiment cases.  is a value of 1 or 0. If the edge  of one calling relationship appears in an execution trace, no matter how many times of it, let   be 1; otherwise it is 0. Figure 3 presents a simple process of the WSN established.As shown in Figure 3(a),  (1 ≤  ≤ 5) is a function invoking trace in one-time execution of the software.The trace  contains a series of function calling relationships which can reflect the software execution process.Figure 3(b) shows a structure of WSN, in which the node and the edge of the network are defined as the function and the calling relationship between the functions appearing in 1∼s5 in Figure 3(a), the weight of edges represents the number of each calling process executed in the 5 times' execution, and the times of a calling relationship in some execution processes were ignored.
Based on multiple execution information under the different experimental cases of software, we guarantee the completeness of the experimental data.Function nodes which have appeared during the software multiple execution processes are considered as a set of nodes of the network structure, calling relationship between the software functions is considered as a set of edges, the weight of the edge, we consider the weight  to stand for the edge appearing  times in the  execution traces of the software, and we ignore the times appearing in an execution trace .In this way, WSN is built.

Node Importance
According to the complex invoking relationships for software system, we show the most common topology structures of the weighted software network in Figure 4 to explain the importance of the function node.As shown in Figure 4(b),  is a terminal node.
Definition 4 (LTN (loop terminal nodes)).The nodes that only have an outlink to their own are defined as loop terminal nodes.
As shown in Figure 4(c),  only has an outlink to its own.So  is a loop terminal node.
Definition 5 (OD (output degree)).The weight sum of each edge for a node V to its outdegree nodes, OD (V), is named as output degree of the node V.
In Figure 4(a), the weight of each edge for  to ON () is 2, 2, and 5, respectively.OD () is the sum of these weights, namely, 's output degree.
Definition 6 (WC (weighted contribution)).The ratio of the weight for node V to node V and V's output degree, WC (V), is the weighted contribution of V to V.
In Figure 4(a), the weight of  to  is 2. The weighted contribution of  to  is given as follows: Based on the above definitions, the node importance (NI) of node V is given as follows: where  is the certain probability of calling a random node for LTN, and the probability of invoking each node is the same.It is set as 0.15 with experimental verification.

Important Nodes Mining
In this section, we first provide an algorithm outdegree nodes to get the outdegree node list of all nodes, according to the outdegree nodes of each node in the software network, and then we provide another algorithm FunctionRank

Security and Communication Networks
s1 to calculate NI of each node.In the method Function-Rank, we evaluate the importance of nodes iteratively (see Algorithms 1 and 2).
As shown in algorithm outdegree nodes, for each node in set  we traverse the edges in set  in line (1) and line (2).We define the nodes of an edge as start node V and end node V, respectively.In line (3) to line (4) we add the end node V of an edge to the childStr of node V, when node V equals the start node V of the edge.Finally, we print the childStr of V in line (6).
We evaluate the importance of each node in the network by an iterative process, as shown in Algorithm 2. In line (1), we initialise NI(vi) as importance of nodes and  as the certain probability to call a random node, respectively.Line (2) to ( 19) is the iterative process to compute the importance coming from outdegree of the current node and other nodes which call the current node.The computational formula of node importance (NI) is given in line (18); it has higher influence when the node has lager value of NI.Ultimately, the importance for a node (NI) is obtained when error of current NI value and previous NI value is less than a given threshold for all nodes.
With the measuring results obtained from Algorithm 2, we choose the top- nodes as the influential nodes for the software network.In Algorithm 3, we illustrate the process of top- nodes (KN).
In Algorithm 3, we initialise list as the measurement list for all the nodes in line (1).Lines ( 2 process to store the NI value for each node.The sorting process is given in line ( 5) and line ( 6), the top- nodes are chosen from the list in line (7).

Experimental Analysis
A series of experiments were conducted to compare the performance of the proposed algorithm (named as Func-tionRank) with different parameter values.They were implemented in JDK1.6.0 and executed on a PC with 3.30 GHz CPU and 5 GB memory.

Experimental Datasets.
Firstly, several dynamic software datasets are used to evaluate the performance of the algorithms.The classical software is obtained from the opensource community.These software programs are coded in C or C++, including program software tar and cflow.
In the experiment, we chose different versions of tar and cflow, respectively, for experiment.tar is a decompression software for Linux, and cflow is an analysis tool for C program to extract the relationship of function calls (download from the open-source software library: Https://sourceforge.net).

Evaluation on the FunctionRank.
We run the algorithm on each version of tar and cflow.By the algorithm Function-Rank, we calculate the  of each function node.Here we mine top-10 nodes in each version about software tar and cflow.It is shown in Tables 1 and 2, respectively.
As it is shown in Table 1, for versions tar-1.21 and tar-1.23, the NI of the top-10 are almost the same.The reason is that the difference between the three versions only reflects the number of function calls.In other words, there is no change of the component function of these two versions.In the latest three versions, developers changed the logical contents of some functions or insert new functions into the software to enrich the features of software; on the other hand, the software was simplified or some features were removed to improve the robustness, which results in the ranking variation.For example, in the prior versions node gnu flush read ranked 2nd or 3rd but it ranked 7th and 8th in versions tar-1.25,tar-1.27,and tar-1.28.Table 2 shows the top-10 influential functions of software cflow in different versions.The ranking of some functions in each version of cflow varies but with little range.For example, function print symbol's ranking ranges from 1 to 2. So we can make a prediction that it may still be more influential than most others in the next new version.Meanwhile, there is no function alloc cons for the latest versions cflow-1.3 and cflow-1.4results in the ranking variation.In other words, there is change of the component function of these two versions.
In addition, the number of nodes which have high  is rather small in each version.These high value nodes have taken a great part in ensuring software reliability and stability.It means that there are little functions that should be paid more attention in software updating and software maintenance.We calculate the count for different range of  values.The results of software tar and cflow are shown in Figures 5 and 6, respectively.
As we can see in Figure 5, most of nodes are ordinary functions.We would not pay more attention to them.Meanwhile, a handful of nodes that have high  should be paid more attention.They play important roles in the process of software updating and software maintenance.For cflow, the number of nodes in each scope is shown in Figure 6.It has the same characteristic with tar.The number of nodes with high  is much less than that of low .By paying more attention to these influential nodes in future versions, we can improve software reliability and stability.Thereby we can greatly reduce the amount of work and improve work efficiency.
At the same time,  of the same ranking nodes within different versions has slight wave, as shown in Figures 7 and  8.
As it is shown in Figure 7, the NI distribution of software tar is similar extremely in the six versions.With the increasing of node ranking, the NI of each node shows a decrease trend.laws, the NI of a certain ranking remains stable and the NI distribution of different software versions is nearly the same.So, we can predict the future versions' trends based on this.Meanwhile, Figure 8 shows the NI distribution of software cflow, the higher NI ranges from 0.8 to 4.0, and most nodes' values are around 0.5.The curve of each version has the same tendency; namely, the NI distribution of software cflow follows the same trend.

Performance Evaluation.
In the study of complex network, we often examine the effectiveness of a method [19,20] through the analysis of spreading influence about top- nodes.Therefore, this paper will introduce IC (independent cascade) model.The IC model derived from the SIR (Susceptible-Infected-Recovered) model, the SIR model is a theory about virus spreading and has to be researched widely in complex networks, such as the marketing, advertising, early warning, and social stability.In software engineering, the similar algorithms were used to analyze the change impact [21] and error propagation [22].
The IC model is a probability model; when a node V is activated, it will attempt to activate its inactive outdegree nodes with probability  only once [23].Whether node V can activate its neighbor nodes successfully, V is still active, but it has no influence later.The communication process is over when there are no influential active nodes in the network, while, in the actual execution process of software, the running fault can affect the other function running due to the invoking relationship.When running fault, all of the invoked functions would affect the normal execution of the parent function.So the faults can widely spread among the function nodes during the running process.So we take IC model as a software attack model to evaluate the effectiveness of our method.A software attack instance is shown in Figure 9.
We assume the node  and node  are attacked as Figure 9(a) shows, and then  and  will attack its inactive outdegree nodes with probability  only once, where  and  are attacked successfully by ; meanwhile  and ℎ are attacked successfully by  in Figure 9(b), next  and  have no aggressivity, and the nodes attacked by a and d can attack their inactive outdegree nodes with probability  in the same way.Finally, the number of attacked nodes represents the influence of original attacked nodes.
When calculating the influence of the top- important nodes obtained by different methods, we will separately run IC model about 10 times and then consider the average of active nodes as the performance evaluation of the method.
The software key entities typically account for a small proportion and only account for one point five percent to two percent in the study of class size [24].At the same time, it is not acceptable for the cost of checking most of the key entities.So an appropriate number of key entities is needed to be selected.By ranking all functions as descending order according to the measurements, we chose little key functions for different systems: top 20 for tar and top 30 for cflow.
Figure 10 shows the average of active nodes for different software versions.In all the different versions of the software systems, key functions identified by NI can activate more nodes than that identified by the method PageRank and MKN [25] as Figure 10 shows.Visibly, compared with another two methods, NI is more effective in the identification of the key functions.The key functions play an important role in software system in terms of reducing the numbers of test data, detecting the vulnerabilities of software structure, and analyzing software reliability, and they should be paid more attention in the process of software updating and software maintenance.Measuring accurately the importance of the node in the software networks is the premise to improve the security and robustness of software.Moreover, with the development of the software, measuring the importance of nodes in the network has practical significance for protecting the influential nodes from deliberate attacks in the software network.

Conclusions and Future Work
In order to understand and recognize software structure better, a novel method is proposed in this paper to mine the influential nodes in weighted software network.Firstly, taking into account the invoking times, we construct a directedweighted network structure to make the understanding and recognization of software structure more accurate.Then, a measurement of NI is put forward to evaluate the node importance, where we provide an idea of importing PageRank and WSN to Software engineering domain.Furthermore, we also consider the outdegree value as a key parameter to the node importance.The outdegree value can reflect the complexity of the node.Finally, the algorithm named FunctionRank is presented to calculate the NI and the change trends of nodes' importance are analyzed by different software versions.In addition, the experimental results show that the proposed feasible approach has good performance in identifying the influential software nodes.
Although the approach we proposed shows some feasibilities in identifying influence nodes in complex software network, the broad validity of our approach should be demonstrated further.Our future work is using more opensource software network to evaluate the validity to improve our approach.

Definition 1 (
IN (indegree nodes)).For a node V, IN is a set of functions which call node V directly.The IN of node V is gotten by only one call step.As shown in Figure 4(a), IN() = {, }.The influence of node V is based on IN (V) which call vi directly.Definition 2 (ON (outdegree nodes)).For a node V, ON is a set of functions which are called by node V directly.The number of ON(V) is V's outdegree, CO.As shown in Figure 4(a), ON() = {, , } and CO () = 3. Definition 3 (TN (terminal nodes)).The nodes that have no outdegree and have no contribution to the influence of other nodes are defined as terminal nodes.

Figure 5 :Figure 6 :Figure 7 :
Figure 5: Number of nodes in value scope of tar.