Mining the Key Nodes from Software Network Based on Fault Accumulation and Propagation

The increasement of software complexity directly results in the augment of software fault and costs a lot in the process of software development and maintenance. The complex network model is used to study the accumulation and accumulation of faults in complex software as a whole. Then key nodes with high fault probability and powerful fault propagation capability can be found, and the faults can be discovered as soon as possible and the severity of the damage to the system can be reduced effectively. In this paper, the algorithmMFS AN (mining fault severity of all nodes) is proposed to mine the key nodes from software network. A weighted software networkmodel is built by using functions as nodes, call relationships as edges, and call times as weight. Exploiting recursive method, a fault probability metric FP of a function, is defined according to the fault accumulation characteristic, and a fault propagation capability metric FPC of a function is proposed according to the fault propagation characteristic. Based on the FP and FPC, the fault severity metric FS is put forward to obtain the function nodes with larger fault severity in software network. Experimental results on two real software networks show that the algorithmMFS AN can discover the key function nodes correctly and effectively.


Introduction
With the development of computer technology and the expansion of software applications [1], the scale and complexity of software systems increase continuously.Software faults directly lead to the rise of system failure ratio, and their reliability is becoming more and more difficult to guarantee.In the test and maintenance process, developers cannot deal with the software problems with a clear purpose [2].Therefore, if some potentially useful information can be found from software source code or dynamic execution process to help software workers understand the structural characteristics of software quickly, it will be of great significance for improving software development and maintenance efficiency [3][4][5].Affected by the achievements in the complex network field, some researchers regard software system as a software network for scientific research.This provides a novel research idea and platform for better understanding and measuring the internal topology structure of complex software system and receives great attention.
The knowledge of complex network has been introduced into software engineering by using network model to represent the structural characteristics of a software system, and researchers have found many novel features of the structure from different points of view [6,7].Valerde et al. [8] apply complex network to construct the topology structure of software and propose a method to model the software as an undirected network for the first time.In the method, the node is regarded as the software class and the edge is regarded as the call relationship among the classes.With experiments, they find the "scale-free" and "small-world" properties in software network.Myers et al. [9] use directed network to represent the collaboration relationship among the classes of software.They learn that indegree and outdegree distribution of the network obey the power-law distribution with different exponents.Pan et al. [10] adopt a binary software network to represent class units or package units and their dependence relationships in a software system, as well as a community detection method to detect the best module partition of the software system.The optimal module partition is compared with the real module partition in the system to guide the optimization design of the module when a software version is updated.Thung et al. [11] propose a new method to simplify the complexity of a software network.Then, they measure the importance of classes from different properties (betweenness, closeness, etc.).Furthermore, they condense the class network which only contains some important classes.By the method, they are able to depict the overall design for software and make the design model easy to understand.Mohammed et al. [12] construct a mapping of research system to identify software security techniques used in the software development process, which enables software developers to understand the existing software security approach better.Thus, software security problems are urgent to be solved.
Measuring the importance of nodes accurately in software network is the premise to improve the security and reliability of software [13].In software network, a few key nodes have an important effect on the overall stability, reliability, and robustness of the system [14], such as the impact of cascading failure propagation.If there are faults in these nodes, it can result in partial or total system crashes and irreversible results.Identifying these nodes and providing them with key protection help to prevent system crash caused by deliberate attacks.Researchers have defined the importance of nodes in software network from different aspects.Freeman [15] utilizes the betweenness to measure the node importance and points out that a node is more important in software network if its betweenness is bigger.Callaw et al. [16] consider that a node is more important in software network if its degree is bigger, because the node with bigger degree connects with more nodes.However, it does not consider the overall structure of a software network and has some limitations.Kitsak [17] makes it clear that the location of a node in network has a great impact on its importance and exploits the k-shell decomposition method to measure the node importance.The metric result is proved to be better than the betweenness and degree of centrality.Turnu et al. [18] measure the quality of software by analysing the degree distribution of nodes in a software network.They define the structure entropy to describe the degree distribution of nodes and prove that the statistical information of the structure entropy in a software network can be related to the number of software bugs.It further proves that there is a relationship between the structure characteristics of a software network and the quality of the software.Wang et al. [19] define the influential nodes in a network by studying the weighted software network at the function level.They analyse the relationship between the statistical characteristics of software network and the influential nodes through experiments.Bhattacharya et al. [20] predict software evolution based on the static graph topology analysis and propose NodeRank value to measure the importance of a node.The fault of a function is not only caused by itself but also affected by other functions.Huang et al. [21] define the importance of a node based on the dependence relationship and the information propagation among functions.Their algorithm MIN can effectively mine the influential nodes in a software network, but its assignment to the probability of information propagation has certain subjectivity.
In complex networks, random walk model judges the importance of a node by considering its own connectivity degree and the importance of neighbouring nodes around it.Typical methods are PageRank, NodeRank [20], and so on.In software network, the CK metric set proposed by Chidamber and Kemerer [22] indicates that the number of classes that are coupled to a given class named CBO can affect the propensity of the class node to contain defects.If the CBO of a class is larger, it is more sensitive when other parts change.So it is harder for software workers to maintain.Ren et al. [23] believe that the more numbers of modules, classes, or functions that are directly or indirectly dependent on the function nodes, the greater the cost of constructing it and the probability of error.Ren et al. [24] also believe that when a function node is the role of the calling function, it may accumulate the defect of the called function node with a certain probability.When a function node is the role of the called function, it may propagate its defects to its caller with a certain probability.Based on the random walk model and combined with the directed weighted feature of software network, the following FP and FPC are proposed.
In summary, this paper focuses on the call dependence relationship among functions and the fault accumulation and propagation of dynamic execution process.Firstly, according to the dynamic execution information of software, we build a weighted software network model.Then, utilizing recursive method, the fault probability metric FP of a function is defined in accordance with fault accumulation characteristic, and the fault propagation capability metric FPC of a function is proposed on the basis of fault propagation characteristic.Finally, by combining FP and FPC, the fault severity metric FS is put forward and the algorithm MFS AN (mining fault severity of all nodes) is proposed to calculate the FS and obtain the function nodes with larger fault severity in software network.
The rest of this paper is organized as follows.Section 2 describes the process of mining key nodes in a software network based on the fault accumulation and propagation.The experiment results are given in Section 3. Conclusion and future work are mentioned in Section 4.

Mining Key Nodes Based on Fault
Accumulation and Propagation  where NSet is the function node set of a software network, ESet is the edge set which is the function call relationship during the software execution process, and Weight denotes the execution times that a function calls another one.As the software system works, a function is a calling function and also a called function.In the execution process of a function node u, the nodes called directly by u are the direct outdegree neighbor node of u, and the set of these direct out-degree neighbor nodes is called as the Direct Outdegree Neighbor Set (DONS).Similarly, set of the indegree neighbor nodes which call node u directly is named as the Direct In-degree Neighbor Set (DINS).
2.2.The Fault Probability.Figure 2 shows a more common topology structure of software network.By analyzing these three different topologies, we study the fault accumulation characteristics of functions in software network and obtain the fault probability of a function.For nodes B1 and C1, they have the same size of the call function set; that is, the number of call nodes is equal, but the call relationship between these nodes is different.For nodes B1 and D1, they have the same size of execution routes, but the node D1 has a larger call function set and a more complex execution process.Therefore, the influences of the node B1, C1, and D1 on the node A are different.
With the structure shown in Figure 2, we can learn that the function fault is caused not only by itself, but also by its call functions.Moreover, for the objective function, each node in its DONS has different influence on the objective function.For this, we define the fault probability quantitative standard FP of each node which is the accumulation of infection coming from its call nodes.Based on the call relationships among the functions in DONS, the computational formula of the FP is given as follows.

Security and Communication Networks
Definition 4 (FP (the fault probability of a node)).

𝐹𝑃 (𝑢) = 𝛼 +
where  is the fault probability of  caused by itself (0 ≤  ≤ 1), V  is a node in (),  is the size of the (),  →V  is the probability infected by the direct neighbors of ,  is a node in (V  ), and  is the size of the (V  ).
Example 5. Figure 3 is a simple weighted function execution network.
In Figure 3, an example shows how to calculate the FP of a function node.In real world, the size of each function with various definitions is different, and the fault probability of each function is also different.But the setting of specific fault values is more complicated.The main work of this paper is to show the correlation of faults and not to pay attention to the fault calculation method of the node itself.To be universal, we set the fault probability of function node itself to 0. For nodes E and F, which belong to leaf nodes in the software network, then () = () =  = 0.5; according to the definition of FP, the fault probability of other nodes in the software network is as follows: Through the calculation of FP, the fault probability of a function node is identified.According to the above calculation results, the fault probability of each function node in Figure 3 is in the order of: A>B>C>D>E=F.
The fault probability of a node (FP) in Definition 4 is really not probability.It is just a metric of a node, if the FP of a node is higher and the node more likely has faults.So it may take arbitrary large value.Just because FP measures recursive weighted out degree of node u, it happens to embody the process of fault accumulation in a software system.Therefore, in the case of probability cumulative, total more than 1, this is a possibility, not restricted by 1.
Via the above analysis, based on the fault accumulation characteristics of a function and the recursive method, we utilize the formula (4) to calculate the fault probability FP for each function in software network.Then the algorithm MFP AN (Mining fault probability of all nodes) is proposed to get the FP of all nodes in software network.
In Algorithm 6, we show the process of the method MFP AN.In line (1), we first initialize a FPList to store the information and fault probability FP of all function nodes.In lines (2-9), a looping procedure calling the procedure MFP to calculate and store the FP for all function nodes is presented.In the procedure MFP, we show the process that the FP of a node is calculated by a recursively process.In line (1), we first define and initialize some related variables.Lines (2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21) describe the process to compute the current node affected by its out-degree neighbor nodes recursively and obtain the FP of the target node.
where    is the in-degree of a node ,   max is the maximum value of in-degree in the network,    /  max denotes the fault propagation capability of the objective function itself, V  is a node in (),  is the size of the (),  V  → is the probability of  called by functions in the (),  is a node in (V  ), and  is the size of the (V  ).
Through the calculation of FPC, the fault propagation capability of a function node is identified.According to the above calculation results, the fault propagation capability of each function node in Figure 3 is in the order of F>E>C>D>B>A.Similarly, via the above analysis, based on the fault propagation characteristics of a function and the recursive method, we use the formula (8) to calculate the fault propagation capability FPC for each function in software network.Then the algorithm MFPC AN (mining fault propagation capability of all nodes) is proposed to get the FPC of all nodes in software network.Algorithm 9 is similar to Algorithm 6.

The Fault Severity
. Some researchers believe that the type of fault determines the behaviour of transmission [25,26].That is, different faults in the same software have different laws of propagation behaviour.The research focuses on the study of fault characteristics.However, other researchers believe that the system architecture determines the behaviour of fault propagation [27].That is, the same fault in different architectures can be evolved into system failure with different types or different severity levels.This view is based on the analysis of the system structure.It focuses on the regularity of the propagation of faults in the architecture.This paper mainly studies the latter.Therefore, this article firstly calculates the fault probability FP and the fault propagation capability FPC of a function node, respectively.Then the two points are taken into account in this function node.Supposing it fails, the possible fault severity FS of the software system is calculated.Under this premise, we study the function fault characteristics of software system based on architecture.
In Sections 2.2 and 2.3, the fault probability and the fault propagation capability of a function node have been studied, respectively.The former is measured from the Out-Degree Neighbour of a function node, or to say that is the node affected by others.The latter is measured from the in-degree neighbour of a function, or to say that is the effect of the node on others.However, only a comprehensive consideration of these two aspects can fully measure the severity of the damage to a software system.
A node is more likely to have faults if its FP is higher, and it should be paid more attention.However, if a function only has faults but it does not spread its own faults, then the node will not cause very serious consequences to software system, while if a function is not only prone to fail but also has a strong capability to spread its faults to others, then it will cause very serious consequences to software system.Therefore, from the perspective of the fault severity, the fault probability FP and the fault propagation capability FPC of a function are directly proportional.The definition of FS (The fault severity) is given as follows.

𝐹𝑆 (𝑢) = 𝐹𝑃 (𝑢) * 𝐹𝑃𝐶 (𝑢) , 𝑢 ∈ 𝑁𝑆𝑒𝑡
where () is the fault probability of a function node  and () is the failure propagation capability of .They jointly determine the fault severity to a software system when the function node  fails.And if a node is with a bigger , it will have greater impact on the software system and then it is more critical.
First, we obtain the fault probability set FPList and the failure propagation capability set FPCList of software network through Algorithms 6 and 9, respectively.Then, we use formula (12) to calculate the fault severity FS, and the algorithm MFS AN (mining fault severity of all nodes) is proposed to discover the top-k key nodes from software network.
In Algorithm 11, the process of the method MFS AN is presented.Line (1) first initializes an empty FSList set that stores the FS of all function nodes.Lines (2-7) present a looping process that calculates the FS.In line (8), the FS in the FSList are sorted.In Line (11), the first k function nodes in the FSList are selected as the key nodes of software network.

Experiment and Analysis
In this section, we verify the method MFS AN by testing two kinds of classic tool software Tar and Cflow obtained from the open source community.Tar is a file compression and decompression tool.Cflow is a C program analysis tool for tracking the calling process of functions in the C program.In the Linux environment, we can extract the functions and the dependence relationships of open-source software with the help of tool pvtrace.The results are output to files as text (such as graph.dot).The nodes and the dependence relationships then can be graphically displayed by means of the visualization tool Graphviz.As the main function must be very important to software, so it is excluded in the following experimental verification.In addition, before the experiment, we pretreat the experimental data and delete the loop in the software network, so that recursion can be finished successfully.

The Distribution of FS.
By tracking the execution process of Tar and Cflow, the dynamic execution information of the two types of software is obtained, and the weighted function execution network WFEN is constructed as the basis of experimental data.The fault probability FP and the fault propagation capability FPC of all functions are obtained by mapping the node set and the call relationships of software network to Algorithm 6 (MFP AN) and Algorithm 9 (MFPC AN).The return values of Algorithms 6 and 9 are mapped to Algorithm 11 (MFS AN).The fault severity FS of all nodes and the key nodes in software network are obtained.Figures 4 and 5 show the fault severity scores and the distribution of key nodes in the different versions of Tar and Cflow.From the results distribution shown in Figures 4 and  5 (the first 70 nodes are chosen because of the number of nodes in different versions is different), we can summarize the following rules: (1) We can find that every result distribution obeys the power-law distribution.With the distribution, we verify that software network shows the scale-free properties of complex network.
(2) There are a few nodes with big FS and most of nodes with small FS.But their criticality and impact on the overall software architecture can be reflected in the higher scores.
(3) Their curves are basically at the same trend in different versions of the Tar and Cflow.In other words, in different versions, if the function nodes have the same criticality, the fault severity to software system is no big difference.
By analysing the node criticality in the two types of software from Figures 4 and 5, the key nodes in the software network are defined according to the FS, as the hierarchical structure of FS distribution is obvious, according to the turning point of curves in the graph, to select Top-10 as the key nodes of Tar and Cflow software network, respectively.In Tables 1 and 2, we present the key nodes and their rank for different versions of the Tar and Cflow.
From the data shown in Tables 1 and 2, the following rules can be summarized: (1) For a given function node, the criticality ranking in different versions is basically stable.Although there is a change about the ranking of a specific function node in different software versions, the change is very small.For example, in Table 1, the ranking range of the function node find next block is [4,5] and the criticality ranking of the function node dump file has been stable at 2 in different versions.
(2) Due to the ranking stability of the key nodes in software evolution, we have sufficient reason to predict the position of a function node in a new software version.For instance, in Table 2, the function node yylex is always more critical in each version, and then we can predict that it is likely to still be more critical in the next up-to-date version.

Degree Distribution of FS.
In order to illustrate the correctness of the key nodes, taking Cflow-1.4 as an example, we use indegree K in and outdegree K out these two indicators, respectively, as the criticality characterization of a function in software network.
According to the data in Table 3, it can be explained that although the criticality of a function node is not directly related to degree, they have a certain positive correlation.The outdegree values of top-5 are bigger; then their fault probability is greater, and the in-degree values are also bigger; then their fault propagation capability is greater as well, so the overall fault severity is greater.While the back-5 are all leaf  nodes with one in-degree, even if they fail, the range of fault propagation is limited.The different versions of Tar and other versions of Cflow are similar to Table 3 and they will not be described in detail here.6 and 7 show the joint distribution of the FP and FPC in Tar-1.28 and Cflow-1.4 software networks.As shown, most functions are located at the lower left corner of the graph; it means that the FP and FPC of these functions are relatively small; a small number of functions are located in the middle of the graph; it means that the FP and FPC of them are relatively big; only a very small number of functions are at the upper right corner of the graph; it signifies that the FP and FPC of these functions are big.Such functions are not only prone to fail but also have a strong fault propagation capability.If they fail, the fault severity to system disruption will be greater.In order to ensure the stability of software system, we should pay more attention to such functions and guarantee their correctness and robustness.

IC Model.
In social network, the Independent Cascade Model (IC Model) is a propagation model of researching influence maximization problem.It is a probabilistic model.When a node u is activated, it tries to activate its inactivated neighbor node v with probability P uv .This attempt is only done once, and these attempts are independent of each other.That is to say, the activation of u to v is not affected by other nodes.
In software network, when a failed node u is called, it propagates faults to the neighbor node that calls it with a probability P uv .If the node u can affect a number of nodes, the severity of its failure is significant.This is very similar to the maximization of influence in social network.Therefore, we use IC model to verify that the proposed method MFS AN does help to measure the fault severity of a node.
According to the mining results of the two kinds of software Tar and Cflow, the top-10 nodes and back-10 nodes are selected as source nodes, respectively.Through the IC model to simulate the number of nodes they can affect after being called, which in turn shows the severity of their failure.Due to the randomness of the IC model, we repeated the simulation 10 times for each version of each kind of software and then averaged the results, as shown in Figures 8 and 9.
As can be seen from Figures 1 and 2, if the functions fail, the number of nodes that can be affected by the top-10 function nodes is about 4 to 5 times that of the back-10 function nodes.This shows that if the top-10 function nodes fail, the fault severity of the software system will be 4 to 5 times that of the back-10 function nodes.Therefore, the ranking of node importance by the MFS AN method does help to measure the fault severity of a node.In directed graph, the degree centrality algorithm is a classical algorithm to measure the node criticality from outdegree and indegree.Thus, this paper compares the algorithm MFS AN with degree centrality algorithm (denoted as Degree).Tables 4 and 5 show the comparative results of Cflow-1.4 and Tar-1.28.The node ranking lists presented in Tables 4 and 5 are different, and the Degree method has the phenomenon that the same metric value of multiple nodes results in the same ranking.The reason is that the MFS AN method starts from the fault accumulation and propagation characteristics of a function and focuses on out-degree neighbor nodes and In-Degree Neighbor nodes that have direct or indirect relationship with the current function node.It considers the global influence of the node.While the Degree method only pays attention to the direct out-degree and in-degree neighbor node of the current function node, it ignores the indirect influence of other nodes.However, in software network, the nodes are not isolated and they realize the complicated software function by calling each other.Therefore, compared with Degree method, MFS AN method can identify the structure of a software more clearly and mine the key nodes of software network more accurately.
In summary, the algorithm MFS AN proposed in this paper is correct and effective for the node criticality evaluation in software network.By using the algorithm MFS AN to identify the key nodes in software network, it is helpful to reduce the software fault severity and improve the robustness and stability of software.

Conclusions and Future Work
In this paper, a novel algorithm MFS AN is proposed to evaluate the criticality of nodes in software network by combining the two characteristics of fault probability and fault propagation capability together.And function nodes with larger fault probability and stronger fault propagation capability are regarded as the key nodes.With experiment, we analyse the FS distribution of the nodes in different software versions, realize the evolution law of software, and prove the algorithm MFS AN can discover the key function nodes correctly and effectively in software network.On the other hand, the criticality of a function node is not directly related to degree, but it has a certain positive correlation.Furthermore, we could understand the software structure more easily and reduce the workload of testing and maintenance process to a maximum extent.In the future research, we will focus on how to divide the software module based on the key nodes.

Figure 1 :
Figure 1: A portion of a weighted function execution network.

Figure 1
Figure 1 represents a portion of a weighted function execution network.As the software system works, a function is a calling function and also a called function.In the execution process of a function node u, the nodes called directly by u are the direct outdegree neighbor node of u, and the set of these direct out-degree neighbor nodes is called as the Direct Outdegree Neighbor Set (DONS).Similarly, set of the indegree neighbor nodes which call node u directly is named as the Direct In-degree Neighbor Set (DINS).

Figure 2 :
Figure 2: Common topologies in software network.

Figure 3 :
Figure 3: A simple weighted function execution network.

( 10 )
According to the definition of FPC, the fault propagation capability of all nodes in the software network is as follows:

Figure 6 :
Figure 6: Joint distribution of FP and FPC in Tar-1.28.

Figure 7 :
Figure 7: Joint distribution of FP and FPC in Cflow-1.4.

Figure 8 :
Figure 8: IC Model simulation results for different Tar versions.

Figure 9 :
Figure 9: IC Model simulation results for different Cflow versions.

Table 3 :
In/Out-degree statistics of ranking top-5 and back-5 nodes in Cflow-1.4.

Table 4 :
The comparison of node rankings in Cflow-1.4.

Table 5 :
The comparison of node rankings in Tar-1.28.Comparison with Degree Method.The algorithm MFS AN measures the node criticality from two aspects of the outdegree and indegree in the whole network structure.