An Improved Approach to Identifying Key Classes in Weighted Software Network

1State Key Lab of Software Engineering and School of Computer, Wuhan University, Wuhan 430072, China 2School of Computer, Wuhan Vocational College of Software and Engineering, Wuhan 430205, China 3International School of Software, Wuhan University, Wuhan 430072, China 4Research Center of Complex Network, Wuhan University, Wuhan 430072, China 5Faculty of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China


Introduction
With the increment development of open-source software (OSS), the overall scale and complexity of software system become more and more great.For a newcomer in this context, if he or she wants to obtain a general understanding of the software system as soon as possible, the key classes are in general recommended to the newcomer to master some basic concepts better [1].Software can be characterized as a dependency network in terms of relationship between various elements (class, package, feature, and so on), labeled as software dependency network (SDN).The nodes in SDN present classes or interfaces while the links present different dependencies between the nodes.The frequency of dependency between two nodes is viewed as edge weight.Therefore, we can understand software in terms of weighted software network.In other words, the problems of identifying key classes of the software system converted into measuring key nodes in SDN.
The key nodes refer to a number of nodes which are more likely to affect the structure and function in a network.Although the proportion of such node usually is not high, they can rapidly influence most of the remaining nodes.In complex networks, node importance represents the node's influence, transmission capacity, and robustness.At present, numerous measures have been proposed to identify the important node in a network; for example, both Jian-Guo et al. [2] and Ren and Lü [3] made a comparison between several commonly used methods to analyze their differences and the specific application scenarios.Sun and Luo [4] also reviewed the approaches to measuring important nodes from the perspective of the global network topology structure and node attributes.According to node degree and clustering coefficient, Ren et al. [5] proposed a new method for measuring node importance.Kitsak et al. [6] proposed a -shell based method to resolve the issue.
In social network analysis, except for centrality, ℎ-index has been widely applied to evaluate the location or influence of actors [7].For coauthorship network, ℎ-index is useful for reflecting the scholars' performance based on their position and influence within their collaboration network.The main advantage of ℎ-index is the consideration of the quantity and quality of the papers published by a scholar.
As mentioned above, there are lots of methods for measuring node importance in a network, but few people attempt to use ℎ-index or its variants to identify key classes in context of software engineering.To compensate the lack and verify the feasibility of ℎ-index, this paper proposes new measures based on ℎ-index to identify key classes in SDN.
Our contributions are summarized as follows: (1) We proposed four new measures based on ℎ-index to study class importance in SDN, respectively, according to the degree of neighbors and the edge weights.
(2) Compared with several existing centrality measures, we validated the feasibility of proposed measures to identify important nodes.
The rest of this paper is organized as follows.Section 2 is a review of related work.In Section 3, the preliminary theories and approaches are introduced.Section 4 shows the results of our experiments in detail.After that, a conclusion and future work are made in Section 5.

Related Work
There are many indicators measuring a node importance defined in complex networks and social network analysis, such as node centrality [8]; one of the simplest is the node degree centrality, and we could understand it as the number of other nodes connected to the target node, representing the potential impact that a node made on the surrounding nodes, which is the local properties of nodes in a network.To reflect the node's global properties, this paper puts forward betweenness centrality, closeness centrality, and eigenvector centrality and so forth.Both betweenness centrality and closeness centrality are involved in topological distance between nodes in network.
Wang and Pan [9] used the betweenness centrality, closeness centrality, and eigenvector centrality to measure the importance of classes in the software network and analyzed differences of these indexes in identifying key classes.In the process of prediction, we could focus attention on different contents according to the importance of entities in software systems.Zimmermann and Nagappan [10] take Windows Server 2003 as the experimental object, built software network according to the dependence relationship between classes, and contrastively analyzed the effect that network indexes such as the node degrees, betweenness centrality, closeness centrality, and source code metrics have on defect prediction.The results showed that the prediction results of network indexes were verified to be more effective than source code metrics.
Meneely et al. [11] built the developer cooperation network according to the code change information for the file level, and then network centricity indexes were used to measure contribution of developers and predicted the system fault condition after the release.The authors found that centricity index and the fault occurring after release have obvious correlation in the developer cooperation network; it confirmed that the network indexes have advantages to earlier found faults during the development phase.Under the use of centricity indexes, the difference of the role of developers in the community was measured by Crowston et al. [12], and then the core-edge hierarchy between developers was identified.According to the contribution relationship between developers and the modules, Pinzger et al. [13] used network centricity indexed metric to measure the contribution of developers and, combined with the relationship between the developer's contribution and the number of defects after release, found module in the center more likely to break down than the edge.Shin and others [14] used centricity indexes and fragile code snippet forecasted system and found that the use of these indicators could well distinguish fragile and neutral file of the system and build file vulnerability forecasting model.
Zimmerman et al. [15] identified the core program unit by using centricity indexes to analyze software network.Bhattacharya et al. [16], respectively, built code level and module level network and used network indicators to assess the seriousness of bug and optimization of refactoring and predict defects.Steidl and others [17] used centricity, PageRank, and node degrees to determine the core classes in software system and confirmed that the results found agree well with the results of practical experience developers.Zanetti and others [18] also used centricity indexes to measure the ability level of defect reporters and then direct the distribution flaw based on the index values.
Besides using centricity indexes, Perin et al. [19] order classes according to the dependency between them by using PageRank algorithm.HITS algorithm was used by Zaidman and Demeyer to calculate importance of classes to determine the system's key classes [20].Pan et al. [21] put forward a kind of weighted PageRank algorithm to identify the key package of software system.Zhou and Xu [22] compared the differences in identifying important classes under the use of PageRank and HITS and betweenness centrality indexes at the class level in software network.Meyer et al. [23] model software as a network and apply -core decomposition to identify a core subset of potentially important classes.Jiang et al. [24] proposed a technique to measure the importance of each class based on unique input/output sequence to identify the key class.Kamran et al. [25] propose an efficient technique that pinpoints the core architecture classes of the system.Sora [26] models the static dependencies structure of the system as a graph and applies a graph ranking algorithm to identify key classes in software systems.Pan et al. [27] put forward a kind of weighted -core decomposition to identify important packages of object-oriented software.S ¸ora [28] proposed a tool to automatically extract some summary to identify the most important classes of a system.
As one of the important indexes for evaluating research level of scholars, few people applied ℎ-index to measure the importance of classes in software system.Wang et al. [29] used the ℎ-index and its variant identified the key classes in the class-level software network, and it was also compared

Software Dependency Network (SDN).
Considering that the direction and weight of the dependencies between classes have practical significance, so, in this paper, reverse engineering method was used to construct SDN model.We adopt  = (, , ) to represent a weighted network,  = {V  } is all nodes in this network, that is, the set of all classes in a system, and  = {  } is all edge sets.Note that if there is a dependency between node  and node , then   = 1, otherwise it is 0;  = {  } is a weight set corresponding to the edge set , where   is the number of dependencies between node  and node , and is the weight of   .During building of SDN, a directed edge   mainly considers the following five kinds of dependencies: (1) Class V  inherited or realized the class or interface V  (inheritance dependence).
(2) Class V  contained the type field of class V  (field dependence).
(3) Class V  called the method of class V  (method dependency).
(4) Class V  returned an object of class V  (return dependency).
(5) Method of class V  took the object of the class V  as a parameter (parameter dependency).

Centrality Measures.
The definition of ℎ-index given by Hirsh is as follows: "Suppose that the number of papers published by a scholar is   (descending order according to the number of references), if the references of top ℎ articles are not less than ℎ, the scholar is considered have index ℎ."Using this concept proposed by Hirsh, we apply ℎ-index to SDN and define four new centrality measures.
aN-Degree: aN-Degree of a node is the average degrees of its top  neighbors such that the degree of each is not less than .mN-Degree: mN-Degree of a node is the max  such that the degree of its top  neighbors together is at least  2 .aE-Weight: aE-Weight of a node is the average weights of its top  edges that have at least a weight of .mE-Weight: mE-Weight of a node is the max  such that its top  edges together have minimum weight of  2 .
The aN-Degree of node 1 is 14/3 as its top 3 neighbors have degree of at least 3.The mN-Degree is 4 as sum of the top 4 neighbors' degrees is greater than 4 2 .The aE-Weight is the average of the top 3 edges' weights (8+6+5 divided by 3).The mE-Weight is 5 as the sum of the weights of the top 5 edges is not less than 5 2 (see Table 1).

Research Problems.
In this paper, the research mainly focuses on the following two questions: (1) Do the Proposed Centrality Measures Work Well?As mentioned above, in complex network and social network analysis, various measures used to identify the important nodes have been proposed.Thus, it is worth analyzing the correlation between the proposed centrality measures and existing centrality measures. (

2) Do the Proposed Centrality Measures Identify More Key Classes in SDN?
To further investigate the usefulness of the proposed centrality measures, we compared them with several existing centrality measures by analyzing the ability of identifying key classes in the actual maintenance, in the context of weighted software networks.

Experiment Design.
In this paper, we first present the framework of our work (see Figure 2).It mainly consists of three parts.First, we used Dependency Finder and SNATa kind of network analysis tool which is developed by us to parse source codes and to extract classes and their dependencies.Then, we built SDN at the class level.Second, we calculated the centrality of all nodes in SDN and analyzed the correlation between four existing centrality measures and the proposed ones.Finally, according to the version control log derived by TortoiseSVN, we further validated the feasibility of the proposed centrality measures.

4.4.
Results.We organize our results according to the two research questions proposed in Section 4.2.
(1) Do the Proposed Centrality Measures Work Well?The purpose of various centrality measures is to sort the importance of classes in software system.The greater the measure values, the more important the class is.For the proposed measures, in order to test their feasibility, we first need to quantify their relationship with other commonly used centrality measures.
In other words, we should ensure the results are significantly related as a whole.Note that, in this paper, we introduce four typical centrality measures: node degree, betweenness centrality (BC), closeness centrality (CC), and eigenvector centrality (EC).Considering our purpose is to compute the correlation of two groups of results, Kendall rank correlation coefficient can be used for analysis (using tau- () in SPSS).  = 1 represents that the results are completely positively correlated;   = −1 represents the results are not relevant.Table 3 shows that there is a strong correlation between two groups of centrality measures, except for EC.For example, for JUNG the correlation coefficient is up to 0.9 between aN-Degree and node degree and up to 0.6 between aN-Degree and CC.Furthermore, although the correlation coefficients between the proposed centrality and EC are small, the statistical results show that they are significant.Figure 3 gives the relationship among aN-Degree and four typical centrality measures in three software projects, respectively.
Therefore, on the one hand, there is a significant correlation between the proposed centrality measure and four benchmark measures.On the other hand, the consistency within mE-Weight and other centrality measures is more obvious.In a word, the results show that the proposed centrality measures work as well as four benchmark centrality measures.
(2) Do the Proposed Centrality Measures Identify More Key Classes in SDN?In the previous question, we learnt that the proposed centrality measures show a remarkable consistency in measuring the key classes with existing network centrality measures.However, whether the important nodes identified by these proposed measures are key classes in the actual system is still unknown, especially the important nodes identified by mN-Degree and mE-Weight.A key class might be more complex because it is associated with other classes and might be highly reused because it relies on more classes.
In the process of software maintenance, the chance to change this kind of class often is greater.Therefore, we export the corresponding revision information from version control log for each software system and have access to the number of revisions of these classes during the period of change.Figure 4 shows that the changed times of a class are positive to its centrality measure value as a whole.For example, the value of  2 (determinate coefficient) is up to 0.864 between the average values on mE-Weight measure and the changed times of classes.That is, the classes with greater centrality are changed more frequently.On the other hand, the results validated that centrality measures are useful to identify the key classes in software systems.
Given that the results in the other two projects have similar tendency to that of the Tomcat project, only the case obtained in Tomcat was given.Figure 4 also shows that the trends from our proposed centrality measures are more obvious than CC because of the greater values of  2 .Note that, except for mE-Weight, BC measure performs better than the proposed measures.The results further validated the advantage of BC measure as mentioned in [6,30].Meanwhile, it is clear that mE-Weight measure performs best, which indicated that computing the centrality on the weighted edge of nodes has an advantage compared to that based on the node degree and even prior to other frequently used centrality measures.Besides, the proportion of the classes that have actually been modified in the top  key classes returned by different centrality measures were reported.Table 4 shows that the proportion becomes larger with the increase of  value.When  is 5, the most important five classes have not been modified, except for the case of mN-Degree and BC.A possible explanation is that the core framework should keep stable and in general is less likely to modify during the process of maintenance of software system.When  is set to 200, the proportion (7 out of 8) is more than 60%, especially for the mN-Degree and mE-Weight.For instance, when using mE-Weight measure, 87.5% of the top 200 classes were successfully recognized that they needed to be changed.In addition, in Table 4 the proportion from mE-Weight is larger than other measures as a whole.Compared to BC, the improvement is up to 0.215.It further verified the advantage of mE-Weight to identify key classes in software engineering.
In a word, compared to existing network centrality measures, the proposed measures do identify more key classes in software network, especially for the mE-Weight measure.

Conclusion
This paper puts forward four centrality measures based on ℎindex to compute the importance of a class in software system from two aspects: the degree of its neighbors and the weight of the edges that connected the current class and its neighbors.Taking three open-source software projects as the research objects, the results indicate that the proposed measures not only are able to identify the key classes as some commonly used centrality measures (correlative coefficient 0.987) but also perform better than some commonly used centrality measures (the improvement is at least 0.215).In addition, the finding suggests that mE-Weight defined by the weight of a node's top  edges performs best as a whole.
The work will help new developers understand the core parts of a software system faster and provide a guideline for the priority of class modification in software maintenance.In future, we will further validate the measures proposed in this paper with more open-source software applications and apply these measures to software network from the other

Figure 1 :
Figure 1: A simple weighted software dependency network.

4. 1 .
Data.Three open-source projects are chosen as the research objects in this paper.

Figure 2 :
Figure 2: The framework of our experiment.

Figure 3 :Figure 4 :
Figure 3: An example: the relationship among aN-Degree and other centrality measures in three software projects.

Table 1 :
Centrality measures for node 1 mentioned in Figure1.with the code index in the identification accuracy.They found that if they take the top 15% of the class based on the indicators the recall rate could achieve 70%.In their work, however, the calculation for ℎ-index only considered the node degrees and did not consider the weight of edge.Compared to Wang et al., although this article did not consider other variants of the ℎ-index, when calculating the ℎ-index of nodes, we, in turn, considered the number of node edges (the node degrees) and the influence of edge weight.

Table 2
lists the statistical information.Tomcat (http://tomcat.apache.org/) is a free Web

Table 2 :
The information of projects used in our experiment.

Table 4 :
The proportion of actual modified classes in top  key classes returned by different centrality measures (Tomcat).