Using Hierarchical Latent Dirichlet Allocation to Construct Feature Tree for Program Comprehension

Program comprehension is an important task faced by developers during software maintenance. With the increasing complexity of evolving systems, program comprehension becomes more and more difficult. In practice, programmers are accustomed to getting a general view of the features in a software system and then finding the interesting or necessary files to start the understanding process. Given a system, developers may need a general view of the system.The traditional view of a system is shown in a packageclass structure which is difficult to understand, especially for large systems. In this article, we focus on understanding the system in both feature view and file structure view. This article proposes an approach to generate a feature tree based on hierarchical Latent Dirichlet Allocation (hLDA), which includes two hierarchies, the feature hierarchy and file structure hierarchy. The feature hierarchy shows the features from abstract level to detailed level, while the file structure hierarchy shows the classes from whole to part. Empirical results show that the feature tree can produce a view for the features and files, and the clustering of classes in the package in our approach is better (in terms of recall) than the other clustering approach, that is, hierarchical clustering.


Introduction
Understanding a software system at hand is one of the most frequently performed activities during software maintenance [1][2][3].It is reported that developers working on software maintenance tasks spend up to 60% of their time for program comprehension [4][5][6].Program comprehension is a process performed by a software practitioner using knowledge of both the semantics and syntax to build a mental model of its relation to the situation [7,8].As software evolves, its complexity usually increases as well [9].Moreover, sometimes documents affiliated to the evolving system also become inaccessible or outdated, which makes the program comprehension activity even more difficult.
To reduce the difficulty of program comprehension, one of the effective approaches is to create a meaningful decomposition of large-scale system into smaller, more manageable subsystems, which is called software clustering [10][11][12].A number of program comprehension techniques have been studied [13,14].The widely used clustering approaches are partitional clustering and hierarchical agglomerative clustering [15][16][17].These approaches usually cluster the system based on static structural dependency in the program.
However, the results indicating the program decompositions based on the structure relationships of the software system can merely provide the structure view of these clusters.But for a program during software maintenance, the first activity faced by software developers is to locate the code that is relevant to the task (related to a functional feature) at hand [18].Thus developers may be more interested in understanding the functional features of a system and how the source code corresponds to functional features.So some other clustering approaches are proposed based on semantic clustering, which exploits linguistic information in the source code identifiers and comments [14,[19][20][21][22].All these clustering approaches provide either a structure view or a brief feature view for comprehension, but only a few can generate both of them.
In practice, understanding features from abstract level to detailed level and structure from whole to part can effectively help developers to understand a system in a stepwise manner.Such a whole-part way is more beneficial to developers when comprehending object oriented systems [3,5].For evolving software with increasing size and complexity, developers find it difficult to identify and choose the interesting packages and understand the chosen classes, their relationship, and their functionalities.Hence, a clustering representation considering both file structure and feature of the program should be constructed to ease program comprehension.With such clustering, developers have an easier, stepwise, and quicker understanding of the system.
In this article, we propose a feature tree to help understand a software system.Program comprehension is performed relying on two hierarchies of the functional features and file structure based on the feature tree.The feature tree contains the features from abstract level to detailed level and the file structure from whole to part.Hence, developers can obtain a good understanding of functional features of the whole system.The feature tree is generated based on hierarchical Latent Dirichlet Allocation (hLDA), which is a hierarchical topic model to analyze unstructured text [23,24].hLDA can be employed to discover a set of ideas or themes that well describe the entire text corpus in a hierarchical way.In addition, the model supports the assignment of the corresponding files to these themes, which are clusters for the software system corresponding to the functional features.
Therefore, our approach can be effectively used in a whole-part program comprehension way during software maintenance.Our approach is particularly suitable for researchers in the scientific and engineering computing area as researchers can employ our approach to understand the systems and maintain them in their own way.Researchers can easily understand these clusters for a software system.The main contributions of this article are as follows: (1) We propose using hLDA to generate a feature tree.
The tree includes feature hierarchy and file structure hierarchy.Feature hierarchy shows the features from abstract level to detailed level, while the file structure hierarchy shows the classes from whole to part.
(2) We provide a real case study to explain how the feature tree helps to understand the features and files for the JHotDraw program.
(3) We conduct empirical studies to show the effectiveness of our approach on two real-world open-source projects, JHotDraw and JDK.
The rest of the article is organized as follows: in the next section, we introduce the background of the hLDA model.Section 3 describes the details of our approach.Section 4 shows a real case study of the feature tree on JHotDraw program.We conduct empirical studies to validate the effectiveness of our approach in Section 5.The empirical results and threats to our studies are shown in Sections 6 and 7, respectively.In Section 8, related work using clustering for program comprehension is discussed.Finally, we conclude the article and outline directions for future work in Section 9.

Background
In this article, we use hLDA to cluster the classes in the software system into a hierarchical structure for easy program comprehension.This section discusses the background of hLDA.
Given an input corpus, which is a set of documents with each consisting of a sequence of words, hLDA is used to identify useful topics for the corpus and organize these topics into a hierarchy.In the hierarchy, more abstract topics are near its root and more concrete topics are near its leaves [23,24].hLDA is a model for multiple-topic documents which models dependency between topics in the documents from abstract to concrete.The model picks a topic according to its distribution and generates words according to the word distribution of the topic.
Let us consider a data set composed of a corpus of documents.Each document is a collection of words, where a word is an item in a vocabulary.Our basic assumption is that the topics in a document are generated according to a mixture model where the mixing proportions are random and document-specific.These topics are the basic mixture components in hLDA.The document-specific mixing proportions associated with these components are denoted by a vector Θ.We assume that there are  possible topics in the corpus and Θ is a -dimensional vector.Suppose that we generate an -level tree, where each node is associated with a topic.The process of applying hLDA is as follows: (1) To select a path from the root to a leaf in the tree (2) To generate a vector of topic proportions Θ with dimensional Dirichlet (3) To identify the words in the document from the topics along the path from root to leaf based on the topic proportions Θ,  Then, the nested CRP (Chinese Restaurant Process) is used to relax the assumption of a fixed tree structure [23,24].A document is then generated by first choosing an -level path through the restaurants and then identifying the words from the  topics associated with the restaurants along that path.In this way, hLDA can generate a hierarchy where more abstract topics are near its root and more concrete topics are near its leaves.

Our Approach
Faced with the source code of a software system, developers need to use their domain knowledge to understand the features from abstract level to detailed level and the classes from whole to part.When understanding a system, developers first need to get a general understanding of the whole system and then find the interesting functions and source code.The process of our approach is shown in Figure 1.Firstly, the source code should be preprocessed for information retrieval technique; then the preprocessed corpus is processed with hLDA.Finally, we visualize the results in a feature tree view for program comprehension.3.1.Preprocessing Source Code.We preprocess the source code of each software system by applying the typical source code preprocessing steps [25].We firstly isolate source code identifiers and comments and then strip away syntax and programming language keywords.We deal with comments and identifiers differently due to their different programming requirements; that is, comments are natural language texts and identifiers are (composed of) single words.
The comments sometimes include authorship information which does not make sense in program comprehension.For example, words like "author, distributed" and other descriptions about the software itself are included in the header comments.So we first remove the header comments within the classes.Then, we follow four common operations to handle the remaining identifiers and comments as follows: (1) Tokenize: we firstly tokenize each word in the source code to remove some numbers and punctuation marks (2) Split: we split the identifiers based on common programming naming practices, for example, camel case ("oneTwo") and underscores ("one_two") (3) Remove stop words: we remove common English language stop words ("in, it, for") and key words ("int, return") in the program to reduce the noise (4) Stem: we stem the corpora to reduce the vocabulary size (e.g., "changing" becomes "change") After these preprocessing operations on the unstructured source code, information retrieval can be more effectively used to extract the key information from the source code.Figure 2 shows an example of the process of preprocessing the source code in the class Applet.java in JDK.After tokenizing, splitting, removing the stop words, and stemming the source code, useful words are left for information retrieval application.

Modeling the Corpus with Hierarchical Latent Dirichlet
Allocation.After preprocessing the source code of the software system, we apply hLDA to generate a hierarchy in which more abstract topics are near the root of the hierarchy and more concrete topics are near the leaves.
With the hLDA model, we can draw an outline about the topics, the files, and the hierarchical structure.The topics and the hierarchical structure constitute the feature hierarchy which shows the features from abstract level to detailed level.The files and the hierarchical structure constitute the file structure hierarchy, which shows the classes from whole to part.

Visualizing the Hierarchy for Program Comprehension.
After modeling the corpus with hLDA, we can get a hierarchy of the topics for the software system.To get a view of the hierarchy, we display the hierarchy in a tree structure, which is called feature tree.The tree is generated from the results of the hLDA including two parts, one is the topic words and the other is the assigned classes.What is more, when comprehending the classes in each topic, developers may want to know the included package name.So we provide the package name when listing the relevant classes of each topic.In the feature tree, we can get three types of relationships: node to topic, father node to son node, and topic to file.
Node is an important part of the feature tree and it contains key information for program comprehension.The node consists of two elements, topics and relevant file names.Node in the first level generated by the hLDA model is the root of the feature tree which concludes the whole topics and files of the system.Nodes in the remaining levels are the members of the subtree indicating the subtopics of the system and the related files.Topics are composed of words extracted from the preprocessed comments and identifiers based on the topic distribution in the hLDA model.So the topics are used for understanding the source code of interest.This is the feature hierarchy showing the features from abstract level to detailed level, which can help developers get a general view of the features of the software system.With a general view of the topics, users are more interested in the files corresponding to these topics.Relevant files are assigned based on the distribution of the hLDA with the generated topics.This is the other structure hierarchy showing the classes from whole to part in the form of clusters to show the classes corresponding to the features.In addition, when listing all the files, some more information is needed such as the location of the file.So the files and their locations are also presented in the tree nodes.The structure of feature tree shows the relation between nodes.Father-child is the main relationship in the feature tree.All this information can be obtained from the results of the hLDA.Father node is a generalization of his son nodes.For the topics in the nodes, father node represents more general and abstract features than son nodes.For the files, father node includes all the files of his son nodes.
A three-level feature tree of the JDK program is displayed in Figure 3.We can see topics in each node with corresponding package names and class names in the content.

Case Study
In this section, we provide a case study of applying our approach to the JHotDraw program (this subject is also used in our empirical study.More about this subject will be introduced in Section 5.2).The JHotDraw program includes 23 packages and 305 classes.
We first generate the feature tree for the JHotDraw program.Part of the tree view is shown in Figure 4.The tree has three levels and there are several words describing each node in the tree.For the root node of the tree, 305 classes are included and the topic labeling the root node is "invoc defin tool net delete modif preserve ad clone create draw."Due to the step of preprocessing, some words are transformed into different forms.For the verbs in the topic, we can easily find that the original form of the verbs "invoc defin delete modif preserve ad clone create draw" is just "invoke define delete modify preserve add clone create draw" and these words express the functional features of the system.The nouns in the topic are "tool net" which indicate the objects of the system.JHotDraw defines a skeleton for GUI-based editor with tools in a tool palette, different views, user-defined graphical figures, and support for saving, loading, and so forth.So words in the root node describe the features of the software system to some extent.
For the first son node in the second level of the root node, there are 163 classes.The node contains words like "instal method plug point instance creat action applic event draw."The original forms of these words are just "install method plugin point instance create action application event draw."The content to describe the functional features in this node is finer than these in the root node.For example, "plugin application action event" are words for plugin and application and some actions in the source code.As we know, a plugin program always includes XML language program.In the results, we can also find that packages like "nanoxml, xml" are in this node.Then, for the "action" feature, packages corresponding to it are distributed in this node like "drawaction, samples-svg-action, app-action."Some actions can be easily analyzed from the class names in the node such as "Spli-tAction.java,CombineAction.java" in "samples-svg-action,"   which indicates the actions of split and combine for the SVG (Scalable Vector Graphics).
For the first node in the third level, which is the son node of the node mentioned above, 26 classes are assigned to this node.The labeled topic is "attribute show action applic event draw."The original form of these words is just "attribute show action application event draw."Most of them have appeared in their father node.The representative word is "attribute."In JHotDraw, AttributeFigure is a directly accessed component.We can further refine its behavior by overriding some methods such as draw() to customize the graphical representation in the diagram.The node only includes four packages.They are draw.action,app, draw, app.action.In "draw.action,"we may find "AttributeAction.java,AttributeToggler.java,Select-SameAction.java,DefaultAttributeAction.java," which correspond to the word "attribute." Hence, from the above study on a real software system, we can use the information on the feature tree to facilitate understanding the functional features and file structure in the system.

Empirical Study
In this section, we conduct empirical studies to evaluate the effectiveness of our approach.In our studies, we address the following two research questions: (RQ1) Does the feature hierarchy really help users to comprehend the software system?(RQ2) Are the clustering results more convenient for users to understand the software system than the hierarchical clustering approach [16]?
(RQ1) and (RQ2) are used to evaluate the feature hierarchy and file structure hierarchy, respectively.(RQ1) indicates whether the results of the feature tree really help users to comprehend the software system.In addition, we investigate (RQ2) to see whether the clustering results can make it easy for users to comprehend software systems compared with the results of hierarchical clustering [16].

Subject Systems.
We address our research questions by performing studies on the source code of two well-known software systems, JHotDraw (https://sourceforge.net/projects/ jhotdraw), and two packages in the Java Development Kit 1.7 (http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html), that is, JDK-sun and JDKjava.JHotDraw is a medium-sized, 2D drawing framework developed in Java.The Java Development Kit (JDK) is implementation of either one of the Java SE or Java EE platforms in the form of a binary product.Specifics of these subject systems are shown in Table 1.
These projects belong to different problem domains with medium or large size.They are all real-world software systems and have been widely used as empirical study in the context of software maintenance or program comprehension [26,27].

Design of Our Study.
In our approach, there is a parameter, that is, tree level .It represents the level of the feature tree, which affects the tree structure for program comprehension and the clustering results in the software system, respectively.
In our study, we consider  to be 3 for manual program understanding since the level of three achieves relatively good results in our study.
For the hLDA computation, we used the C++ implication from David Blei's hLDA topic modeling software (https://github.com/xch91/hlda-cpp).We ran it for 10,000 sampling iterations, the first 1000 of which were used for parameter optimization.
In addition, we used hierarchical clustering algorithm for a comparative study.Anquetil and Lethbridge proposed a hierarchical clustering algorithm suite, which provides a selection of association and distance coefficients as well as update rules to cluster software systems [16].There are three variants of hierarchical agglomerative clustering, which were used in many software architecture recovery techniques [28,29].These variants can be distinguished based on their update rules as follows: Single Linkage (SL), Complete Linkage (CL), and Average Linkage (AL).SL merges two clusters with the smallest minimum pairwise distance.CL merges two clusters with the largest minimum pairwise distance and AL merges two clusters with the average minimum pairwise distance.Previous works have shown that, in relatively large systems, CL generates the best results in terms of recovering authoritative decompositions [16,30,31].We also used the same semantic data (words in comments and identifiers) to perform the hierarchical agglomerative clustering.

Participants.
We conducted a user study to answer the two research questions.Our study involved 10 participants from school and/or industry.These participants have different levels of software development experience and familiarity with Eclipse.Half of them are from our university (graduates in our lab) with 2-3 years of development experience and the other half are from industry with 5-6 years of development experience, especially large project development experience.They were required to conduct program comprehension for a system (e.g., JDK-java, JDK-com, and JHotDraw).In addition, we gave participants the classes, and they needed to identify the relevant classes (semantic relevant) for them.

Measures.
For (RQ1), we investigated whether the feature hierarchy helps the participants comprehend the functional features of whole system.To show whether the topic hierarchy generated by our approach is useful, the participants needed to assess whether the feature hierarchy enables them to understand the clusters.A five-point Likert scale with 1 (very useless) to 5 (very useful) assesses each participant's view of the topic hierarchy.According to the scores, we can see whether the structure helps program comprehension.
To answer (RQ2), we used precision and recall, two widely used metrics for information retrieval and clustering evaluation [32,33], to evaluate the accuracy of the topics and the clustering results.Precision in clustering evaluation is the percentage of intrapairs proposed by the clustering approach, which are also intrapairs in the authoritative decomposition.Recall is the percentage of intrapairs in the authoritative decomposition, which are also intrapairs in the decomposition proposed by the clustering approach.The authoritative terms are given by each participant.We found that the participants gave different results for the given classes.So we consider that a class is included in the authoritative decomposition when 60% of participants or more selected it as relevant one.These two metrics are used to measure the accuracy of the clustering results in an a posteriori way.They are defined as follows: To avoid optimizing for either precision or recall, measure is used, which is the harmonic mean of precision and recall. is the harmonic mean of precision and recall.It is defined as follows.
Since agglomerative clustering can generally be adjusted to increase or decrease recall at the expense of precision by changing parameters, we selected 50 results with the best  values and then computed the average values for our comparative study.

Variables.
To perform our study, the main independent variable of the empirical study is the clustering techniques (hLDA and CL) that we used to generate the clustering results for program comprehension.The dependent variables are the values of precision (), recall (), and  (), which are used to measure the accuracy of these clustering techniques.

Empirical Results
In this section, we collect and discuss the results from our studies to answer the proposed two research questions, respectively.

(RQ1).
Our approach is aimed at giving a view of the topics in a hierarchical way from a general level to a concrete level for the whole system and provides a clustering result of the packages and classes.In this subsection, we discuss whether the feature tree structure of the topics helps to understand the whole system and whether the topics are representative.
We provided the topic tree generated from hLDA of each system to the participants to investigate whether the tree can give a view of the topics in a hierarchical way from a general  level to a concrete level for program comprehension.First, we provided the topics of each cluster to the participants.They used a five-point Likert scale to answer questions related to the quality of the topics.The results are shown in Table 2.
The results show that the average score of the results is around 3.5, which indicates that the participants think that the topics are useful to understand the cluster.So, for program comprehension, some topics labeling the clusters are useful to help users to understand the program.

(RQ2).
In this subsection, we compare the accuracy of the two clustering results in leaves of the feature tree generated by our approach and the hierarchical CL clustering approach.
To quantitatively compare these two clustering approaches, we compute their precision and recall results, which are shown in Table 3. From the results, we notice that the average precision of CL is 54% and the highest is 60% for JDK-sun.And all the precision values of the three subject programs are higher than the hLDA approach.
But, for the recall results, we notice that the average recall of hLDA is around 40% and the highest is 42% for JDKjava.And all the recall values of the three subject programs are higher than CL, which means that the hLDA model can predict more files for the clusters.This is because the hLDA model clusters the files or classes not only with the same words but also with the same latent topics.That is to say, our approach can cover more relevant classes in the authoritative clusters, which can effectively facilitate program comprehension.
Although clustering results with high precision are more correct, many other correct relevant results are not discovered [16].When using clustering for program comprehension, high recall is more important [16].So, from the results discussed above, our approach can effectively identify more relevant classes in a cluster to help program comprehension compared with the CL approach.

Threats to Validity
Like any empirical validation, ours has its limitations.In the following, threats to the validity of our studies are discussed.

Threats to External
Validity.We only applied our technique to two subject programs.Thus we cannot guarantee that the results in our studies can be generalized to other more complex or arbitrary subjects.However, these subjects are selected from open-source projects and widely employed for experimental studies [26,27].In addition, we only selected part of reprehensive subjects and used them for comparison with the CL clustering approach when evaluating the effectiveness of clustering results.
In addition, we considered only Java programming language and Eclipse development environment.Further studies are required to generalize our findings in large-scale industrial projects and with developers who have sufficient domain knowledge and familiarity with the subject systems in various development environments.

Threats to Internal
Validity.First, the preprocessing of the program is to transform the source code into word list, which is then used for information retrieval.This may affect the results of the hLDA model.Since we derive our topics based on the use of identifier names and comments, the quality of the topics generated by hLDA relies on the quality of the source code preprocessing.
Another threat is just like other comprehension techniques based on semantics demonstrating that the code should have a high quality in both name rules and the comments.Some programs without high quality may not get good results as in our study.
A third threat to the internal validity of our study is the difference in the capabilities of the participants.Specifically, we cannot eliminate the threat that the participants have different capabilities.This also can affect the accuracy of the results.As the authoritative decomposition changes, the results may also be different.

Threats to Construct Validity.
To evaluate the effectiveness of our approach, we used precision and recall metrics.These two metrics only focused on the false-positives and false-negatives for authoritative clustering results.However, for program comprehension, other factors may be more important.
In addition, when we compare our approach with agglomerative clustering, we selected 50 results with the best  values and then computed the average values for our comparative study.However, agglomerative clustering approaches can generally be adjusted to increase or decrease recall at the expense of precision by changing parameters.Some other design approaches may obtain different results.

Related Work
Source code is one of the important inputs for program comprehension.There are a number of studies focusing on this area [10,[34][35][36][37][38].Program clustering is one of the effective ways for program comprehension.There are two different types of program clustering, one is based on syntactic dependency analysis [39][40][41][42][43], while the other is based on the semantic information analysis [44,45].
The syntactic based clustering approaches usually focus on analyzing the structural relationship among entities, for example, call dependence, control, and data dependence.Mancoridis et al. proposed an approach which generates clusters using module dependency graph for the software system [36].Anquetil and Lethbridge proposed an approach, which generates clusters using weighted dependency graph for program clustering [46].Sartipi and Kontogiannis presented an interactive approach to recovery cohesive subsystems within C systems.They analyze different relationships and build an attributed relational graph.Then the graph is manually or automatically partitioned using data mining techniques for program clustering [47].For all these works, they analyzed syntactic relationships to cluster the program, and developers understand how the functional features are programmed in the source code based on this syntactic clustering.In this article, we focus on extracting the functional features in the source code and clustering the source code based on hLDA.
Semantic based clustering approaches just attempt to analyze the functional features in a system [13,48].The functional features in the source code are extracted from comments, identifier names, and file names [49].For example, Kuhn et al. proposed an approach to group software artifacts based on Latent Semantic Indexing.They focused on grouping source code containing similar terms in the comments [14,50].Scanniello et al. presented an approach which also uses Latent Semantic Indexing to get the dissimilarity between the entities [51].Corazza et al. introduce the concept of zones (e.g., comments and class names) in the source code to assign different importance to the information extracted from different zones [50,52].They use a probabilistic model to automatically give weights to these zones and apply the Expectation Maximization (EM) algorithm to derive values for these weights.In this article, our approach used hLDA to generate a feature tree structure of the topic hierarchy for program comprehension and cluster the packages and classes based on them.The feature tree includes two hierarchies for the functional features and clusters of packages and classes.
In addition, some other approaches combine the syntactic analysis and semantic analysis for program clustering [11,[53][54][55].Tzerpos proposed an ACDC algorithm, which uses name and dependency of classes to cluster all the system into small clusters for comprehension [56].Adritsos et al. presented an approach, LIMBO, which considers both structural and nonstructural attributes to decompose a system into clusters, while in this article we used only the semantic analysis for clustering.But our approach can generate topics to help users more easily understand the classes or packages in each cluster.

Conclusion and Future Work
In this article, we proposed an approach of generating a feature tree based on hLDA, which includes two hierarchies for functional features and file structure.The feature hierarchy shows the features from abstract level to detailed level, while the file structure hierarchy shows the classes from whole to part.We conducted empirical studies to show the effectiveness of our approach on two real-world open-source projects.The results show that the results of our approach are more effective than the hierarchical CL clustering approach.In addition, the topics labeling these clusters are useful to help developers understand them.Therefore, our approach could provide an effective way for developers to understand the system quickly and accurately.
In our study, we only conducted our studies on two Javabased programs, which cannot imply its generality for other types of systems.Future work will focus on conducting more studies on different types of systems to evaluate the generality of our approach.In addition, we find that, just like other IR approaches, the preprocessing process, the scale of the corpus, and the parameters in the model can affect the results sensitively.Further study on how to make the best use of IR approach for programs transformation is necessary.

Figure 1 :
Figure 1: Process of our approach.

Figure 2 :
Figure 2: An example of preprocessing the source code.

Figure 3 :
Figure 3: The feature tree of the JDK program.

Figure 4 :
Figure 4: A part of the overview of the feature tree for JHotDraw.

Table 2 :
The score for the quality of the topics.

Table 3 :
Precision and recall of two clustering approaches for each subject system.