The five-level CMMI software process maturity framework [
Some researchers suggested that we should study how to systematically exploit
One of the most potentially useful application areas when mining software repository data is understanding defect insertion and defect removal processes [
Unfortunately, it is quite difficult to identify defect insertions. The common first step in defect-related MSR analyses is to identify transactions in the version archive that correct a defect, usually called
We call such a pair of bugfix commit and bugtracker entry a
For the present study, we define a
The study describes how practicing engineers at a company, Infopark, went to establish bugfix links in order to (later on) perform actual bugfix data analysis for engineering purposes. This is in contrast to researchers performing such analysis for research purposes and has a number of impacts on the contents and noncontents of the paper.
We will proceed to describe the design rationale of our study and its research contributions (Section
This section describes why we set up the study the way we did and what scientific (certainty-oriented) and engineering (cost-/benefit-oriented) contributions we claim it to make.
(1) Previous studies on linking commits to issue reports have mostly considered open source software systems only (as opposed to commercial closed source). Most of the few uses of closed source data in MSR works are either vague with respect to the origin and nature of the data (e.g., [
(2) We do indeed find a characteristic in our data that has not previously been used: explicit reverse links contained in a bugtracker entry that point to a commit. To exploit it, we introduce new filtering heuristics (Sections
(3) We describe and analyze the filters individually so they can be used in a modular fashion and users can understand for what types of repository it might be sensible to use or not use certain filters (Section
(4) For some bugfix link candidates, it is difficult to understand whether they are valid or not. We aim at being always correct in this judgment by employing a company insider product expert who manually checked more than 2500 bugfix link candidates carefully (Section
(5) Most filters involve a cutoff threshold or similar settable parameter. Successful use of the filtering chain requires adjusting these parameters to the characteristics of the given repository, because unsuitable settings can utterly ruin the filter chain’s performance. To our knowledge, our work is the first to provide guidance for this step; we describe metaheuristics for setting the filtering heuristics’ cutoff parameters in Sections
(6) While a scientific study of bugfix link identification will perform extensive manual validation of the results, a practicing engineer can at best afford manual validation for a small sample. We therefore sketch the overall procedure of how one would apply bflinks in practice without a comprehensive manual validation and discuss how to adapt to different properties of the repository at hand (Section
Summing up, our contributions focus on achieving practical applicability of the technique while ensuring high validity of the results. In contrast, it is expressedly
On a methodological level, we explain why absolute measurements of recall are problematic (Section
This work was started by a company, Infopark AG, with the intention of achieving insights regarding their own defect insertion and removal processes that could be turned into process improvements.
It was planned to (1) identify bugfix commits, (2) establish their bugfix links, (3) identify the corresponding bug commits (defect insertions), and (4) analyze them for interesting patterns.
Steps 3 and 4 turned out to be infeasible [
Infopark identified
We perform the study on only Infopark’s repository rather than several because accurate assessment of precision requires the correct manual classification of many commits—which requires background knowledge of the respective product and its development practices.
Infopark is an early Web company. Founded in 1994, it built the first version of its main product CMS Fiona in 1997. CMS Fiona is a content management system (CMS) aiming at large-scale and high-traffic web sites with both static and dynamic parts. Its particular strengths lie in consistency-keeping for static content. Two other products represented in our repository data, the Online Marketing Cockpit and an internal product, are so much smaller that they hardly influence the results of our investigation at all.
Infopark has always been very open for innovation not only in its products but also in the technologies and processes it used for building them. Therefore, the data investigated represents, over time, a multitude of both technical platforms for managing it and programming languages used. CMS Fiona was originally started in Objective
The following properties of Infopark will be relevant for the current study. In the CMS domain, feature requests are very frequent and there is no clear line between avoidable bugs and feature requests. Consequently, Infopark often treats the implementation of small improvements to the functionality just like bugs and such improvements represent a substantial fraction of our “bugfix” data. Because feature requests typically involve more code than ordinary bugfixes, our data contains many nonsmall bugfixes. Infopark has always had low turnover of staff and was therefore able to follow intended processes and good practices stably, in particular (iii) and (iv). The commit message of a bugfix commit typically mentions the number of the corresponding bugtracker entry. When closing a bugtracker entry, a comment will often be added that mentions the version number of the corresponding bugfix commit.
The earliest version archive data that is available for our analysis starts in the year 2000 and takes the form of a
The only bug tracking data that is still available (and extends until the present) comes from an instance of
For the analysis presented here, we created one single continuous Git archive that covers all commits (including branches) from the Infopark CVS and SVN and Git archives which contain 25 653, 14 694, and 5 261 commits, respectively; more than 45 000 commits overall, created over roughly 11 years of development. We added artificial bridging commits leading into each initial state of each subsequent subrepository; we kept all original version identifiers (and in fact all relevant metadata used in our analysis) in a separate database as extracted by the MininGit (
Since we are interested in bugfixes and those are defined to relate to bugtracker entries, we use only those commits leading towards the product releases made since the introduction of Bugzilla. These data consist of 31 854 commits which represent a total of 263 033 file-level deltas comprising 21 995 065 line-changes altogether. Only 2% of the commits are lacking the commit message.
The Bugzilla database contains 9 444 bug entries with a total of 46 302 comments. 60% of the entries are marked as repaired, 9% as duplicates, and 24% as invalid or “works for me.” According to results reported in [
Our approach for establishing a set of bugfix links consists of two methods for generating
The two generator methods are as follows.
The filters are as follows.
If it appears appropriate for a given dataset, any one of the filters could be left out, so the approach is customizable. Before all of these, the filter
Early works such as [
Since it was introduced by [
The present paper appears to be the first report involving backward links and the corresponding filters FC and UD.
ReLink [ first, a combination of TT and LU; second, the requirement that the author of the commit must also be a (co)author of the bugtracker entry; third, sufficient textual similarity (TS) between the commit message and the description and comments in the bugtracker entry. For the latter, ReLink employs stemming, stop-word elimination, and thesaurus-based word unification to reduce the vocabulary and then applies the cosine text similarity metric. The idea is that if no bug ID is provided in the commit message, it will instead talk about topics (such as failure symptoms or class or method identifiers) that are also mentioned somewhere in the (typically much longer) bugtracker entry.
MLink [
MLink and ReLink are very clever ideas, which could nicely be combined with the ideas of bflinks. When a dataset has a great deal of explicit bugfix links (as the Infopark data does) bflinks will work well, while for other datasets ReLink and MLink can improve the otherwise low recall with hopefully sufficient precision.
Note, however, that ReLink’s LU and TS both require cutoff parameters and their values are extremely critical for good precision because of the huge number of incorrect candidate links that need to be rejected.
This is not a small issue. For instance, the smallest of the three datasets investigated, ZXing, has 1 694 commits and 135 fixed bugs, hence, over 220 000 candidate bugfix links. The reported precision is 91% (107 of 118 correct), but if less appropriate parameter settings would result in only an additional 1% of the incorrect candidates getting through, precision would drop to 4.6% (107 of 2318 correct). For the largest of the datasets, Apache, the equivalent drop in precision, would go from 75% to 0.016% (only 0.06% of the commits from the Apache dataset, the 493-commit Linkster subset, are actually used in the paper; its Table 3 is misleading in this respect, consult the original paper [
Unfortunately, in the published form of ReLink just this critical parameter setting is unsound. The published algorithm involves a fine-grained search of the parameter space. The best parameter values are identified by measuring the resulting precision and recall via a ground truth dataset (golden set). This procedure is impossible in a real application, because ground truth will not be known—if it was, no procedure for establishing bugfix links would be needed at all. Therefore, the published performance figures are based on near-optimal parameter settings that could not be achieved by a real user and are hence overoptimistic. This means that the ReLink procedure as published is utterly incomplete and cannot be applied in practice. Until a sound method for tuning the ReLink cutoff parameters is devised, we can therefore perform neither a combination nor a comparison of bflinks and ReLink.
MLink shares a similar problem, if less strongly. The authors present in detail what they call an “unsupervised hill-climbing algorithm” (which happens to be neither unsupervised nor doing hill-climbing) requiring ground truth for setting the threshold parameters but do not state what data they fed it. At least they use just one fixed set of parameters for all of their four benchmark datasets afterwards.
Bugfix link identification is a form of information retrieval. From among all conceivable links, find all correct ones.
If the output of such a search is not too large, measuring its precision is practical. Manually assess each output and classify it as correct or incorrect.
Measuring recall, however, is generally difficult because it involves assessing all conceivable links. Unless this base set is rather small, complete manual assessment (and hence reliable determination of recall) is not feasible.
For instance determining the ground truth for the smallest dataset in the ReLink article involves checking 220000 pairs. Even at an unlikely fast speed of 15 seconds per pair, true individual checking of each pair would take half a person-year of effort. It appears unlikely that the authors have invested this much effort for determining their “golden set” (and then another). Instead, they will have performed a reduced, heuristic checking, which opens possibilities for overlooking correct bugfix links—and at a determined bugfix link density of only 0.065% (143 links) the fraction of overlooked ones can easily be quite large. For the Infopark dataset, establishing a complete ground truth would even involve checking over 300 million pairs (or 700 person years at 15 seconds apiece).
So alternative approaches to complete manual checking have to be devised in a domain-dependent manner. See, for instance, the ingenious approach taken by the TREC text retrieval contest [
In the bugfix link identification domain, we see four approaches that have been used to obtain estimates of recall. Heuristic manual checking aimed at a good approximation of the complete ground truth for small repositories (ZXing and OpenIntent of ReLink). This approach will be inaccurate if too many links are overlooked. Tool-supported manual checking by a project expert aimed at a good approximation of a partial ground truth for a large repository (Linkster/Apache of ReLink). This case stems from [ A survey aiming at a rough estimate of overall recall for a very large repository. This was used at Infopark and will be described in Section Relative measures similar to recall (results gain for augmentation methods such as ReLink and results loss for filtering methods such as bflinks). This approach avoids the need for a ground truth completely and instead requires only the checking of links proposed (perhaps at some internal stage) by the respective system. It will be described in the next subsection.
It is feasible to avoid relying on ground truth datasets for bugfix link identification. For research purposes, relative measures are often sufficient for comparing different systems and much cheaper than establishing ground truth. For practical engineering purposes, the survey method is sufficient for classifying recall as good enough or not good enough and again cheaper than establishing ground truth.
We go on to describe the measures we use in the present study to characterize the quality of the individual bflinks filters.
After the generators, the filters can be used individually or in arbitrary combination. Assume that we have as input
For a given filter or filtering chain, we also speak of the following.
The next subsection describes how these metrics were operationalized.
For measuring any of the above metrics we need to know which candidates are actually valid ones. This can only be determined by applying human judgment and background knowledge, so that manual assessment is required. For this study, we have manually checked over 2500 (52%) of our candidate bugfix links. More precisely, a long-term member of the Infopark development team has individually looked at these pairs of defect database entry and commit message and consulted the source code diff and/or colleagues where necessary in order to make a reliable decision whether a pair forms a true bugfix link or not.
For our research goal of
Unless the filters are idiotic, these candidates will contain a higher density of invalid bugfix links than the rest of the population and our sample is thus biased. The overall precision assessment from this biased sample is pessimistic and the resulting estimates of overall precision
No such “will be pessimistic” argument is available for overall recall
For obtaining at least a very rough estimate of recall in order to check whether Infopark’s 50% recall requirement is fulfilled, we performed a quick, informal survey of four long-time Infopark developers, asking (a) what percentage of all commits are primarily bugfix commits? and (b) what percentage of those has a corresponding Bugzilla entry? We compare the outcome to the number of bugfix links we found.
This section explains the generator methods, filters, and tuning parameters in detail.
Infopark was interested in complete bugfix links only, not in solo bugfix commits without a corresponding link to a bugtracker entry. We therefore made the following early (and quite radical) decision with respect to our analysis of the commit messages. We would not perform any keyword string matching on the commit messages (for terms such as “bug,” “defect,” and “fix”) at all. Instead, we would only look for integer numbers that represented existing Bugzilla entry IDs, assume that this number constitutes a bugfix link, and then validate the correctness of that link as best we can. (Most entries are not just integers such as “1234” but rather strings such as “#1234” or “BZ1234” or similar forms. However, there was no fixed rule in place at Infopark in this regard and so we decided to go for all integers (as also suggested by [
Requiring some keyword matching in addition would improve the precision of the results at the expense of lower recall. It is quite plausible that careful application of keyword matching could improve our results somewhat, but we did not investigate this in the present study and rather present the results of a “pure” bugfix link search instead.
This numbers-based search approach and the disciplined Infopark development culture suggest a second data source for bugfix links that we have not yet seen analyzed in the literature yet. If commit messages might contain numbers referring to bugtracker entries, why not also look in bugtracker entry comments for numbers referencing commits (version numbers). We use all strings that look like the IDs of an existing version as candidate bugfix links as well. These IDs look very different for each versioning system. For CVS they are either single-dotted numbers (on the trunk, e.g., 1.23) or multi-dotted numbers (on a branch, e.g., 1.23.1.7); for SVN they are integers, at Infopark preceded by “r” (e.g., r1722; we accept “R” as well); for Git they are 40-digit sedecimal strings (e.g., 6050732e725c68b83c35c873ff8808dff1c406e2).
SVN and Git version IDs are unique for the whole repository and so are easy to resolve. CVS version IDs, however, are local per file, so resolving them requires finding the corresponding filename and perhaps (where filenames are not unique themselves) even pathname. This is much more difficult, hence quite error-prone, and often even impossible. Since we have only a few months worth of CVS data, we decided not to bother and leave the CVS part of the history out of our analysis entirely.
The union of these two datasets of candidate bugfix links (from commit messages and from bugtracker entry comments) form the basis for the subsequent filtering steps. Such filtering is very important: Bugzilla IDs in particular will be polluted with plenty of false positives from port numbers, RFC numbers, percentage numbers, time measurement numbers, and many other kinds. The version IDs, although far less confusable, will also have false positives, because they may be mentioned in other roles than the bugfix link role (e.g., defect was validated to be still present in r123). Our study will investigate the following filtering criteria.
An integer number found in a commit message that is really the ID of a Bugzilla entry will occur only in one or possibly a few different commit messages (for difficult bugs, needing multiple complementary fixes or fixes of fixes), but not in many.
Even more dubious is the same commit ID appearing in multiple different Bugzilla entries because Infopark does not often check in multiple bugfixes in a single commit. Therefore, multiply appearing commit IDs probably indicate something other than a bugfix (such as an alpha release).
Small integers found in commit messages will very often not be Bugzilla IDs at all but rather other things such as percentages, HTTP status codes, buffer sizes, and item numbers. Since such numbers tend to occur repeatedly with the same value, rejecting them should improve precision. It will also not hurt recall much, because only few
It makes no sense to assume that a commit
There are three exceptions to these rules: (1) Git has a feature called “rebase” that allows to create time-travel effects at will by removing a commit from the history and reinserting an equivalent one at any other point into the history. In practice, this is used for eliminating branches and the reinsertion will be at a later time, not an earlier one. Also, the commit changes its ID. So this is not a problem. (2) Git allows changing the content or timestamp of a commit message later. This feature is not normally used at Infopark, so this is not a problem either. (3) If the time difference is only small, it might be due to a misalignment between the bugtracker server clock and the version archive server clock. We therefore accept references into the very near future (a few minutes), but not into a farther future, as those could only be valid in the presence of lucky guessing or time travel—and history teaches us that both of these are not common in software development.
If a commit
Any single bugfix link may be spurious and the above heuristics attempt to identify it as such in a context-free manner, without looking at any other link. However, as there are tens of thousands of commits and thousands of bugtracker entries, it is unlikely that a spurious bugfix link from, for example, commit
This is a potentially ruinous filter. Unless it is common in the development organization to mention commit IDs in bugtracker comments, the rule will be far too strict and will result in extremely low recall. In the Infopark case, however, it turns out to be practical and should be considered because it promises high filtering precision. Also, this filter does not require a tuning parameter to be chosen.
This section describes how we select the tuning parameters of the filters and what performance is thus obtained with each. The methods for choosing each tuning parameter do not use the ground-truth knowledge from our manual assessment of the candidates, but rather are procedures as they could be applied by any engineer attempting to perform a good search for bugfix links automatically. These procedures are based on simple diagnostic plots and invoke human judgment, so that relevant background knowledge is not thrown away if the engineer has such knowledge. We will describe this reasoning for each filter.
In each case, we will describe three choices of parameter: (1) a
Section
Generator BM suggested 4037 candidate bugfix links. These have a precision of 79%.
Generator CM suggested 3015 candidate bugfix links. These have a precision of 60%.
Both generators combined suggested 5005 candidate bugfix links (of which 2047 are bidirectional). These 5005 have a precision of 73%, the starting point of our filters’ precision improvement work.
We emphasize that this estimate is very rough. It has an unknown and possibly large margin of error. Its only purpose for Infopark was to make the decision whether the required “at least 50% recall” (see Section
For choosing
How many Bugzilla IDs (
It turns out that 3 is in fact overly aggressive and results in a filtering precision of only 39% and a hurtful results loss of 12%. 4 (and 5) works alright at filtering precision 50% (60%) and resulting results loss of 6% (3%). All three parameters nevertheless result in the same overall precision of 77%, so the loose choice would have been clearly best in this case.
For a complete overview of the performance statistics for the default parameter choice of all the filters, please refer to Table
Performance of each successive stage of the default COMBI filter: filtering precision (
This one filter solo | Filters up to here together | |||||||
---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
TT(C) | 43 | 0.3 | 0.2 | 73 | 43 | 0.3 | 0.2 | 73 |
TT(B) | 99 | 0.9 | 0 | 74 | 86 | 1.2 | 0.2 | 74 |
FC | 96 | 3.7 | 0.2 | 77 | 93 | 4.9 | 0.4 | 78 |
LU | 62 | 5.6 | 2.1 | 77 | 74 | 8.3 | 2.7 | 79 |
SB | 81 | 6.9 | 1.3 | 81 | 67 | 15 | 4.0 | 90 |
FB | 50 | 11 | 5.6 | 77 | 60 | 23 | 9.3 | 93 |
UDa | 29 | 59 | 42 | 99 | 63 | 19 | 7.1 | 93 |
Analogously, for choosing
How many commit IDs (
It turns out that this filter is very precise. Filtering precision for 3 (4) is 96% (99%). However, as the strength is only 4% (2%), the impact of the filter is modest, with results losses below 0.3% and overall precision of 77% (76%).
We plot the density of Bugzilla IDs. For readability we restrict the plot to maximum 2000 (this covers the bottom 26% of all IDs) and obtain the plot shown in Figure
Density of the occurrence of small and medium small candidate Bugzilla IDs. The boxplot shows percentiles 10, 25, 50, 75, and 90. The density estimator was computed by the density function of the R statistical software system (version 2.12) using default parameters.
We choose 130 as the loose choice and 750 as the aggressive choice. Considering that the raw data values look like a good mix of quite a few different values (rather than just a very few values occurring over and over), we decide not to be too aggressive and pick 350 as default choice. The mix of many different values also means the filter will not overlap too strongly with FB.
These considerations are all quite valid and successful. Filtering precision of the loose/default/aggressive choice is 92%/81%/56% and the resulting overall precision is 78%/81%/81%, so being aggressive is not helpful in this case. Results loss is 0.3%/1.3%/5%.
We expect that the time travel filter to work perfectly once we account for a few minutes of clock drift. However, the two diagnostic plots used each hold a surprise. Figure
Time difference from creation of Bugzilla entry to time of commit that mentions it.
Figure
Time difference from commit to time of Bugzilla comment that mentions it.
Both TT(B) and TT(C) are very weak, with a strength below 1% and almost no results loss at all. TT(C) has filtering precision 43% and overall precision 73%. TT(B) has filtering precision 99% and overall precision 74%.
We use the same plot as for TT(C) but now we focus on the positive time range. We add 1.5 hours to minimize the distortion from the time zone problem. The density plot is not useful if we start the plot at time zero, as by far the most activity is early, not late, so we start the density estimation only after 4 hours as shown in Figure
Time from commit to the appearance of its mention (if any) in a Bugzilla entry.
As Infopark is a one-time zone company, the plausible cutoff points are in the nights. The first after 10 hours (clearly overly aggressive; not visible in the density plot because of our removal of the first 4 hours), then after about 32 hours, 55 hours, 80 hours, and 100 hours. An informal survey among Infopark developers suggested “three days” as a good cutoff, so we pick 32 hours as the aggressive choice and 80 hours as the default and loose choice.
The difference is not large: filtering precision for 32 (80) hours is 56% (62%), results loss is 3% (2%), and overall precision is 77% in both cases.
This link uses a qualitative criterion and does not have a tuning parameter.
Since the majority of bugfix links is unidirectional, this filter’s strength is high (59%); its filtering precision is low: only 29%. The results are rather extreme. Results loss is an unacceptable 42%, but the resulting precision is a brilliant 99%.
Given these properties of filter UD, it is obviously not helpful to combine it with the others in a successive filter chain, as even UD alone is too aggressive and has too much results loss.
So we turn UD around and use it as an
In the loose parameter choice for each filter, the resulting COMBI filter has filter precision 70% and a strength of 10%. It achieves an overall precision of 88% at 4.1% results loss; a very good result. The default filter is even better: its filter precision of 63% leads to 93% overall precision at just 7.1% results loss. The aggressive version is a little less efficient: with filter precision 48% it achieves 94% precision, but at the cost of 14% results loss (strength 27%).
Table
For the loose parameter choices, the acceptance stage reduces results loss from 5.3% to 4.1%; for the aggressive choices it is from 19% to 14%. Overall, this is a smooth and effective ensemble of filters.
As for final overall recall, we need to go back to the estimate of initial recall in Section
For clarification, we will now summarize how one would apply the overall bflinks method to one’s own repository and how one can copy with the potentially large variation in repository content and properties that may occur due to different application domain, technology used, development conventions, and idiosyncrasies.
The procedure has to be applied by somebody who knows and understands the repository content well, in particular the text of the commit messages and bugtracker descriptions and comments. Make the version archive and bugtracker data accessible and run the generators BM and, if applicable, CM. Create the diagnostic plots for choosing the cutoff parameter for each filter as described in Sections Based on each plot, combined with your understanding of repository content, select a loose, default, and aggressive parameter setting for each filter. Expect to find potentially much-different values compared to those in the present paper. For instance, your bugtracking procedures may produce much higher numbers of correct mentions of some bug IDs (affects filter FB), your quality assurance procedures may produce much later correct updates of bugtracker entries (affects LU), your application domain may involve other number ranges of non-ID numbers mentioned in commits (affects SB), your server clocks may have less or more time and time zone issues (affects TT), and so on. Using the default parameter, compute filter strength for each filter. If those are similar to the strengths reported in this paper or conform to your repository-specific expectations for some other reason, you may be willing to trust the values and hence the filters and start using the filter chain in this form. Otherwise, switch to the loose or aggressive setting for the problematic filters. Sort the filters by increasing filter strength and compute overall filter strength along the chain as shown in the right half of Table Otherwise, draw a random sample of 100 candidate links, apply the filter chain to them, and manually validate the results. Compute precision. Compute results loss. If necessary, adjust the parameter setting of your strongest two or three filters until you obtain a reasonable tradeoff between precision and results loss. If results loss appears unacceptably high, your repository is not well suited for bugfix link identification. If precision appears unacceptably low, add syntax matching and keyword matching to the BM generator so that it does not use every integer found but instead requires forms such as If precision and results loss are now acceptable for the sample, this provides you with a rough estimate of true precision and true results loss with approximately the following precision. According to the binominal distribution, for observed frequencies of false positives or false negatives of 5% (or 10% or 25%), with a probability of 90%, the actual value will be in the range 2%–9% (or 5%–15% or 18%–32%, resp.).
Keep in mind that no amount of quality of the method (or validation of that quality for other repositories) and no amount of understanding of your particular repository content on your side will make a successful bugfix link identification happen if the repository content has too few correct and/or too many misleading mentions of IDs. See also the discussion of external validity below.
It should be clear from the above discussion that the results of bugfix link filtering strongly depend on the development practices of the organization.
Our results serve to explain certain methods by which good results can be obtained, but the actual results are clearly specific to our particular case and might be very different elsewhere.
In particular, the results could be better if the discipline of mentioning Bugzilla IDs and commit IDs were higher or far worse if it were lower; they could also be better if a fixed syntax for Bugzilla ID mentions were used throughout; they could be worse if more Bugzilla IDs or commit IDs were mentioned in other roles than the bugfix link role.
One particularly important issue for the generalizability of our method and findings is the frequency of backward links from the bugtracker to a particular commit. If this were a common practice in the large old version repositories of popular projects such as Apache and Eclipse, somebody would have invented the use of backward links long ago. But how about more recent environments? In particular, the Git version management system (
To determine whether such technology leads to a development culture with many backward links, we performed a study on GitHub ( We searched for the super-generic keyword “project,” which gave 151730 project hits (repository hits) in apparently random order. We took the first 100 of these projects and selected all those that had at least 1000 followers (stars), which resulted in 16 projects ranging from 1041 to 7990 followers. 3 of these projects did not use the issue tracker at all. For each of the others, if they used tags, we reviewed the 30 youngest closed tracker entries tagged “bug,” “confirmed,” or “defect.” For projects not using tags, we reviewed all “issue” entries among the 30 youngest closed entries and excluded only those that were very obviously not about a bug (but rather about a feature request or configuration issue). Of these tracker entries, between 0% and 100% (per project) provided backward links to a commit, only one project was below 20%. The average over the 13 projects was 44%, which happens to be quite close to our own data (41%).
We conclude that the applicability of the powerful UD filter will likely be good at least in many younger projects using Git and GitHub.
The manual validation of bugfix links is boring, quasirepetitive work and is likely to contain some amount of error even for the repository expert. We estimate this error to be on the order of 1% to 2%—too little to modify our conclusions.
Since we have validated only a subset of all bugfix links, the precision measurements also involve sampling error. As described in Section
As mentioned in Section
In this paper, we have described how to use a chain of different filters on candidate bugfix links in order to obtain high precision without loosing much recall. In particular, we have shown how to tune those filters’ cutoff parameters and how to make use of backward links from defect database to version archive if such links are available. We have evaluated the techniques on a rather large commercial repository by carefully and individually checking over 2500 candidate bugfix links. There are a number of conclusions as follows. The ambitious quality criteria set by Infopark for avoiding incorrect conclusions (at least 50% recall and at least 80% precision) were met for the bugfix links. Our generator/filter network achieved roughly 65% recall (please observe the discussion in Sections We present 7 filters, 6 of which have a cutoff parameter that needs tuning. The simple diagnostic plots and tactical reasoning we describe to be used for parameter tuning have worked very well. In all 6 cases the results were good, in 5 of the 6 cases they also had the properties that were expected; only the FB filter parameter choice turned out to be more aggressive than expected and needlessly aggressive, too. These approaches and their general idea are practical and appear to be helpful and sufficiently safe. While we have demonstrated a practical approach to choosing the cutoff values with the help of diagnostic plots, the particular parameter tuning decisions made with this approach need to be idiosyncratic and must take specific properties of the product and the development process into account. After tuning, it took 5 of the 6 filters to be chained to cross the precision threshold of 80%. We conclude that filtering must not be done too timidly and many filtering ideas (possibly organization-specific ones) should be combined. Results loss can be used to guide the aggressiveness in parameter choice by means of some manual classification of filtering results where needed. Unconditional acceptance of bidirectional links helps limiting results loss. Much less results loss needs to be accepted in order to achieve the same high precision if (and only if) a high density of backward links from bugtracker entries to commits is available. Thus, we recommend regularly mentioning bugfix commit IDs in bugtracker comments as a development practice.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors thank Thomas Witt and the development team at Infopark for helping with access to and understanding of the repository data. They thank Franz Zieris for catching two major errors in the paper.