Comparing maintainability index, SIG Method, and SQALE for technical debt identification

Many techniques have emerged to evaluate software Technical Debt (TD). However, differences in reporting TD are not yet studied widely, as they can give different perceptions about the evolution of TD in projects. The goal of this paper is to compare three TD identification techniques: i. Maintainability Index (MI), ii. SIG TD models and iii. SQALE analysis. Considering 17 large open source Python libraries, we compare TD measurements time series in terms of trends in different sets of releases (major, minor, micro). While all methods report generally growing trends of TD over time, MI, SIG TD, and SQALE all report different patterns of TD evolution.


INTRODUCTION
Technical Debt (TD) is a metaphor coined by Ward Cunningham in 1993, that made an analogy between poor decisions during software development and economic debt. Even though short-term decisions can speed-up development or the release process, there is an unavoidable interest that will have to be paid in the future.
In general, the impact of TD can be quite relevant for industry. Many studies found out that TD has negative financial impacts on companies (e.g., [14]). Every hour of a developer time used on fixing poor design or figuring out how badly documented code works with other modules instead of developing new features is essentially a waste of money from the company point of view.
The goal of the paper is to compare three main techniques about TD identification that were proposed over time for source code TD identification: i. the Maintainability Index (MI) (1994) that was one of the first attempts to measure TD and is still in use, ii. SIG TD models (2011) that were defined in search of proper code metrics for TD measurement and iii. SQALE (2011) a framework that attempts to put into more practical terms the indication from the ISO/IEC Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s

TECHNICAL DEBT (TD)
One of the widely spread definitions of TD is from McConnell in 2008: "A design or construction approach that's expedient in the short term but that creates a technical context in which the same work will cost more to do later than it would cost to do now (including increased cost over time)" [8]. In the research context, Guo et al. (2014) presented TD as "incomplete, immature, or inadequate artifact in the software development lifecycle" [4]. Theodoropoulos proposed a new, broader definition of TD: "Technical debt is any gap within the technology infrastructure or its implementation which has a material impact on the required level of quality" [13]. Gat (2011) proposed an even more extensive definition of TD: "Quality issues in the code other than function/feature completeness" divided into intrinsic and extrinsic quality issues [1]. One of the most important shortcomings of TD definitions is the fact that there is yet to be a unified measurement unit [11]. It is generally complex to quantify most of the forms of TD. As well, there are different categories of technical debt: code debt, design and architectural debt, environment debt (connected to the hardware software ecosystem), knowledge distribution and documentation debt, and testing debt [14].
There are not many studies that compare alternative TD identification methods. One reason could be the complexity / time required to implement the methods, the second reason about the comparability of the metrics defined. Furthermore, Izurieta et al. [5] note that it can be difficult to compare alternative TD measurements methods due to missing ground truth and the uncertainties of the measurement process. One of the earliest studies to compare metrics for TD identification was the study by Zazworka et al. [15], comparing four alternative methods across different versions of Apache Hadoop: a) modularity violations, b) design patterns grime build-up, c) source code smells, d) static code analysis. The focus was on comparing how such methods behave at the class level. The findings were that the TD identification techniques indicate different classes as part of the problems, with not many overlaps between the methods. Furthermore, Griffith et al. [3] compared ten releases of ten open source systems with three methods of TD identification (i. SonarQube TD plug-in, ii. a method based on TD identification using a cost model based on detected violations, iii. and one method defining design disharmonies to derive issues in quality). These methods were compared against software quality models. Authors found that only one method had a strong correlation to the quality attributes of reusability and understandability.

MI, SQALE, SIG TD COMPARISON
For the definition of the experimental evaluation, we defined the following goal: to analyze technical debt evaluation techniques (MI, SQALE, SIG TD) for the purpose of comparing their similarity with respect to the trends and evolution of the measurements from the point of view of practitioners aiming at measuring TD. The goal was refined into three main research questions (RQs): RQ1: Are the trends of the measurements provided by the three methods comparable? Metric: Pearson correlation between trends. RQ2: How are trends of TD comparable across different release types? Metric: comparison of trends by release type. RQ3: How much can one method be used to forecast another one? Metric: Granger causality between time series.

Compared TD Measurement Techniques
Of many methods that were proposed over time, we focus on three representative methods for TD identification: i. the Maintainability Index (MI), ii. SIG TD models, and iii. SQALE analysis. Methods that practitioners can still find implemented in TD analysis tools.

Maintainability Index (MI)
. MI was introduced in 1994, with the goal to find a simple, applicable model which is generic enough for a wide range of software [10]: Where HV is average Halstead Volume per module, CC is average Cyclomatic Complexity per module, LoC is average lines of code per module, CMT is average lines of comments per module. A derivative formula of MI is still used in some popular code editors (e.g., Microsoft Visual Studio), so is relatively easy to be adopted by practitioners to detect TD.
3.1.2 SIG TD Models. The Software Improvement Group (SIG) defined in 2011 a model which quantifies TD based on an estimation of repair effort and estimation of maintenance effort, which provides a clear picture about the cost of repair, its benefits, and the expected payback period [9]. Quantifying TD of the project is done in several steps and requires a calculation of three different variables: Rebuild value (RV), Rework fraction (RF) and Repair effort (RE).
Rebuild value is defined as an estimate of the effort (in manmonths) that needs to be spent to rebuild a system using particular technology. To calculate this value, the following formula is used: Where SS is System Size in Lines of Code and TF is a Technology Factor which is a language productivity factor.
Rework Fraction is defined as an estimate of % of LoC to be changed in order to improve the quality by one level. The values of the RF in between two quality levels are empirically defined [9].
Finally, the Repair Effort is calculated by the multiplication of the Rework Fraction and Repair Effort. It is possible to multiply it by Refactoring Adjustments (RA) metric, which shows external, context-specific aspects of the project, which represent a discount in the overall technical debt of the project: Software QuALity Enhancement (SQALE) focuses on the operationalization of the ISO 9126 Software Quality standard by means of several code metrics that are attached to the taxonomy defined in ISO 9126 [6]. Mimicking the ISO 9126 standard, SQALE has a first level defining the characteristics (e.g. testability), further sub-characteristics (e.g., unit testing testability), and further source code level requirements. Such source code requirements are then mapped to remediation indexes that translate in the time/effort required to fix the issues. For calculation of TD, the Remediation Cost (RC) represents the cost to fix the violations to the rules that have been defined for each category [7]: For SQALE, we adopted the SonarQube implementation: a default set of rules was used, which is claimed to be the best-practice, minimum set of rules to assess the technical debt.

Rationale & Methods
To compare the three methods, we looked at the time series of all the measures collected by all the three methods. The tested packages were randomly selected from the list of 5000 most popular Python libraries (full list can be found in [12]). We define a time series as T , consisting of data points of TD at each release time R = {r 1 , r 2 , ...r n }, as: T = {t r 1 , t r 2 , ..., t r n }. The MI measure is an inverse of the other measures, as is giving an indication of the maintainability of the project (the lower the worse), while the other methods give indication of TD accumulating (the higher the worse).
For the other parts of the analysis, to compare the time series, we reversed the MI index, to make it comparable. For RQ1. trends of measurements, we compute TD's ∆ measurements between two releases for each of the projects. For release r i , ∆T D is defined as: We then compute the Pearson correlations between all the points of each of the compared methods. Results of the ∆T D r i for each of the time series are also shown in aggregated form in boxplots.
For RQ2., we consider different types of project releases: major (e.g., 0.7.3, 1.0.0), minor (e.g., 0.7.3, 0.8.0), and micro (e.g., 0.9.0, 0.9.1) releases, looking at TD trends differences. To answer this research question, we look at TD's ∆ ↑ as increasing trends, ∆ ↓ as decreasing trends, and ∆0 in periods between releases in which TD did not change. Where ∆T D r i is categorized in one of the categories: For RQ3., we look at how much one of the three methods can be used to forecast the results from another method. We take into account time series of the measurements from the three methods (MI, SIG TD, SQALE) and we compute Granger causality between methods in pairs. Granger causality test, first proposed in 1969 by Clive Granger, is a statistical hypothesis test which is used to determine whether a time series can be used to predict other time series values [2]. More precisely, we can report that T 1 "Granger causes" T 2, if the lags of T 1 (i.e., T 1 t −1 , T 1 t −2 , T 1 t −3 ,...) can provide predictive capability over T 2 beyond what allowed by considering the own lags of T 2. The null hypothesis is that T2 does not Grangercause the time series of T1. We adopted the standard SSR-based F test. If the probability value is less than 0.05, T2 Granger-causes T1.  We run Wilcoxon Signed-Rank Tests, paired tests to evaluate the mean ranks differences for the correlations. For Wilcoxon Signed-Rank Test, we calculate effect size as r = Z / √ N , where N = # cases * 2, to consider non-independent paired samples, using Cohen's definition to discriminate between small (0.0 − 0.3), medium (0.3−0.6), and large effects (> 0.6). The difference is statistically significant for SQALE-MI vs SIG-MI (p-value 0.044 -p ≥ 0.05, two-tailed, medium effect size (r = 0.34)) while not significant for SQALE-SIG vs SQALE-MI (p-value 0.423, p ≥ 0.05, two-tailed).

Results
When we look at the trends on comparisons between every two following releases (Table 1), the trend is similar for SIG and MI (as previous correlations discussed), with a slight difference in the falling trend. This seems to indicate that, according to MI, TD tends to be repaid more often than on SIG. In SQALE, however, we can observe that TD was more stable across different releases (Table 1). RQ1 Findings. SIG TD and MI are the models which show statistically significant comparability in terms of correlation of the trends of TD changes. SQALE and SIG TD, also show similarities, though not statistically significant. Generally, SQALE and MI are the models that show lower correlation in trends. This RQ is similar to RQ1, but in RQ2, we look at the comparison based on the release types, if major, minor, micro releases matter for the differences in TD identification. Comparisons solely between major releases have brought interesting results (Table 2), similar to the results on all releases. Throughout all comparisons, most of major releases caused TD to rise for each analysis. SQALE had again the highest number of still trends and the most TD repayments (falling trend) were recorded with MI. The rising of TD is stronger at minor release level for both SIG TD and SQALE, as each of the methods encountered the rise of rising trend compared to major releases (see Table 3). SQALE showed a decrease in growing trends. As in the previous cases, SQALE recorded more periods of steady TD, and MI the most repayments of TD (to a much larger extent than SIG TD and SQALE). The last comparison was done on micro releases. The same trends were observed also at this level: vast majority of releases inducted more TD on SIG and MI, while considering SQALE the majority of releases did not change TD (see table 4). Again, MI is the method that reports more TD repayment (21.99%).

CONCLUSIONS
Comparing three main methods about TD identification (Maintainability Index (MI), SIG TD models, SQALE) on a set of 17 Python projects, we can see in all cases increasing trends of reported TD. However, there are different patterns in the final evolution of the measurements time series. MI and SIG TD report generally more growing trends of TD compared to SQALE, that shows more periods of steady TD. MI is the methods that reports largely more repayments of TD compared to the other methods. SIG TD and MI are the models that show more similarity in the way TD evolves, while SQALE and MI are the less comparable. Granger causality for all projects and combination of methods shows that there is a limited dependency between the TD time series. We could find some relationships between SQALE & MI, and SQALE & SIG TD models, in the sense that previous lags of SQALE time series could be used to improve prediction of other models in 1/3 of the projects.