A Tool-Based Perspective on Software Code Maintainability Metrics: A Systematic Literature Review

,


Introduction
Nowadays, software security and resilience have become increasingly important, given how pervasive the software is. Effective tools and programming languages can (i) discover mistakes earlier (ii) reduce the odds of their occurrence (iii) make a large class of common errors impossible by restricting at compile time what the programmer can do Several best practices are consolidated in software engineering, e.g., continuous integration, testing with code coverage measurement, and language sanitization. All these techniques allow the application of code analysis tools automatically, which can provide a significant enhancement of the source code quality and allow software developers to efficiently detect vulnerabilities and faults [1]. However, the lack of comprehensive tooling may render it challenging to apply the same code analysis strategies to software projects developed with different languages or for different domains.
e literature defines software maintainability as the ease with which a software system or component can be modified to correct faults, improve performance or other attributes, or adapt to a changing environment [2]. us, maintainability is a highly significant factor in the economic success of software products. Several studies have described models and frameworks, based on software metrics, to predict or infer the maintainability of a software project [3][4][5]. However, although many different metrics have been proposed by the scientific literature over the course of the last 40 years, the available models are very language-and domain-specific, and there is still no accordance in the industry and academia about a universal set of metrics to adopt to evaluate software maintainability [6].
is work aims at answering the primary need of identifying evaluation frameworks for different programming languages, either affirmed or newly emerged, e.g., the Rust programming language, developed by Mozilla Research as a language similar in characteristics to C++, but with better code maintainability, memory safety, and performance [7,8].
us, the first goal of this paper is to find which are the most commonly mentioned metrics in the state-of-the-art literature. We focused on static metrics since the analysis of dynamic metrics (i.e., metrics collected during the execution of adequately instrumented software [9]) was out of the scope of this work. e second goal of the paper is to determine which tools are more commonly used in the literature to calculate source code metrics. Based on the mostly used tools, we then define an optimal selections able to compute the most popular metrics for a set of programming languages.
To pursue both goals we (i) applied the systematic literature review (SLR) methodology on a set of scientific libraries (ii) performed a thorough analysis of all the primary studies, available in the literature, about the topic of software metrics for maintainability Hence, this manuscript provides the following contributions to researchers and practitioners: (i) e definition of the most mentioned metrics that can be used to measure software maintainability for software projects (ii) Details about closed-source and open-source tools that can be leveraged by practitioners to evaluate the quality of their software projects (iii) Optimal sets of open-source tools that can be leveraged to investigate the computation of software metrics for maintainability adopt them in evaluation frameworks adapt them to other programming languages that are currently not supported e remainder of the manuscript is structured as follows: (i) Section 2 describes the approach we adopted to conduct our SLR (ii) Section 3 presents a discussion of the results obtained by applying such approach (iii) Section 4 discusses the threats to the validity of the present study (iv) Section 5 provides a comparison of this study with existing related work in the literature (v) Section 6 concludes the paper and provides directions for future research

Research Method
In this section, we outline the method that we utilized to realize this study. We performed a systematic literature review (from now on, SLR), following the guidelines provided by Barbara and Charters [10] to structure the work and report it in an organized and replicable manner. An SLR is considered one of the key research methodologies of evidence-based software engineering (EBSE) [11]. e methodology has gained significant attention from software engineering researchers in recent years [12]. SLRs all include three fundamental phases: (i) planning the review (which includes specifying its goals and research questions); (ii) conducting the review (which includes querying article repositories, selecting the studies, and performing data extraction); and (iii) reporting the review.
All those steps have been undertaken during this research and are detailed in the following sections of this paper.

2.1.
Planning. According to Barbara and Charters guidelines, the planning phase of an SLR involves the identification of the need for the review (hence the definition of its goals), the definition of the research questions that will guide the review, and the development of the review protocol we will use.

Goals.
e need for the review, as said in the introduction section, came from the need to improve the software maintainability, in terms of clarity of its source code, while implementing complex algorithms. Our primary objective was to identify a dependable set of metrics widely used in the literature and computed for software usage with available tools. e objectives of our research are defined by using the Goal-Question-Metric paradigm by van Solingen et al. [13]. Specifically, we based our research on the following goals: (iv) RQ2.2: what is the ideal selection of tools able to apply the most popular metrics for the most supported programming languages? is research question entails measuring the coverage provided by the set of the most popular metrics for each language and providing the optimal set of tools that can compute those metrics.

Selected Digital Libraries.
e search strategy involves the selection of the search resources and the identification of the search terms. For this SLR, we used the following digital libraries: (i) ACM Digital Library (ii) IEEE Xplore (iii) Scopus (iv) Web of Science

Search Strings.
e formulation of the search strings is crucial for the definition of the search strategy of the SLR. According to the guidelines defined by Kitchenham et al., the first operation in defining the search string involved an analysis of the main keywords used in the RQs, their synonyms, and other possible spellings of such words.
In this phase, all the researchers collaboratively selected several pilot studies. e selected pilot studies are presented in Table 1 and are related to the target research domain. ese studies are selected to be used to verify the goodness of the research queries: the researchers should review the queries if the pilot studies are not present after the refining phase. e starting keywords identified were software, maintainability, and metrics. e search string "software maintainability metric" was hence used to perform the first search on the selected digital libraries. Our results include articles published between 2000 and 2019.
is first search pointed out that adding the code synonym of the keyword software added a large numbers of papers to the results.
Also, the following keywords were excluded from the search to reduce the number of unfitting papers from the results: (i) Defect and fault, to avoid considering manuscripts more related to the topic of verification and validation, error-proneness, and software reliability prediction, than to code maintainability (ii) Co-change, to avoid considering manuscripts more related to the topic of code evolution (iii) Policy-driven and design, to avoid considering manuscripts more related to the definition and usage of metrics used to design software, instead of evaluating existing code Table 2 reports the search queries before and after excluding the keywords listed above, for each of the chosen digital libraries.

Inclusion and Exclusion Criteria.
e final phase of the study selection uses the studies obtained by applying the final search queries detailed below. e following are the inclusion criteria used for the study selection: IC1: studies written in a language comprehensible by the authors IC2: studies presenting a new metric accurately IC3: studies that present, analyze, or compare known metrics or tools IC4: detailed primary studies On the other hand, in the following are defined the exclusion criteria: EC1: studies written in a language not directly comprehensible by the authors, i.e., not written in English, Italian, Spanish, or Portuguese EC2: studies that present a novel metric, but not do not describe it accurately EC3: studies that do not describe or use metrics or tools EC4: secondary studies (e.g., systematic literature reviews, surveys, and mappings) 2.2. Conducting. After defining the review protocol in the planning phase, the conducting phase involves its actual application, the selection of papers by application of the search strategy, and the extraction of relevant data from the selected primary studies.

Study Search.
is phase consisted of gathering all the studies by applying the search strings formulated and discussed in Section 2.1.4 to the selected digital libraries. To this end, we leveraged the Publish or Perish (PoP) tool [17]. To aid the replicability of the study, we report that we performed the last search iterations at the end of October 2019. After the application of the queries and the removal of the duplicate papers on the four considered digital libraries, 801 unique papers were gathered (see Table 3). e result of this phase is a list of possible papers that must be subject to the application of exclusion and inclusion criteria. is action allows having a final verdict for their selection as primary studies for our SLR. We exported the mined papers in a CSV file with basic information about each extracted manuscript.

Study Selection.
e authors of this SLR carried the paper selection process independently. To analyze the papers, we used a 5-point Likert scale, instead of dividing them between the fitting and unfitting. We performed the following assignation: (i) One point to the papers that matched exclusion criteria and did not match any inclusion criteria (ii) Two points to papers that matched some exclusion criteria and some inclusion criteria (iii) ree points to papers that did not match any criteria (neither exclusion or inclusion) (iv) Four points to papers that matched some, but not all, inclusion criteria (v) Five points to papers that matched all inclusion criteria We analyzed the studies in two different steps: first, the title and abstract for finding immediate compliance of the paper to the inclusion and exclusion criteria. For papers that received 3 points after reading the title and abstract, the full text was read, with particular attention to possible usage or definition or metrics throughout the body of the article. At the end of the second read, none of the uncertain studies were evaluated as fitting with our research needs, and hence, no other primary study was added to our final pool.
During this phase, we also applied the process of snowballing. Snowballing refers to using the reference list of the included papers to identify additional papers [18]. e application of snowballing, for this specific SLR, did not lead to any additional paper to take into consideration.

Data Extraction.
In this phase, we read each identified primary studies again, to mine relevant data for addressing the formulated RQs. We have created a spreadsheet form to be compiled for each of the considered papers, and that contained the data of interest subdivided by the RQ they concurred to answer. e data extraction phase, again, was performed by all the authors of the papers in an independent manner.
For each paper, we collected some basic context information: (i) Year of publication (ii) Number of times the paper was viewed fully and number of citations (iii) Authors and location of the authors To answer RQ1.1, we needed to inspect the set of primary studies to understand which metrics they defined or mentioned. Hence, for each paper, we extracted the following data:

Library
Before refinement After refinement ACM Digital Library +("software" "code")+(metrics)+(maintainability) +("software" "code")+(metrics)+(maintainability)-(defect)-(fault)-(co-change)-(policy-driven)   (i) e list of metrics and metric suites utilized in each paper (ii) e programming languages and the family of programming language (e.g., C-like and object oriented) for which the used or proposed metrics can be computed To answer RQ1.2, we wanted to give an additional classification of the metrics, other than the number of mentions. We took in consideration the opinion of the authors on each of the metrics studied in their papers. is allowed us to evaluate if a metric is considered useful or not in most papers. is analysis allowed us to take into consideration the popularity of the metrics by counting the difference between positive and negative citations by authors.
To answer RQ2.1, we needed to inspect the primary studies to understand which tools they presented or used to compute the metrics that were adopted. For each paper that mentioned tools, we hence gathered the following information: (i) e list of tools described, used, or cited by each paper (ii) When possible, the list of metrics that can be calculated by each tool (iii) e list of programming languages on which the tool can operate (iv) e type of the tool, i.e., the fact that the tool is open source or not Finally, to answer RQ2.2, we had to correlate the information gathered for the previous research questions. We achieved this by finding the tool or tools covering the metrics that proved to be the most popular among selected primary studies.

Data Synthesis and Reporting.
In this phase, we elaborated the data extracted and synthesized previously to obtain a response for each of the research questions we had. Having all the data we needed, in the shape of a form per paper analyzed, we proceeded with the data synthesis.
We gathered all the metric suites and the metrics we found in tables, keeping track of the papers mentioning them. We computed aggregate measures on the popularity value assigned to each metric.

Results
is section describes the results obtained to answer the research questions described in Section 2.1.2. e appendices of this paper report the complete tables with the extracted data to improve the readability of this manuscript.
At the end of this phase, we collected a final set of 43 primary studies for the subsequent phase of our SLR. Figure 1 reports the distribution over the considered time frame of the selected papers, and Figure 2 indicates the distribution of authors of related studies over the world. We report the selected papers in Table 4. e statistic seems to suggest that the interest in software maintainability metrics had grown since 2008 and has increased in the latest years since 2016 (see barplot in Figure 1).

RQ1.1: Available Metrics.
e papers selected as primary studies for our SLR cited a total of 174 different metrics. We report all the metrics in Table 5 in the appendix. e table reports (i) the metric suite (empty if the metric is not part of any specific suite) (ii) the metric name (acronym, if existing, and a full explanation, if available) (iii) the list of papers that mention the metric. e last two columns, respectively, report (iv) the total number of papers mentioning the metric (i.e., the number of studies in the third column) (v) the score we gave to each metric We computed the score in the following way: (i) +1 if the study used (or defined) the metric or the authors of the study expressed a positive opinion about it (ii) −1 if the paper criticized the metric By examining the last two columns of the metrics table, it can be seen that the last two columns are most of the times identical. is is because the majority of the papers we found just utilize the metrics without commenting them, neither positively or negatively.
It is immediately evident that some suites and metrics are taken into consideration much more often than others. More than 75% of the metrics are mentioned by just a single paper. e boxplots in Figure 3 show, in red, the distribution of the total number of mentions and the score for all the considered metrics. It is evident, from the boxplots, that the difference between the two distribution is rather limited, confirming the vast majority of neutral or positive opinions when the metrics are referenced in a research paper. Since only 24.7% of the metrics are used by more than one of our selected studies, the median values of both the measured indicators, "TOT" and "Score", are equal to 1 if the whole set of metrics is considered.
In general, however, it is worth underlining that a low score does not necessarily mean that the metric is of lesser quality but instead that it is less known in the related literature. Another interesting thing to point out is that we did not find a particular metric that received many negative scores.

RQ1.2: Most Mentioned
Metrics. Since our analysis was aimed at finding the most popular metrics, to extract a set of them to be declined to different languages, we were interested in finding metrics mentioned by multiple papers. In Table 6 we report metrics that were used by at least two papers among the selected primary studies. is operation allowed us to reduce the noise caused by metrics that were Scientific Programming 5 mentioned only once (possibly in the papers where they were originally defined). After applying this filter, only 43 metrics (the 24.7% of the original set of 174) remained. e boxplots in Figure 3 show, in green, the distributions of the total number of mentions and the measured score for this set of metrics. On these distributions, the rounded median value for the total number of mention is 3, and for the score is 3. Since our final aim in answering RQ1.2 was to find a set of most popular metrics for the maintainability of source code, we resorted on selecting, on the complete set of 43 metrics mentioned in at least two papers, those whose score was above the median.
With this additional filtering, we obtained a set of 13 metrics and 2 metric suites, which are reported in Table 7. Two suites were included in their completeness (namely, the Chidamber and Kemerer suite and the Halstead suite) because all of their metrics had a number of total mentions and score higher or equal to the median. For them, the table reports the lower number of mentions and score among those of the contained metrics. Instead, for the Li and Henry suite, only the MPC (message passing coupling) metric obtained a number of mention and score above the median and hence was included in our set of selected most popular metrics. A brief description of the selected most popular metrics is reported in the following. e metrics are listed in alphabetical order: (i) CC (McCabe's Cyclomatic Complexity). It is developed by McCabe in 1976 [56] and is a metric meant to calculate the complexity of code by examining the control flow graph of the program, i.e., counting its independent execution paths based on the flow graph [14]. e assumption is that the complexity of the code is correlated to the number of execution paths of its flow graph. It is also proved that there exists a linear correlation between the CC and the LOC metrics, as found by Jay and Hale. Such relationship is independent from the used programming language and code paradigms [57]. Each node in the flow graph corresponds to a block of code in the program where the flow is sequential; the arcs correspond to branches that can be taken by the control flow during the execution of the program. Based on those building blocks, the CC of a source code is defined as M � e n + 2p  where n is the number of nodes of the graph, e is the number of edges of the graph, and p is the number of connected components, i.e., the number of exits from the program logic [6].
(ii) CE (Efferent Coupling). It is a metric that measures how many data types the analyzed class utilizes, apart from itself. e metric takes into consideration the known type inheritance, the interfaces implemented by the class, the types of the parameters of its methods, the types of the declared attributes, and the types of the used exceptions.

(iii) CHANGE (Number of Lines Changed in the Class).
It is a change metric, which measures how many lines of code are changed between two versions of the same class of code. is metric is hence not defined on a single version of the software project, but it is tailored to analyze the evolution of the source code. e assumption between the usage of  Chidamber and Kemerer NOC, number of children [5,14,26,43,44], [20,27,32,39], [23,29,35,47] 13 11 Chidamber and Kemerer RFC, response for class [5,15,19,26,43], [14,21,27,44], [20,29,32,39], [23,46,47,50] 17 15 Chidamber and Kemerer WMC, weighted methods per class [5,21,26,43,44], [16,20,27,32], [   In the literature, there is typically accordance about how to count the operations of modifications, which typically counts two times as the additions or deletions (the modification is considered as a deletion followed by an addition). Most of the times, comments, and blanks are not considered in the computation of the changed LOCs during the evolution of software code. (iv) C&K (Chidamber and Kemerer Suite). It is one of the best-known sets of metrics, which was introduced in 1994 [58]. is suite has been designed keeping into consideration the object-oriented approach. It is composed of 6 metrics, listed as follows: WMC, weighted method per class, defined in the same way as McCabe's WMC (weighted method count, described below) but applied to a class, i.e., it gives the complexity of that particular class by   adding together the CC of all the methods within that same class [58]. DIT, depth of inheritance tree, defined as the length of the maximal path from the leaf node to the root of the inheritance tree of the classes of the analyzed software. Inheritance helps to reuse the code; therefore, it increases the maintainability. e side effect of  [20,23,28,29,35,36,39,47,50] 18 16 Chidamber and Kemerer DIT, depth of inheritance tree [5,14,26,43], S13, [20,21,27,32], [23,28,29,35,39,47] 15 13 Chidamber and Kemerer LCOM, lack of cohesion in methods [5,14,26,43], S13, [20,27,28,32], [23,36,39,46,47] 14 12 Chidamber and Kemerer NOC, number of children [5,14,26,43], S13, [20,27,32,39], [ Having one, two, or even three levels of inheritance can help the maintainability, but increasing the value further is deemed detrimental. NOC, number of children, is the number of immediate subclasses of the analyzed class. As the NOC increases, maintainability of the code increases. CBO, coupling between objects, is the number of classes with which the analyzed class is coupled. Two classes are considered coupled when methods declared in one class use methods or instance variables defined by the other class. us, this metric gives us an idea on how much interlaced the classes are to each other and hence how much influence the maintenance of a single class has on other ones. RFC, response for class, is defined as the set of methods that can potentially be executed in response to a message received by an object of that class. Also, in this case, the greater is the returned value, the greater is the complexity of the class. LCOM, lack of cohesion in methods, is defined as the subtraction between the number of method pairs having no attributes in common, and the number of method pairs having common attributes. Several other versions of the metrics have been provided in the literature. High values of LCOM metric value provide a measure of the relative disparate nature of methods in the class.
(v) CLOC (Comment Line of Code). It is the metric which gives the number of lines of code which contain textual comments. Empty lines of comments are not counted. In contrast to the LOC metric, the higher the value CLOC returns, the more the comments there are in the analyzed code; therefore, the code should be easier to understand and to maintain. e literature has also proposed a metric that puts in relation between CLOC and LOC, and it is called the code-to-comment ratio. (vi) e Halstead Suite. It is introduced in 1977 [59] and is a set of statically computed metrics, which tries to assess the efforts required to maintain the analyzed code, the quality of the program, and the number of errors in the implementation.
To compute the metrics of the Halstead suite, the following indicators must be computed from the source code: n 1 , i.e., the number of distinct operators; n 2 , i.e., the number of distinct operands; N 1 , i.e., the total number of operators; and N 2 , i.e., the total number of operands. Operands are the objects that are manipulated, and operators are all the symbols that represent specific actions. Operators and operands are the two types of components that form all the expressions. e following metrics are part of the Halstead suite: where N 1 is the total number of occurrences of operators and N 2 is the total number of occurrences of operands. Vocabulary (n): n � n 1 + n 2 , i.e., where n 1 is the total number of distinct operators and n 2 is the number of distinct operands in the program. By definition, the Vocabulary constitutes a lower bound for the Length, since each distinct operator and operand has at least an occurrence. Volume (V): V � N log 2 n, i.e., the size, in bits, of the space used to store the program (note that this varies according to the specific implementation of the program). Difficulty (D): D � n 1 /2·N 2 /n 2 , which represents the difficulty to understand the code. (vii) JLOC, (JavaDoc Lines of Code). It is a metric specific for Java code, which is defined as the number of lines of code to which JavaDoc comments are associated. It is similar to other metrics discussed in the literature that measure the number of comments in the source code. In general, a high value for the JLOC metrics is deemed positive, since it suggests better documentation of the code and hence a better changeability and maintainability. is metric is specific to the Java programming language. Similar documentation generators are available for Java-Script (JSDoc) and PHP (PHPDocumentor); however, we were not able to gather evidence from the manuscripts about the applicability of the JLOC metric to them, so we deemed it applicable only for source code written in Java. (viii) LOC (Lines of Code). It is a widely used metric which is often used for its simplicity. It gives an immediate measure of the size of the source code. Among the most popular metrics, the LOC metric was the only one to have two negative mentions in other works in the literature. ese comments are related to the fact that there appears to be no single, universally adopted definition of how this metric is computed [14]. Some works consider the count of all the lines in a file, and others (the majority) remove blank lines from such computation; if there is more than one instruction in a single line or a single instruction is divided into different rows, there is ambiguity about considering the number of lines (physical lines) or the actual number of instructions involved (logical lines). us, it is of the utmost importance that the tools to calculate the metrics specify exactly how they calculate the values they return (or that they are open source, hence allowing an analysis of the tool source code for deriving such information). Although LOC seems to be poorly related to the maintenance effort [14] and there is more than one way to calculate it, this metric is used within the Table 8: All tools found in the selected set of primary studies.

(ix) LCOM2 (Lack of Cohesion in Methods). It is an
evolution of the LCOM metric, which is part of the Chidamber and Kemerer suite. LCOM2 equals the percentage of methods that do not access a specific attribute averaged over all attributes in the class. If the number of methods or attributes is zero, LCOM2 is undefined and displayed as zero. A low value of LCOM2 indicates high cohesion and a well-designed class. (x) MI (Maintainability Index). It is a composite metric, proposed as a way to assess the maintainability of a software system. ere are different definitions of this metric, which was firstly introduced by Oman and Hagemeister in 1992 [61]. ere are two different formulae to calculate the MI, one utilizing only three different metrics, Halstead volume (HV), cyclomatic complexity (CC), and the number of lines of code (LOC), while the other takes in consideration also the number of comments. Despite being quite popular, Ostberg and Wagner express their doubts about the effectiveness of this metric, claiming it does not give information about the maintainability of the code, since it is based on metrics considered not suited for that task, and the result of the metric itself is not intuitive [14]. In contrast, Sarwar et al. state that MI proved to be very efficient in improving software maintainability and cost-effectiveness [6]. In both equations, the following symbols are adopted: avgV is the average Halstead volume for the source code files; avgLOC is the average LOC metric; avgCC is the average cyclomatic complexity; perCM is the percentage of LOC containing comments. A returned value above 85 means that the code is easily maintainable; a value from 85 to 65 indicates that the code is not so easy to maintain; below 65, the code is difficult to maintain. e returned value can reach zero, and even become negative, especially for large projects. (xi) MPC (Message Passing Coupling). It is a metric from the Li and Henry suite (the only metric of that suite to have a score above the rounded median), and it is defined as the number of send statements defined in a class [62], i.e., the number of method calls in a class.

(xii) NOM (Number of Methods Counts).
It is the number of methods in a given class/source file, with the assumption that the higher the number of methods, the lower the maintainability of the code.  Table 8, we report all the tools that were identified while reading the papers. e columns report, respectively, as follows: the name of the tool, as it is presented in the studies; the studies using it; a web source where the tool can be downloaded. In the upmost section of the table, we reported papers from which we cannot find the used tool (i.e., a tool was mentioned but no download pointer was provided, indicating that the tool has never been made public and/or it had been discontinued), or for which no information about the used tool was provided. For the latter, we have indicated the studies in the table with the respective author's name.

RQ2.1: Available Tools. In
In the second and third section of the e majority of the tools we found are mentioned by only one study; three are cited by two studies, and only one, CKJM, is quoted by five papers.
It is immediately evident that the open-source tools are more than two times in number than the closed-source ones.
is result may be unrelated to the quality of the tools themselves but instead be justified by the fact that opensource tools are better suited for academic usage since they provide the possibility of checking the algorithms and possibly modify or integrate them to analyze their performance.

Scientific Programming
For each of the tools that we were able to identify, we give a brief description in the following; the details about their supported languages and metrics can be found after the descriptions of the tools.

Closed-Source Tools.
Six closed-source tools can be found in the analyzed primary studies, three of which are mentioned in the same paper. e tools described hereafter are listed in alphabetical order and not in any order of importance.
(i) CAST's Application Intelligence Platform. is tool analyzes all the source code of an application, to measure a set of nonfunctional properties such as performance, robustness, security, transferability, and changeability (which is strictly tied to maintainability). is last nonfunctional property is measured based on cyclomatic complexity, coupling, duplicated code, and modification of indexes in groups [63]. e tool produces as output a set of violation of typical architectural and design patterns and best practices, which are aggregated in formats specific for both the management and the developers.
(ii) CMT++/CMTJava. CMT is a tool specifically made to estimate the overall maintainability of code done in C, C++, C#, or Java, and to identify the less maintainable parts of it. It is possible to compute many of the discussed metrics with the tool: McCabe's cyclomatic number, Halstead's software science metrics, lines of code, and others. CMT also allows computing the maintainability index (MI). e tool can work in command line mode or with a GUI. (iii) Codacy. It is a free tool for open-source projects and can be self-hosted, otherwise a license must be purchased to use it. is tool aims at improving the code quality, to augment the code coverage and to prevent security issues. Its main focus is on identifying bugs and undefined behaviours rather than calculating metrics. It provides a set of statistics about the analyzed code: error-proneness, code style, code complexity, unused code, and security. (iv) JHawk. e tool is tailored to only analyze code written in Java, but it can calculate a vast variety of different metrics. JHawk is not new on the market since its first release was introduced more than ten years ago. At the time of writing this article, the last         e correlation between the supported metrics and the inferred maintainability of software projects is not explicitly mentioned in the tool's documentation.
(v) Visual Studio. It is a very well-known IDE developed by Microsoft. It comes embedded with modules for the computation of code quality metrics, in addition to all its other functions. Among the maintainability metrics listed in the previous section, it supports MI, CC, DIT, class coupling, and LOC. e main limitation for the Visual Studio tool is that these metrics can be computed only for projects written in the C and C++ languages, and not for projects in any other of the many languages supported by the IDE. Also, from the Visual Studio documentation, it can be seen that the IDE makes some assumptions about the metrics that are different from the standard ones. As an example, the MI metric used in Visual Studio is an integer between 0 and 100, with different thresholds from the standard ones defined for MI (MI 20 indicates a code easy to maintain, a rating from 10 to 19 indicates that the code is relatively maintainable, and a value below 10 indicates low maintainability).

Open-Source Tools.
Fourteen open-source tools could be found in the analyzed primary studies. Most of them, however, require a license to be used in not open-source projects or to be used without limitations. e tools described hereafter are listed in alphabetical order and not in any order of importance: It is a tool built on top of Understand (see the previous section about closed-source tools), and it uses it to calculate the metrics. e tool calculates metrics that are highly related to software reliability, maintainability, and preventable technical debt. It provides a dashboard to present the data to developers/maintainers. It is worth noting that the tool, although open source, needs a license for the Understand tool to be used. (ii) CCFinderX (Code Clones Finder). Previously known as CCFinder, it is a tool able to detect duplicate code fragments in source codes written in Java, C, C++, C#, COBOL, and VB. At the time  [64], cited in five of our selected studies, supports only the Java programming language. It can calculate the six metrics of the C&K suite, plus the afferent coupling (CA), and the number of public methods (NPM). e results can be exported in XML format, and the program can be integrated with Ant. e tool appears to have been discontinued, since its last release at the time of the writing of this manuscript, i.e., the 1.9, was released in 2008.
(iv) CodeMetrics (IntelliJ IDEA Plugin). e tool is released under the MIT license. It can compute the complexity of each method and the total for each class of the source code. It does not calculate the standard cyclomatic complexity, but an approximation of that. At the time of writing this article, the project is still maintained.
(v) Escomplex. It is a tool that performs a software complexity analysis of JavaScript abstract syntax trees. It can compute several metrics among those previously identified, e.g., the maintainability index, the Halstead suite, McCabe's CC, and LOC. e results are returned in JSON format so that they can be used by front-end programs. At the time of writing this SLR, the last version of the tool dates back to the end of 2015.
(vi) Eslint. e tool is a linting (i.e., running a program to analyze code to automatically verify the presence of potential errors) utility for JavaScript. e tool allows using a set of built-in linting rules and also allows adding custom ones as plugins that are dynamically loaded. e tool also allows fixing automatically some of the issues that it finds. It is a program to analyze JavaScript code in search of code smells, such as duplicate code and repeated logic. e basic aim of the tool is to identify separate portions of code with a similar structure in a software project, based on the AST node types, e.g., BlockStatement, VariableDeclaration, and ObjectExpression. At the moment of writing this SLR, the tool seems to have been discontinued, since the last commit on the repository dates back to August 2017. (ix) MetricsReloaded (IntelliJ IDEA Plugin). e tool, in addition to being available as a plugin for the popular IDE IntelliJ IDEA, can also be used standalone from the command line. e project seems to be discontinued since September 2017. (x) Quamoco Benchmark for Software Quality. It is a Java-based tool aimed to analyze code written in Java. It is based on the Quamoco model, aimed at integrating abstract code quality attributes and concrete software quality assessments [65]. e tool is mentioned in several academic studies selected in this SLR, and its code repository is available on GitHub. From the repository, it can be seen that the development has been discontinued, and the last commit dates back to July 2013. (xi) Ref-Finder (Eclipse Plugin). A tool whose principal aim is to detect refactorings occurred between two   [66]. (xii) SonarQube. Along with CodeAnalyzers, it is a product by SonarSource. e two products are provided in two different editions: the community one, which is open source, and a commercial one. e community edition features fewer metrics and less programming languages and does not provide the security reports that are a main feature of the commercial versions. ey support more than 25 programming languages (15 in the OS editions) and hundreds of rules, among which code smells and maintainability metrics. (xiii) Squale (Software QUALity Enhancement). It is based on third party technologies (commercial or open source) that produce raw quality information (such as metrics for instance) and uses quality models (such as ISO-9126) to aggregate the raw information into high-level quality factors. Released under the LGPLv3 license, it is a program to help to assess the software quality, giving as output information to be used from both the development and the management team, dealing with both technical and economic aspects of software quality. It targets different programming languages (including Java, C/C++,.NET, PHP, and Cobol) and utilizes code metrics and quality models to assess the grade of the code. e tool appears to be discontinued, and the last version of the program, v7.1, released in May 2011. Figure 4 shows which languages are supported by each tool. Some of the considered tools support a wide variety of languages, such as Understand, Codacy, and the tools by SonarSource (SonarQube and CodeAnalyzers). CBR Insight, as stated before, is based on Understand; hence, it supports the same set of programming languages. e majority of tools, however, support a limited number of programming languages or also just one. For instance, JHawk, CKJM, CodeMetrics, and Ref-Finder all support only Java; JSInspect, escomplex, and eslint are tailored to work only with JavaScript.

Correspondence between Tools and Languages.
From the table, it is evident that the closed-source tools support more programming languages (an average of 10.5) compared to open-source tools (an average of 4.85). By analyzing the primary studies selected for this SLR, it is also reported that closed-source tools tend to support some metrics better than open-source counterparts: for instance, a comparative study between different tools capable of MI reports a higher dependability of such metric when computed using closed-source tools rather than open-source alternatives [6]. Figure 5 shows how many closed-source and opensource tools have been found for each language. From that chart, it is evident that some languages are better supported than others. Java, C, and C++, followed closely by JavaScript and C#, are supported by at least half of the tools we considered in our study. More specifically, Java, C, C++, and C# are supported by almost all the closed-source programs we found. Some less widespread languages (e.g., APAB, GO, RPG, and T-SQL) are supported only by open-source tools, among the set of tools that we gathered from analyzing the primary studies used for the SLR. Also, a suite was considered as supported if at least one of its metrics was supported by a given tool.

Correspondence between Tools and Metrics.
In the case of the closed-source tools, the metrics have been most of the times inferred from limited documentation. Most of the times, in fact, closed-source tools provide dashboards with custom-defined evaluations of the code, for which the linkage with widespread software metrics is unclear. For instance, the Codacy tool provides a single, overall grade for a software project, between A and F. is grade depends on a set of tool-specific parameters: error-proneness, code complexity, code style, unused code, security, compatibility, documentation, and performance. In addition to some metrics whose usage was explicitly mentioned by the tool's creators (e.g., number of comments and JavaDoc lines for the documentation property and McCabe's CC for the code complexity property), it was not possible to find the complete set of metrics used internally by the tool.
In many cases, the tools compute also compound metrics (i.e., metrics built on top of other ones reported in the literature) or metrics that were not previously found in the analysis of the literature performed to answer to RQ1. In these cases, the tools were labelled as featuring other metrics: this information is reported in the last row of the table.
As it is evident from the table, no tool supported all the most popular metrics previously identified. e number of supported metrics among the most popular ones ranged from 1 to 10. Two tools featured just one suite/metric from the set of the most popular ones. e Halstead Metrics Tool, as evident from its name, is an open-source tool with the only purpose of computing the entire set of metrics of the Halstead suite; as well, the CodeMetrics plugin is a basic tool capable of computing only the McCabe cyclomatic complexity (for each method and the total for each class of the project). Quamoco is indeed not only a tool but instead a quality metamodel, based on a set of metrics that are defined, in the scope of the paper presenting the approach, as base measures; the metamodel is theoretically applicable to any kind of base measure that can be computed through static analysis of source code; however, the literature presenting the tool mentions only the LOC metric explicitly. Some other tools, such as JSInspect, CCFinderX, and Ref-Finder tools, featured a limited set of the maintainability metrics previously identified, since they were mainly focused on other aspects of code quality, e.g., detecting code duplicates and code smells.
Tools such as MetricsReloaded, Squale, and SonarQube featured large sets of derived metrics, which were obtained as specializations, sums, or averages of basic metrics such as the McCabe cyclomatic complexity or the coupling between classes. e bar graph in Figure 7 reports the number of tools that featured each of the considered metrics. Also, in this case, the metrics were divided into three sections on the xaxis: the 15 metrics/suites deemed as most popular in the answer to RQ1, other metrics from the full set, and other metrics not in the set of metrics mined from the literature. Two metrics stood out in terms of the number of tools that supported them. e LOC metric, despite many papers in the literature question its usefulness as a maintainability metric, was supported by 14 out of 19 tools. e metric is closely followed by the cyclomatic complexity (CC), which was supported by 13 tools. ose numbers were expectable since both the metrics are simple to compute and are needed by many other derived metrics. On the other hand, three of the most popular metrics were used by only two of the selected tools. e CHANGE metric refers to the changed lines of code between different releases of the same application and was not computed by most of the tools that performed static analysis on single versions of the application; it was instead computed by two tools that were particularly aiming at measuring code refactorings and smells. e LCOM2 metric is an extension of the LCOM metric, which is part of the C&K suite; several tools just mentioned the adoption of the suite without explicitly mentioning possible adoptions of enhanced versions of the metrics; finally, the message passing coupling was adopted by two tools and in both cases defined with the synonym fanout.
In general, closed-source tools featured a higher number of metrics than open-source counterparts. Open-source tools, several times, were, in fact, plugins of limited dimension, tailored to compute just a single metric or suite. If only the measures mined from the primary studies are considered, the closed-source tools were able to compute an average of slightly less than 8 metrics, while open-source tools were able to compute an average of 5 metrics. Of the set of 15 most popular metrics, on average 6 could be computed by the closed-source tools and 3 by the open-source tools. Table 9 reports the tools able to compute each of the set of most popular metrics for the five languages that were supported the most (see bar plot in Figure 5). We took into account C, C++, C#, Java, and JavaScript, since at least 7 tools (more than the average for all programming languages) supported them. e table reports all tools that can compute a metric for a given language. For the case of the JLOC metric, the relevant information is only related to the tools compatible with Java, since the metric cannot be computed for other programming languages. Open-source tools are highlighted by using bold lettering. As it is evident from the table, the most featured metrics (e.g., CC and LOC) can be computed with many alternative tools (either closed source or open source) for the same languages. On the other hand, several metrics can be computed by just a single tool: for instance, CCFinderX is the only tool that explicitly supports the CHANGE metric for all the languages of the C family, or the MPC (message passing coupling) metric is explicitly supported only by the CAST's Application Intelligence Platform for the languages of the C family and JavaScript. . Tables 10 and 11 show the optimal set of tools to cover all the most popular metrics shown in Table 5. e former takes into account both closedsource and open-source tools; the latter only considers opensource tools. We define an optimal set of tools as the minimal set of tools which can cover the highest possible amount of metrics (or suites) out of the set of 14 most mentioned ones (15 for Java, for which also the JLOC metric can be computed). Inside round brackets, we identified alternative tools that could be selected without influencing the number of tools in the optimal set or the number of metrics covered.

RQ2.2: Ideal Selection of Tools
By using both closed-source and open-source tools, it is possible to compute all the most mentioned metrics with an optimal set of 4 tools for all languages except for Java, for which 5 tools were necessary. Specifically, for all the languages of the C family, all the metrics are covered by CAST's Application Intelligence Platform, Understand, CCFinderX, and CMT++. Java needed also the adoption of a tool among MetricsReloaded, Squale, or Codacy to compute the JLOC metric; JHawk and Ref-Finder could be used, respectively, as alternatives to CAST's AIP and CCFinderX; CMTJava had to be selected instead of CMT++. For JavaScript, escomplex and one between CodeAnalyzers or eslint have to be included in the set, replacing CCFinderX and CMT.
By using open-source tools only, it is not possible to obtain full coverage of the most mentioned metrics. e LCOM2 and MPC metrics were not explicitly supported by any of the considered open-source tools. e maximum amount of metrics that could be supported with an optimal set of tools ranged between 8 (for the JavaScript programming language, with two tools) and 13 (for Java, with 5 tools, also including the JLOC metric).

Threats to Validity
reats to construct validity, for an SLR, are related to failures in the claim of covering all the possible studies related to the topic of the review. In this study, the paper was mitigated with a thorough and reproducible definition of the search strategy and with the use of synonyms in the search strings. Also, all the principal sources for the scientific literature were taken into consideration for the extraction of the primary studies.
reats to internal validity are related to the data extraction phase of the SLR. e authors of this paper evaluated the papers manually, according to the defined inclusion and exclusion criteria. e authors limited biases in the inclusion and exclusion of the paper by discussing disagreements. e metric selection phase was performed based on the opinions extracted from the examined primary studies (considered as adverse, neutral, or positive). Again, the reading of the papers and the subsequential opinion assignments are based on the judgment of the authors and may suffer from misinterpretation of the original opinions. It is, however, worth mentioning that none of the authors of this paper were biased towards the demonstration of a specific preference for one of the available metrics.
reats to external validity are related to the incapability of obtaining generalized conclusions from the conducted study.
is threat is limited in this study since its main results, i.e., the sets of most popular metrics, were formulated w.r.t. to a set of programming languages. e results are not generalized to programming languages that were not discussed in the primary studies examined in the SLR.

Related Works
e literature offers several secondary studies regarding code metrics and tools. However, usually, those studies analyze or present a set of tools, and they describe the metrics based on the features of the tool. Our review instead started from an analysis of the literature that was tailored at finding all metrics available in relevant studies in the literature, and then the focus was moved to tools to understand whether the found metrics were supported or not by those tools.
For example, in the literature review published in 2008, Lincke et al. [67] compared different software metric tools showing that, in some cases, different tools provided uncompatible results; the authors also defined a simple universal software quality model, based on a set of metrics that were extracted from the examined tools. Dias Canedo et al. [68] performed a systematic literature review for finding tools that can perform software measures. Starting from the tools, the authors analyzed the tool features and described the metrics the software could analyze. For their secondary studies, the authors analyzed papers from 2007 to 2018.
On the other hand, there are also other secondary studies explicitly focused on metrics as the comparative case study published in 2012 by Sjoberg et al. [69], which has a focus on code maintainability metrics but only considers a subset of 11 metrics for the Java language. e work had a primary aim at questioning the consistency between different metrics in the evaluation of maintainability of software projects. e systematic mapping study published in 2017 by Nunez-Varela et al. [70] is one of the most complete works on this topic. e authors discovered 300 source code metrics by analyzing papers published from 2010 to 2015. ey also mapped those metrics with the tools that can use them. is work, however, covers a limited time window and does not focus on a specific family of software metrics, gathering dynamic and change metrics along with static ones.
In a recent systematic mapping and review, Elmidaoui et al. identified 82 empirical studies about software product maintainability prediction [71]. e paper focuses on analyzing the different methods available for maintainability estimation, including fuzzy, neurofuzzy, artificial neural network (ANN), support vector machines (SVMs), and group method of data building (GMDH). e paper concludes that the prediction of software maintainability, albeit many techniques are available to perform it, is still limited in industrial practice.
Our work differs from the secondary studies presented above. Our point of view is finding the most common maintainability metrics and tools to be applied to new programming languages. For doing so, we analyzed papers in a 20-year time window (2000-2019). We also distinguished open-source tools from closed-source tools, and for each of them, we mapped the maintainability metrics they use. e output of this work is actionable by practitioners wanting to create new tools for applying maintainability metrics to new programming languages.
Other primary studies in the literature presented (or used) popular software metric tools, which were, however, not extracted during our study selection phase, since their primary purpose was not analyzing code from a maintenance point of view, and hence, the manuscripts could not be found by searching for the maintainability keyword. A relevant example of those tools is CCCC, a widespread tool to evaluate code written with object-oriented languages [72,73].

Conclusion
Maintainability is a fundamental feature for software projects, and the scientific literature has proposed several approaches, metrics, and tools to evaluate it in real-world scenarios. With this systematic literature review, we wanted to have an overview of the most used maintainability metrics in the literature in the last twenty years, to find the most commonly used ones, which can be used to evaluate existing software, and that can be adapted to measure the maintainability of new programming languages. In doing so, we wanted to provide the readers actionable results by identifying sets of (closed-and open-source) tools that can be adopted to be able to compute all the most popular metrics for a specific programming language.
With the application of a formalized SLR procedure, we identified a total of 174 metrics, some of which were distributed in 10 metric suites. Among them, we extracted a set of 15 most frequently mentioned ones, of which we reported the definitions and formulae. We also identified a set of 38 tools mentioned in primary studies about software maintainability metrics: by filtering those that were not made available by the authors, could not be retrieved on the web, or were no longer available, we came up with a set of 6 closed-source and 13 open-source tools that can be used to evaluate software projects, covering 34 different programming languages. By analyzing the tools, we found that Java, JavaScript, C, C++, and C# are the most common programming language compatibles with the analyzed tools. By pairing the information about supported programming languages and supported metrics, we found that it is possible to find an optimal selection of at most five tools to cover all the most mentioned metrics for the languages of the Java and C family. However, not all the most popular metrics could be computed by taking into consideration only open-source tools.
is manuscript can provide actionable guidelines for practitioners who want to measure the maintainability of their software by providing a mapping between popular metrics and tools able to compute them. Also, this manuscript provides actionable guidelines for practitioners and researchers who may want to implement tools to measure software metrics for newer programming languages. Our work identifies which tools can provide the computation of the most popular maintenance metrics and the support they provide to the most common programming languages. Our work also provides pointers to existing open-source tools already available for computing the metrics, which can be leveraged by tool developers as guidelines for their counterparts for source code written in different languages.
As future work, we aim at implementing a tool that uses the set of metrics we found in RQ1.2 to analyze code written in the Rust programming language. For the Rust programming language, we identified no tool capable of computing the most popular maintainability metrics mentioned in the literature. We plan to extend a tool named Tokei 1 , which offers compatibility with many modern programming languages. e results of these works are considered capable of easing other researchers to create tools for measuring the maintainability of modern programming languages and for encouraging new comparisons between programming languages.

Data Availability
e data used to support the findings of this study are included within the article in the form of references linking to resources available on the FigShare public open repository.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.