An Effective Strategy to Build Up a Balanced Test Suite for Spectrum-Based Fault Localization

During past decades, many automated software faults diagnosis techniques including Spectrum-Based Fault Localization (SBFL) have been proposed to improve the efficiency of software debugging activity. In the field of SBFL, suspiciousness calculation is closely related to the number of failed and passed test cases. Studies have shown that the ratio of the number of failed and passed test case has more significant impact on the accuracy of SBFL than the total number of test cases, and a balanced test suite is more beneficial to improving the accuracy of SBFL. Based on theoretical analysis, we proposed an PNF (Passed test cases, Not execute Faulty statement) strategy to reduce test suite and build up a more balanced one for SBFL, which can be used in regression testing. We evaluated the strategy making experiments using the Siemens program and Space program. Experiments indicated that our PNF strategy can be used to construct a new test suite effectively. Compared with the original test suite, the new one has smaller size (average 90% test case was reduced in experiments) and more balanced ratio of failed test cases to passed test cases, while it has the same statement coverage and fault localization accuracy.


Introduction
Software fault localization is an activity of identifying the exact locations of program faults.It is one of the most tedious and time-consuming tasks in program debugging [1].To decrease the cost of software debugging, many automatic software fault localization techniques have been proposed in recent years [2][3][4][5][6][7][8].Among those, Spectrum-Based Fault Localization (SBFL) techniques have attracted a lot of attention, such as Tarantula [9] and Ochiai [10].SBFL techniques are simple and highly efficient, and they usually calculate the suspiciousness of program entities according to different dynamic behaviours of failed and passed test executions.
In the field of SBFL, more than 30 kinds of formulas have been proposed to calculate suspiciousness of program entity [11].And it is essential to count the number of failed and passed test cases in each SBFL technique.Due to that, we can speculate that the ratio of failed test cases to passed test cases might affect the accuracy of software fault location.In this paper, we call the ratio of the number of failed test cases to the number of passed test cases as class ratio.Test cases reduction is an important topic in respect of the test case number.Some studies [12][13][14][15] focused on test cases reduction technologies for SBFL, but they primarily reduced the total number of test cases, without considering class ratio.
Only a few studies focused on the impact of class ratio on the accuracy of software fault localization [16][17][18].The experiments in Gong et al. 's work [16] showed that the class imbalance phenomenon of test suites would negatively affect the efficiency of SBFL.They used two methods to change the class ratio of failed to passed test cases.(1) The first method fixed the total size of test suites and changed the ratio of failed to passed test cases.(2) The second method fixed the number of failed cases and changed the number of passed test cases.They selected some test cases from an original test suite randomly when generating a new test suite.Base on the work, Gao et al. [17] conducted a theoretical study to generate balanced test suite.They cloned the failed test cases for suitable number of times to catch up with the number of 2 Mathematical Problems in Engineering passed test cases.Their theoretical analysis result suggested that the efficiency of SBFL can be improved under certain conditions and impaired at no time by using their strategy.
In this paper, we proposed a nonrandom PNF (Passed test cases, Not execute Faulty statement) test case selection strategy to build up a reduced and balanced test suite.This strategy is applicable for some regression testing.For example, part of test cases should be selected from the original test suite when software platform is updated.Different from cloning failed test cases in literature [17], we prefer to select passed test cases from the original test suite to construct a new balanced test suite.This strategy will not only make the test suite more balanced than before, but also significantly reduce the size of the original test suite without decreasing statement coverage.
This paper is structured as follows.In Section 2, we analyse the PNF strategy from a theoretical perspective.Then the experiments on Siemens and Space are presented in Section 3. Finally Section 4 concludes the paper and outlines our further work.

Theoretical Analysis
In this section, we conduct a theoretical analysis about the impact of increasing passed test cases on the accuracy of SBFL.Here, we take two typical SBFL techniques Tarantula and Nashi2 as representative.Based on the analysis result, we propose PNF strategy to build up a balanced test suite, which can not only reduce the size of test suite but also hold the accuracy of SBFL like the original test suite.

PNF Strategy.
In order to describe our PNF strategy, we define the following symbols at first: (i)  orig : the original test suite.
(ii)  init : the initial test suite which consists of some test cases selected from  orig .
(iii)  new : the increased test suite in which some test cases are added based on  init .
(v)  orig : the number of all passed test cases in  orig .
(vi)  orig : the number of all failed test cases in  orig .
(vii) : the number of all passed test cases in  init .
(viii) : the number of all failed test cases in  init .
(ix)  In the fault localization report of SBFL, all statements are often ranked by their suspiciousness in descending order.A smaller Rank indicates a higher likelihood of being faulty statement.Let statement  represent the faulty statement and let  represent any of the nonfaulty statements.Suppose Rank  > Rank  in the original test suite; if the inequality of Rank  > Rank  is not changed after modifying the class ratio of failed to passed test cases, we regard it as a positive change strategy for modifying the class ratio.
In this paper, we use the PNF strategy to modify class ratio and construct the new test suite.In PNF strategy, we hold the failed test cases unchanged and then change the number of passed test cases.The detailed steps are listed as follows: (1) Build up an initial test suite  init .Copy all failed test cases from  orig and select part of passed test cases from  orig .Here, the number of selected passed test cases is equal to  orig .It means that the class ratio is 1 : 1 (failed : passed).
(2) Build up a new test suite  new in which the class ratio is 1 :  (failed : passed).Here, the number of passed test cases is  times of .Moreover, the increased test cases do not execute the faulty statement .
(3) Take the coverage information into consideration.We also calculate the statement coverage because of the importance of the coverage criterion in software testing.We give the priority to the passed test cases which contribute to the statement coverage when selecting a new test case.If the statement coverage of  new is lower than that of  orig , we add some additional passed test cases to ensure this goal (the statement coverage of  new is equal to that of  orig ).
Algorithm 1 of PNF is used to select passed test cases and build up a balanced test suite.
To do a detailed theoretical analysis, we describe some common points when applying this strategy to increase passed test cases: (1) Since we do not increase any failed test cases, we get    =    = , We try to calculate the object equation: where Susp  and Susp  denote the suspiciousness of statement  and statement  after increasing passed test cases.We discuss three cases based on the relation of the suspiciousness of the faulty statement and nonfaulty statement in the initial test suite.
According to (1) and    = , (2) can be expressed as follows: Here, we proof that PNF strategy is positive from three cases, respectively: Susp  > Susp  , Susp  = Susp  , and Susp  < Susp  .
Case 1 (Susp  > Susp  ).Because the suspiciousness of statement  is greater than the suspiciousness of statement , then we can express it as follows: That is, Since    = , (5) can be simplified as follows: To ensure that our strategy is positive, we need to have Susp  − Susp  > 0.
According to previous calculation, we know Mathematical Problems in Engineering According to the above analysis, this problem can be simplified to the following proof: The following is given: Proof.For the faulty statement , because the increased passed test cases do not execute this statement,    =    .For any statement  ( ̸ = ), because we increased ( − 1) ×  passed test cases, the value range of where    denotes the number of passed test cases which executed the statement .Consider Based on the proof, if Susp  > Susp  , we can get Susp  > Susp  when we use the strategy to select passed test cases.That is to say, for the faulty statement  whose suspiciousness is higher than the suspiciousness of the statement  before increasing passed test cases, its suspiciousness is still higher than the suspiciousness of statement  after increasing passed test cases.It shows that the rank of the faulty statement  will not decrease.Therefore, the strategy of increasing passed test cases is a positive approach to select passed test cases in this condition.
Case 2 (Susp  = Susp  ).Similarly with the first case, when we increase passed test cases to  ( > 1) times than before, the value range Based on the above analysis, if Susp  = Susp  , we can get Susp  ≥ Susp  when we use the strategy to select passed test cases.It implies that the rank of the faulty statement  will not decrease while it maybe increases.When we increase passed test cases, if we select passed test cases which execute as many nonfaulty statements (   >    ) as possible, it will enhance the rank of the faulty statement.The strategy of increasing passed test cases in this way will be a positive approach to enlarge test suite.
Case 3 (Susp  < Susp  ).The object equation is the same as previous analysis: Since 1/ > 0,  > 0, we only focus on the numerator: When    > ( ) could be one of the three cases: > 0, = 0, and < 0. But as we can know, the following condition can effectively reduce the negative effects: Therefore, we should make the value of    as large as possible; namely, we should select those passed test cases which execute as many nonfaulty statements as possible.

Theoretical Analysis in Nashi2. The suspiciousness formula of Nashi2 is
The symbols in (15) are the same with the formula of Tarantula.We use the similar analysis process with Tarantula.Therefore, we still try to calculate the difference between Susp  of the faulty statement and Susp  of the statement and expect that the rank of faulty statement in the report of SBFL is not impaired.The difference between Susp  and Susp  can be expressed as follows: We discuss three cases based on the relation of Susp  (suspiciousness of the faulty statement) and Susp  (suspiciousness of the statement) in the original test suite as Section 2.2.
Case 1 (Susp  > Susp  ).According to Susp  > Susp  , we have In order to ensure Susp  > Susp  , the object equation of ( 16) can be transformed into the following: According to the above analysis, this problem can be simplified to the following proof: The following is given: The following is proved: Proof.Consider the following: This proof shows that the suspiciousness of the faulty statement  is still higher than the suspiciousness of statement  after passed test cases are increased.It means that the rank of the faulty statement does not reduce.
Case 2 (Susp  = Susp  ).According to this condition, we have In order to ensure Susp  ≥ Susp  , the object equation of ( 16) can be transformed into the following: Proof.Because  + 1 > 0, we can only focus on the numerator and simplify it into The above inequality implies that the suspiciousness of the faulty statement  must be equal to or higher than the suspiciousness of statement  after passed test cases are increased.Moreover, we should make the value of    as large as possible, and it means that we should select the passed test cases which execute as many nonfaulty statements as possible.
Case 3 (Susp  < Susp  ).According to this condition, we have We need to calculate the relation between    and ( −  , it shows Susp  > Susp  .This case can effectively reduce the negative effects on the rank of the faulty statement by increasing passed test cases.Therefore, we should make the value of    as large as possible; namely, we should select those passed test cases which execute as many nonfaulty statements as possible.

Summary of Strategy.
According to the analysis and proof about Tarantula and Nashi2 in Sections 2.2 and 2.3, we draw a conclusion that the accuracy of SBFL would not be affected by the class ratio of failed to passed test cases when we change the class ratio using a nonrandom PNF strategy.Here, PNF strategy means holding the number of failed test cases unchanged and modify the class ratio by increasing passed test cases.When increasing passed test cases, the selected passed test cases are expected not to execute the faulty statement and to execute as many nonfaulty statements as possible.According to the analysis, we can build up a balanced test suite with the PNF strategy.The new test suite has the following advantages: (1) A smaller size than the original test suite.
(2) A more balanced ratio of failed to passed test cases than the original one.
(3) Keeping the same statement coveragence as the original test suite.
(4) At least keeping the same fault localization accuracy of SBFL as the original test suite.
Despite the knowledge about the location of a fault in PNF strategy, it can be replaced by other information, such as the top  suspicious statements according to suspiciousness calculation.However, in order to get more exact result, the knowledge about the location of a fault is required in this paper.This PNF strategy is not applicable to the regression testing in which target program has to be modified.However, it can be used to construct a new test suite for the regression testing in which target program is unchanged.For example, regression testing is often required for all developed products in a company when OS is updated, package is applied, or platform is changed.

Experiment
Because the effectiveness of our PNF strategy has been analysed and proven from a theoretical perspective in Section 2, we took two small programs (tcas and totinfo) from Siemens program suite and one large program (Space) as samples to verify our strategy from an experimental perspective in this section.There are several faults in some faulty versions of tcas and Space, and we call these versions multiple faults versions.In experiment, we simplified the interactions and interferences between multiple faults presented in [19] and took one multiple faults version as several single fault versions.For these multiple versions, we supposed that only the most suspicious fault could be localized in every iteration, then fixed it, and entered it into the next iteration to find out another fault.Although this way is not very efficient, it is similar to the fault localization process of real software testing in a certain degree.In addition, we used 20 of 38 versions of Space program in our experiment and excluded other versions in which there were compile errors or no failed test cases.
To evaluate the influence of class ratio in fault localization activity, we performed 4 basic fault localization techniques, Tarantula, Nashi2, Jaccard, and Ochiai, on each faulty program version with different class ratio.Because of the time and space constraints, our case studies did not use other advanced fault localization techniques, such as RBF, DStar, and others proposed in [4,6,[20][21][22].
Since failed test cases are more contributive to fault localization, we generated test case suites by remaining all the failed cases and increasing passed cases in accordance with the different class ratio.When we constructed test suites, we used random strategy and nonrandom PNF strategy, respectively.The random strategy selects passed cases randomly from the original passed test cases to generate a new test suite, while the nonrandom PNF strategy selects passed cases according to Algorithm 1.

Results
Measured by Score.In this section, we used suspiciousness Score to measure the accuracy of fault localization, which has been widely used in software fault localization [1,2,23]: In (24), Total denotes the total number of statements, and Rank denotes the rank of the faulty statement.Rank/Total presents the percentage of code that needs to be examined before the faults are identified.A higher score means higher efficiency of fault localization.When calculating score, we used First-Line strategy to deal with the same suspiciousness situations in which we assigned all statements sharing the same suspiciousness with the first ranking number of them.And we took the average of the score of each faulty version of the same program as the final score of the program.For multiple faults version, we assigned the highest score among several iterations as the score of the version.
Figures 1, 2, and 3 present our experiment results on tcas, totinfo, and Space respectively.In the horizontal axis of these figures, "orig" denotes the original test suite and "1 : N" means that the class ratio of failed test cases to passed test cases is 1 : N.
The three figures show that class ratio has effect on fault localization performance when using random strategy to generate test suite for all four SBFL techniques.For tcas and totinfo, we can get better accuracy of fault localization with lower class ratio, which implies that a balanced test suite is more efficient for SBFL by selecting passed test cases in a random strategy, which is also consistent with the result of literature [16].But when we use the nonrandom PNF strategy to enlarge passed test cases, the class ratio can do barely nothing to fault localization performance, which conformed with our theoretical analysis.Consequently, whether the class ratio has the effect on the accuracy of SBFL is closely related to the strategy for generating the new test cases.From the results, it may be observed that the scores of the new test suites constructed by PNF strategy are higher than the scores of original test suite for the three target programs, while the strong point does not always hold when using random strategy to build up a new test suite.
As shown in Figure 3, there is a small difference between Space and tcas/totinfo.Figure 3(b) about Space program indicates that the four classical SBFL techniques could get better fault localization accuracy with a more balanced test suite using random strategy.It is the same with tcas and totinfo, while the fault localization accuracy becomes better when the distribution of test cases becomes more unbalanced using PNF strategy.It is a different trend compared with tcas and totinfo.But no matter using PNF strategy or random strategy, we can still achieve higher score than the original test suite.The experiment evidence of our PNF strategy is effective for building up a relative balanced test suite.In this figure, 1 : 2 is the best class ratio considering both the accuracy of fault localization and the size of test suite.

Results
Measured by NScore.This section evaluates our theoretical analysis with NScore measurement conforming with literature [16].NScore could be calculated by the following equation: In (25),  detected is the number of faulty program versions in which the accuracy of fault localization is higher than a threshold  and  total is the total number of program versions.
The experimental results of tcas and totinfo using different threshold  (0.8/0.9/0.95) are presented in Figures 4 and 5.For tcas program, the most left top point in Figure 4(a) means that the scores in 50% of versions are higher than 0.8.Namely, if the threshold  of accuracy is 0.8 and the class ratio is 1 : 1, then 50% of faults can be localized correctly with Tarantula.The following can be observed: (1) The performance of fault localization is more stable using PNF strategy than random strategy to generate test case; (2) higher fault localization accuracy can be achieved using PNF strategy instead of random strategy.The totinfo program has the same tendency with the tcas.In experiment, we also found that the score of different faulty versions has a significant difference with the same SBFL method, while the results of some version have same tendency.The average score will make us ignore the difference.and Ochiai have the same problem.It is worth doing more studies from this perspective.

Conclusions
Every suspiciousness calculation of SBFL method is closely related to the number of passed and failed test cases.Previous studies have shown a balanced test suite, which means that the class ratio of the number of failed to passed test cases is similar and is more beneficial to improve the accuracy of SBFL.In this paper, we proposed a PNF strategy to building up a balanced test suite according to the theoretical analysis and evaluated it by experiments using different SBFL methods.
In the PNF strategy, in order to construct a new balanced test suite, we kept the failed test cases unchanged and selected passed test cases from the original test suite according to certain rules: the selected passed test cases should not execute the faulty statement and should execute However, PNF strategy still has some limitations.For example, when there are only few failed test cases in the original test suite, the selected passed test cases by PNF strategy must be not enough for normal testing.And the process about the multiple faulty program is not enough.The work can be improved in following directions: (1) To reveal the other factors which have the effect on the accuracy of SBFL besides the class ratio; (2) To combine the PNF strategy with test suite reduction techniques.These studies will improve the efficiency of testing and the accuracy of fault localization.In addition, how to build up a balanced test suite without the knowledge about the location of a fault is one of our future work.

Figure 6
Figure 6 illustrates the experiment result of Space program.Similar to the results in Section 3.2, there is a small difference between the Siemens and Space program.Figures 6(c) and 6(d) indicate the following: No matter using PNF or random strategy for Nashi2, NScore of Space does not change with the variation of class ratio, while NScore has an increased tendency along with the growth of unbalanced class ratio for Tarantula presented in Figures 6(a) and 6(b).Although a thorough analysis about the reason causing the difference has not been conducted yet, the PNF strategy can reduce
: the number of passed test cases which execute the statement  in  init .
(xii)    : the number of failed test cases which execute the statement  in  new .

Table 1 :
Information of three target programs.Siemens and Space (http://sir.unl.edu/portal/index.php)dataset have been widely used in the research work of fault localization, and the original information of the three target programs is listed in Table 1.All the programs are written in C. The average fault localization accuracy of all available versions for a target program is the final accuracy of the program.

Table 2
is the score of different tcas versions with several SBFL methods, and it is sorted by Tarantula's score in descending order.From this table, we observed that the maximum score of tcas is 0.9859, but the minimum score is only 0.1268.Why are the scores of some versions so different?Nashi2, Jaccard, NScore with different class ratio (space).as many nonfaulty statements as possible with the consideration of statement coverage.The experiments indicated the PNF strategy is effective for SBFL.Based on the original test suite, it can generate a new more balanced test suite, which has smaller test suite size (average 90% test cases are reduced), the same accuracy of SBFL, and the same statement coverage.