Autonomous systems must successfully operate in complex timevarying spatial environments even when dealing with system faults that may occur during a mission. Consequently, evaluating the robustness, or ability to operate correctly under unexpected conditions, of autonomous vehicle control software is an increasingly important issue in software testing. New methods to automatically generate test cases for robustness testing of autonomous vehicle control software in closedloop simulation are needed. Searchbased testing techniques were used to automatically generate test cases, consisting of initial conditions and fault sequences, intended to challenge the control software more than test cases generated using current methods. Two different searchbased testing methods, genetic algorithms and surrogatebased optimization, were used to generate test cases for a simulated unmanned aerial vehicle attempting to fly through an entryway. The effectiveness of the searchbased methods in generating challenging test cases was compared to both a truth reference (full combinatorial testing) and the method most commonly used today (Monte Carlo testing). The searchbased testing techniques demonstrated better performance than Monte Carlo testing for both of the test case generation performance metrics: (1) finding the single most challenging test case and (2) finding the set of fifty test cases with the highest mean degree of challenge.
The use of autonomous systems is growing across many areas of society, with further increases planned in the next decade [
Photographs of (a) Google selfdriving car (image source: Wikipedia Commons. Author: Mark Dolinar. Used under the Creative Commons AttributionShare Alike 2.0 Generic license) and (b) autonomous UAV (image source: Wikipedia Commons. Author: Miki Yoshihito. Used under the Creative Commons AttributionShare Alike 2.0 Generic license).
Testing the software that controls autonomous vehicles operating in complex dynamic environments is a challenging task. Due to cost and time considerations, software testing with digital models is used to augment realworld tests run with the actual autonomous vehicle. Modelbased testing methods are frequently used for autonomous vehicle software testing, wherein an executable version of the vehicle software under test (SuT) and a simulation of the vehicle and operating environment (known as the
One important objective for testers of autonomous vehicle software is to generate an adequate set of test cases to assess overall system robustness.
For realworld autonomous systems, it is not possible to test all combinations of initial conditions and fault sequences. The multidimensional test space is too large to combinatorially generate and execute all possible test cases in the time allocated for software testing. Therefore, given a SuT that must control the vehicle and a Simulation Test Harness to provide the sensor inputs to the SuT and to act on the SuT’s commands, the tester must identify and prioritize the most challenging sets of valid test case conditions.
The remainder of this paper is organized as follows. Prior related research is summarized in Section
This section provides relevant background information on the overlapping disciplines related to this research. Section
Autonomous vehicle software is typically assessed using modelbased testing methods prior to field deployment. Modelbased testing connects the autonomous vehicle SuT with a Simulation Test Harness that is used to test closedloop interactions between the SuT and the vehicle in the context of the intended operating environment. The Simulation Test Harness models the behavior of the vehicle and its associated hardware being controlled by the SuT as well as the vehicle’s external environment. Because the SuT for an autonomous system must make decisions based on sensed data from the environment, it requires a closedloop test capability to realistically assess software performance. The SuT can begin as an executable model of the software and later be replaced with more mature code instantiations.
Figure
Modelbased testing process (adapted from [
A Simulation Test Harness for a complex system typically has system
Because a large number of test cases are desired to test autonomous vehicle software robustness, ATG methods can be used to significantly augment the number of test cases that can be developed manually. As the name implies, ATG is the ability to automatically generate test cases to be used for exercising software code to evaluate its suitability for its intended purpose. Software testing can take up to 50% of the overall software development budget [
The most complete form of ATG software robustness testing is full
However, for most realworld systems, it is highly unlikely that all combinations of inputs can be tested in a realistic amount of time due to the exponential growth of the test space as the number of initial conditions, faults, and possible fault occurrence time increase. Figure
Example of multiple fault injection timelines occurring for a single initial condition.
The inability to test all possible combinations motivates a range of approaches for selecting and prioritizing test cases to be run. One approach frequently used is to leverage Design of Experiment (DoE) principles originally developed in the statistics and operations research communities. DoE is a method of generating a series of tests in which purposeful changes are made to the system input variables and the effects on response variables are measured.
DoE principles can be used to analyze simulations with large numbers of input and output variables [
Monte Carlo (MC) testing is the ATG method most commonly used today to assess autonomous vehicle software robustness [
SBT is a subdiscipline within ATG that uses optimization techniques to generate challenging test cases. Approaches used in SBT include metaheuristic algorithms, branchandbound algorithms, and mixed integer linear programming [
GAs are a specific type of metaheuristic search algorithm that are frequently applied to global optimization problems with many input variables.
GAs have a long history of use for SBT applications. A systematic review of SBT for finding test cases that will cause the system to no longer function as expected is given in [
SBO is an optimization technique that creates a surrogate model (or metamodel) that is used in place of a computationally expensive objective function (often a simulation of a physicsbased model) when performing certain optimization calculations. The goal of SBO is to minimize the number of required executions of the expensive objective function while still being able to perform a global search. The results of the surrogate model predictions are used to carefully select the next point for objective function evaluation. A variety of surrogate models have been used in SBO applications, including polynomial, radial basis functions, and Kriging models [
SBO has more typically been used in the design phase of system development rather than the test phase. The use of surrogate models in the design optimization problem is discussed in [
The reasons why SBO techniques may have been used less frequently for test applications than for design are discussed in [
While there has been a large amount of work performed in both modelbased testing and SBT areas, there are only a small number of examples of using these techniques together—that is, the use of SBT principles applied to closedloop modelbased testing. One prominent example in this field is the TestWeaver software package [
GAs are used to find rulebased fault conditions that cause failures in autonomous vehicle controllers in [
In summary, there are no examples in the literature that attempt to perform simultaneous SBT optimization of initial conditions, fault combinations, and fault occurrence times for closedloop modelbased testing applications.
In order to quantify the performance of SBT algorithms when used in autonomous vehicle modelbased testing applications, SBT performance was compared to the performance of the method most commonly used today (MC testing) as well as the theoretical upper bound of performance for this problem (as determined by full combinatorial testing).
To support these objectives, a medium complexity Simulation Test Harness has been developed. The Simulation Test Harness was intentionally designed to include sufficient features and variables to support a meaningful comparison of the test case generation methods without being so complex as to preclude the generation of a full combinatorial set of test cases to be used for comparison. The key feature of this test setup is that the total number of input variables (initial conditions and fault parameters) was small enough that complete combinatorial testing of the system was feasible but large enough to reveal performance differences between the ATG methods that were compared. Being able to evaluate each possible combination of input conditions made it possible to enumerate a full ranking of the maximum error for all possible test cases. The ATG algorithms (MC and SBT) were constrained to perform significantly fewer simulation runs than the full combinatorial testing, thus enabling evaluation of their effectiveness and efficiency relative to a known truth baseline. It is assumed that, for complex realworld systems, the execution of the highfidelity simulation interacting with the vehicle control software will be very computationally expensive, thus limiting the number of simulation executions that the tester will be able to execute while generating test cases. Therefore, in this research, the SBT algorithms were constrained to run only a fraction of the total possible combinations, ranging from 0.03% in the most limiting trial up to 1.27% for the trial with the most allowed simulation executions.
The problem for this research was testing UAV flight control software attempting to steer a small UAV quadcopter through a 10 m entryway. As seen in Figure
Depiction of the UAV flight control software test problem.
In order to assess the UAV flight control software’s robustness, an objective of the software testing process is to define test cases that will maximize the test case degree of challenge, in this case, the lateral deviation of the UAV. For this particular example problem, the system can pass through the entryway if the lateral deviation is no more than ±5 meters of the intended flight path, a straight line passing the center of the entryway. If the UAV can successfully pass through the entryway with the required success rate over the most challenging set of test cases that can be generated, the UAV development team will have increased confidence the system is ready for the actual flight test phase.
Figure
Automated test generation architecture for UAV flight control software testing.
Figure
Block diagram of the UAV Simulation Test Harness.
Each test case for evaluating the flight control software consists of six initial condition variables and three faults. The six initial conditions are lateral position, lateral velocity, actuator bias, actuator scale factor, sensor bias, and sensor longitudinal scale factor. The three faults that can occur are a stuck actuator, a sensor multipath error, and an intermittent wind gust. A fault may occur during a single time step of the simulation (between the first and last time steps), or it may occur not at all in a given test case.
The state variables of the motion model are
Equation (
The model is executed for five seconds, and a time step of 1 second is used. Note that a higher fidelity dynamics model would normally use a much smaller integration step to capture highfrequency dynamic effects. For this example, a coarse time step is used in order to limit the total number of fault insertion points in the model, thus enabling the generation of a truth reference through full combinatorial testing of all possible test cases. This truth reference is used to assess the performance of the test case generation algorithms being developed.
The Lateral Sensor model is composed of the following terms:
The SuT for this research problem is the flight control software of the UAV. Its objective is to minimize the lateral deviation relative to the intended flight path while the vehicle is passing through the entryway. The flight control software accepts as input the imperfect Lateral Sensor model estimate of the lateral position and produces as output a command to the UAV Rotor Actuator. As described above, the force actuator is incapable of perfectly implementing the commanded value from the flight control software.
Equation (
The ATG algorithms are responsible for automatically generating the test cases used to assess UAV flight control software performance. Three ATG techniques will be analyzed and compared: MC, GA, and SBO.
Test cases for the MC testing were generated randomly using uniform probability distributions. For each of the six initial condition variables, each of the three possible values (minimum, midpoint, maximum) was equally likely, and for each of the three fault variables, each of the six possible values (occurring at 1, 2, 3, 4, or 5 seconds, or not occurring) were equally likely.
The GA is a good metaheuristic algorithm candidate for this problem due to its potential to perform well for both test case generation performance metrics: finding the single most challenging test case (fittest member of the population) as well as generating a set of challenging test cases through the natural evolution of the entire population.
The UAV test case generation problem was framed as a discrete optimization problem where the initial conditions could took one of three values and the fault insertion time could take one of six values. The objective of the GA was to find test cases that maximized the final lateral miss deviation. The MATLAB
Figure
Flowchart of the genetic algorithm automated test generation algorithm.
Step 1 of the GA randomly generates the population of the initial generation of test cases using the constraints on allowable values for initial conditions, faults, and fault occurrence times.
Step 2 sends the initial population of generated test cases to the Simulation Test Harness for evaluation. The Simulation Test Harness runs each generated test case in closedloop fashion with the Flight Control Software. The final lateral deviation is calculated for each generated test case that is evaluated, and this value is sent back to the GA.
Step 3 sorts the generated test cases based on the final lateral deviation they produced, with the highest ranked test case being the one that produced the maximum lateral deviation.
The GA then evaluates if the stop criteria for the current trial has been met. For this research, the number of Simulation Test Harness Executions performed is compared to the maximum number allowed for the current trial. If the number of allowed executions has not been reached, the GA moves to Step 4 and begins creating the next generation of test cases for evaluation.
Step 4 selects the elite member of the previous generation for inclusion as a member of the next generation. For this research, the elite member is the test case with the largest final lateral deviation as calculated by the Simulation Test Harness. The elite member is automatically advanced to next generation of test cases. This ensures that the largest final lateral deviation value for the next generation will always be equal or larger than the previous generation’s maximum value.
Step 5 selects parent test cases to be used to produce offspring in the next step of the process. Parent selection uses a weighted random draw where test cases with larger final lateral deviations are more likely to be selected.
Step 6 performs the crossover operation using the parents that were selected in Step 5 to create offspring in the form of new test cases that have attributes from both parents. As discussed above, each test case is composed of nine values (six initial conditions and three fault values). The genetic algorithm considers each of these nine entries to be a gene of the parent test case. The crossover process mixes these genes across the two parents. For this research, a scattered crossover function was used. A nineentry random binary vector is generated, with each entry having a value of 0 or 1. During the crossover operation, the offspring is created by taking the genes from Parent 1 in all cells where the random binary vector has a value of 0 and taking genes from Parent 2 in all cells where the value is 1. This ensures that new offspring are created that combine properties from both parents.
Step 7 introduces potential mutations into the population. The mutation process randomly changes gene values to help introduce additional randomness into the search process. Mutation helps the process avoid convergence on local minima due to lack of diversity in the current generation of parents. A random number generator is used to determine if any of the values in the test cases will be changed to a different value. For both the crossover and mutation steps, the input constraints on initial conditions, valid faults, and fault occurrence times are enforced to ensure that only valid test cases are generated. If an invalid test is generated that violates these constraints, the test case is aborted and the process is repeated until a valid test case is generated.
Following Step 7, the next generation of test cases is sent to Step 2 for evaluation in the Simulation Test Harness, and the process repeats until the maximum number of allowed evaluations is reached. The final output at termination is the ranked set of all test cases that were generated.
An SBO method based on [
Flowchart of the surrogatebased optimization automated test generation algorithm.
Step 1 of the SBO uses Latin Hypercube Sampling based on DoE principles to generate the initial set of test cases. Latin Hypercube Sampling is a form of stratified sampling that attempts to distribute samples evenly across the sample space. For this research, the initial design typically consumed 20–40% of the total available simulation executions for the trail.
Steps 2 and 3 are very similar to the corresponding steps described above for the GA. Generated test cases are sent to the Simulation Test Harness for evaluation, and the generated test cases are sorted based on the final lateral deviation values returned from the Simulation Test Harness. The stop criteria are again evaluated to determine if the total number of allowed Simulation Test Harness Executions has been met or exceeded. If not, the algorithm proceeds with generating the next test case to be evaluated.
Step 4 creates a surrogate model for the Simulation Test Harness by fitting a cubic polynomial regression model to the output of all available Simulation Test Harness evaluations. The inputs to the surrogate model are the generated test cases (each comprising values for the six initial condition variables and the three fault variables) that have been evaluated in the Simulation Test Harness. The goal is to create a surrogate model that approximates the final output of the Simulation Test Harness while being much less computationally expensive to evaluate.
Steps 5a and 5b use local and global sampling strategies, respectively, to identify potential test cases to evaluate next in the Simulation Test Harness. Because the surrogate model can be executed more quickly than the Simulation Test Harness, it makes sense to evaluate many candidate points with the surrogate model in order to determine the strongest possible candidate for the next Simulation Test Harness execution.
In Step 5a, a local sampling algorithm is used to generate 25 new potential test cases. This algorithm adds random local perturbations to the test case with the largest final lateral deviation found to date using the Simulation Test Harness. The surrogate model is then evaluated for each of these 25 potential test cases to estimate the expected final lateral deviation.
In Step 5b, a uniform global sampling algorithm is used to attempt to help the algorithm avoid local minima. The state space is uniformly sampled to generate 25 new potential test cases, and again the surrogate model is evaluated for each potential test case.
In Step 6, the next Simulation Test Harness candidate test case is selected from the set of 50 potential test cases generated by the local and global sampling algorithms. The selection algorithm uses a weighted calculation that considers both the predicted final lateral deviation based on the surrogate model and how far away in the state space each potential test case is from the test case with the maximum final lateral deviation evaluated so far by the Simulation Test Harness. Test cases that are farther away in the state space are rewarded in the selection process to help minimize the risk of being stuck in a local minimum.
Once the best candidate is selected in Step 6, it is sent to Step 2 for evaluation in the Simulation Test Harness, and the process begins again. This process continues until the maximum number of allowed Simulation Test Harness evaluations is reached and the algorithm is terminated. The final output at termination is the ranked set of all test cases that were generated.
The following sections present the results of the research. Section
The generation of the truth reference test case rankings for the UAV flight control problem is discussed in Section
The UAV flight control problem for this research was carefully constrained to make full combinatorial testing possible to enable generation of a truth reference for evaluating ATG algorithm performance. Full combinatorial testing generates the complete set of possible test cases by evaluating all possible combinations of initial conditions and fault occurrence times. In this case, there were six initial condition variables, each of which could have one of three discrete values (minimum, midpoint, and maximum values). Therefore, there are 3^{6} = 729 initial condition combinations that can be tested when no faults are inserted. For a given set of initial conditions, the three fault variables can each take one of six values, inserted at times 1, 2, 3, 4, and 5, or not inserted, giving a total of 6^{3} = 216 fault variations. The total number of possible test cases when initial condition variations and faults are considered simultaneously is therefore
Figure
Complete timehistories of the UAV flight trajectories for all possible test cases.
Absolute value of the lateral deviations for all possible test cases.
For more realistic autonomous software testing problems, combinatorial testing is not an option in the time available for testing. For example, for a basic highfidelity Simulation Test Harness with 20 initial conditions, 10 faults, and an execution time of 15 seconds with a 0.1 second time step, the total possible number of test cases exceeds 10^{21}. This combinatorial explosion motivates the use of more selective ATG methods that require fewer simulation executions to generate challenging test cases.
We evaluate ATG performance for three different algorithms (MC, GA, and SBO) for varying numbers of total allowed simulation executions, as shown in Table
Number of simulation executions allowed for each experimental trial.
Trial  1  2  3  4  5  6 
Number of simulation executions allowed  50  100  200  500  1000  2000 
Percentage of total possible simulation executions  0.032%  0.064%  0.127%  0.318%  0.635%  1.270% 
This section presents the ATG results for the two performance metrics for each of the simulation trials, which differed in number of simulation executions. The objective of this research is to determine if the SBT methods (GA and SBO) can generate test cases with equal or higher lateral deviations than MC testing when using the same number of simulation executions. Because random sampling is used in all three ATG algorithms tested, 50 repetitions were conducted for each method for each trial, and the mean value of the parameter over those repetitions is included in the table. A onesided pooled twosample
Table
Maximum lateral deviation (in meters) as a function of simulation executions.
Trial  1  2  3  4  5  6 

Simulation executions  50  100  200  500  1000  2000 
Monte Carlo  4.27  4.59  4.78  5.58  5.93  6.40 
Genetic algorithm  3.95  4.33  5.66  6.85  7.28  7.39 
Surrogatebased optimization  4.76  6.85  7.11  7.38  7.47  7.47 
True maximum  7.47 
Results of statistical hypothesis testing for maximum lateral deviation.
Trial  1  2  3  4  5  6 

Simulation executions  50  100  200  500  1000  2000 
H0: 
Fail to reject H0 
Fail to reject H0 
Reject H0 
Reject H0 
Reject H0 
Reject H0 


H0: 
Reject H0 
Reject H0 
Reject H0 
Reject H0 
Reject H0 
Reject H0 
Maximum lateral deviation as a function of allowable simulation executions.
A number of findings based on the results of attempting to find the test case that maximizes lateral deviation using each method are listed below:
At very small numbers of simulation executions (50 and 100), MC testing outperforms the GA, while at all values 200 and above, the GA begins to significantly outperform MC. This result makes intuitive sense; a GA requires a balance of population size with an appropriate number of generations to evolve in order to improve the results over random selection. At very small number of allowable simulation executions, random draws using MC can be equally effective at exploring the test space.
The SBO algorithm is able to slightly outperform MC testing for the trial with the fewest simulation executions (50) and then proceeds to significantly outperform MC for all other numbers of allowable execution runs (
The SBO algorithm outperforms the GA in all trials, with the most noticeable differences in the range of trials that allowed 50 to 500 simulation executions.
The SBO algorithm is able to find the test case that produces the true maximum lateral deviation (7.47 m) every time when allowed to run 1,000 and 2,000 simulation executions. The test case found by the GA approaches the true maximum value (97% of the maximum true value) when allowed to perform 2,000 simulation executions.
Table
Mean lateral deviation (in meters) for the 50 most challenging test cases generated.
Trial  1  2  3  4  5  6 

Simulation executions  50  100  200  500  1000  2000 
Monte Carlo  1.45  2.29  2.85  3.52  3.91  4.30 
Genetic algorithm  1.41  2.09  3.48  5.33  6.18  6.92 
Surrogatebased optimization  1.61  3.85  5.39  6.05  6.64  6.88 
True maximum  7.07 
Results of statistical hypothesis testing for mean of 50 most challenging test cases.
Trial  1  2  3  4  5  6 

Simulation executions  50  100  200  500  1000  2000 
H0: 
Fail to reject H0 
Fail to reject H0 
Reject H0 
Reject H0 
Reject H0 
Reject H0 


H0: 
Reject H0 
Reject H0 
Reject H0 
Reject H0 
Reject H0 
Reject H0 
Mean lateral deviation for the 50 most challenging test cases as a function of allowable simulation executions.
The findings for this measure of performance are generally similar to those for maximum lateral deviation. At low numbers of allowed simulation executions, MC testing outperforms the GA and is much closer to performing as well as the SBO algorithm. As the number of executions increase, both the GA and the SBO algorithm show significant improvement over MC testing, with all trials of 200 or more simulation executions having a statistically significant larger mean, with
Two different types of SBT algorithms (GA and SBO) were used to automatically generate test cases to challenge UAV flight control software. A mediumfidelity Simulation Test Harness was used to perform closedloop testing with the UAV flight control software for a variety of different initial conditions and fault occurrence times. The SBT algorithms significantly outperformed the ATG method most commonly used today (MC testing) for both performance metrics assessed: (1) finding the most challenging single test case and (2) finding the set of the 50 most challenging test cases. When the number of allowed simulation executions was small (<0.1% of the total possible runs), MC testing was able to outperform the GA and come closer to SBO algorithm performance levels, but as the number of executions allowed increased, both SBT algorithms significantly outperformed MC testing; this finding was confirmed using statistical hypothesis testing. The SBO algorithm demonstrated a rapid rise in performance for relatively small number of runs (between 0.1 and 0.5% of the total number of runs) and was able to achieve performance very close to the theoretical maximum performance (as evaluated using full combinatorial testing) when finding the most challenging test case once the number of allowed executions was above 0.3%. The GA was slower to improve than the SBO algorithm but also achieved performance approaching the theoretical maximums once the number of executions was greater than 1% of the total possible test cases.
Future work in this area will be to evaluate performance for a more complex autonomous vehicle modelbased testing scenario. Typical highfidelity autonomous vehicle simulations have more than 20 initial conditions and more than 10 possible faults. Because these simulations also have longer run times with smaller integration time steps, the number of possible test case scenarios can increase by several orders of magnitude. GA and SBO algorithm performance can be evaluated against MC testing for trials with the same number of simulation executions. Full combinatorial testing would not be possible for such a complex system, but GA and SBO performance can also be evaluated against MC tests that are allowed to run many more simulation executions to determine if comparable SBT performance can be achieved with far fewer executions. Given the very large number of total possible simulation executions, it will be of interest to see if the GA and SBO algorithm are able to significantly outperform MC methods again, even though the total number of executions is likely to be less than 0.1% due to the extremely large total search space.
The authors declare that there is no conflict of interests regarding the publication of this paper.