A New Look at Worst Case Complexity: A Statistical Approach

We present a new and improved worst case complexity model for quick sort as yworst(n, td) = b0 + b1n 2 +g(n, td) + ε, where the LHS gives the worst case time complexity, n is the input size, td is the frequency of sample elements, and g(n, td) is a function of both the input size n and the parameter td. The rest of the terms arising due to linear regression have usual meanings. We claim this to be an improvement over the conventional model; namely, yworst(n) = b0 + b1n + b2n 2 + ε, which stems from the worst case O(n) complexity for this algorithm.


Introduction
Sometimes theoretical results on algorithms are not enough for predicting the algorithm's behavior in real time implementation [1].From research in parameterized complexity, we already know that for certain algorithms, such as sorting, the parameters of the input distribution must also be taken into account, apart from the input size, for a more precise evaluation of time complexity of the algorithm in question [2,3].Based on the results obtained, we present a new and improved worst case complexity model for quick sort as  worst (,   ) =  0 +  1  2 +  (,   ) + , where the LHS gives the worst case time complexity,  is the input size,   is the frequency of sample elements, and (,   ) is a function of both the input size  and the parameter   .The rest of the terms arising due to linear regression have usual meanings.We claim this to be an improvement over the conventional model; namely,  worst () =  0 +  1  +  2  2 + , which stems from the worst case ( 2 ) complexity for this algorithm.It is important to note that our results change the order of theoretical ( 2 ) complexity of this algorithm as we get  worst (,   ) = ( 3 ) complexity in some situations.This new model in our opinion can be a guiding factor in distinguishing this algorithm from other sorting algorithms of similar order of theoretical average and/or worst case complexities.The dependence of basic operation(s) on the response is more prominent for discrete distributions rather than continuous ones for the probability of a tie is zero in a continuous case.However, presence of ties and their relative positions in the array is crucial for discrete cases.And this is precisely where the parameters, apart from  characterizing the size of the input, of the input distribution come into play.
We make a statistical case study on the robustness of theoretical worst case complexity measures for quick sort [4] over discrete uniform distribution inputs.The uniform distribution input parameters are related as  =   * .The runtime complexity of quick sort varies from (log 2 ) to ( 2 ) depending on the extent of tied elements present in a sample.For example, complexity is log 2  when all keys are distinct and ( 2 ) when all keys are similar [5].Apart from this result an important observation with respect to average time complexity is made by Singh et al. [6], which claims 2 International Journal of Analysis quadratic average case complexity of quick sort program under universal data set.This is true especially for certain models where tie-density is a positive linear function over  values.With these observations, it would be interesting to know the behavior of quick sort program when the linear growth of   is replaced by some superlinear function.This interest is a major motivation towards this research article.
Just as in this case quick sort is found to be worse than it is; the reverse is also possible in other algorithms.That is to say, an algorithm can perform better than what a worst case mathematical bound says.In this case the bound becomes conservative [7].A certificate on the level of conservativeness can be provided using a statistical analysis only.Empirical-O, the statistical bound estimate, will be pointing to some other bound lower than the mathematical bound obtained by theoretical analysis.The difference provides the desired certificate.For a detailed discussion on empirical-O reader is suggested to see [8].
It is well known that quick sort's performance is dependent on the underlying pivot selection algorithm for a proper pivot selection that greatly reduces the chances of getting the worst case instances.As discussed above, the worst case complexity measures can be conservative.That is, for an arbitrary algorithm with () = (  ) worst case complexity, in a finite range setup, we can expect for an () = ( − ) complexity, where  > 0. However when other input parameter(s) (  in our case) are also taken into account we come up with (,   ) = ( + ) worst case complexity, which is a novel finding.

Justifying the Choice of Algorithm.
There are many versions of quick sort.Industrial implementations of quick sort typically include heuristics that protect it against ( 2 ) performance when keys are similar.With respect to the quick sort, the question of choosing a proper pivot selection algorithm is more relevant in average case complexity measures, as its (log 2 ) average case complexity itself is not robust [5].The small (but nonzero) probability of getting the worst case instances is often cited as the reason for overall good performance of quick sort.This is true especially for random continuous distribution inputs.This research article is a study on finding the worst case behavior of both the naïve and randomized versions of quick sort algorithm.
The Organization of the paper.The paper is organized as follows.Section 2 gives analysis of quick sort using statistical bound estimate.Under Section 2 Section 2.1 gives analysis for sorted data sequences.Section 2.2 gives the analysis for random data sequences with three case studies.Section 2.3 gives justification for worse than ( 2 ) complexity of quick sort.Section 3 gives conclusion.

Analysis of Quick Sort Using Statistical Bound Estimate
Our statistical adventure explores the worst case behavior of the well-known standard quick sort algorithm [4] as a case study.The worst case analysis was done by directly working on program run time to estimate the weight based statistical bound over a finite range by running computer experiments [9,10].This estimate is called empirical-O.Here time of an operation is taken as its weight.Weighing allows collective consideration of all operations, trivial or nontrivial, into a conceptual bound.We call such a bound a statistical bound opposed to the traditional count based mathematical bounds which is operation specific.Since the estimate is obtained by supplying numerical values to the weights obtained by running computer experiments, the credibility of this bound estimate depends on the design and analysis of computer experiments in which time is the response.It is suggested for the interested reader to see [11,12] to get more insight into statistical bounds and empirical-O.This section includes the empirical results obtained for worst case analysis of quick sort algorithm.The samples are generated randomly, using a random number generating function, to characterize discrete uniform distribution models with  as its parameter.Our sample sizes, for random data sequences, lie in between 5 * 10 5 and 10 * 10 6 .The discrete uniform distribution depends on the parameter  [1, . . ., ], which is the key to decide the range of sample obtained.
Most of the mean time entries (in seconds) are averaged over 500 trial readings.These trial counts, however, should be varied depending on the extent of noise present at a particular sample size value.As a rule of thumb, the greater the noise at each point of  is, the more the numbers of observations should be.
The interpretations made for the various statistical data are guided by [13].
System Specification.All the computer experiments were carried out using PENTIUM 1600 MHz processor and 512 MB RAM.Statistical models/results are obtained using Minitab-16 statistical package.The standard quick sort is implemented using "C" language by the authors themselves.It should be understood that although program run time is system dependent, we are interested in identifying patterns in the run time rather than run time itself.It may be emphasized here that statistics is the science of identifying and studying patterns in numerical data related to some problem under study.

Analysis for Sorted Data
Sequence.This section includes the empirical results for sorted data sequences.The samples thus generated consist of all distinct elements (at least theoretically).The program runtime data obtained for sorted sequences is fitted for a quadratic model.The regression analysis result is given in Box 1.With a very significant t-value (194.92) of quadratic term, the regression analysis statistic strongly supports a quadratic model.The quadratic model goodness is further tested through cubic fit for the same runtime data set.
Next the very same program runtime data is fitted for a cubic model.The regression analysis result is given in Box 2. With a value of 23.75 the t statistic for the quadratic term is significantly higher than other terms in the obtained regression model.Remarkably the statistical significance of cubic term is very weak compared to other terms, hence liable to be discarded.The  2 value is the maximum with a very

Worst Case Analysis of Naïve Quick Sort (Case Study 1).
As our first case study with random data sequences we have analyzed the worst case complexity measures for the inputs in the range 5 * 10 5 -50 * 10 5 .The points on horizontal axis in Figures 2(a)-2(c) and 2(e) correspond to points on the third degree polynomial shown in Figure 1.It can be seen in Figure 2(a) that the runtime data when fitted to a quadratic model gives an underfit.This fit gets improved significantly when the fitted model is changed to a cubic model of type  =  0 +  1  +  2  2 +  3  3 .We get a more improved fit when the cubic model is replaced by a fourth degree polynomial.We are not interested in higher order models such as fifth or sixth degree polynomials to avoid the problem of overfitting as we wish to catch the general trend of the population rather than a fit by forcing a polynomial to pass through all the input points.Our graphical observation is next verified with more rigorous statistical results given with Boxes 3-5.

Regression Analysis and ANOVA (Analysis of Variance)
Results.The program runtime data obtained for random data sequences corresponding to curve in Figure 1 is fitted for a quadratic model.The corresponding regression analysis result is given in Box 3. The significant t-value of quadratic term suggests for a quadratic complexity.However the  2 value is relatively low and the standard error is high.

The Cubic Model as a Test Of Quadratic Goodness of Fit.
A test of quadratic goodness of fit is performed by fitting a cubic model to the same program runtime data.From the regression and ANOVA table (Box 4), this cubic model looks much better than the earlier quadratic model.The  2 is much higher (98.0 against 87.7), the standard error is smaller (3.60446 against 8.19305), but, more importantly, the coefficient of the cubic term is highly significant.Interestingly the cubic term is statistically more significant than the quadratic term (5.49 against −4.03)!
Verifying the Cubic Complexity through "Box-Cox Transformation".In order to get a better fit of the model we prefer the response to be transformed.In general, transformations are used for three purposes: stabilizing response variance, making the distribution of the response variable closer to the normal distribution, and improving the fit of the model to the data [14].We perform transformation to simultaneously accomplish more than one of these objectives.The power family of transformations  * =   is very useful, where  is the parameter of the transformation to be determined.
General Regression Analysis:  versus ,  Below we provide statistical results for Box-Cox transformation of the response variable for  (lambda) = 0.5.The detailed result of Box-Cox transformation is provided in Box 5.
This result once again refutes the theoretical ( 2 ) worst case complexity of quick sort algorithm for certain data patterns.

Worst Case Analysis of Naïve Quick Sort (Case Study 2)
. We investigate the behavior of quick sort program for yet larger data sets whose horizontal axis elements correspond to points on Figure 3.In this experimental setup the samples vary from 5 to 10 million of data in terms of their size.From the regression and ANOVA table (Box 6), this new cubic model looks much better than the quadratic model in Box 3. The  2 is much higher (97.2 against 87.7), the standard error is still smaller (2.96191 against 8.19305), but, more importantly, the coefficient of the cubic term is highly significant.In fact the cubic term is statistically more significant than the quadratic term.These statistics indicate that a cubic model better describes the sample experimental data, as it does not impart overfit to the complexity data.The smaller PRESS statistic for the cubic models in Boxes 4 and 6 is also favorable.See Figure 4 to get an insight into super quadratic nature of  versus  curve.These observations lead to a conjecture that the quick sort worst case complexity is  worst (,   ) =  0 +  1  2 + (,   ) +  =  emp ( 3 ).It is important to note that our results do change the order of theoretical ( 2 ) complexity of this algorithm.
The statistical analysis in both case studies suggests a super quadratic worst case complexity for quick sort algorithm provided  is a function of both  and   .

Worst Case Analysis of Randomized Quick Sort (Case Study 3).
As our third case study with random data sequences we have analyzed the worst case complexity measures of randomized quick sort for the inputs in the range 5 * 10 5 -50 * 10 5 .Apart from the version of quick sort used in this case study other input requirements are similar to that in Section 2.2.1.The response variable  in this case corresponds to points in Figure 5.
The runtime data obtained for randomized quick sort is fitted to a cubic model.The corresponding result is given in Box 7. It can be seen that with a t statistic of 5.27 the cubic term is statistically more significant than the quadratic term.The standard error ( = 2.29581) of the model thus obtained is relatively small.Also the PRESS statistic (278.092)strongly supports a cubic complexity for the observed data.Compare these values against the corresponding values in Box 3. The worst case complexity thus can be expressed as  worst (,   ) =  0 +  1  2 +  (,   ) +  =  emp ( 3 ) .(3) 2.3.Justification for Worse than ( 2 ) Complexity.It is well known that runtime of quick sort depends on the number of equal keys present in the sample [5].With fixed tie-density value, in a finite but reasonably wide range setup,  is a linear function over input parameter .Similarly for fixed , the value   grows linearly (see Figures 6(a over the input parameter   .An important observation as that made in [6] is that, for a linear growth in   (of course  is to be kept constant to satisfy  =   * ), the time complexity is quadratic.With this observation, it is interesting to know the behavior of quick sort program when this linear growth is replaced by some steeper function.This interest is a major motivation towards this research article.In this requirement, due to nonlinear nature of   , even  is not a constant but rather a decreasing function over input parameter .

Conclusions
We conclude this paper with the following remarks.Worst case analysis is termed as a useful science, since the worst case bounds give a sense of guarantee against the nonfavorable cases.But is the science as useful as it is projected by the computer scientists?Far from it.Worst case
Regression analysis:  versus ,  2 , and  3 (sorted data sequence).smallstandard error value. Wth insignificant t value (0.13) of the cubic term, this statistic analysis result suggests a strong quadratic model for the given data set.
Box 1: Regression analysis:  versus  and  2 (sorted data sequence).over increasing  values, a linear growth in  results in a linear growth of   values as well and vice versa.In any such case either  or   has to be constant.As a change, if growth rate of   is made super linear we get random samples in which neither  nor   remain constant.The empirical results for these random sequences are given in Boxes 3-7.