Knee Point Search Using Cascading Top-k Sorting with Minimized Time Complexity

Anomaly detection systems and many other applications are frequently confronted with the problem of finding the largest knee point in the sorted curve for a set of unsorted points. This paper proposes an efficient knee point search algorithm with minimized time complexity using the cascading top-k sorting when a priori probability distribution of the knee point is known. First, a top-k sort algorithm is proposed based on a quicksort variation. We divide the knee point search problem into multiple steps. And in each step an optimization problem of the selection number k is solved, where the objective function is defined as the expected time cost. Because the expected time cost in one step is dependent on that of the afterwards steps, we simplify the optimization problem by minimizing the maximum expected time cost. The posterior probability of the largest knee point distribution and the other parameters are updated before solving the optimization problem in each step. An example of source detection of DNS DoS flooding attacks is provided to illustrate the applications of the proposed algorithm.


Introduction
Anomaly detection system and many other applications often rely on finding the largest knee point in the sorted curve to perform clustering, classification, anomaly identification, and so forth [1][2][3][4][5][6]. Here the largest knee point is targeted because the particular interests lie in finding the cluster of the largest points whose values differ significantly from their lower neighbors in the sorted curve.
Knee point is defined as the point whose value is close to its upper neighbor while far from its lower neighbor in the sorted curve and thereby taken as the boundary of the cluster of upper points. For an unsorted list, it is necessary to sort it to facilitate the knee point search. Due to time and space efficiency considerations, the method of first completely sorting and then searching the sorted list is often not the optimal one. An alternative approach is to perform search on the partially sorted, namely, top-k, list, hoping to save the cost of sort. Therefore the top-k sort algorithm is introduced to help minimize the time complexity of the knee point search in this paper. There have been many efforts for bounding and evaluating the time and space complexity of sort algorithms [7][8][9][10][11][12]. These works provide component algorithms for our work. But the problem of knee point search via top-sorting has not been addressed by any of the previous works. We present in this paper a knee point search algorithm using topsorting with minimized time complexity. This paper is organized as follows: some basic concepts and definitions on knee point search and top-k sorting are presented in Section 2; Section 3 will design a knee point search algorithm, including basic idea, top-k sort algorithm, time complexity, parameter updating, cascading top-k sorting with minimized time complexity, the knee point search algorithm, and the solution of the optimization problem; Section 4 will introduce source detection of DNS DoS flooding attacks as an application example of the proposed algorithm; Section 5 will conclude this paper.

Knee Point Search and Selection Sort
Assume there are points 1 , 2 , . . . , whose values are As illustrated in Figure 1, there is usually a notable gap of value between points on the upper left side and those on the lower right side of the sorted curve. We define a knee point as the one whose neighboring differential values differ significantly.
where is the threshold, whose value ranges from 10 to 50 in the practice of anomaly detection.
Note that there may be more than one knee point for a list, and the goal of the algorithm is to find the largest one in the sorted curve. For an unsorted list, we should first partially sort the list to find the sorted top-k list and then search the sorted top-k list for the largest knee point.

Definition 2.
A top-k sort problem of selection number k for an unsorted list L is a problem that finds k largest elements of L sorted in descending order.
Apparently, total sort is often not optimal for the problem as knee point search may be successful on a partially sorted top-k list if it contains the largest knee point. Therefore it is preferable to selectively sort first using the top-k sort algorithm and then search in each step. The procedure may go through many recursive steps until finding the largest knee point for the search may fail in the previous steps. There is a tradeoff between the time cost and expected hit probability of knee point search in each step, both determined by the selection number and both contributing to the expected overall time cost. In this paper, the optimization problem of the selection number is solved by minimizing the maximum expected time cost.

The Knee Point Search Algorithm
3.1. Basic Idea. The knee point search algorithm is based on cascading top-k sorting. In each step, top-k sorting segments the list left to be searched. The optimal selection number is determined by minimizing the expected time cost on the list left. If the search successfully finds the knee point in the sorted top-k list, the algorithm ends there. Otherwise, the residual list excluding top-k requires further checking. It becomes the objective list for the next step and the function of the expected time cost using top-k sorting should also be updated according to the a priori knowledge that the search fails in the previous step. Thus the new top-k sort problem, likewise, holds for the next step. The algorithm runs in this way recursively until the knee point is found. There are two cases that bring the algorithm to the end.
(1) The knee point is found in the sorted top-k list in a step. (2) The optimal selection number in one step equals the length of the objective list. This means the optimal option is total sorting. Therefore the knee point is certain to be found in the completely sorted list.

Top-k Sort Algorithm.
We design a quicksort variation as the top-k sort algorithm. Quicksort is a very efficient sort algorithm invented by Hoare [7]. Quicksort has two phases: the partition phase and the sort phase, which makes it a good example of the divide and conquer strategy for solving problems.
Top-k sorting only aims at treating the largest k elements, and thus it can be facilitated by the divide and conquer strategy. The intermediate results of quicksort, namely, the pivot positions, can be leveraged to possibly cut off one of the smaller problems divided from the bigger problem and to be conquered. For the strategy to be effective, the partition phase runs recursively only for the lower part if the pivot falls below position k, because there is no need to sort the upper part, which only consists of elements larger than top-k. This is the major distinction from the original quicksort algorithm and brings sorting efficiency.
At the same time, the pivots located after position (the optimal selection number) at step in Section 3.6 are potentially useful for the afterwards steps, while they are actually not helpful for the inner top-sorting. Therefore we record those pivots via a stack. A stack is a data structure featured by last in, first out (LIFO). Recalling that the recursive partitions with their pivots after position produce their pivots in a sequential descending position order, we push these pivots into the stack resulting in a stack of pivots ordered by their positions. At the afterwards step + 1, if the optimal selection number d +1 is larger than the position of the pivot at the top of the stack, the pivot is popped from the stack used for an inner pivot of top-+1 . Since this pivot is no longer needed for the afterwards steps, it should not be The Scientific World Journal 3 maintained in the stack. When the stack is empty or the pivot at the top of the stack (so do all of the other pivots) is located after the selection number, the partition has to run by itself to find a pivot without the help of the pivot stack.
The top-k sort algorithm, namely, QuickSortTopK, can be expressed as in Algorithm 1.
The input of QuickSortTopK is the objective unsorted list indexed from FirstIndex to LastIndex. For all steps, LastIndex is fixed at , whereas FirstIndex is progressively increased to exclude the sorted part of in all previous steps. The output includes the sorted top-elements of indexed from FirstIndex to LastIndex and the stack containing all pivots falling after position obtained in all previous steps.
The termination condition of the recursion is checked in Line 1 of the algorithm. If the stack is nonempty and the top element of the stack falls into the objective range (see Line 2), the top element is used as the pivot for the partition (see Line 3). Otherwise, the pivot is obtained by a partition (see Line 6). Once the pivot is presented, different recursive steps are to be taken depending on the position of the pivot. If the pivot falls after position k, it should be pushed into the stack and then run further sorting on the original list subtracting the pivot, hoping to help afterwards steps (see Lines 9,11). If the pivot is located exactly at position k, the pivot itself is the last element of the output and thereafter only top-k-1 sorting on the original list subtracting the pivot is needed (see Line 15). If the pivot is located prior to position k, both the upper and lower parts should be treated. The action on the upper part is equivalent to Quicksort, while the action on the lower part is actually the recursive running of QuickSortTopK with diminished selection number k and the shrinking objective list (see Lines 18, 20).

Time Complexity.
Let totally sorting of a list of length require ( ) time. Calculated by the number of comparisons, the average time complexity of ( ) is ( log( )) following some efficient algorithms, for example, Quicksort [7].
Let top-k sorting of selection number k require ( , ) time, where n is the length of the list. The QuickSortTopK algorithm requires an expected time of ( + log ). So ( , ) equals ( + log ).
Let the time complexity of finding the knee point in the sorted list of length be ( ). Recalling (2), ( ) takes ( ).

Parameter Updating.
For a list of length , the algorithm divides the overall procedure into +1 steps by a sequence of selection numbers, 1 , 2 , . . . , , 1 ≤ 1 < 2 < ⋅ ⋅ ⋅ < ≤ . Additionally, to facilitate the formulation, we let 0 = 0 and + 1 = . Let the length of the objective list in step , = 1, 2, . . . , + 1 be ; we have In the first step, = . Let top-sorting for the objective list of length be performed in step , = 1, 2, . . . , + 1, and we have Particularly, = in step + 1. And thus top-sorting for the objective list of length is actually total sorting of the objective list. If the search of the knee point is successful in step for the sorted top-list, the algorithm ends at step . Otherwise, the algorithm continues with the next step. The algorithm lasts until step +1 if the search misses in all of the previous steps during the progressive search. Since step + 1 takes no further selection of the objective list, the algorithm finishes in it. Let be the position variable of the knee point and = 1, 2, . . . , . Let ( ) represent the probability that = , = 1, 2, . . . , , and thus ∑ =1 ( ) = 1. The value of ( ) is assumed to be known at the beginning of the algorithm. Let be the position variable of the knee point in step and = −1 + 1, −1 + 2, . . . , . Let ( ) represent the probability that = , = −1 + 1, −1 + 2, . . . , , = 1, 2, . . . , + 1, and thus ∑ = −1 +1 ( ) = 1.

Lemma 3. The probability distribution of the knee point in step
( ) can be written as Proof. At the first step, all knowledge about the probability distribution of the knee point is only given by ( ). But the search in the afterward steps should make use of the posterior distribution of the knee point for it is confirmed not to exist prior to the selection number in the previous steps; for example, when the algorithm comes to step , the knee point is already checked to be not present in the top-−1 list, = 2, 3, . . . , + 1. Therefore ( ) should be updated in step as Let the hit probability of search in step be .
Proof. For the selection number in step , the search for the knee point is successful if and only if the knee point falls 4 The Scientific World Journal Input: : an unsorted list of length FirstIndex: the first index of to be sorted LastIndex: the last index of to be sorted : top-elements of are to be sorted Output: : Sorted top-elements of ranging from FirstIndex to LastIndex : a stack of pivot positons useful for the afterwards selection sorts pivotpos = pop( ) / * use the top element of the stack as the pivot * / (4) else (5) / * Partition without using pivots from the stack * / (6) pivotpos=Partition(L, FirstIndex, LastIndex) (7) if pivotpos− FirstIndex +1> then / * The pivot falls after position * / (8) / * The pivot may be useful for the afterwards steps * / (9) push(pivotpos, ) into the interval of position among −1 + 1 and . According to Lemma 3, the probability distribution of the knee point at step ( ) should be updated as (5). Therefore we have

Cascading Top-k Sorting with Minimized Time Complexity
Lemma 5. Let the expected overall computational time cost in step be tc ; yields Proof. When the search succeeds in step , tc comes only from top-k sorting which requires ( , ) time and the search in the top-sorted list which requires ( ). However, recalling Section 3.3, the time complexity of ( ) is negligible compared to that of ( , ), so the summation of them can be approximated by ( , ).
Lemma 5 tells us that the expected computational cost tc 1 can be calculated iteratively following (7), until reaching tc +1 where there is no selection for step + 1. Thus tc +1 only consists of the time cost of total sorting of the list of length and the search in it. As total sorting takes ( ) and search takes ( ), thus we have Let ( , ) ( ≤ and , are integers) denote the set of integers { | ≤ ≤ and is integer}. In every step , the algorithm calculates the current probability distribution of the knee point which determines the hit probability and The Scientific World Journal 5 chooses to be the solution of the following optimization problem: Min : tc s.t. ∈ ( −1 + 1, ) .
We see in (12) that for any fixed the minimum of tc is determined by tc +1 under the optimal selection of +1 in the next step, ∈ ( −1 + 1, ). And the optimal tc +1 for any fixed +1 is also determined by the tc +2 under the optimal selection of +2 and so on. This kind of iterative dependency finally extends to the last step which has no further selection. So tc is the function of , +1 , . . .. The variation of any choice of selection in any number of steps makes the search space of optimization very huge, especially for the initial steps. Therefore it is not practical to evaluate all possibilities of selections in all of the afterwards steps when solving the optimization problem in step . Thus it is necessary to constrain the variable of the objective function tc in (12) as mere .
Plugging (3) into (15) and then (15) into (16), we have (14), and thereby (13) is proved. Theorem 6 manifests that, for a fixed , tc is definitely bounded by the time cost of total sort of the residual list of length + 1 plus that of search in it. The optimization problem in step described by (13) can be isolated from all of the possible selections of the afterwards steps and becomes a function of mere . This minmax technique brings convenience to our analysis, such that (13) can be simplified as Min : Plugging (5) into (17), we have Min : Solving (14), we can obtain the optimal in step .

The Knee Point Search Algorithm.
The knee point search algorithm runs iteratively using cascading top-sorting. When the optimal selection number is determined at step , top-sorting can be done via running QuickSort-TopK ( , −1 +1, , − −1 ). Specifically, the first step starts with QuickSortTopK (L, 1, n, d 1 ).
The procedure can be described as follows.
(2) The optimal selection number 1 is obtained by solving the following optimization problem as (18): Min : (3) Perform top-1 sorting on the list of length n.
(4) Search for the knee point on the sorted top-1 list. If successful or 1 = , the algorithm ends. Otherwise, go to Step 2.
(1) The probability distribution of the knee point for the optimization problem is updated as follows: The Scientific World Journal (2) 2 is derived as the solution of the following optimization problem: Min : Note that 1 is inherited from Step 1. list. If successful or 2 = , the algorithm ends. Otherwise, go to Step 2.
Step i.
(1) The probability distribution of the knee point for the optimization problem is updated according to (5).
(2) is obtained as the solution of the optimization problem in (18), where −1 is known from step − 1. The knee point search algorithm can be summarized as in Algorithm 2.
The knee point search algorithm can also be expressed recursively.
For each recursive step, we have an unsorted list L of length n and the probability distribution of the knee point in the sorted list of length n: P. Thus we can modify the optimization problem in (18) as follows: Solving (14), we can obtain the optimal in each recursive step.
When the knee point search fails after top-sorting in each recursive step, the algorithm has to go to the next recursive step. First, we need to update the probability distribution of the knee point as well as the residual unsorted list as two parameters for the recursive function. According to Lemma 3, the update of the probability distribution of the knee point yields In each recursive step, top-sorting can be done via running QuickSortTopK(L, 1, length(L), d), where is the objective list in the current step and length ( ) denotes the length of L.
A recursive version of knee point search algorithm can be summarized as in Algorithm 3 3.7. The Solution of the Optimization Problem. In this section, we will assume two forms of the probability distribution of the knee point and discuss the solutions of (18) under these presumptions.  Although (26) is a discrete function, we still utilize the method of derivation to find the extremum, which can be only applied to the continuous and derivable function. Here we treat the discrete variable as continuous ones, and thereafter (26) turns into a continuous and derivable function. This is a rational approximation of the problem, which facilitates our analysis and solving. The final solution should be the round-off of obtained by solving the continuous function. For simplicity, we let where 1 , 2 , and 3 are all constants. By choosing such that ( )/ = 0, we have the optimal which satisfies For large and d, we have the approximation of (28) as To get the explicit mathematical expression of the solution of the nonlinearity equation (29), we used a heuristic approach to simplify the problem. We assume that is a proportion function of , such that where p decides the optimum of .

Thus (30) yields
Input: Unsorted list of length , probability distribution of the knee point in the sorted list of length : Output: The knee point e or the non-existence of it ( = null).

8
The Scientific World Journal Theorem 7. If the probability distribution of the knee point follows uniform distribution and the optimal selection number takes the form of (30), the optimal method for the knee point search algorithm is binary search or logarithmic search method.
Proof. We first prove using an inductive method that in each recursive step of the search algorithm, ( ) is equal for all ∈ (1, ). We know that for the first recursive step, ( ) is equal for all ∈ (1, ) as the starting point of the induction. Then for the second recursive step, we can derive according to (24) that ( ) is equal for all ∈ (1, ).
Therefore we can form the concept inductively that in each recursive step of the knee point search algorithm, ( ) is equal for all ∈ (1, ). So the optimization problem in each recursive step can be written as no other than (25), whose solution of number of selection is half of the length of the list discussed above. Here the list is the one left from previous recursive step. Hence the optimal method for the search algorithm is binary search or logarithmic search method.
As an approximate treatment of the summation ∑ = (1/ ), we consider Particularly, for = 1 and a large , we have And for a large , we have where is the Euler constant and has the approximate value as 0.5772. Plugging ( ) and then (36)  Proof. We only need to prove that ( ) is a monotony increase function of d, and the equivalent condition for it is that is, We assume that is an exponential function of , such that = , 0 < < 1. (42) Thus the left and right sides of (41) can be, respectively, written as Comparing (43) and (44), we obtain (41) for a large . That means the optimal selection number is n for the first step, or the optimal top-k sorting for the knee point search algorithm is full sorting.

DNS DoS Flooding Attacks. The Domain Name System is a fundamental and indispensable component of the modern
Internet [13,14]. The availability of the DNS can affect the availability of a large number of Internet applications. Ensuring the DNS data availability is an essential part of providing a robust Internet.
In the past few years, some important DNS name servers on the top level of the DNS hierarchical structure were targeted by the DoS or DDoS attackers, and some of these attacks did succeed in disabling the DNS servers and resulted in parts of the Internet experiencing severe name resolution problems [15][16][17][18]. Particularly, DNS DoS flooding attacks are the attacks launched by the attackers towards the DNS name servers with an overwhelming traffic flux in order to disrupt the DNS service for the legitimate clients. However, it is usually not easy to efficiently detect and defend the DoS flooding attacks because the attacking traffic is blended with the legitimate ones, which complicates the distinguishing efforts. Moreover, the detection mechanism should be implementable or should not add heavy computational load. Here we focus on the source-based detection method and show that the problem of source detection of DNS DoS flooding attacks can be addressed by the knee point search in the sorted curve discussed in this paper.

Detection Using the Knee Point Search.
Generally, DNS name servers may receive queries coming from thousands of DNS clients (mostly DNS cache servers), whose traffic volumes are expected to remain far below those of the DoS flooding attacks. The real-time query rates for all incoming sources can be counted by the traffic monitoring system residing at the border gateway in front of the DNS name server. The goal of the DoS attack defense is to realize real-time attacking source detection and then filter out the attacking traffic from these sources accordingly. Therefore the detection problem is equivalent to knee point search in the sorted curve, where all points above the largest knee point are identified as the attacking sources. Moreover, time efficiency is also the key requirement for the problem, for timely attacking detection means timely defending action. Applying the knee point search algorithm proposed in this paper, the expected detection time is minimized.

Leaning the Knee Point Distribution.
The assumption on probability distribution of the knee point is the prerequisite for the knee point search algorithm. However, in the initial rounds of detection we have hardly any a priori knowledge about the knee point. But the distribution estimation of the knee point can be learned based on the empirical data obtained in all previous rounds of detection.
First, suppose that the knee point largely follows stationary random distribution; hence its distribution exhibits almost the same probability model in all rounds of detection. We can fit a statistical model to data and provide estimates for the model's parameters. Here we apply the method of maximum likelihood for the estimation.
We obtain the maximum likelihood estimation of , = 1, 2, . . . , − 1:̂= , = 1, 2, . . . , − 1, At the beginning of each round of detection, if the previous round finds the knee point at position * , and , = 1, 2, . . . , , are updated as follows: The knee point distribution may evolve over time; thus the position of the knee point detected in recent rounds provides more reliable information for the estimation than earlier rounds. Taking the chronological order into consideration, we assign more weight to recent rounds than earlier rounds. This can be done by decreasing the detection results in previous rounds progressively. The deceasing is performed in updating and , = 1, 2, . . . , , and sums up the current detection and the previous ones at a discount , 0 < < 1. Formally, the updating of and , = 1, 2, . . . , , can be modified as follows: ← * + 1, * ← * * + 1, ← * , ∈ (1, ) , ̸ = * . (53)

Conclusion
Knee point search in the sorted curve is often used in the practice of anomaly detection and many other applications. Due to the inefficiency of total sorting, top-k sorting should be adopted for the knee point search. In this paper, a knee point search algorithm using cascading top-k sorting is proposed. The expected time complexity is minimized via optimizing the selection number in each step.