Parallel kd-Tree Based Approach for Computing the Prediction Horizon Using Wolf’s Method

. In different fields of science and engineering, a model of a given underlying dynamical system can be obtained by means of measurement data records called time series. This model becomes very important to understand the original system behaviour and to predict the future values of that system. From the model, parameters such as the prediction horizon can be computed to obtain the point where the prediction becomes useless. In this work, a new parallel kd-tree based approach for computing the prediction horizon is presented. The parallel approach uses the maximal Lyapunov exponent, which is computed by Wolf’s method, as an estimator of the prediction horizon.


Introduction
In large amount of applications in science and engineering, the main purpose to study a given dynamical system is to be able to obtain a prediction in order to carry out preventive actions.For example, in the case of the analysis of electrocardiogram signals, the information obtained through that study could be used to prevent heart related diseases; in air temperature regulation, the prediction information could be used to manage energy in an efficient manner either into a building or in a crop; moreover, breakdown in wind turbine generators caused by a broken wing could be prevented applying the same principles over the study of the wind speed.However, in all these cases and in the most part of the real-world applications based on dynamical systems, there is not a mathematical model or description of the underlying dynamical system.It is precisely the context in which the nonlinear time series analysis plays an important role.According to this kind of analysis, from time series of one single physic variable, for example, voltage, air temperature, or wind speed, it is possible to obtain parameters which describe the behaviour of the underlying dynamical systems represented for these time series.This work deals with the problem of computing an estimation for the prediction horizon, that is, the point where the prediction becomes useless.To calculate this value, we used the parameter known as the maximal Lyapunov exponent.From nonlinear time series analysis, there are several methods in the literature to obtain this parameter: Wolf et al. [1], Rosenstein et al. [2], and Kantz [3].The advantage of Wolf 's method is the conceptually simpler and direct approach which converges to the maximal Lyapunov exponent value without subsequent computing.However, the implementation presented in [1] uses a brute force approach which is enough for shorter time series but becomes very expensive in terms of the execution time for larger data records.As far as the authors know, there is not any implementation of the method considering other approaches.
Therefore, the main goal of this work is to obtain a runtime reduction of Wolf 's method, bearing in mind the time requirements of the real-world applications for the subsequent prediction stage.Thus, to improve the runtime of the method, we developed a parallel approach introducing kd-tree data structure into the implemented code.In order to assert the results, four time series have been considered as case studies.These data records of one single physic variable were obtained from the Lorenz system, the Hénon map, the Rössler system, and electrocardiogram signal.
The parallel approach was developed as part of a multidisciplinary framework.Into the framework, a group formed by computer scientists and physicists is working applying high performance programming onto nonlinear time series analysis algorithms.To present this particular work, we have structured the paper as follows.Wolf 's method is introduced in Section 2. The parallel approach is introduced in Section 3. The experimental results are shown in Section 4 and the conclusions and future works are presented in Section 5.

Wolf's Method
Into the context of nonlinear time series analysis, we mainly focused on time series from deterministic dynamical systems that exhibit sensitive dependence on initial conditions.For the purpose of quantifying this sensitivity, the maximal Lyapunov exponent  1 can be used.This exponent describes the average rate at which the predictability is lost [4].Thus, a crucial fact is that the prediction horizon  can be defined by Before computing  1 , let us introduce you to the basic concepts related with nonlinear time series analysis.Let  = {V  ∈ R |  = 0, 1, . . .,  − 1} be a time series, a set of sequence scalar values of one single physic variable which are typically measured at successive times and spaced at uniform time intervals.From these measurement values, points in delay coordinates, also known as delay vectors, can be constructed according to where  is the embedding delay and  is the embedding dimension (Fraser and Swinney [5]).Note that the number of delay vectors constructed in this manner is  =  − .
Takens' embedding theorem [6] states that, for a large enough embedding dimension greater than or equal to  ∈ N, where  > 0, the delay vectors yield a phase space that has exactly the same properties as the one formed by the original variables of the system.We can use the mutual information method [5] to obtain the proper value of  and the false nearest neighbors method [7] for determining the minimal embedding dimension .
For the reconstructed -dimensional phase space obtained from a single variable, we can compute  Lyapunov exponents Λ = {  ∈ R |  1 >  2 > ⋅ ⋅ ⋅ >   }.These exponents determine the rate of convergence or divergence for two initially nearby trajectories in that phase space.The fact that if one or more of these exponents are larger than zero we are in presence of a system that exhibits sensitive dependence on initial conditions is a well-established fact [8,9].In this case, two arbitrary close trajectories of the system will diverge apart exponentially, eventually ending up in completely different phase space areas as time progresses.
For our purpose, as we mentioned at Introduction, only the maximal Lyapunov exponent computation will be sufficient.To calculate this parameter we use the method introduced by Wolf et al. [1], referred to by Wolf 's method.The method works as follows.In the embedding dimension , we selected a point ⃗ V  as the starter point, also known as fiducial point, which by default is ⃗ V 0 .A near neighbor searching is performing to obtain a point ⃗ V  which satisfies the relation ‖ ⃗ V  − ⃗ V  ‖ < , where ‖ ⋅ ⋅ ⋅ ‖ is the square norm of a vector and  is a small constant.Then, both points are iterated forward in time for a fixed evolution time , which should be a couple of times larger than the embedding delay  but not much larger than .If the dynamical system has sensitive dependence on initial conditions, the distance   after the evolved time ‖ ⃗ V + − ⃗ V + ‖ will be larger than the initial ; that is,   ≫ .But, if the dynamical system does not have sensitive dependence on initial conditions, then  ≈   .After the evolution, a near neighbor searching is performing to find a new point ⃗ V  , the replacement point, whose distance to the evolved point ⃗ V + is less than .The angular separation  between the vectors constituted by the points ⃗ V + and ⃗ V + and ⃗ V + and ⃗ V  should be small.If we found a point ⃗ V  that satisfies the constraint, the point ⃗ V + is replaced by that point.This procedure is repeated until the fiducial point of the trajectory reaches the last one which is called the fiducial trajectory.Finally, the maximal Lyapunov exponent  1 can be calculated according to where  is the total number of replacement steps and ln denotes the natural logarithm.Figure 1 shows a partial snapshot of the trajectories followed by the computing, and Algorithm 1 depicts the algorithmic notation by Wolf 's method.The algorithm is called fet1 by fixed evolution time for  1 such as in the open source code presented by Wolf et al. [1].

Parallel Approach Using kd-Tree
The most time-consuming part in the algorithm fet1 is the neighbor searching (lines 10 and 17 in Algorithm 1).This task searches for the point using the brute force approach.This approach will be enough when working with shorter time series but not for working with larger data records.For this task, a review of neighbor search algorithms, which are especially useful for the study of time series data, can be found in Schreiber [10].
In our experience into the context of nonlinear time series analysis, we have worked with a kd-tree data structure to implement the false nearest neighbor method [7] onto a distributed memory platform [11].In that work, we showed that the implementation based on kd-tree provides better results in terms of the execution time.Thus, we decide to keep this data structure in the present work.

The kd-Tree Data Structure.
From the computational theory point of view, the kd-tree based algorithms [12,13] have the advantage of providing an asymptotic number of operations proportional to (log) for a set of  points, which until  was a minimum; (19) if a replacement point ⃗ V  was found then  =  else  =  + ; (20) Set  =  + ; V  is computed.The value of  1 converges according to (3) using the values of  and   obtained for each evolved step.
is the best possible performance for arbitrary distribution of elements.The disadvantage of the brute force search approach is the fact that it involves a lot of comparisons with all the array elements one by one.Then, the algorithm complexity of the searching is equivalent to () for a set of  points with  coordinates.Therefore, it is expected that the runtime will be reduced replacing the brute force algorithm by a kd-tree based algorithm.
In general, a kd-tree data structure considers a set of  points ⃗ V  ∈ R  , where  = 0, 1, . . .,  − 1.This data structure is a -dimensional binary search tree that represents a set of points in a -dimensional space.The variant described in Freidman et al. [13] distinguishes between two kinds of nodes: internal nodes partition the space by a cut plane defined by a value from the  dimensions; external nodes, also known as buckets, store the points in the resulting hyperrectangles of the partition.For instance, Figure 2 depicts a 2-dimensional space partitioned using cut planes defined by the dimension containing the maximum spread, and Figure 3 shows the kdtree that represents these partitions.
Although many different versions of kd-trees have been developed, their purpose is always to hierarchically decompose the -dimensional space into a relatively small number of buckets such that no bucket contains too many points.This decomposition provides a fast way to access any point by position.We traverse down the hierarchy until we find the bucket containing the point and then scan through the few points in the bucket to identify the right one.Figure 2: A 2-dimensional space partitioned using cut planes defined by the dimension containing the maximum spread.In the first stage, the whole space is split using the cut dimension  with the cut value 0.91.In the second stage, the bottom partition is split using the cut dimension  with the cut value 3.33, and the top partition is split using the cut dimension  with the cut value −0.4 and so on.Figure 3 shows the kd-tree that represents these partitions.
Figure 3: The kd-tree that represents the partitions in Figure 2. The internal nodes have the cut dimension and the cut value for each partition.For each level, all the points contained in the left subtree have values less than or equal to the cut value in the cut dimension; all the points contained in the right subtree have values greater than the cut value in the cut dimension.The external nodes, or buckets, store the points in the resulting hyperrectangles of the partitions.
To perform this work, we have implemented the kd-tree data structure and the kd-tree based functions using the outlines described in [13,14].The most important functions are the following: (i) newkdtree: building a kd-tree using as input a set of  points with  dimensions.By definition, a kdtree data structure has no repeated points, that is, points with similar coordinates.Due to the fact that the index of each point is very important in the course of Wolf 's method, we have updated the algorithms in [14] to implement a kd-tree that supports these feature.
(ii) frnn: performing a fixed-radius-near-neighbor query.This query reports all points within a fixed radius of the given target point under a specified metric.In this case, the metric used is the square norm of a vector.We also have updated the algorithms in [14] to use both the scalmn and the scalmx parameters (lines 10 and 17 in Algorithm 1).Thus, the resulting query is a set of points which have a distance between scalmn and scalmx to the fiducial point.

Tasks Decomposition.
We use data decomposition to implement the parallel approach.Conceptually, the set of  delay vectors ⃗  is splitted into  consecutive subsets of length /, where  is the number of processes.This data decomposition technique allow us to build a local kd-tree which represents a -dimensional subspace formed for all the  points in the subset ⃗   , where  = 0, 1, . . .,  − 1.Hence, the transposition of all the local subspaces represents the whole space as shown in Figure 4 as an example of 2-dimensional space and 4 processes.
We try to obtain a well-balanced task workload with this coarse-grained decomposition.For this goal, we use balanced trees; that is, to split each subspace in the resulting hyperrectangles, the median of the cut dimension is selected as the cut value.

Tasks Synchronization.
A master-slave model is implemented as the tasks synchronization.We assume a parallel platform model which allows us to run copies of a single program in different processes with unique identifiers for each.Hence, each process uses its process-identifier  to know if it is the MASTER or a slave, where  = 0, 1, . . .,  − 1.
For each near neighbors searching, each process calls the function frnn to obtain a set of points  from the local kd-tree which are between scalmn and scalmx of the fiducial point ⃗ V  .From , each process computes which is the local nearest neighbor which has the minimum distance and excludes the others.A synchronization between processes is performed to gather the local results in the MASTER.This process computes which is the global nearest neighbor from  candidates and a new synchronization is performed to broadcast the result from the MASTER to the slaves.Table 1: Runtime for the sequential brute force approach, considering the four case studies, using time series of one million data records.

Case study
(seconds) Lorenz [15] 764.103Hénon [16] 4878.367Rössler [17] 90.6 ECG 1094.323 Algorithm 2 depicts the algorithmic notation by the parallel Wolf 's method approach.We call this approach pkdfet1 by parallel kd-tree-fixed evolution time for  1 .Note that the MASTER process has the task to print  1 and  (see (1)) at the end of the entire processing.

Experimental Results
As we mentioned previously, the sequential brute force approach of Wolf 's method is enough for shorter time series but becomes very expensive in terms of the execution time for larger data records.This is put in evidence in Table 1, which depicts the sequential runtime   of the method presented in Algorithm 1, using data records of one million pieces of data.network.Each processor has 4 cores with 6144 KB of cache memory per each.This platform uses Red Hat Enterprise Linux operating system.
The experimental results are presented in terms of execution time, speed-up, and efficiency.These parallel metrics are defined as follows: Execution time: the sequential runtime of a program is the elapsed time between the beginning and the ending of its execution on a sequential computer.The parallel runtime is the time that elapses from the moment when a parallel computation starts to the moment when the last processing element finishes its execution.We denote the sequential runtime by   and the parallel runtime by   .
Speed-up: it is a measure that captures the relative benefit of solving a problem in parallel.It is defined as the ratio of the time taken to solve a problem in a single processing element to the time required to solve the same problem on a parallel computer, with  identical processing elements.We denote speedup by the symbol   .Mathematically, it is given by   =   /  .
Efficiency: it is a measure of the fraction of time for which a processing element is usefully employed; it is defined as the ratio of speed-up to the number of processing elements.We denote efficiency by the symbol   .Mathematically, it is given by   =   /.
We use the Message Passing Interface MPI [19] to implement the tasks synchronization into the source code of the parallel approach.The implementation of Algorithms 1 and 2 is written in C language (all the codes can be obtained from http://www.dsi.uclm.es/ntsa/).The results obtained are depicted in Figure 5.The runtimes involves the calculus performed from line 12 to line 30 in Algorithm 2.
For clarity, we present the results of the execution time in Table 2.Note that the parallel runtimes for the case  = 1 are not the same as those shown in Table 1.This is because at the first stage of this work we implement a sequential approach of Mathematical Problems in Engineering  Algorithm 1 using a kd-tree data structure, and with this new approach we take times for that case.Hence, as a first result of our work, we have obtained an important reduction of Wolf 's method which is from 6 (Hénon case) up to 55 (Rössler case) times faster than the original approach.In practice, at the second stage of this work, we introduce MPI in this new sequential approach to obtain the final parallel version.As you can note, for each case, the parallel metrics are quite different between them, despite the fact that all the cases have the same value of .We can explain this as follows.In first term, we have the fixed evolution time which determines the number of replacement steps that pkdfet1 must do, for example, we use  = 2 for Hénon and  = 120 for Rössler, corresponding with the maximal   and the minimal   , respectively.In second place, we have the size of the fixedradius-near-neighbor query which is determined by both scalmn and scalmx variables.At the same time, the query size determines the number of buckets that must be visited.This number of buckets must be increased or decreased or kept constant when we add a major number of processes.If the number of buckets decreases, then the performance gain achieved could be unexpected due to the best behaviour of the kd-tree based algorithms; for example, Hénon depicts a phenomenon known as superlinear speed-up from  = 2 to  = 16.If the number of buckets increases or remains constant, the speed-up has the expected behaviour as shown in the other cases, where   < .Moreover, the Hénon case does not have a superlinear speed-up in  = 32, probably due to the fact that the number of buckets cannot decrease more.In the Rössler case, we cannot achieve better results for parallel metrics due to the fact that the introduction of the kd-tree at the first stage of the work allows us to obtain a very small value for the neighbors searching using  = 1.So for  > 1 the synchronization has a most important influence on the parallel runtime.
In general, as we expected, the parallel runtime is better when we use a major number of processes, due to the less number of comparisons for each query.However,   also involves the processes synchronization.Hence, at the moment when the neighbors searching reaches a very small value, the synchronization has a most important influence on the parallel runtime.This is the situation shown in  = 32 for both the Lorenz and Rössler case studies.
Finally, the efficiency showed us that the parallel approach has a better performance when  ≤ 8 for almost all the cases.In a real-world application, the reasons to use less or more processes will be a decision of the researchers.
At this part, we want to mention that the kd-tree query is performed using a top-down approach over balanced trees.For a bottom-up approach and for unbalanced trees, we have obtained quite similar results.
To check the performance of the parallel approach using real time series, we download two time series, related with engineering system applications, from the UCI Machine Learning Repository [20].In these cases we use the parallel approach considering the better performance reached when  ≤ 8, as we commented before.We run the implementation of Algorithm 2 using default values for both the embedding delay and the minimal embedding dimension parameters.
The first case is a time series with measurements of electric power consumption in one household with a oneminute sampling rate over a period of almost 4 years.In particular, the time series is a household global minuteaverage active power (in kilowatt).The metrics for this case are depicted in Figure 6.
The second case is a time series with measurements of a chemical sensor exposed to gas mixtures at varying concentration levels.This time series was constructed by the continuous acquisition of the sensor signals for a duration of about 12 hours without interruption.In particular, the sensor was exposed to a gas mixtures of Ethylene and CO in air.The metrics for this case are depicted in Figure 7.
As you can see, both cases have similar performance using default values into the parallel approach, despite the fact that both cases have different value of : the first case has 2 million pieces of data and the second case has 4 million pieces of data.In general, as we can observe for the theoretical case studies, the parallel runtime is better when using a major number of processes; at the same time both the speed-up and the efficiency decrease with  ≤ 8, but this one is over 60% until 8 processes.

Conclusions and Future Works
A new parallel kd-tree based approach for computing the prediction horizon was presented.The results obtained can be extrapolated below over Wolf 's method, which is used as an estimator of the prediction horizon.Moreover, a new sequential kd-tree based approach for that method was developed as part of this work.This sequential approach is from 6 to 55 times faster than the original approach and by itself represents an improvement of the mentioned method.
In the case of the new parallel approach, this is from 2 to 31 times faster than the new sequential approach, considering from 2 to 32 processes onto a distributed memory platform.As was presented, the results depend on each particular case.The better results were obtained when the advantage in the data organization of the kd-tree data structure can be fully exploited.
In terms of the efficiency, for almost all the cases, the better performance is achieved using  ≤ 8. But, considering the time requirements of the real-world application, if we use 32 processes, the parallel program is from 170 to 297 times faster than the original brute force approach, which represents a time reduction from one hour and 20 minutes to 27 seconds in the best case.
As far as the authors know, to work with a kd-tree data structure onto a distributed memory platform, the literature suggests the same approach by different ways.That is, the local kd-tree building represents a subtree of the original tree formed by a sequential building.Here, a new approach is presented which proposes a decomposition of the whole space in subspaces represented by the local trees; that is, the data decomposition technique is focused on the phase space instead of the kd-tree.For this particular work, this new approach allows us to obtain a well-balanced task workload.
As a future work, we will apply the QR decomposition into Wolf 's method to improve the accuracy of the maximal Lyapunov exponent estimation.Through this new parallel implementation, we thought that the whole spectrum of Lyapunov exponents could be computing, despite the fact that we work with time series of one single physic variable instead of working with a model of the dynamical system.

Figure 1 :
Figure 1: A partial snapshot of Wolf 's method.The fiducial point is signed by ⃗ V  .The fiducial trajectory is formed by the points ⃗ V  , ⃗ V + , ⃗ V +2 , and ⃗ V +3 .For each evolved step, a new replacement point ⃗ V  for ⃗V  is computed.The value of  1 converges according to (3) using the values of  and   obtained for each evolved step.

Figure 4 :
Figure 4: The circles in the figure are the processes with a unique identifier.Each process has a local kd-tree which represents a 2-dimensional subspace formed by the data decomposition technique.The transposition of all the subspaces represents the whole space, such as the space shown in the example of Figure 2. The lines between processes represent the bidirectional synchronization.

Figure 5 :
Figure 5: Parallel metrics for the case studies using data records of one million pieces of data.(a) Execution time.(b) Speed-up.(c) Efficiency.

pFigure 6 :
Figure 6: Parallel metrics for a household global minute-average active power real case study, using data records of 2 million pieces of data.(a) Execution time.(b) Speed-up.(c) Efficiency.

pFigure 7 :
Figure 7: Parallel metrics for a chemical sensor exposed to gas mixtures real case study, using data records of 4 million pieces of data.(a) Execution time.(b) Speed-up.(c) Efficiency.

Table 2 :
Parallel runtime for the case studies using data records of one million pieces of data.