Prediction after a Horizon of Predictability: Nonpredictable Points and Partial Multistep Prediction for Chaotic Time Series

. Tis paper introduces several novel strategies for multi-step-ahead prediction of chaotic time series. Introducing a concept of “generalized z -vectors” (vectors of nonsuccessive time series observations) makes it possible to generate sets of possible prediction values for each point we are trying to predict. Trough examining these sets, unifed predictions are calculated, which are in turn used as a basis for predicting subsequent points. Te key diference between the strategy presented in this paper and its conventional counterparts is the concept of “nonpredictable” points (points which the algorithm categorized as “incalculable” and excluded from the calculations altogether). Te results obtained for the benchmark and real-world time series indicate that while typically the number of nonpredictable points tends to grow exponentially with the number of steps ahead to be predicted, the average error for predicted points remains small and nearly constant. Tus, we redefne the problem of multi-step-ahead prediction as a two-objective optimization problem: on one hand, we aim to minimize the number of nonpredictable points and the average error among the predictable ones. Te resulting strategy demonstrates accurate results for both benchmark and real-world time series, with the number of predicted steps exceeding that of any other published algorithm.


Introduction
Tere are countless examples of chaotic systems in the world, which is evidenced by the constantly increasing number of chaotic time series forecasting algorithms.However, while those dealing with one-step-ahead prediction demonstrate remarkable efciency [1], their multi-step-ahead prediction counterparts are still in their infancy.Tis may be attributed to the exponential growth of an average prediction error with increasing prediction horizon (the number of steps ahead to be predicted).
Tis exponential growth refects Lyapunov instability, inherent to any chaotic system.According to the defnition of Lyapunov instability, any initial diference between any two neighboring trajectories, however small, grows exponentially over time; its exponent equals to the highest Lyapunov exponent λ [2,3].Lyapunov instability leads to another concept, that of the horizon of predictability, sometimes referred to as "Lyapunov time" [4].For a given observational error ε(0) and the maximum prediction error ε max , the aforementioned exponential growth satisfes the constraint ε(t) � ε(0)e λt , where ε(t) represents error at the moment t and λ, the highest Lyapunov exponent (which is positive for chaotic time series and negative for the nonchaotic ones).ε(t) ≤ ε max gives an estimated horizon of predictability T ∼ 1/λ lnε max /ε(0) [2,3].In the context of the multi-step-ahead prediction, it follows that the smallest diference between true and predicted values at any intermediate position between the last observable point and the point to be predicted triggers an exponential error growth in all subsequent points, regardless of the employed prediction method.It should be noted that the horizon of predictability should not be confused with the prediction horizon, the former being a theoretical upper boundary for the number of steps to be predicted (given ε(0) and ε max ), and the latter being simply the number steps to be predicted.Most of the time, the prediction horizon h is signifcantly less than the horizon of predictability T: h ≪ T.However, predictive clustering approach provides the necessary tools for dealing with exponential growth of the average prediction error [5,6].
First, predictive clustering utilizes motifs, repeated sequences of data in a series.If a section of a time series resembles an initial part of a motif "closely enough," it is likely that the subsequent points of the series will closely match the subsequent points of the motif, thus enabling the use of motifs for forecasting the time series at hand.Motifs can be obtained by clustering vectors of observations and calculating their centers.It should be noted that whereas most available prediction methods attempt to construct a single unifed prediction model, the motifs provided by predictive clustering form a set of local prediction models.
Second, Gromov and Borisenko [6] proposed using generalized z-vector templates, vectors of nonsuccessive observations, which are spaced out according to some predefned pattern.A pattern is a vector of distances between positions of series observations.Te resulting vector generalizes a conventional z-vector (successive observations), which corresponds to a pattern (1, 1, . . ., 1) L − 1 times.Figure 1 demonstrates a z-vector constructed according to a pattern k 1 , k 2 , . . ., k L−1 , k i ∈ N, whereas Figure 2 illustrates the way z-vectors are clustered into motifs and also how motifs are used for obtaining predictions.Te process of generating z-vectors can be visualized as placing a "comb" with L teeth spaced out according to some pattern, with all the other teeth "broken of."While moving the "comb" along the series, we can obtain a vector of observations under the teeth of the "comb" (y t , y t+k 1 , . . ., y t+k 1 +...+k L−1 ) for each position t.Each pattern k 1 , k 2 , . . ., k L−1 , k i ∈ N has its own set of such vectors (the sample corresponding to a given pattern).Each such a sample is then clustered separately.Tis approach obtains a set of possible values for each point to be predicted.Each set is comprised of predicted values obtained using motifs corresponding to diferent templates.Since the number of patterns can be arbitrarily large, so can sizes of sets of possible prediction values for points we are trying to predict.Despite the fact that points in those sets are not always statistically independent, a great deal of algorithms determining a unifed predicted value based on these sets can be designed.
Tird, the concept of nonpredictable points can be extended.However, previously a point was considered to be nonpredictable if it did not have corresponding motif(s), and now it can also be considered nonpredictable if it is impossible to calculate a unifed predicted value for this particular point based on its set of possible prediction values.An example of the latter would be a point which set of possible prediction values consists of two equally-sized clusters with diferent centers.Present work introduces a novel concept of nonpredictable points that the authors have not yet encountered in any literature on the topic.It should be noted that while in our previous paper, we employed only the former type of nonpredictable points, but we use both in this one.
Introduction of a concept of non-predictable points renders the problem two-objective: on one hand, we have to minimize the total number of non-predictable points, and on the other, the average error among the predictable ones.Discarding some of points as nonpredictable has proven to have a positive efect on the results, and it is better for the algorithm to skip some of the points rather than force a prediction at each point.Tis approach is actually quite natural in some areas: for example, stock market traders do not have to make trades at every single moment in time, as they could simply choose the moments that the algorithm had recognized as predictable.
Furthermore, since the patterns , the fact that the point at position t is nonpredictable does not imply that the points at subsequent positions t + 1, t + 2, . . .will be nonpredictable as well.Tis is fundamentally important for the multi-step-ahead prediction: if a point t + h needs to be predicted, iterative strategy would be used (or rather its modifcations based on nonsuccessive observations).Namely, the intermediate values between points t and t + h are predicted step-by-step, with nonpredictable points being identifed along the way and ignored.
In the context of dynamic systems [2,3], each cluster of vectors (which by defnition represent similar sections of a trajectory) corresponds to a particular area of a strange attractor.Areas with a high value of an invariant measure are associated with larger clusters (i.e., more frequently observed motifs), and vice versa.Te number of clusters increases with the size of their respective training set, whereas the number of nonpredictable points and the average error among the predictable ones decrease.A large-scale simulation for the Lorenz series [7] supports this conclusion.
In conclusion, however, the algorithm does not calculate a prediction for each point (thus the word "partial" in the title), it does make predictions up to the horizon of predictability (sometimes even surpassing it).Te paper presents several methods of identifying nonpredictable points and corresponding strategies for the multi-step-ahead prediction, as well as examples of partial predictions beyond the horizon of predictability, for the benchmark and real-world time series.
Te rest of the paper is organized as follows: (i) Section 2 reviews recent advances in the feld (ii) Section 3 formally states the problem (iii) Section 4 outlines the employed clustering technique and ways of identifying nonpredictable points and calculating predictions (iv) Section 5 provides the results for predicting the Lorenz series and hourly electric grid load values in Germany (v) Section 6 contains conclusions

Related Works
Most of the recently published papers discussing chaotic time series prediction [1,6,[8][9][10][11] concern themselves only with the one-step-ahead prediction problem, whereas the number of papers tackling the problem of multi-step-ahead prediction (MSAP) is considerably lower.An MSAP algorithm for chaotic time series consists of two parts, which are equally responsible for the accuracy of the fnal prediction.Te frst part is a technique used to make one-or few-steps-ahead predictions, whereas the second one "assembles" the results of the previous part into a fnal multistep-ahead prediction.
Algorithms used to make a one-step-ahead prediction employ nearly all felds of data mining and machine learning, such as (i) Support vector regression and its modifcations [9] (ii) Multilayer perceptron [12] (iii) Clustering algorithms in predictive clustering [13][14][15][16] (iv) LSTM neural networks [10] (v) Voronoi diagrams [17] (vi) Ridge polynomial neural networks [18] (vii) Wavelet neural networks [19,20] (viii) Dilated convolution networks [21] Another factor of fundamental importance is a strategy that "assembles" one-step-ahead predictions into a multistep-ahead one.It involves two basic strategies of MSAP, the iterated (recursive) and direct strategies [22,23].Te iterated one implies that multi-step-ahead prediction is a process carried out step-by-step: new predicted values are calculated based on the predictions made in the previous steps.Te direct strategy aims for getting the results immediately without calculating predicted values for intermediate positions.Tis strategy applies prediction techniques for various prediction horizons, thereby providing multiple predictions for any position to be predicted.Ben Taieb et al. [22] review diferent methods based on these two basic strategies.Sangiorgio and Dercole [10] apply both strategies with multilayer perceptrons and LSTM nets employed as tools to make one-step-ahead predictions.
Unfortunately, MSAP methods designed in the aforementioned strategies sufer from the same exponential error growth, thus giving rise to a number of hybrid strategies aiming to resolve it.In their review, Ben Taieb et al. [8] compare the two basic and three novel strategies (DirRec, MIMO, and DIRMO).DirRec (Direct + Recursive) combines the two basic strategies and uses the direct approach to predict values with the number of inputs being enlarged iteratively to include values of the most recently predicted positions.When it comes to MIMO (multiple input multiple output) strategy [24,25], an array of values is produced for the positions between the observed values and a prediction horizon (inclusively).Tis reveals any correlations the time series may have and allows them to be stored within predicted values.
DIRMO (DirRec + MIMO) strategy divides the series (up to some prediction horizon) into chunks and applies MIMO strategy to each chunk separately.Te authors test these fve strategies on a reasonably large sample of diferent time series (NN5 competition), which refects various irregularities inherent to time series.Bao et al. [9] compare efciency of the iterated, direct, and MIMO strategies by performing a one-step-ahead prediction using a modifed support vector regression.According to the authors, the MIMO strategy compares favourably with the other two (all other factors being equal).
Chaotic time series prediction methods that rely on reservoir computing would mark a diferent approach [26,27].Canaday et al. indicate that the method demonstrates excellent results for small prediction horizons; however, their performance worsens greatly when applied to larger horizons [28].
Multiple-task learning can be viewed as a strategy of its own for multi-step-ahead prediction.Chandra et al. [12] propose an algorithm to determine a neural network structure to solve MSA prediction problems; their approach can be considered a combination of the direct and iterated strategies.Wang et al. [29] utilise a neural model in order to combine periodic approximations for longer periods and machine learning models for the shorter ones.Tis could be viewed as a separate strategy when dealing with data with a marked periodicity.Ye and Dai [11] employ multitask learning for multistep prediction.Kurogi et al. [17] make use of an out-of-bag model for selecting models for multistep prediction.Te authors aim at predicting chaotic series as far as possible (the largest possible prediction horizon), whereas retaining reasonable accuracy.Te authors present their results for the Lorenz series.
Te importance of being able to make accurate predictions up to an increasing number of steps was named "Run for the Horizon" by the authors of the current paper in one of their previous publications [7].Te series of works by Sangiorgio et al. [10,30] (that culminated in a monograph [25] in 2021) approach multi-step-ahead forecasting by using neural networks, classic feed-forward, and recurrent architectures (LSTM) nets.Te latter are traditionally trained with a supervisor, thus forcing the algorithm the speed up the convergence of the optimization.Te authors managed to make adequate predictions up to six Lyapunov times on the benchmark series (logistic and Hénon maps as well as two generalized Hénon maps).Even though the authors did manage to considerably delay the exponential error growth, they failed to avoid it completely; the results indicate that after six Lyapunov times the error starts to increase exponentially.It should also be noted that the predictions achieved in [31], which relied on reservoir computing with the data calculated by the integration of the Lorenz-96 model, are similar in terms of prediction intervals.
To summarize, none of the aforementioned strategies are immune to the exponential growth of an average prediction error with an increasing prediction horizon.Te present paper discusses a few novel strategies with the main difference from their classical counterparts being the concept of nonpredictable points and an ability not to take into account clearly erroneous predictions at intermediate positions [6], thus weakening the exponential nature of the growth rate.

Problem Statement
Tis paper deals with an h-steps-ahead prediction for a chaotic time series Y � y 0 , y 1 , . . ., y t , . . . , h > 0 ∈ N. We assume that all transient processes are completed, and that the series itself refects the movement of a trajectory in the neighborhood of a strange attractor.Te third assumption is that the series meets the conditions of the Takens' theorem (which makes the analysis of the attractor structure based on the series observations possible) [2,3].
We divide the series into two parts: where ζ(∅) � 0 for any function ζ.
Te second operator determines a unifed predicted value for a set of possible predicted values (provided the position is predictable): Using these two operators, we can defne the multi-stepahead prediction process as a twofold optimization problem: Both functionals sum over all observations of the test set, with frst one minimizing the total number of nonpredictable points and the second one minimizing the average error among the predictable ones.Te present paper attempts to solve this two-objective problem.

Algorithm. Te subsection is organized as follows:
(i) Te process of generating samples from the time series according to predefned patterns (ii) Employed clustering techniques (iii) Generating sets of possible prediction values (iv) Quality measures of identifcation of nonpredictable points (v) Ways of identifying nonpredictable points

Training Set.
Te series is assumed to be normalized.Patterns (arrays of distances between nonsuccessive observations) are used to generate samples.Each pattern is an where K max is the maximum distance between any two consecutive observations in the pattern.
For example, let us consider a three-point pattern (2,3,4).Te frst vector of the corresponding training set would be (y 0 , y 2 , y 5 , y 9 ) (the frst z-vector); the next vector would then become (y 1 , y 3 , y 6 , y 10 ) and so forth.Te last vector of the training set would be (y t−9 , y t−7 , y t−4 , y t ) with y t being the last observable value of the training set.
In the context of predictive clustering, the conventionally used z-vectors comprising successive observations have proven to be less efcient than their counterparts comprising nonsuccessive observations [6].We attribute it to the fact that vectors of nonsuccessive observations have more chances to store information about salient observations (minima, maxima, tipping points, etc.) and correlations between them.
A set of all combinatorically possible patterns ℵ(L, K max ) is used for generating training sets.Each set is generated separately, according to corresponding pattern α ∈ ℵ(L, K max ).Vectors from these training sets can be used either "as-is" (i.e,. each vector is treated as a separate motif ), or they can be clustered frst, with the center of each of the formed clusters becoming a motif of its own.
We denote a set of motifs corresponding to the pattern Te obtained motifs can only be used with the pattern α.Te set of all motifs is denoted as

Clustering Technique.
As a trajectory moves repeatedly through the same area of a strange attractor, similar recurring sequences of values associated with that particular area can be observed in the trajectory's time series.Centers of clusters of sequences associated with diferent areas of the attractor serve as simplest prediction models for the respective areas of the attractor [14].We used the clustering method outlined below to cluster sequences of observations.We employ a modifed version of the Wishart clustering technique [32,33] developed by Lapko and Chentsov [34].It uses graph theory concepts and a nonparametric probability density function estimator of r-nearest neighbors.Some of the difculties associated with the application of this algorithm to forecasting are discussed in [6].
A signifcance value of a point x is defned as p(x) � r/V r (x)n, where V r (x) and d r (x) are the volume and the radius of a minimum-size hypersphere containing at least r observations, centered at point x (n is the number of sample vectors).Te method relies on a proximity graph G(Z n , U n ), vertices of which correspond to samples and edges defned as and an edge set comprised of all edges of U n in a such a way that its vertices belong to Z q .Let w(x q ) be a cluster number label of x q .A cluster c l , l > 0 is said to be height-signifcant with respect to the height value μ > 0 if max Tus, the algorithm consists of the following steps: After performing large-scale simulations for diferent parameter values, we determined the best values of r and μ to be 11 and 0.2, respectively.Complexity

Sets of Possible Predicted Values. We consider a set of motifs
and construct a set of possible prediction values  S p,α t+h associated with the pattern α.Namely, we compose a vector of time series observations corresponding to the pattern α: for a position t + h we are trying to predict (all elements of C are assumed to be either observed or predicted in the previous step(s)).Te next step is to calculate the Euclidean distance d between C and a truncated motif C α,trunc , C α ∈ Ξ α , a vector comprised of all elements of a motif except for the last one, where ε is a parameter of the algorithm) then the last element of the motif C α becomes a possible predicted value at the position t + h.Tus, the set of possible predicted values at position t + h for pattern α is defned as Te frst technique relies on the notion of a "reliability" of individual intermediate predictions, used to calculate  y i t+h .Each prediction step results in a greater error of the fnal prediction at position t + h.Tus, in order to ofset this error growth, we can assign a weight to an element of a sets of possible predicted values based on the number of prediction steps made in order to calculate it.We assign weights ω i � 1 to observed values (thus indicating that they are 100% reliable).Given a possible prediction value  y i t+h (either at the fnal point i � t + h or intermediate points , the weight of  y i t+h is calculated as an average of weights of the preceding points of α times a stepdown factor λ: Te step-down factor λ ensures that predicted points receive a progressively smaller weight compared to the observed and earlier predicted points.We typically used λ � 0.99.We used the average weight of those possible predicted values that were used to calculate the unifed predicted value in order to calculate the weight of this unifed predicted value (either at the fnal of intermediate points).determine d r (x q ) � distance to the sample's r-nearest neighbor; sort d r (x q ) in ascending order; q � 1; for each subgraph G(Z q , U q ): while q ≤ n: x q � newly added vertex of the subgraph; if x q is not connected to any clusters: start new cluster; else: if x q connected to the vertices of clusters c 1 , c 2 , . . ., c l , l ≥ 1: if all clusters are completed: w(x q ) � 0; else: label signifcant clusters as completed; delete labels of insignifcant clusters; else: merge clusters c 2 , . . ., c n into c 1 ; w(x q ) � c 1 ; set w(x i ) � c 1 for samples in c 2 , . . ., c n ; q � q + 1; ALGORITHM 1: Te Wishart clustering algorithm.6 Complexity Te second technique considers ω i to be inversely proportional to the distance between observed values and the motif chosen for the prediction: where ε is a small threshold (typically equal to 0.05 for the purposes of the current simulation).Te third approach calculates ω i as a product of the results of the previous two methods.

Unifed Predicted
). DBSCAN [35] demonstrates good performance for one-dimensional data and was deemed acceptable for the task.Te exact value of the UPV in this case is t+h .(4) Clustering only elements of  S t+h with weights exceeding some threshold ω 0 .(5) Two previous techniques can be applied to fuzzy sets.
Normalized weights can be viewed as values of a membership function that indicates belonging to a set of possible predicted values.(6) Clustering  S t+h and picking the center of a randomly chosen cluster based on the sum of weights of its elements relative to the sum of weights of all possible prediction values (roulette wheel).(7) [mf] Choosing the mode of  S t+h .(8) [mfp] Choosing the mode of  S t+h and adding a small amount of uniformly distributed noise to it: , where ζ(∆) is a normally distributed random variable with zero mean and variance ∆ ≥ 0.
In the case when p > 1, we introduce a concept of a "prediction trajectory" (a sequence of possible prediction values for every position from t to t + h).Let   ξ j t+h (h).Based on the results of the research, the frst technique proved to be the most efcient one.

Quality Measures of Identifcation of Nonpredictable
Points.Determining whether a point is predictable or not relies on either its set of possible prediction values or its set of prediction trajectories.For each prediction horizon h, we may consider the following set: A set of nonpredictable points at intermediate positions is defned as NIP(t, h) � NP(t, 1, h).A full set of all nonpredictable points from the test set is defned as follows:

Complexity
In other words, this is a set of all positions that the algorithm failed to produce a prediction value for.It should be said that a point becomes nonpredictable if any of the following cases are true: (1) Its set of possible predicted values is empty (2) It is impossible to calculate a unifed prediction value based on its set of possible predicted values Tese sets rely heavily on the specifc one-step-ahead prediction algorithm used, the value of, and the technique of identifying nonpredictable points.In order to develop evaluation criteria for algorithms identifying nonpredictable points, we need to consider two extreme cases: (1) Not identifying nonpredictable points at all (the algorithm always uses the closest available motif ); (2) Assuming that the algorithm possesses a priori information about the actual values at intermediate positions.Te point is marked as nonpredictable if the diference between the predicted and actual values d ≥ ε.It is worth noting that the algorithm does not replace its predictions with the true values: it simply excludes clearly erroneous predictions from the following prediction operations instead.In our internal team's slang, this algorithm is named "the daemon", Socrates is known to have had his own daemon, who gave him advice in his most painful situations.
Remaining points comprise a set of ground-truth nonpredictable points for a given algorithm.We can create a set for each point t + h that we are predicting, where  y t+i � g(  S t+i .Consequently, the set of positions that the algorithm failed to make a prediction for is defned as follows: Also, the set of nonpredictable points as follows: ( L−i+1 (9) end for (10) C α ⟵(η (α) l , . . ., add η (α) L to  S t+i (15) end if (16) end for (17) if point t + i is predictable then (18) calculate  y t+h using corresponding algorithm (19) end if (20) end for (21) end procedure ALGORITHM 2: Te prediction procedure.8 Complexity Ten, the set of ground-truth nonpredictable points becomes Simulations reveal that the second case is almost completely independent of the threshold value ε.Naturally, "real" algorithms do not possess a priori information about actual values at intermediate positions y t+i , since they deal only with either sets of possible predicted values  S (p) t+i or sets of predicted trajectories  Ξ t+i .Nevertheless, this method of identifying nonpredictable points is useful for determining a lower boundary of the prediction error.Methods of identifying nonpredictable points should be developed in such a way as to approximate this boundary as closely as possible.In the team's internal slang, we referred to them as "approximating the daemon." Graphs in the Figure 3 exemplify dependences of the number of nonpredictable points (Figure 3(a)) and the average error on predictable points (Figure 3(b)) on the prediction horizon h for the two extreme cases.Te blue curves correspond to the frst extreme case (intermediate nonpredictable points are not identifed at all), and orange, to the second (intermediate nonpredictable points are identifed with a priori information).It can be seen that in the frst case the number of nonpredictable points is equal to zero and the prediction error grows exponentially with h.Te second extreme algorithm demonstrates the opposite: the number of nonpredictable points grows exponentially with h, while the prediction error function remains nearly constant, bounded, and rather small.It is obvious that the frst algorithm minimizes the functional (3), neglecting the functional (4); the second one minimizes the functional (4), neglecting the functional (3).
Figures 4(a) and 4(b) display the true values (blue solid) and intermediate predicted values (red dashed) for the Lorenz series for the frst (Figure 4(a)) and second (Figure 4(b)) extreme cases.From the latter subfgure, we notice that the second algorithm does not make a prediction for all points, while making rather accurate predictions where it "decides" to predict.On the other hand, we notice that in the frst case the predicted trajectory diverges from the true one just after the point (a green disk in Figure 4(a)) that is identifed as nonpredictable by the second algorithm, the frst algorithm must predict at this point (as with any point), and this immediately ruins the MSA prediction process, causing the predicted trajectory to diverge from the true one.
Te results of algorithms identifying nonpredictable points fall somewhere between these two extreme cases.In the context of multi-step-ahead prediction, the second case presents more interest since it allows one to make a prediction up to a fairly large number of steps ahead.From this point, all algorithms discussed are attempt to follow it.
A large-scale simulation suggests that a forecast algorithm will be efcient only if its set of identifed nonpredictable points is sufciently close to the set of groundtruth nonpredictable points GTNP(h).Diference between the two constitutes an auxiliary quality measure for the techniques of identifying nonpredictable points: Two possible opposite scenarios can be observed: either the diference |GTNP(h)/NP(h)| is large whereas the difference |NP(h)/GTNP(h)| is small, or vice versa.Te frst case corresponds to an overly optimistic algorithm, whereas the second, to an overly conservative one.

Identification of Nonpredictable Points
Before we begin talking about diferent ways of identifying nonpredictable points, we need to establish a concept of a "perfectly predictable" point.A "perfectly predictable" point is the one which set of possible prediction values consists of a single cluster containing the vast majority of its possible predicted values.On the other hand, sets of possible prediction values for nonpredictable points tend to either consist of multiple equally sized clusters or be nonclusterable at all.In terms of probability distributions, predictable points correlate with a unimodal symmetric distribution with a relatively small variance, whereas nonpredictable ones correlate either with a uniform distribution or with a distribution with several pronounced modes.
In order to characterize nonpredictable points, we employed the following quantities: (1) Multimodality: a percent of possible predicted values remote from the main mode (2) Diference between α and 1 − α percentiles (3) Second, third, and fourth central moments, as well as kurtosis of the distribution and its entropy We performed a large-scale simulation on the validation set Y 3 by calculating sets of nonpredictable points NP(h) and the ground-truth nonpredictable points GTNP(h) for each feature separately.A comparison of these sets makes it possible to estimate the best threshold values for each feature.It revealed that each feature by itself does not cause the  A green dot indicates a point that the algorithm that uses a priori information identifes as nonpredictable.10 Complexity objective to become larger; therefore, it becomes necessary to take into account multiple characteristics at once.When clustering sets of possible prediction values, we take into consideration the following points: (1) Total number of clusters (2) Size of the largest cluster compared to the size of the entire set (3) Relative diference in the size of clusters We use the following machine learning techniques to separate the aforementioned features into regions corresponding to predictable and nonpredictable points, respectively: (1) [lr] Logistic regression (2) [svm] Support vector machine (3) [dt] Decision tree (4) [knn] K-nearest neighbors clustering (5) [mlp] Multilayer perceptron with a single layer with 8 neurons In order to test this approach, a new validation set Y 3 independent of both Y 1 and Y 2 needs to be introduced: We label Y 3 using the set of ground-truth nonpredictable points.Since the number of nonpredictable points exceeds the number of predictable ones for larger prediction horizons, a sample is balanced by SMOTE (synthetic minority oversampling technique) methods [36].Tis method generates new elements in the neighbourhood of the smaller cluster.
In addition to the above classifying methods, we employ their ensembles: When analyzing a set of possible prediction trajectories  Ξ t+h using the second approach, trajectories converging at certain points indicate predictability.Tus, predictable points correspond to the regions of convergence of the possible prediction trajectories.
Several techniques for identifying nonpredictable points were developed: (1) Calculating an average spread over sets of possible prediction values A point is classifed as nonpredictable if its spread exceeds k avg times some factor ϑ.
(2) Instead of considering the spread itself, one can consider its variation over a number of successive positions.Te frst point is classifed as nonpredictable if the spread increases monotonously.Based on the results of a simulation, it appears that the best option here is to consider three successive points. (

Numerical Results
Methods discussed in the previous sections are applied to the benchmark (Lorenz) and the real-world (electric grid load values) time series.Clustering is used to generate samples using all possible patterns consisting of four elements (L � 4) with the maximum distance between neighboring positions K max � 10.Tus, the total number of used patterns equals |ℵ(L, Te Lorenz series is calculated by integrating the Lorenz system with parameters σ � 10, b � 8/3, r � 28 using the Runge-Kutta fourth-order method with integration step ∆ � 0.1.Te result is a typical chaotic series used as a conventional benchmark for testing chaotic time series forecasting methods.
In order to distinguish between chaotic and simply noisy series, we check positions of entropy and complexity on their respective planes.Te values of these for the Lorenz series are 0.68 and 0.45, respectively, thus indicating the series as chaotic.Furthermore, the highest Lyapunov exponent λ Complexity calculated using the Eckman algorithm for the Lorenz series has a value of 0.92, which corresponds to the results of Malinetskiy and Potapov [3].Its strictly positive value confrms an inherently chaotic nature of the series.
For testing purposes, the frst 3000 observations of the series were discarded to ensure that the trajectory moves in the neighbourhood of its respective strange attractor.Te sizes of the training and testing sets are 10000 and 1000 observations, respectively.
Te real-world series is a series of electric grid load values in Germany from 2014-12-31 to 2016-02-20 measured in 1-hour intervals.Entropy and complexity of the series are measured at 0.499 and 0.372, respectively, and its highest Lyapunov coefcient is 0.125, which indicates the series' chaotic nature.Te series was analyzed in the same way as the Lorenz series, with the results presented in Tables 1-4.
A large-scale simulation suggests that classifying only those points which lack corresponding motifs as nonpredictable results in an exponential increase in both the total number of nonpredictable points and an average prediction error.Based on Figure 2, it may be concluded that the choice of a specifc technique used to calculate a unifed prediction value has little efect on the rate of growth.However, the choice of a specifc technique of identifying nonpredictable points seems to have a much larger impact (see Tables 5 and 6).
As a quick reminder, the unifed predicted value can be calculated in the following ways (see Tables 5 and 6): When it comes to identifying nonpredictable points, several methods from various felds of machine learning and mathematics are employed: Te results of each combination of techniques are presented in the graphs and tables.Te graphs present the root mean squared error (RMSE) and the mean absolute percentage error (MAPE) for each technique alongside those of the two extreme cases (see Quality Measures of identifying non-predictable points) for comparison.
Presented below are the graphs of two types.Te frst one displays a predicted trajectory up to a certain prediction horizon alongside a "real" trajectory; the second type shows error measures (namely, root mean squared error, mean absolute percentage error and the number of nonpredictable points) for the diferent versions of the algorithm against diferent prediction horizons.Furthermore, each graph of the second type also contains plots of its respective error measure for the aforementioned two extreme cases, using a priori information, and not identifying nonpredictable points at all.
A large-scale simulation reveals that in the frst extreme case (only points that have no corresponding motifs are considered nonpredictable) both the number of nonpredictable points and errors among the predictable ones grow exponentially with the prediction horizon.
Te second extreme case (where the aforementioned algorithms are also taken into consideration when determining predictability of a point) shows an opposite situation.Figure 5 demonstrates MAPE, RMSE, and the number of nonpredictable points as a function of the prediction horizon.Orange line corresponds to the method using a priori information; gray, to the version of the "rapid growth" algorithm that uses the Wishart clustering Bold values represent column minimums for ease of identifcation of the corresponding algorithms.
12 Complexity  blue, to the version of the "rapid growth" algorithm that averages sets of possible predicted values.It should be noted that, when comparing the results with those of others [25] that also demonstrate an ability to predict up to multiple Lyapunov times ahead, the almost-constant (nonexponential) error behavior intrinsic to some of the algorithm's variants allows for making predictions up to even more Lyapunov times ahead.However, it comes at the cost of increasing number of nonpredictable points.Te work by Sangiorgio et al. [25] eventually demonstrates exponential error growth.Of all the used machine learning algorithms, logistic regression and AdaBoost ensemble (Figure 6) stand out.Teir error measures remain nearly constant with increasing prediction horizon and are comparable to those produced by the algorithm using a priori information for identifying nonpredictable points.However, it also demonstrates an exponential growth of the number of nonpredictable points with increasing prediction horizon at a rate larger than that of the second extreme case.Te method employing a multilayer perceptron, on the contrary, demonstrates a moderate rate of growth of the number of nonpredictable points along with a rapid error growth.
When considering diferent techniques of identifcation of nonpredictable points, it should be noted that simulations reveal that the versions of the algorithm based on sets of trajectories perform better than their counterparts based on sets of possible predicted values.It can be seen from the tables that the number of nonpredictable points is close to the frst extreme case (the one using a priori information), while the average prediction error is constrained between the two.Tis is a typical behavior for all methods of calculating the unifed predicted value and identifying nonpredictable points.It should be noted that the main objective is still achieved, instead of increasing exponentially, average prediction error remains constant and relatively small.Figure 7 demonstrates an example of predicted energy load series (Tables 3 and 4).
Figure 8 indicates that the algorithm is able to predict values up to (and sometimes exceeding) an estimated horizon of predictability of 75 steps (thus the title of the article).
In order to compare our algorithm with the results of other researchers, we have recreated the simulation by Prof. Kurogi and colleagues [17].Figure 8 displays the Lorenz series (blue), prediction made by prof.Kurogi (green), and the values predicted by the algorithm (dashed red line with dots).As can be discerned from the fgure, while the trajectory predicted by prof.Kurogi diverges from the actual one, the algorithm discussed in the present paper does correctly identify nonpredictable points and the predicted points approximate the true trajectory quite accurately.
In conclusion, it should be noted that the result obtained on the real-world German energy time series seems to confrm that the efect of noise (stochasticity and nonstationarity) dramatically afects the forecasting accuracy (the prediction is reliable only for the frst day ahead due to observation noise and the daily periodicity that afects many real phenomena) [30,37].However, the current paper did not specifcally concern itself with the Bold values represent column minimums for ease of identifcation of the corresponding algorithms.

Mean Absolute Error
Figure 6: RMSE, MAE, and percentage of nonpredictable points of various ML algorithms compared to those of the "ideal" version.Black solid lines correspond to logistic regression with AdaBoost; round red dot lines correspond to a support vector machine with AdaBoost; square orange dot lines-to a regular logistic regression; green dash lines-to a multilayer perceptron; blue dash dot lines correspond to a support vector machine; long red dash line corresponds to the "ideal" version that uses a priori information.Te subplot order is the same as in Figure 5.
18 Complexity algorithm's behavior in case of chaotic series with a signifcant noise component.We hope to make it the subject of future research.

Conclusion
Current paper introduces several novel strategies for multistep prediction of chaotic time series.Generalized zvectors, comprised of nonsuccessive time series observations, allow for obtaining a set of possible values for each point that a prediction is being made for.Upon examining such a set, the point can be defned as either predictable (in which case its value is estimated) or nonpredictable.Te concept of nonpredictable points renders the multistep prediction as a two-objective problem, with both the number of nonpredictable points as well as the average prediction error for the predictable ones having to be minimized.
Te results indicate that while the number of nonpredictable points grows exponentially with the number of steps, up to 80-90%, the average prediction error remains constant.Tis allows for making predictions up to a considerable number of steps ahead (although skipping some intermediate points) sometimes even exceeding the horizon of predictability.It was tested for the Lorenz and real-world time series.
It was concluded that the choice of a technique of identifying nonpredictable points has a much greater impact on the accuracy of the fnal prediction than the choice of a technique of estimating the actual value of the point itself.Large-scale simulation reveals that prediction methods based on sets of possible trajectories deliver more accurate results than those based on sets of possible values, allowing for making predictions beyond the horizon of predictability.
t+i ) is the unifed prediction value of the set of possible prediction values  S (p)

Figure 3 :
Figure 3: Lorenz series.Te number of nonpredictable points (a) and RMSE (b) vs. a prediction horizon for the two extreme cases.Te blue curves correspond to the frst extreme case (intermediate nonpredictable points are not identifed at all); orange corresponds to the second one (intermediate nonpredictable points are identifed with a priori information).

Figure 4 :
Figure 4: True (blue solid) and intermediate predicted (red dashed) values for the Lorenz series for the frst (a) and second (b) extreme cases.A green dot indicates a point that the algorithm that uses a priori information identifes as nonpredictable.

( 1 )
[adb] AdaBoost: an algorithm training each subsequent classifer on the elements incorrectly classifed by the previous classifer (2) [lrs] Stacking: applying several classifers to produce a feature space for a combined algorithm (3) [lrv] Voting: assigning a label to each element by the classifers' votes sets of possible predicted values; red-to the version of the "rapid growth" algorithm that uses DBSCAN for clustering sets of possible predicted values;

Figure 7 :
Figure 7: German energy time series: true trajectory (blue curve) and predicted points (red dashed curve with disks).Te algorithm employs possible predicted trajectories and identifes nonpredictable points using spread growth.
The Lorenz Series.Comparison with results by Kurogi and colleagues

Figure 8 :
Figure 8: Te Lorenz series: true trajectory (blue curve) and predicted points (red dashed curve with dots).A green line presents results by Kurogi and colleagues.For the sake of comparison with other literature works, note that 72 steps roughly correspond to 7 Lyapunov times.
t+h exclusively; for p � h, it consists of sets of possible predicted values for the points [y t , y t+1 , . . ., y t+h ].We denote the algorithm that constructs sets of possible predicted values  S t+h as f h :  S t+h � f h ( y t , . . ., y t−s+1  ).Te concept of nonpredictable points implies that two operators are applied to the set  S y a set of sets of possible predicted values corresponding to all patterns  S Te size of the set is directly proportional to the number of employed patterns.It allows for an implementation of algorithms for determining the fnal, unifed predicted value at t + h as well as identifying nonpredictable points.Tese algorithms are discussed later in the paper.In addition to the set of possible prediction values  S h :  S t+h � f h ( y t , . . ., y t−s+1  ), which yields a set of possible predicted values for position t + h.
Value.Te method of calculating a unifed predicted value (UPV) depends on whether the algorithm's parameter p � 1 or p > 1.If it is, the UPV is calculated based solely on the set of possible prediction  S t+h ).Tere are several techniques that can be used to extract UPV from  S t+h :(1) [avg] Averaging over  S t+h :  y t+h � 1/N t+h us denote an s-th possible prediction trajectory (PPT) as  ξ t+i being its value at i-th position and S max being the maximum number of trajectories.Ten the set of all PPTs ending at position t + h would be defned as  Ξ t+h , with  Ξ t+h (i) being a set of values of its trajectories at position i.Naturally,  S t+h ≡  Ξ t+h (h) Algorithm 2.Starting a new trajectory at the center of each cluster of  S t+h , thus making the total number of trajectories increase exponentially with the number of steps (a garden of forking trajectories).(3)Starting a new trajectory at the center of each cluster of  S t+h and assigning a weight to each trajectory as a product of the weights of its individual points; trajectories with weights lower than a certain threshold are fltered out at each step of the iteration and do not form new trajectories.Regardless of the technique used, the UPV is calculated as an average of the last points of all trajectories from the set  Ξ t+h :  y t+h � 1/S max ) For the trajectories relying on reduced initial information, we may compare the last value of a trajectory starting at position t,  y t+h �  ξ 0 t+h (h) with a weighted average of the last values of trajectories starting at positions t − 1, t − 2, . . ., t − a:  y t+h � 6) It is also possible to apply all of the listed techniques to a set of fnal points of  Ξ t+h .

Table 1 :
Te Lorenz series: nonpredictable points and error measures.

Table 2 :
Te Lorenz series: sets of nonpredictable points.

Table 3 :
Electricity load series: nonpredictable points and error measures.

Table 4 :
Electricity load series: sets of nonpredictable points.

Table 5 :
Abbreviations for the methods to calculate the unifed prediction value.avg Average.Averaging the set of possible prediction values or trajectories wavgl Weighted average, length.Averaging the set of possible prediction values or trajectories with weights being determined by a prediction length wavgd Weighted average, distance.Averaging the set of possible prediction values or trajectories with weights being determined by a distance to the cluster center

Table 6 :
Abbreviations for the methods to identify nonpredictable points.Decision tree.Applying a decision tree employing an entropy criterion without any restriction on the tree depth to the features of the sets of possible predicted values knn K-nearest-neighbors. Applying a k-nearest-neighbors classifer is applied to the features of the sets of possible predicted values; k-nearest-neighbors value is equal to 3 mlp Multilayer perceptron.Applying a multilayer perceptron containing 8 hidden layers and utilizes a sigmoidal activation function to the features of the sets of possible predicted values adb AdaBoost.Assembling logistic regression classifers using AdaBoost adblr AdaBoost + logistic regression.Logistic regression classifers are assembled with AdaBoost adbsvm AdaBoost + SVM.Assembling SVM classifers using AdaBoost lrs Logistic regression, stacking.Stacking of logistic regression classifers lrv Logistic regression, voting.Assembling logistic regression classifers by voting dbscan DBSCAN.Clustering the sets of possible predicted values at each intermediate point using DBSCAN and calculating the growth rate of the total number of clusters over several consecutive points wshrt Wishart.Clustering the sets of possible predicted values at each intermediate point using Wishart clustering and calculating the growth rate of the total number of clusters over several consecutive points cwat Compare weighted average, trajectories.Comparing the main predicted trajectory with a weighted average of other predicted trajectories amd Average modulus of diference.Calculating an average of modulus of a diference wamd Weighted average modulus of diference.Calculating a weighted average of modulus of a diference wamdt Weighted average modulus of diference, trajectory.Comparing a weighted average of modulus of a diference of the main trajectory to that of other trajectories rg Rapid growth.Te spread grows monotonically over several consecutive points rgdbscan Rapid growth, DBSCAN.DBSCAN is used to obtain centers of clusters of possible predicted values rgwshrt Rapid growth, Wishart clustering.Wishart clustering is used to obtain centers of clusters of possible predicted values