Learning Bounds of ERM Principle for Sequences of Time-Dependent Samples

Many generalization results in learning theory are established under the assumption that samples are independent and identically distributed (i.i.d.). However, numerous learning tasks in practical applications involve the time-dependent data. In this paper, we propose a theoretical framework to analyze the generalization performance of the empirical risk minimization (ERM) principle for sequences of time-dependent samples (TDS). In particular, we first present the generalization bound of ERM principle for TDS. By introducing some auxiliary quantities, we also give a further analysis of the generalization properties and the asymptotical behaviors of ERM principle for TDS.


Introduction
Let X ⊂ R  and Y ⊂ R  be an input space and the corresponding output space, respectively.Define ZflX × Y ⊂ R  with  =  + .Many classical results of statistical learning theory are built under the assumption that samples {z  }fl{(x  , y  )} are independently drawn from an identical distribution on Z, that is, the so-called i.i.d.-sample assumption, for example, [1][2][3][4][5].However, the theoretical results sharing the i.i.d.-sample assumption may not be valid for (or cannot be directly used in) the non-i.i.d.scenario.
There have been many research interests lying in the theoretical analysis of the learning processes for time-dependent samples.Zhang and Tao [6,7] study the generalization properties of ERM-based learning processes for time-dependent samples drawn from Lévy processes and continues-time Markov chains, respectively.Zou et al. [8] establish exponential bound on the rate of relative uniform convergence for the ERM algorithm with dependent observations and derived the related generalization bounds with -mixing sequence.Jiang [9] introduces a triplex inequality based on Hoeffding's inequality (cf.[10]) and then obtains probability bounds for uniform deviations in a very general framework, which is suitable for the cases with unbounded loss functions and dependent samples.Zou et al. [11] present a novel Markov sampling algorithm to generate uniformly ergodic Markov chain samples from a given dataset.Xu et al. [12] adopt the similar techniques of [11] to develop the error bound for an online SVM classification algorithm, which provides a higher learning performance than that of classical random sampling methods.
In this paper, we are mainly concerned with the generalization performance of ERM principle for the sequences of time-dependent samples (TDS), which are independently and repeatedly observed from an undetermined stochastic process at the fixed time points.We can see that the generalization performance of such learning process is affected by the following factors: the number of independent observation sequences, time-dependence among samples in a sequence, the sample number in a sequence, and the observing time points.
By introducing some auxiliary quantities, we propose a new framework to analyze the generalization properties of the learning process.In particular, we first show that the generalization bound of this learning process can be decomposed into four parts: Φ 1 , Φ 2 , Φ 3 , and Φ 4 , which are related to the aforementioned factors.We then analyze the properties of these quantities.By imposing some mild conditions, we further obtain the upper bounds of the first three types of quantities Φ 1 , Φ 2 , and Φ 3 , respectively.Finally, some techniques of statistical learning theory are applied to bound the quantity Φ 4 , for example, the uniform entry number, the relevant deviation, and the symmetrization inequalities for independent sequences of TDS.
Different from previous works [6,7], there is no specific assumption on the distribution of the stochastic process sampled from.In contrast, the previous works require that samples should be observed from Lévy processes and continuous-time Markov chains, respectively.Instead of the only Z-valued function class appearing in the previous works, we impose a function class consisting of functions evaluated on Z and the time interval [ 1 ,  2 ].Moreover, the samples considered in previous works [6][7][8] are a series of data observed from one certain stochastic process, while this paper discusses the learning process based on quantities of independent sequences of TDS observed from a stochastic process at the fixed time points.
Moreover, the works [9,11,12] only consider the case of one single sequence of TDS, while this paper studies the learning process for multiple sequences of TDS, where one sequence corresponds to one trajectory (sample path) of a stochastic process.Therefore, our results are more general than previous ones.
The rest of paper is organized as follows.In Section 2, we first introduce some notions and notations used in the paper and then exhibit the decomposition of the generalization bounds of ERM principle for TDS.In Section 3, we bound the four auxiliary quantities Φ 1 , Φ 2 , Φ 3 , and Φ 4 and present the main results of the paper.The last section concludes the paper.

Problem Setup
In this section, we formalize the main research issue of this paper and then show the decomposition of the generalization error of ERM principle for TDS.

ERM Principle and Generalization Bounds.
For any  ≥ 0, let x  ∈ X and y  ∈ Y be the -time inputs and the corresponding outputs, respectively.Denote z  fl(x  , y  ) ∈ R  ( ≥ 0) and assume that {z  } ≥0 is an undetermined stochastic process with a countable space.Consider a function class G ⊂ Y X×T consisting of functions evaluated at the input space X and the time interval T = [ 1 ,  2 ].We would like to find a function  * ∈ G such that, for any input x  ∈ X ( ∈ T), the corresponding output y  can be predicted as accurately as possible.
A natural criterion to choose the function  * is the lowest expected risk caused by some function in G: where (x  , ) is a function with respect to the -time input x  and the time point  (Here, x  in the function (x  , ) is just an input value of the functional (⋅, )), ℓ ∘  is the composite function ℓ after , and   stands for the -time distribution of the stochastic process {z  } ≥0 on Z.However, the distribution of {z  } ≥0 is unknown, and it is difficult to directly obtain the target function  * by minimizing the expected risk (ℓ ∘ ).
Instead, the empirical risk minimization (ERM) principle provides a solution scheme to this issue.For 1 ≤  ≤ , let S []   fl{z []    }  =1 be the th sequence of samples observed from a certain stochastic process {z  } ≥0 at the fixed time points  1 ≤  1 <  2 < ⋅⋅ ⋅ <   ≤  2 and let the sequences S []   (1 ≤  ≤ ) be independent of each other.The ERM principle aims to minimize the empirical risk over G: and the solution ĝ is regarded as an estimate to the expected solution  * with respect to the sample sequences {S []   }  =1 .For convenience, we further define the loss function class and call F the function class in the rest of this paper.Given  sample sequences {S []   }  =1 , taking  = ℓ ∘  provides the following brief notations of expected risk (1) and empirical risk (2): ,   ) . ( Similar to the classical statistical learning theory [13], the main issue of this paper is to discuss whether the empirical solution ĝ provided by ERM principle will perform as well as the expected solution  * .Then, the supremum sup called the generalization bound of ERM principle for TDS {S []   }  =1 , will play an important role in analyzing the generalization performance of the above learning process as well.

Relationship with Some Time-Dependent Problem.
At the end of this section, we will show that many time-dependent problems can come down to the aforementioned learning process based on ERM principle, for example, the estimation of information channel and functional linear models.

Estimation of Information
Channel.Since the functions in F are evaluated at both the real input space X and the time interval T, the framework proposed in the paper can describe the inherent characteristics of the time-dependent problems more precisely than the learning framework given in the previous works [6,7], where the function class is only evaluated at X. Certainly, the time-dependent problem mentioned in [14], for example, the estimation of information channel, can also be included in this learning setting.Different from the previous one, the setting considered in this paper is also suitable for analyzing the performance of estimating dynamic information channel.
The estimation of dynamic information channel is of the following model: y  = H()x  + n(), where H() and n(), changing status with time varying, are the channel matrix and the noise vector, respectively.The corresponding function class H is formalized as 2.2.2.Functional Linear Models.Moreover, the functional data classification or regression with functional linear models is also in accordance with the learning setting mentioned in this paper.By denoting the following are the most frequently used functional linear models mentioned by [15]: (i) The model with a scalar input and a functional output: y() = B()x+(), which corresponds to the function class for any  ∈ T: (ii) The model with a functional input and a scalar output: y =  + ∫ Θ  B()x() + , which corresponds to the function class for any  ∈ T:  In the above models, the loss function ℓ is usually selected as the mean square error function and then some functional data algorithms are used to find the function that minimizes empirical risk (2) over the function class F. We refer to [16] for more details on functional linear models.

Analysis of Generalization
then there holds that where the quantities Φ 1 , Φ )           . ( This result implies that the behavior of the generalization bound can be described by using the summation of the quantities Φ 1 , Φ 2 , Φ 3 , and Φ 4 .In the next section, we will discuss the properties of the quantities Φ 1 , Φ 2 , Φ 3 , and Φ 4 , respectively.

Analysis of Relevant Quantities
As mentioned above, regardless of the distribution characteristics of {z  } ≥0 , there are four factors affecting the generalization performance of the ERM learning process for TDS: time-dependence, the number of TDS sequences, the sample number, and the observing time points, which are actually related to the quantities Φ 1 , Φ 2 , Φ 3 , and Φ 4 , respectively.
Moreover, two mild conditions will be imposed for the following discussion: (C1) Assume that there exists a constant  1 such that, for all ∈ T and  ∈ F, there holds that (C2) Assume that each  ∈ F is differentiable with respect to the time  and there exists a constant  2 such that, for all  ≥ 0 and  ∈ F, there holds that, for any z ∈ Z and  ∈ T, The former requires that any function  ∈ F should have the bounded expectation with respect to the distribution of {z  } ≥0 at any time  ∈ T, and the latter implies that all functions in F should have bounded first-order partial derivative with respect to .

3.1.
Upper Bound of Quantity Φ 1 .Under Condition (C1) and recalling (15), we can bound the quantity Φ 1 as follows: which implies that Φ 1 is affected by the choice of the specific time sequence Γ * = { * 0 ,  * 1 , . . .,  *  } which achieves supremum (13).Note that it is possible that the time sequence Γ * is not the unique one which can achieve supremum (13); then we define It can be observed that if the time points  * 1 , . . .,  *  satisfy that the length of each Δ *  (1 ≤  ≤ ) is equivalent, the sampling-time error Φ 1 is equal to zero.From the probabilistic perspective, it means that Γ *  is the one that is closest to uniform distribution among the candidate sequences.In other words, more uniformly the time points in Γ *  are distributed and the quantity Φ 1 is closer to zero.On the other hand, the uniform distribution of time points  * 0 ,  * 1 , . . .,  *  implies that, for the stochastic process {z  } ≥0 , the distributions of z  are identical for all  ≤ 0, which is actually a probability distribution.

Upper Bound of Quantity
which is the so-called integral probability metric (IPM) between the two distributions   and   with respect to the function class F  .IPM plays an important role in measuring the discrepancy between two distributions in probability theory, and we refer to [17,18] for more details on IPM.According to (16), we then obtain that Particularly, according to (23), if the stochastic process {z} ≥0 has an identical distribution at any time, the quantity Φ 2 is equal to zero.3.4.Upper Bound of Quantity Φ 4 .Similar to the classical statistical learning theory, the task to bound Φ 4 can be divided into three steps: complexity measure of function classes and the deviation and the symmetrization inequalities.Based on Azuma's inequality, we will obtain a suitable deviation inequality for TDS and then the relevant symmetrization inequality.By using the uniform entropy number, we finally bound the quantity Φ 4 in the sense of probability.

Deviation Inequality.
Let S  = {z   }  =1 be the observation sequence of the stochastic process {z  } ≥0 with a countable state space Z.By the way presented in [19], a filtration associated with S  can be built as follows: (iii) For any 1 ≤  ≤  − 1, let Ω  be the -algebra generated by {z  1 , . . ., z  −1 }.
Then, there naturally holds that According to Azuma's inequality [19,20], since the observation sequences S [1]   , . . ., S []   are independent of each other, we obtain the following result.

Approximate Error
The following is the definition of the covering number and we refer to [22] for details.
Definition 4. Let F be a function class and let  be a metric on F. For any  > 0, the covering number of F at radius  with respect to the metric , denoted by N(F, , ), is the minimum size of a cover of radius .
The uniform entropy number (UEN) is a variant of the covering number and we refer to [22] for details as well.By setting the metric ℓ  (S  ) ( > 0), the UEN is defined as follows: ln N  (F, , ) flsup Based on the uniform entropy number, the upper bound of Φ 4 can be obtained as follows.
is a function class such that, for any  ∈ F and any  ∈ T, the functional (⋅; ) is bounded with the range [, ].Let {z  } ≥0 be a stochastic process with a countable state space.Let {S []   }  =1 be  independent observation sequences of {z  } ≥0 at the time points  1 ,  2 , . . .,   .Let {S  []   }  =1 be the ghost samples of {S []   }  =1 and denote S []  2 fl{S []   , S []  } for any 1 ≤  ≤ .Then, given any  > 0, one has, for any  ≥ 8( − ) 2 /() 2 , Compared to the classical generalization results for i.i.d.samples (cf.Theorem 2.3 of [22]), time-dependent data are involved in the analysis of upper bound (31).This result is also different from the generalization bound of [7] because more than one sequence of time-dependent samples are investigated.Denote the right-hand side of (31) by , and under notations of Theorem 5 we immediately obtain that, with probability of at least (33) From result (33), we can find that the generalization performance of ERM principle for TDS is affected by the following factors: (i) The choice of the observation time points  1 ,  2 , . . .,   .
(ii) The number  of TDS sequences.
(iii) The length  of sequences.
(v) The complexity of function classes.
And meanwhile, the quantitative relationship among these factors is explicitly given in the above result as well.

Conclusion
In this paper, we propose a new framework to analyze the generalization properties of ERM principle for the sequences of TDS.By introducing four auxiliary quantities  1 ,  2 ,  3 , and  4 , we give a detailed insight to the interactions among the complexity of function classes, the size of samples, and the dependence among samples.To achieve the upper bound of  4 , we develop the relevant deviation inequality and symmetrization inequality for the sequences of TDS.This work is an extension of the classical techniques of statistical learning theory.In our future works, we will consider to relax conditions (C1) and (C2) and further investigate the properties of ERM-based learning processes for the TDS sequences, which are not observed at the fixed observation time points.
Performance.Different from the i.i.d.learning setting, regardless of the distribution characteristics of {z  } ≥0 , the generalization performance of the aforementioned learning process is also affected by the following factors:(i) Time-dependence among samples in one sequence.