Long-time predictive modeling of nonlinear dynamical systems using neural networks

We study the use of feedforward neural networks (FNN) to develop models of nonlinear dynamical systems from data. Emphasis is placed on predictions at long times, with limited data availability. Inspired by global stability analysis, and the observation of the strong correlation between the local error and the maximum singular value of the Jacobian of the ANN, we introduce Jacobian regularization in the loss function. This regularization suppresses the sensitivity of the prediction to the local error and is shown to improve accuracy and robustness. Comparison between the proposed approach and sparse polynomial regression is presented in numerical examples ranging from simple ODE systems to nonlinear PDE systems including vortex shedding behind a cylinder, and instability-driven buoyant mixing flow. Furthermore, limitations of feedforward neural networks are highlighted, especially when the training data does not include a low dimensional attractor. Strategies of data augmentation are presented as remedies to address these issues to a certain extent.

from complex data, and in building relationships between features and outputs. Neural networks with a single hidden layer and nonlinear activation function are guaranteed to be able to predict any Borel measurable function to any degree of accuracy on a compact domain [17].
The idea of leveraging neural networks to model dynamical systems has been explored since the 1990s. ANNs are prevalent in the system identification and time series modeling community [28,35,29,22], where the mapping between inputs and outputs is of prime interest. Billings et al. [7] explored connections between neural networks and the nonlinear autoregressive moving average model (NARMAX) with exogenous inputs. It was shown that neural networks with one hidden layer and sigmoid activation function represents an infinite series consisting of polynomials of the input and state units. Elanayar et al. [52] proposed the approximation of nonlinear stochastic dynamical systems using radial basis feedforward neural networks. Early work using neural networks to forecast multivariate time series of commodity prices [10] demonstrated its ability to model stochastic systems without knowledge of the underlying governing equations. Tsung et al. [50] proposed learning the dynamics in phase space using a feedforward neural network with time-delayed coordinates.
Paez and Urbina [31,51,32] modeled a nonlinear hardening oscillator using a neural networkbased model combined with dimension reduction using canonical variate analysis (CVA). Smaoui [44,45,43] pioneered the use of neural networks to predict fluid dynamic systems such as the unstable manifold model for bursting behavior in the 2-D Navier-Stokes and the Kuramoto-Sivashinsky equations. The dimensionality of the original PDE system is reduced by considering a small number of Proper Orthogonal Decomposition (POD) coefficients [6]. Interestingly, similar ideas of using principal component analysis for dimension reduction can be traced back to work in cognitive science by Elman [13]. Elman also showed that knowledge of the intrinsic dimensions of the system can be very helpful in determining the structure of the neural network. However, in the majority of the results [44,45,43], the neural network model is only evaluated a few time steps from the training set, which might not be a stringent performance test if longer time predictions are of interest.
ANNs have also been applied to chaotic nonlinear systems that are challenging from a datadriven modeling perspective, especially if long time predictions are desired. Instead of minimizing the pointwise prediction error, Bakker et al. [3] satisfied the Diks' criterion in learning the chaotic attractor. Later, Lin et al. [24] demonstrated that, even the simplest feedfoward neural network for nonlinear chaotic hydrodynamics can show consistency in the time-averaged characteristics, power spectra, and Lyapunov exponent between the measurements and the model.
A major difficulty in modeling dynamical systems is the issue of memory. It is known that even for a Markovian system, the corresponding reduced-dimensional system could be non-Markovian [11,34]. In general, there are two main ways of introducing memory effects in neural networks. First, a simple workaround for feedforward neural networks (FNN) is to introduce time delayed states in the inputs [29]. However, the drawback is that this could potentially lead to an unnecessarily large number of parameters [21]. To mitigate this, Bakker [3] considered following Broomhead and King [8] in reducing the dimension of the delay vector using weighted principal component analysis (PCA). While the second approach uses output or hidden units as additional feedback. As an example, Elman's network [21] is a recurrent neural network (RNN) that incorporates memory in a dynamic fashion.
Miyoshi et al. [27] demonstrated that recurrent RBF networks have the ability to reconstruct simple chaotic dynamics. Sato et al. [41] showed evolutionary algorithms can be used to train recurrent neural networks to capture the Lorenz system. Bailer-Jones et al. [2] used a standard RNN to predict the time derivative in discrete or continuous form for simple dynamical systems; this can be considered an RNN extension to Tsung's phase space learning [50]. Wang et al. [53] proposed a framework combining POD for dimension reduction and long-short-term memory (LSTM) recurrent neural networks, and applied it to a fluid dynamic system.
We limit ourselves to feedforward neural networks, since there are still many unanswered questions about modeling dynamical systems even in this simplest form. It is known that time delayed FNNs closely resemble simple RNNs trained with teacher forcing [15]. Further, RNNs are not easy to train since standard training algorithms (e.g., back propagation through time [38]) are likely to introduce stronger overfitting than FNN due to vanishing gradients [15]. Recently, sparse regression (SINDy) [9,26] has gained popularity as a tool for data-driven modeling. The idea is to search for a sparse representation of a linear combination of functions selected from a library. In this work, we will compare it with FNN-based models and highlight some differences.
The paper is organized as follows: the problem description is provided in section 2 and the mathematical formulation of standard and Jacobian-regularized FNNs is presented in section 3. Results and discussion are presented in section 4. We first present a comparison with SINDy for simple dynamical systems. Then we highlight the importance of stabilization to control the global error of predicted trajectory and the impact of Jacobian regularization. Finally, we apply the model in a nonlinear PDE system where a low dimensional attractor is not realized and discuss the limitations of black-box modeling of dynamical system and propose data augmentation as remedies. Conclusions are drawn in section 5.

Consider a dynamical system in Euclidean space
Similarly, one can define a discrete dynamical system induced by the above smooth dynamical system by considering a constant time step ∆t ∈ R and a state transition map F d (x) = φ ∆t (x) : Equivalently, one can rewrite the above system as where F r : R M → R M resembles a first order solution [2] to F c . Our goal is to find an approximation to the dynamics, either in (i) a discrete sense F d , given i=0 uniformly sampled from a trajectory given initial condition where N is the number of data points. It must be mentioned that -as highlighted in the result section -data does not have to be collected on the same trajectory.
Depending on the way one defines the training and testing set, two types of problems are considered in the current work: 1. prediction of a certain trajectory starting from an initial condition that is different from the training trajectories. 2. prediction of the future trajectory given past information of the trajectory as training data. Conservatively speaking, the success of tackling the first of the above problems requires the trajectories in the training data to be representative of the distribution in the region of interest, which may or may not be feasible depending on how informative the data is. In the context of modeling dynamical systems, it is often implied in previous literature [45] that the initial condition of unseen testing data is not far away from the training data. The second problem can also be difficult since it will challenge the effectiveness of the model as past information might not be sufficient for the model to be predictive on unseen data. Again, it is often implied in previous works [53,49,43], successful predictions are often accompanied by an underlying low dimensional attractor so the past states as training data can be collected until it becomes representative of the future.

Mathematical Modeling Framework
In this section, we first define performance metrics of the approximation to the dynamics f , then introduce the standard FNN model and the Jacobian-regularized FNN model. Finally, techniques to mitigate overfitting are described.

Definitions of error metrics
To measure the prediction error for each sample in an a priori sense (i.e., given exact x i ), we define the local error vector ξ i local ∈ R M for the i-th sample (x i , y i ) as where y i ∈ R M is the i-th target to learn from the i-th feature x i ∈ R M . For example, the feature is the state vector at i-th step, x i and the target can be x i+1 for discrete dynamical system oṙ x i for continuous dynamical system. Then, we can define local error at the i-th sample by where · 2 : R M → [0, +∞) is the vector 2-norm, i.e., l 2 norm, and |·| : R → [0, ∞) is the absolute value.
We can further define the local error of the i-th sample for the j-th component as shown by The local error assumes that the i-th input feature x i , is predicted accurately. On the other hand, the global error vector is defined by eq. (7), in which x i is obtained by iterative prediction, i.e., a posteriori evaluation, at the i-th step from an initial condition through either time integration or transition function as a discrete map. That is, x i is obtained from f ( x i−1 ) in a recursive sense as follows Similarly, the global error is defined by and for the j-th component specifically by Further, to obtain a holistic view of the model performance in feature space, if F d or F c is known, either in the continuous or discrete case, we can define stepwise error as Note that e stepwise is not restricted by the training or testing trajectory, but it can be evaluated arbitrarily in the region of interest. Finally, we consider the uniform averaged coefficient of determination R 2 as a scalar metric for measuring regression performance where R 2 j is given by where n sample is the number of samples in the validation data,ȳ j = 1 n sample n sample −1 i=0

Basic model: densely connected feedforward neural network
The basic model approximates F c in eq. (1) for the continuous case, and F r in eq. (3) in the discrete case using a feedforward neural network. The existence of an arbitrarily accurate feedforward neural network approximation to any Borel measureable function given enough number of hidden units is guaranteed from the property of the universal approximator [18]. It should be noted that our basic model is related to Tsung's phase-space-learning model [50]. If the Markovian assumption is adopted, the training feature matrix snapshots X and training target matrix snapshots Y are as follows: where M is the dimension of the state, N is the total number of snapshots of training data, learning target Y is the time derivative, and the subscript stands for the index of the component. Note that each component of the feature and target are normalized to zero mean and unit variance for better training performance in the neural network. By generally constructing a densely connected feedforward neural network f (·): R M → R M with L − 1 hidden layers and output layer as linear, the following recursive expression is defined for each hidden layer: where η 0 stands for the input of the neural network x, η l ∈ R n l , n l ∈ N + is the number of hidden units in layer l, and σ l is the activation function of layer l. Note that the output layer is linear, i.e., σ L (x) = x: where parameters of the neural network are For example, if we consider using two hidden layers where L = 3 and the number of hidden units are the same, the full expression for the neural network model is given bŷ where x ∈ R M is the state of the dynamical system, i.e., the input to the neural network and y ∈ R M is the modeling target, i.e., the output of neural network. σ(·): R → R is a nonlinear activation function.
Sets of weights and biases are The problem is to find the set of parameters of W 3 and b 3 that result in the best approximation of the underlying ground truth (F c , F d or F r ). Under the framework of statistical learning, it is standard to perform empirical risk minimization (ERM) with mean-square-error loss. The set up and parameters corresponding to the desired solution f (x, W * , b * ) can be written as where I train is the index set of training data, and x i and y i correspond to the i-th feature-target pair.
To deal with the high dimensionality of the optimization problem, we employ Adam [20], a gradient-based algorithm, which is essentially a mixture between momentum acceleration and rescaling parameters. The weights are initialized using a truncated normal distribution to potentially avoid saturation and use the Automatic Differentiation (AD) provided by Tensorflow [1] to compute the gradients. The neural network model is implemented in Python using the Tensorflow library [1]. Due to the non-convex nature of eq. (18), for such a high degree of freedom of parameters, one can only afford to find a local minimum. In practice, however, a good local minimum is usually satisfactory [15]. Hyperparameters considered in current work for the basic model are the number of units for each hidden layer n h and activation function σ(·).
Model selection for neural networks is an active research area [4,46,5]. Well known methods involve grid search/random search [4]/Tree of Parzen Estimators (TPE) [5]/Bayesian optimization [46] with cross validation. We pursue the following trial-error strategy: 1. Given the number of training points, computing the number of equations to satisfy if the network overfits all the training data. 2. Pick a neural network with uniform hidden layer structure to overfit the training data with the number of parameters in the network no more than 10% to 50% of the number in step 1. 3. Keep reducing the size of neural network by either decreasing the hidden units or number of layers until the training and validation error are roughly the same order. 4. For the choice of other hyperparameters, we simply perform grid search.

Jacobian regularized model
In standard FNNs, minimizing mean-squared-error on the training data only guarantees model performance in terms of the local training error. It does not guarantee the reconstruction of even training trajectory in the a posteriori sense.
Here, we take a closer look at the error propagation in a dynamical system for the FNN model when evaluated in an iterative fashion, i.e., a posteriori sense. Without any loss of generality, considering the discrete case, after we obtain the model f , we can predict x i+1 given x i Moreover, given F d , we can find the ξ i+1 global given x i and ξ i global as follows, Consider a Taylor expansion of f ( where H is the Hessian matrix evaluated at some point between x i and x i + ξ i global . Assuming ξ i global 2 1, H 2 is bounded, and the high order terms are negligible compared to the Jacobian term we have . (23) Similarly, in the continuous case, we have The right hand sides of eq. (23) and eq. (25) contain contributions from the global error and accumulation of local error. Optimization as in eq. (18) can minimize the latter term, but not necessarily the former. This suggests that manipulating the eigenspectrum of the Jacobian might be beneficial for stabilization by suppressing the growth of the error. Due to the simplicity of computing the Frobenius norm compared to the 2-norm, we consider penalizing the Frobenius norm of the Jacobian of the neural network model. In the context of improving generalization performance of input-output neural network models, similar regularization has been also proposed by Rifai [37]. It should be noted that our purpose is to achieve better error dynamics in a temporal sense, which differs from the generalization goal in deep learning. Thus, one may seek a locally optimal solution that can suppresses the growth in global error while minimizing the local error.
The regularized loss function inspired from the above discussion is thus where J is the Jacobian of the neural network output with respect to the input, and λ is a hyperparameter. On one hand, it should be noted that regularizing the Frobenius norm of the eigenspectrum of the Jacobian indirectly suppresses the magnitude of the eigenvalue of the Jacobian. On the other hand, excessive weighting on the magnitude of the eigenvalue would lead to less weighting on local error, which might result in an undesirably large local error. Thus, λ should be set as a relatively small value without strongly impacting the model performance in an a priori sense.

Reducing overfitting
Overfitting is a common issue in the training of machine learning models, and it arises when models tend to memorize the training data instead of generalizing true functional relations. In neural networks, overfitting can occur from poor local minima and is partially due to the unavoidable non-convexity of an artificial neural network. Overfitting cannot be completely eliminated for most problems, given the NP-hard nature of the problem. Generally, overfitting can be controlled by three kinds of regularization techniques: The first follows the Occam's razor principle, e.g., L1 sparsity regularization [9]. However there is no guarantee that Occam's razor is appropriate for all cases, and finding the optimal sparsity level is often iterative. The second is to smooth the function, e.g., using weight decay [15]. The third type is especially suitable in iterative learning, e.g., early stopping, which is a widely used strategy in the deep learning community [15]. In this work, we found validation-based early stopping to be sufficient. We split the data further into pure training and validation sets, and then monitor overfitting by measuring R 2 .

Results & Discussion
Given sequential training data, the capability of the basic FNN is first evaluated in two-dimensional dynamical systems with polynomial non-linearities in section 4.1 and non-polynomial-non-rational dynamics in section 4.2. The basic model is compared with SINDy [9], a method that directly aims to learn functional models using L 1 sparse regression on a dictionary of candidate basis functions. In section 4.3, we demonstrate that the basic model performs better than SINDy on the problem of incompressible flow behind a cylinder, in spite of the explicit addition of quadratic terms to the dictionary. In addition, the local error is found to be strongly correlated with the maximal singular value of the Jacobian, thus serving as an inspiration for Jacobian regularization. In section 4.4, we demonstrate the stabilizing aspect of Jacobian regularization for the problem of laminar wake behind a cylinder, where the system exhibits a low dimensional attractor. In section 4.5, we assess the ability of our regularized FNN model to approximate a dynamically evolving high-dimensional buoyancy-driven mixing flow system that is characteristic of flow physics driven by instabilities. The results show that, for systems that do not exhibit a low dimensional attractor, it is difficult for a black-box model to have satisfactory long-time prediction capabilities. In section 4.6 we show that predictive properties can be improved by data augmentation in the state space of interest.

2D polynomial system: Van der Pol oscillator
The first order forward discretized scheme of the Van der Pol (VDP) system is given by where ∆t = 0.1 and µ = 2.0. The modeling target is Our goal is to reproduce the dynamics governed by F r from on data collected from a single trajectory. The loss function of basic model in eq. (18) is optimized using training data from a single trajectory, containing 399 data points. Test data containing 599 points is generated using a different initial condition. The data distribution of the training and testing features is shown  hidden layers with each layer containing 8 hidden units. Two hidden layers are accompanied by Swish nonlinear activation as σ(x) = x · sigmoid(βx) where in practice β is fixed as unity [36]. The output layer is linear. Randomly 20% of training data is used as a validation set and we monitor the performance on the validation set as a warning of overfitting. In fig. 1, the learning curve suggests that the model is well-trained and overfitting is not observed. Results of a priori and a posteriori prediction are shown in fig. 2. The basic model predicts the F r at each training point very well a priori, but slight phase lag is observed a posteriori in testing, which originates from the extrapolation of the testing data initially.
The variation of the local and global error together with the maximal singular value of the Jacobian is shown in fig. 3. For training data, e local is observed to be relatively uniform, as expected since the objective optimized is MSE uniformly across all training data points. For testing data, e local exhibits peak values near the beginning of the trajectory as expected, since the first few points are far away from the training data shown in fig. 1. Moreover, it is interesting to observe that in fig. 3, the peak of the temporal history of local/global error shows strong correlation with the maximal singular value of the Jacobian.
Stepwise error contours are displayed in fig. 4. The region of large error close to red (implying the difference of the step-wise vector between neural network prediction and ground truth is large) is located near the corner of figure, where there is a dearth of training points. The model performs well near the training points as expected. In this case, since testing data is not very far away from the training data, good performance of extrapolation can be expected. However, we would like to note that there is a moderate amount of error associated with the vector direction in fig. 4 not only at the corners but also near the origin. This implies that a feedforward neural network can generalize to some extent, but with no guarantees, even in regions enclosed by training data. The results also confirm that the known result that for a dynamical system with an attractor, the neural network can reproduce the dynamics near the attractor [2,56,7,3,49].
With the prior knowledge that the system is polynomial in nature, one can use polynomial basis functions to extract the ground truth. To illustrate this, results obtained from SINDy [9] are shown in fig. 5, with threshold parameter as 2 × 10 −4 , maximal polynomial order as 3, and no validation data set considered. As displayed in fig. 6, the excellent result of SINDy shows the advantage of finding the global features where parameters obtained are not restricted to the scope of training data since the ground truth is governed by sparse polynomials.

2D non-polynomial system: a non-rational non-polynomial oscillator
The success of SINDy is a consequence of the fact that the underlying system can be represented as a sparse vector in a predefined basis library such as that consisting of polynomial or rational functions [26]. Here, we choose a different case: a non-rational, non-polynomial oscillator with ∆t = 0.004:   Here the basic model in eq. (18) is optimized using 1199 data points of a single trajectory. Testing data contains 1799 points. Randomly 20% of training data is taken as the validation set, but also included in later evaluation. The feature distribution in phase space is shown in fig. 7. Hyperparameters are listed in table 2 and 128 minibatchs and 20000 epochs are used. The training error and validation error is also shown in fig. 7.  Results for a priori and a posteriori performance on training and testing data are shown below in fig. 8. The training trajectory is perfectly reconstructed while the predictions show slight deviation. The distribution of the local and global error is shown in fig. 9. Again, we observe that maximal local/global error correlates with the peaks of the maximal singular value of the Jacobian. It is interesting to note that the highest local testing error occurs at the peak of the maximal singular value of the Jacobian, instead of at points close to the initial condition.
The error contour in fig. 10 shows that stepwise error around training trajectory is below 0.1. It is important to note that model performance deteriorates at places far away from the training trajectory, especially at the right corner shown in fig. 10.
For SINDy, the polynomial order is set to three and threshold as 2 × 10 −4 . A priori and a posteriori validation for training and testing is shown in fig. 11. Correspondingly, the stepwise error contour displayed in fig. 12 shows the misfit for the region of interests ranging from -1 to 3 for both two components. Because there is no sparsity in polynomial basis in this case, it is expected that SINDy cannot reconstruct the dynamics correctly and would perform worse than the basic model of FNN. The implication is that for strongly non-polynomial systems, neural networks are far more flexible compared to SINDy.

Nonlinear PDE system: flow behind a cylinder
In this section, we compare the basic model with SINDy in reconstructing the flow in a cylinder wake. The data is from Brunton et al. [9] which comes from an immersed boundary method solution [47] of the 2D incompressible N-S equations with Re = 100 based on the cylinder diameter. The computational domain consists of a non-uniform grid with near-wall refinement.  The inlet condition is uniform flow and the outlet is a convective boundary condition to allow the vorticity to exit the domain freely. Testing data is generated as a temporal extension of states that lie on a limit cycle at the boundary of training data, which indicates this is not an extrapolation task. To work with such a high-dimensional nonlinear PDE system, we use the coefficients of two POD modes [6] and one 'shift mode, which represents the shift of short-term averaged flow away from the POD space of the first two harmonic modes to reduce the spatial dimension. More details on POD and 'shift-modes' are provided in Refs. [30,6]. Training and testing data is the same as in Brunton et al. [9] where the first 2999 snapshots in time are used for training, and a later 2994 snapshots used for testing. A random 10% of training snapshots is considered as validation set but also included in later evaluation. The distribution of training data and testing data is shown in fig. 13.
Hyperparameters of the basic model are shown in table 3 with 40000 epochs. For SINDy, the hyperparameters are the same as in previous work [9]. As shown in fig. 14, for training data, SINDy reconstructs a smaller growth rate of oscillating behavior while the basic model accurately reconstructs both the shift mode and two POD modes. For testing data, SINDy contains an observable phase lag for the time period concerned, while the basic model achieves an almost perfect match. This implies that the model obtained from SINDy, although much easier to interpret than neural network, is not the best model for this dynamical system in terms of accuracy. However, we note that from the data distribution in fig. 13, the basic model performs as expected, as the training data covers the attractor well.

Stabilizing the neural network with Jacobian regularization
Due to the non-convexity of the optimization problem that arises in the solution of the basic model in eq. (18), employing a stochastic gradient-descent type method might lead to a solution corresponding to a local minimum, which is often undesirable and difficult to avoid. Most works in the field of deep learning for feedforward neural networks focus on decreasing the impact of poor local minima to promote generalizability. However, in the context of modeling a dynamical system, as it is often assumed that the trajectory of interest is stable with respect to small disturbances [3], the model should be able to approximately reconstruct the training trajectory in the presence of local errors that arise at each step. This would require regularizing instabilities that could arise in a posteriori prediction. To have meaningful comparisons, random number seeds are fixed for initialization of weights and training data shuffling. Nevertheless, we observe that in some cases, for example in the previous case of the cylinder wake, an inappropriate choice of neural network configuration of the basic model, e.g. number of hidden units and type of activation function, can potentially lead to instability in a posteriori evaluation. Such instabilities may materialize even while reconstructing the training trajectory, while the corresponding a priori prediction is almost perfect. Previous work [50] explicitly ensured stability by simply adding more adjacent trajectories. Here, we take a different approach by adding a Jacobian regularization term in the cost function in eq. (26).
In our numerical experiments, with a certain fixed random seed, it is observed that, when the layer structure is 2-20-20-2 with tanh as activation function instead of elu, the basic model becomes numerically unstable after 2000 steps for training data which is displayed in fig. 15. Similar numerical instability is also observed in testing evaluations. However, for the same fixed random seed, the regularized model with λ = 5 × 10 −5 shows numerically stable results with the same neural network configuration for both training and testing data.
The effectiveness of Jacobian regularization may be attributed to finding a balance between lowering the prediction error, i.e., MSE, and suppressing the sensitivity of the prediction of the future state to the current local error. As shown in fig. 16 and fig. 17, on average, the maximal eigenvalue of the Jacobian is smaller for the regularized model than for the basic model. Furthermore, the distribution of the eigenvalues of the Jacobian is shown in fig. 18 in the form of a linear stability diagram with explicit 5th order Runge-Kutta time integration. It is clear that the model with Jacobian regularization has significantly smaller positive real eigenvalues. Note that, due to the Frobenius norm, negative real eigenvalues are also decreased in magnitude.

Nonlinear PDE system: instability-driven buoyant mixing flow
The test problems thus far have served to assess the performance of the basic and Jacobianregularized models on nonlinear dynamical systems that either evolve on or towards an attractor. Such systems, even if high-dimensional, are amenable for projection onto a lower dimensional subspace, using for instance, POD techniques. In this section, we consider the Boussinesq buoyant mixing flow [54,25], also known as the unsteady lock-exchange problem [40] which exhibits strong shear and Kelvin-Helmholtz instability phenomena driven by the temperature gradient. Compared to the cylinder flow that evolves on a low-dimensional attractor approaching a limit cycle, the Boussinesq flow is highly convective and instability driven. Consequently, such a system state cannot be represented by a compact set of POD modes from the spatial-temporal field of nondimensionalized velocity and temperature. Rather, the low-dimensional manifold itself evolves with time. Further, any noise in the initial data can produce unexpected deviations that makes such systems challenging to model, even using equation-driven reduced order models such as POD-Galerkin [40].
The data set is generated by solving the dimensionless form of the two-dimensional incompressible Boussinesq equations [40], as shown in eq. (30) on a rectangular domain that is 0 < x < 8 and 0 < y < 1.
∂u ∂x + ∂u ∂y = 0, (30a) where u, v, and θ are the horizontal, vertical velocity, and temperature components, respectively. The dimensionless parameters Re, Ri, and P r are the Reynolds number, Richardson number, and Prandtl number, respectively with values chosen as follows: Re = 1000, Ri = 4.0, and P r = 1.0. These equations are discretized on a 256 × 33 grid. Initially, fluids at two different temperatures are separated by a vertical line at x = 4. The bounding walls are treated as adiabatic with the no-slip condition. A fourth-order compact finite difference scheme is used to compute the derivatives in eq. (30). The evolution of the thermal field over the simulation time interval of 32 seconds is shown in fig. 19 and illustrates the highly transient nature of the dynamics. To reduce the dimensionality of the system, POD modes are extracted from the entire data set consisting of 1600 snapshots. The reduced feature set consisting of ten POD weights captures nearly 97% of the total energy is used to train the model and predict the trajectory. For the setup of training and testing, future state prediction is pursued with the first 70% states of the trajectory treated as training data and the rest for testing. For such a system in 10 dimensions, it is observed that the problem of the a posteriori instability in the basic model becomes more pronounced and difficult to avoid. Challenges of numerical instability were observed even for reconstruction for a wide range of network configurations, and thus results from the basic model are not reported. The Jacobian regularized model is employed with hyperparameters shown in table 4, with fig. 20 showing a posteriori evaluation on training data. The reconstruction is successful, but the performance deteriorates on testing data because the trajectory of the system does not exhibit a low dimensional attractor as in the cylinder case. Therefore the training data is not informative for predictions on the test set. For a black-box machine learning model, this phenomena can be expected to be more pronounced in high dimensional space due to data scarcity. Specifically, we discuss this problem in the following section. 4.6 Improving model predictability by data augmentation In this section, we consider two scenarios of data augmentation: (i) augmenting the information in the data by spreading training locations randomly following a uniform distribution provided that one has access to F c or F d at any desired location; (ii) augmenting the data by assembling several trajectories generated from different initial conditions.

Random uniform sampling in phase space
Recall that, in the two-dimensional problems in section 4.1 and section 4.2, the stepwise error contour shows that local error increases on testing scenarios located far away from the training data which was highly concentrated in a compact region of phase space. Without any knowledge of system behavior, it is sensible to start with training data from a random uniform distribution in a compact region of phase space corresponding to interesting dynamics. To conduct a thorough stepwise error contour evaluation of the training target in phase space, the VDP system is chosen to illustrate this idea.
Determining the most informative data samples would potentially involve specific knowledge of the underlying system and the models used, and is beyond the scope of current work. Here we simply consider uniform random sampling in the phase space in a finite domain: [−3, 3] for the first component and [−5, 5] for the second component. We obtained a new set of 399 training data points using random uniform sampling in phase space while retaining the same testing data as in section 4.1.
The performance of the basic model with the same hyperparameter setting as in section 4.1 on randomly distributed training data is shown in fig. 21. While the number of data points has  not been changed, the contour error of the resulting model decreased significantly compared to training with the same number of data points in a single trajectory, which indicates an improved generalizability with the same amount of training data.

Training with multiple trajectories with random initialization
Training data can also be augmented by multiple trajectories with different initial conditions. Here we take the one-dimensional viscous Burgers equation shown in eq. (31) as example.
The initial conditions are generated following a specific energy spectrum [23,34] shown in eq. (32).
where for each k, β k is a random number drawn from a uniform distribution on [−π, π], E(k) = 5 −5/3 if 1 ≤ k ≥ 5, A = 25, and E(k) = k −5/3 if k > 5. Multiple trajectories are generated using different seeds for random numbers to obtain the trajectories of the full-order system. To fully resolve the system as a DNS, eq. (31) is solved using a standard pseudo-spectral method with SSP-RK3 [16] for time stepping. Here we choose k c = 2. Discrete cosine transformation (DCT) is used to reduce the dimension of the full system to first 4 cosine modes in the system where around 97% of kinetic energy is preserved. For simplicity, we seek a closed Markovian reduced-order-system, whereas the underlying dynamics is clearly non-Markovian [33,34].
Since the first component of the DCT is constant, the remaining components of feature space are shown in fig. 22. The training data is far away from the testing data initially, whereas the data converges at a later stage. This is because of the presence of a spiral fixed point attractor resulting from the viscous dissipative nature of the system. Therefore, if the model is only trained from a single trajectory, it will be very difficult for the model to generalize well in the phase space especially where the state of the system is not near an attractor. Many dynamical systems in nature exhibit attractors in the asymptotic sense. From the viewpoint of data-driven modeling of such dynamics, data scarcity is encountered at the start of trajectory where the number of trajectories required to provide enough information to cover the region of interest grows exponentially. Much research on applying neural network-based models for dynamical systems [50] [43][53] demonstrate problems starting on limit cycles or chaotic attractors in a low-dimensional feature space, where the issue of initial data scarcity is not significant, or can be easily alleviated by a small increase in available data. However, for the purpose of modeling phenomena such as turbulent fluid flow, which can be high dimensional even after dimension reduction, the model would likely fail for long-time prediction due to data scarcity. Such a situation may be realized in regions of phase space where the state has not arrived at the low manifold attractor. Therefore, the training data might not be representative of testing data which violates the fundamental assumption of a well-posed machine learning problem [14]. Moreover, data scarcity will shrink the region of generalizability of the model as the dimension of the system increases.
A key benefit of using a neural network model is its linear growth in complexity with dimension of the system, in contrast to traditional polynomial regression methods [15]. However, initial data scarcity would limit the generalizability of a ANN in modeling a high dimensional dynamical system that does not exhibit a low dimensional attractor. We believe this phenomenon of data scarcity observed from this simple nonlinear PDE example also applies to other nonlinear dynamical systems.
To alleviate the initial data scarcity issue, a solution is to augment the training data with more trajectories with different random number seeds in generating the initial condition, while keeping the energy spectrum the same across all cases. In this case we choose 18 such trajectories. Each trajectory contains states of 1000 snapshots equally spaced in time. For testing data, we simply consider one DNS result with an initial condition different from all training trajectories. The corresponding training and testing trajectories are visualized in phase space as shown in fig. 22. The basic model is trained with hyperparameters in table 5 and 1000 epochs. The resulting learning curve and a posteriori evaluation are shown in fig. 23. Relatively large discrepancy is observed near the initial condition as the initial data scarcity is not completely eliminated due to limited number of additional trajectories. Increasing the number of additional trajectories, may be unaffordable for very high dimensional systems. Moreover, the result also shows that the error decreases once the trajectory falls on the fixed point attractor. Thus, if the model starts in the low dimensional attractor where the information is well-preserved in the training data, better performance might be expected. This hypothesis is consistent with previous work [53], where successful prediction of future states starts at the time when the states converge to a low dimensional attractor.

Conclusions
This work investigated the modeling of dynamical systems using feedforward neural networks (FNN), with a focus on long time prediction. It was shown that neural networks have advantages over sparse polynomial regression in terms of adaptability, but with a trade-off in training cost and difficulty in extrapolation, which is a natural barrier for almost all supervised learning. From the perspective of global error analysis, and the observation of the strong correlation between the local error and maximal singular value of the Jacobian, we propose the suppression of the Frobenius norm of the Jacobian as regularization. This showed promise in improving the robustness of the basic FNN model given limited data, or when the model has a non-ideal architecture, or when the model is unstable. The effectiveness of Jacobian regularization is attributed to finding a balance between lowering the prediction error, and suppressing the sensitivity of the prediction of the future state to the current local error. In terms of modeling dynamical systems that do not involve low-dimensional attractors, limitations of FNNs, and perhaps all local ML methods, was demonstrated in a buoyant mixing flow. Challenges were noted in the example of the reducedorder viscous burgers system, where significant initial data scarcity is present. Augmenting the data either by altering the distribution of training data in phase space or by simply adding multiple trajectories from different initial conditions resulted in improvement of the performance of FNN model to some extent. However, these remedies require a significant amount of additional sampling in phase space, especially for high dimensional systems for the period of time without the apparent low-dimensional attractor which suffers data sparsity from the curse of dimensionality.