Comparing the Forecast Performance of Advanced Statistical and Machine Learning Techniques Using Huge Big Data: Evidence from Monte Carlo Experiments

+is research compares factor models based on principal component analysis (PCA) and partial least squares (PLS) with Autometrics, elastic smoothly clipped absolute deviation (E-SCAD), and minimax concave penalty (MCP) under different simulated schemes like multicollinearity, heteroscedasticity, and autocorrelation. +e comparison is made with varying sample size and covariates. We found that in the presence of low and moderate multicollinearity, MCP often produces superior forecasts in contrast to small sample case, whereas E-SCAD remains better. In the case of highmulticollinearity, the PLS-based factor model remained dominant, but asymptotically the prediction accuracy of E-SCAD significantly enhances compared to other methods. Under heteroscedasticity, MCP performs very well and most of the time beats the rival methods. In some circumstances under large samples, Autometrics provides a similar forecast as MCP. In the presence of low and moderate autocorrelation, MCP shows outstanding forecasting performance except for the small sample case, whereas E-SCAD produces a remarkable forecast. In the case of extreme autocorrelation, E-SCAD outperforms the rival techniques under both the small and medium samples, but further augmentation in sample size enables MCP forecast more accurate comparatively. To compare the predictive ability of all methods, we split the data into two halves (i.e., data over 1973–2007 as training data and data over 2008–2020 as testing data). Based on the root mean square error and mean absolute error, the PLS-based factor model outperforms the competitor models in terms of forecasting performance.


Introduction
e prediction of macroeconomic variables is very important under macroeconomic studies, monetary policy analysis, and environmental economics. Accurate forecasts induce sound insights into mechanisms of dynamic economies [1], more effective monetary policies [2], and better portfolio management and hedging strategies [3]. In the data-rich environment existing these days, many macroeconomic series are tracked by economists and decision-makers.
Low-dimensional models often include some prespecified economic covariates for instance vector autoregression and therefore have a complication in capturing the dynamic and complex patterns, which contain huge panels of time series [4]. It is a fact that missing important variable(s) leads to an underspecified model, inducing biased results. ere is an intense need to propose updated statistical models and analysis frameworks with the purpose of expanding the low-dimensional counterparts for improved forecasts. us, in the recent era, the analysis of "Big Data" has become the core of economics research. is in turn has resulted in special attention being paid to the huge class of techniques that are available in the domain of machine learning, dimension reduction, and penalized regression [5,6]. Recently, in the regression context, Doornik and Hendry [7] categorized Big Data into three classes: tall big data, huge big data, and fat big data. Each type can be defined as follows: (i) Tall big data: more observations and several covariates (N >> P) (ii) Huge big data: more observations and more covariates (N > P) (iii) Fat big data: fewer observations and more covariates (N < P) where N and P represent the number of observations and covariates, respectively. We graphically represent the Big Data in Figure 1.
Moreover, Stock and Watson [17] elaborately discussed the past studies on the utility of factor models forecasting.
ere is an intensive and growing body of literature in this area. Few of them are relevant, as they address both theoretical and empirical problems, including Armah and Swanson [12,13]; Artis et al. [8]; Bai and Ng [1,33,34], Banerjee and Marcellino [35]; Boivin and Ng [9,10], Ding and Hwang [36]; Dufour and Stevanovic [37]; Stock and Watson [15][16][17][18]; and Smeekes and Wijler [38]. e abovementioned papers consider principal component analysis, independent component analysis, and sparse principal component analysis for the construction of the factor model. However, there is also a small and growing body of literature investigating the classical approach (Autometrics) in the context of macroeconomic forecasting [7,21,22]. We failed to discover any paper to date that has investigated the use of partial least squares (PLS) theoretically in our context. However, the method has been applied empirically in various fields. Apart from this, some papers have utilized shrinkage methods like ridge regression, lasso, elastic net, adaptive lasso, and nonnegative garrote, but none of the papers to date have used the updated forms of shrinkage methods in our context.
Filling the gaps, this work implements some updated techniques of big data to increment literature of macroeconomic forecasting theoretically as well as empirically. From the dimension reduction aspect, we build factor models intending to highlight the importance of such models for macroeconomic prediction. Particularly, while building factor models, we employ principal component analysis (PCA) and partial least squares (PLS). In addition, we also assess the last version of the classical approach (Autometrics) and the updated version of shrinkage methods including elastic smoothly clipped absolute deviation (E-SCAD) and minimax concave penalty (MCP). We evaluate the performance of these techniques in a simulation setting where the true data generating process (DGP) of the factor model is used. To summarize the whole discussion, our prime contribution comes in the form of comparison of updated shrinkage methods and Autometrics with factor models through forecasting under the simulated scenarios having multicollinearity, heteroscedasticity, and autocorrelation along with application to macroeconomic data to provide a conclusive solution to predictability. e study aims to produce an improved method to help policymakers; the improved tool is not restricted to workers' remittances or the stock market (in our case) but is valid for any time series. e remaining part of the paper is organized as follows. In Section 2, we provide a detailed discussion regarding factor models based on principal component analysis and partial least squares. In Section 3, we discuss big data techniques, such as the classical approach and shrinkage methods. Monte Carlo evidence on the comparative performance of several forecasting techniques is discussed in Section 4. Empirical findings are given in Section 5. Section 6 provides concluding remarks.

Methods
e techniques we intend to apply in subsequent sections are reported in Figure 2.
is study aims to compare the predictive ability of factor models based on principal component analysis and partial least squares with Autometrics, elastic smoothly clipped absolute deviation (E-SCAD), and minimax concave penalty under different scenarios like multicollinearity, heteroscedasticity, and autocorrelation. Macroeconomic and financial datasets are used for the analysis of the real phenomenon.

Factor Models.
e notion of factor models also called diffusion index entails the utility of properly extracted hidden common factors that have been distilled from a huge set of features as inputs in the identification of the parsimonious models. To be more specific, let X be an N × P dimensional matrix of data points and define N × k dimensional matrix of latent factors.
Stock and Watson [17] have delineated in depth the literature regarding forecasting through factor models. In 2 Complexity the below detailed discussion of factor model methodology, we follow Stock and Watson [15]: where ε represent the random error matrix, φ ′ is the P × k coefficients matrix, and F is a factor matrix of N × k dimension.
We construct the following forecasting model based on the work of Bai and Ng [39], Kim and Swanson [19], and Stock and Watson [15]: where Y t+h is an outcome variable to be forecasted, h shows the forecast horizon, and F t is the vector of factors with a dimension, distilled from F in equation (1). e associated coefficient c F is a vector of unknown parameters, and e t+h is the random error. e whole process of factor model forecasting consists of two steps: in the first step, we estimate k latent (unobserved) factors, represented by F, from P observable predictors. To gain convenient dimension reduction, k is supposed to be much smaller than P (i.e., k ≪ P). In the second step, we estimate c F , by utilizing data at hand with Y t and F t . Subsequently, an out-of-sample forecast is constructed. Kim and Swanson [19] utilized the PCA approach to achieve estimates of the unobserved factors, known as principal components (PCs). e latent PCs are uncorrelated which are obtained by using the data projection in the direction of maximal variance, and naturally, the PCs are ordered based on their variance contributions. e first PC reflects the direction of the maximal variance in the data, the second PC reflects the direction that explains the maximal variance in the rest of the orthogonal subspace, and so on.
is approach is most frequently used in the literature of factor analysis because PCs are easily derived via the use of singular value decompositions [15,33,34].
Boivin and Ng [10], however, argued that the performance of the factor model is more likely to be worse in prediction if the incorporated factors are dominated by excluding factors. Similarly, Tu and Lee [26] stated that PCA imposes only the factor structure for X and does not consider the outcome variable. It indicates that PCA ignores the dependent variable while performing it. By dint of neglecting the outcome variable at the time of factors, extraction induces an inefficient forecast of the outcome variable. e solution to this problem is given in the next section.

e Partial Least Squares (PLS) Method.
is study looks at another method that is known as partial least squares (PLS) regression developed by Wold [40]. is method is appropriate in a data-rich environment and may be considered as an alternative to PCA-based factor models. Unlike the PCA method, the PLS identifies new factors in a supervised way; that is, it makes use of the response variable to identify new factors that not only approximate the old factors well but are also related to the response variable. Roughly speaking, the PLS approach attempts to find the directions of maximum variance that help in explaining both the response variable and explanatory variables. e PLS for an outcome variable is motivated by a statistical model as follows: where . ., T, c P is an n × 1 vector of associated coefficients, and e t is the disturbance term. Kim and Ko [29] argued that PLS models are useful especially when there are a large number of covariates. Instead of using a model given in (3), one may adopt another data dimension reduction approach through the following linear regression with Z × 1 vector of components s t � [s 1,t , s 2,t , . . . , s Z,t ] as follows: We define s t : where w � [x 1 , x 2 , . . ., x Z ] is the n × Z matrix of each column, w z � [w 1,z , w 2,z , . . . , w n,z ], z � 1, 2, . . ., Z, denote the vector of weights on covariates for z factors or components, and τ is the Z × 1 vector of PLS coefficients. We may use the following equation for predicting the k steps ahead model; that is, y t+k , k � 1, 2, . . ., m.

Classical Approach and Shrinkage Methods
e fundamental comparison of interest here is between automatic selection over variables as against PC and PLSbased factors in terms of prediction. Factors are often regarded as essential to summarize a large amount of information, but the classical approach and shrinkage methods are alternatives.

Classical Approach.
Autometrics is a well-known big data algorithm, which consists of five steps. In the first step, Complexity we begin the process with the construction of a linear model, which refers to the General Unrestricted Model (GUM); in the second step, we obtain the estimates for unknown parameters and test them statistically; the third step entails presearch process; step four delivers the tree-path search; and the last step leads to a selection of the final model. Doornik [41] elaborately delineated the complete algorithm.
e key notion is to commence modeling with a linear model that incorporates all candidate features (GUM). Estimate the GUM by the least squares method and then carry out the statistical tests to validate the congruency of the model. If the estimated GUM contains statistically insignificant coefficients at prespecified criteria, then again estimate the simpler models by utilizing different paths search and ratified by diagnostic tests. As some terminal models are detected, Autometrics undertakes their union testing. e rejected models are discarded, and the union of those terminal models who survived leads to a new GUM for another tree-path search iteration. e whole inspection process proceeds, and the terminal models are statistically checked against their union. If two or more terminal models clear the encompassing tests, then the preselected information criterion decides about the final choice.
e econometric models are achieved by applying Autometrics on the GUM: Under Autometrics, two main strategies are commonly used for model selection, a conservative and a superconservative also called Liberal strategy. Our study implements the Liberal strategy, which is typically based on a one percent significance level rather than five percent. In other words, the statistical significance of each estimated coefficient is based on one percent level of significance.

Shrinkage
Methods. An alternative prominent approach to deal with many features is the family of panelized regression methods, which comprises of many techniques, but our study adopts the following updated forms: elastic smoothly clipped absolute deviation and minimax concave penalty.

Elastic Smoothly Clipped Absolute Deviation. Fan and
Li [42] added a new penalization technique to literature known as SCAD. e technique is nonconvex and enjoys an oracle property: sparsity, continuity, and unbiasedness. is technique selects useful covariates with their magnitudes asymptotically in an efficient way if the underlying true model is known (i.e., the oracle properties).
e SCAD function covers all the limitations faced by the existing methods like ridge and lasso. e penalty function of SCAD is defined as follows: e unknown tuning parameter k was determined by the generalized cross-validation approach, and they assumed the value of c is 3.7. As given above, the penalty function is continuous, and the resulting solution is given by e tuning parameters can be induced from the datadriven technique. e limitation of SCAD is that it selects only one variable from a correlated set of predictors. Zeng and Xie [43] extended the SCAD by augmenting L 2 penalty and called it elastic SCAD (E-SCAD). Mathematically, it can be written as Due to L 2 penalty, the E-SCAD achieves an additional property along with oracle properties; that is, the penalty function should spur highly correlated features to be in or out of the model simultaneously. Hence, the proposed form selects the whole group of correlated predictors rather than one variable.

Minimax
Concave Penalty. Zhang [44] proposed a minimax concave penalty (MCP), which yields the convexity of the penalized loss in sparse regions considerably given specific thresholds for features selection as well as unbiasedness. e MCP is described as follows: e tuning parameter (c > 0) diminishes the maximum concavity under the following restrictions like unbiasedness and selection of features: e dual-tuning parameters in concave penalty regression play a key role in terms of controlling the amount of regularization. Likewise, the concavity of the MCP penalty considerably evades the sparse convexity by dint of diminishing the maximal concavity. In 2010, the author showed that a rise in regularization parameter value leads to bearing more convexity and achieves an almost unbiased penalty. e penalty function of MCP typically belongs to the quadratic spline function.

Monte Carlo Evidence on Forecasting Performance
Our simulation part consists of three main scenarios, namely, simulations on a data generating process (DGP) with (i) multicollinearity, (ii) heteroscedasticity, and (iii) autocorrelation. In each simulated scenario, varying the DGP attributes in terms of correlation strength among features, the magnitude of the variance of the error term, and the magnitude of correlation of error term with previous values (lag).

Data Generating Process.
We generate data from the following equation: e set of predictors X 1 , X 2 , . . ., X P are generated from multivariate normal distribution as X i ∼N (0, Σ). e same data generating process (DGP) was used by [38] as mentioned in (13) for artificial data generation. Our study considers three types of sample sizes for the simulation experiments. We suppose a dual set of features with altering the number of active (p) and inactive features (q), respectively, as portrayed in Figure 3.
In our simulation experiments, we assume three scenarios as follows: in the first scenario: we generate the pairwise correlation between the predictors (i.e., x m and x n as cov(x m , x n ) � |m− n| ). e population covariance matrix is produced in the following way: While altering the parameter Σ, we obtain different correlation structures. In our work, we assume values for Σ ∈ {0.25, 0.5, 0.9} as followed by Xiao and Xu [45]. In the second scenario, we generate the correlation between current and residuals lag (autocorrelation) and symbolized by ρ. e autocorrelation is generated as follows: Our experiments assume the low, moderate, and high cases of autocorrelation, such as ρ ∈ {0.25, 0.5, 0.9}. e third scenario is for examining heteroscedasticity (i.e., means that the variance of the error term is not constant and alters across data points by σ k ).
To evaluate the forecasting performance of all methods, we divide each realization such that 80 percent of the data are used to train the models and the remaining data are utilized for models' evaluation followed by [46]. e entire process will be replicated M � 1000 times. e average of root mean square (RMSE) and mean absolute error (MAE) are computed over "M" to assess the forecast performance. e smaller the values of RMSE and MAE, the closer the predicted values to the actual values and the better the forecast relatively. For analysis, we have relied on several packages like gets, glmnet, ncvreg, pls, caret, forecast, and Metrics under R programming language.

Simulation Results.
e forecast comparison results derived from Monte Carlo experiments are presented in Tables 1-3. All methods are improving their performance by augmenting the number of observations. Increasing the number of irrelevant and candidate variables adversely affects the predictive ability.

Scenario 1.
In the presence of low and moderate multicollinearity, the performance of MCP is superior to other rival methods except for the case of a small sample, where E-SCAD and PLS-based factor models are dominant. To be more specific, in the presence of low and moderate multicollinearity, E-SCAD often produced better forecasts. As we consider the case of high multicollinearity, the PLS-based factor model is superior in particular, while asymptotically E-SCAD outperformed the other methods. Scenario 2. In the presence of all schemes of heteroscedasticity, the performance of MCP is often better than all competitor models. When the number of predictors is equal to 50, Autometrics provides a similar forecast as MCP in large samples. Scenario 3. In the presence of low and moderate autocorrelation, the MCP showed an outstanding performance in terms of forecasting particularly when we increase the sample size. In contrast, when n � 100, the E-SCAD produced a remarkable forecast. In the case of Complexity extreme autocorrelation, E-SCAD outperformed the rival techniques under both small and moderate samples, but as we further augment the sample equal to 400, the MCP induced a more accurate forecast comparatively.

Real Data Analysis
After Monte Carlo experiments, this study performs real data analysis using big data. For real data analysis, we focus on two datasets: macroeconomic data and financial markets. In the context of both datasets, the study considers worker's remittances inflow and stock market data, respectively. It is a fact that many factors influence the worker's remittances inflow and the stock market. Among them, some covariates are recommended by economic and financial theories to be included in the model. Apart from this, a long list of variables has been recommended by past studies. is study considers all the possible determinants based on theories and literature as well to make a general model. In econometrics literature, such a model is known as the general unrestricted model (GUM).

Data Source.
is study collects the annual data for Pakistan from 1973 to 2020. e data is sourced from the World Development Indicators (WDI), International    Table 4.

Correlation Matrix.
For empirical analysis, we split the data set into parts: observations from 1973 to 2007 are utilized to train the models and the remaining data are used to evaluate their forecasting performance. But before going to compute the forecast error, we discover the correlation structure among covariates through the visualization approach. In Figures 4 and 5, blue and red colors exhibit  positive and negative correlations, respectively. e colors' severity and the area of the circle are directly associated with correlation coefficients. On the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors. We can observe that there are many dark color circles in blue and red, which clearly illustrate the high pairwise correlation. In other words, we can conclude that there exists high multicollinearity among predictors under both datasets. Figure 6 reveals that the distribution of stock market data is almost symmetric. Apart from this,   8 Complexity diagnostic tests revealed that the residuals of an estimated model are independently and identically distributed. As we have noted in simulation experiments that in presence of high multicollinearity, the PLS-based factor model outperformed the other methods in terms of forecast error particularly when the sample size is small. It reveals that PLS-based factor is more robust in such circumstances.

Forecast Comparison Based on Two Real Datasets.
Root mean square error and mean absolute error are computed to ascertain the predictive ability of MCP, E-SCAD, Autometrics, and factor models based on PCA and PLS in Figures 7 and 8, respectively. e findings show that PLS-based factor model outperformed the rival methods in the out-of-sample forecast. It illustrates that PLS-based factor model has good predictive power than other competitor models, in terms of having the lowest forecast errors in multistep ahead forecast for the period (2008 to 2020). It supports the simulation results under both real datasets.

Concluding Remarks
is study compares factor models based on principal component analysis and partial least squares with classical approach (Autometrics) as well as shrinkage procedures (i.e., minimax concave penalty (MCP) and elastic smoothly clipped absolute deviation (E-SCAD)). e comparison is made under the presence of multicollinearity, heteroscedasticity, and autocorrelation with altering sample size and number of covariates. We carried out Monte Carlo experiments to compare all methods in terms of prediction. All methods are improving their performance with a growing sample size in all scenarios. Expanding the number of irrelevant and candidate variables negatively affects forecasting accuracy. In the presence of low and moderate multicollinearity, MCP often produced better forecasts comparatively except for the small number of observations, where E-SCAD is dominant. In the case of extreme multicollinearity, the PLS-based factor model is superior, but with increased sample sizes, the prediction accuracy of E-SCAD significantly boosts up as compared to other methods. In the presence of all schemes of heteroscedasticity, the performance of MCP is better than all competitor models. When the number of predictors is equal to 50, Autometrics provides a similar forecast as MCP in large samples. In the presence of low and moderate autocorrelation, the MCP showed an outstanding performance in terms of forecasting except for the small sample case where E-SCAD produced a remarkable forecast. In the case of extreme autocorrelation, E-SCAD outperformed the rival techniques under both the smallest and medium samples, but as we further augment the sample equal to 400, the MCP induced a more accurate forecast comparatively.
For empirical application, macroeconomic and financial datasets are used. To compare the forecasting performance of all methods, we divide the data into two parts (i.e., data over 1973-2007 as training data and data over 2008-2020 as testing data), using both datasets. All methods are trained on training data and subsequently, their performance was evaluated through testing data. Based on RMSE and MAE, the PLS-based factor model is more robust in terms of forecasting than competitor models. is study has several recommendations, reported in Table 4.

Limitations and Future Direction.
e few limitations of this study are that it only focuses on linear models and has  E-SCAD is best under a small sample. MCP is the best option in case of a large sample.
PLS-based factor model provides a better forecast under small sample. In case of a large sample, E-SCAD is superior. Heteroscedasticity MCP is best. MCP is best. MCP is best.
Autocorrelation E-SCAD is best under a small sample. MCP is the best option using more data.
E-SCAD is best under a small sample. MCP is the best option using more data.
E-SCAD is best under a small sample. MCP is the best option using a large data set.

Complexity 9
considered yearly data. e simulation part of this study is restricted to Gaussian distributed errors, but in practice, this is not essential that the errors of a model are always normal. Hence, the research can be conducted to discover the forecasting performance of advanced statistical and machine learning techniques under nonnormal residuals as well as missing observations in the data set. is study can be expanded to examine the performance of nonlinear and nonparametric algorithms like artificial neural networks, random forests, support vector machines, etc.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this study.

Supplementary Materials
Appendix