Evaluating the Performance of Feature Selection Methods Using Huge Big Data: A Monte Carlo Simulation Approach

In this article, we compare autometrics and machine learning techniques including Minimax Concave Penalty (MCP), Elastic Smoothly Clipped Absolute Deviation (E-SCAD), and Adaptive Elastic Net (AEnet). For simulation experiments, three kinds of scenarios are considered by allowing the multicollinearity, heteroscedasticity, and autocorrelation conditions with varying sample sizes and the varied number of covariates. We found that all methods show improved their performance for a large sample size. In the presence of low and moderate multicollinearity and low and moderate autocorrelation, the considered methods retain all relevant variables. However, for low and moderate multicollinearity, excluding AEnet, all methods keep many irrelevant predictors as well. In contrast, under low and moderate autocorrelation, along with AEnet, the Autometrics retain less irrelevant predictors. Considering the case of extreme multicollinearity, AEnet retains more than 93 percent correct variables with an outstanding gauge (zero percent). However, the potency of remaining techniques, speciﬁcally MCP and E-SCAD, tends towards unity with augmenting sample size but capturing massive irrelevant predictors. Similarly, in case of high autocorrelation, E-SCAD has shown good performance in the selection of relevant variables for a small sample, while in gauge, Autometrics and AEnet are performed better and often retained less than 5 percent irrelevant variables. In the presence of heteroscedasticity, all techniques often hold all relevant variables but also suﬀer from overspeciﬁcation problems except AEnet and Autometrics which circumvent the irrelevant predictors and establish the true model precisely. For an empirical application, we take into account the workers’ remittance data for Pakistan along its twenty-seven determinants spanning from 1972 to 2020 for Pakistan. The AEnet selected thirteen relevant covariates of workers’ remittance while E-SCAD and MCP suﬀered from an overspeciﬁcation problem. Hence, the policymakers and practitioners should focus on the relevant variables selected by AEnet to improve workers’ remittance in the case of Pakistan. In this regard, the Pakistan government has devised policies that make it easy to transfer remittances legally and mitigate the cost of transferring remittances from abroad. The AEnet approach can help policymakers arrive at relevant variables in the presence of a huge set of covariates, which in turn produce accurate predictions.


Introduction
"Big Data" has arrived, but big insights have not [1]. In regression analysis, researchers are often interested in discovering the important features while predicting the response variable. erefore, it is important to identify the potential features for knowledge discovery and the predictive ability of the model [2]. However, variable selection is one of the crucial steps while constructing a linear regression model. Picking too many covariates is more likely to enhance the variance of the estimated or trained model. Stated differently, including more variables in the model leads to high variability in the least squares fit, resulting in overfitting and thus providing poor prediction in the future [3]. In contrast, selecting a few covariates may result in unpredictable output or biased results [3,4]. As [5] stated that for valid results, all relevant predictors should be incorporated in the regression model. Missing a single predictor might lead to a misspecified model and the conclusion we draw can be fallacious. According to [6,7], if the covariates are highly interrelated to each other, then the confidence interval associated with each estimated coefficient becomes wider and leads to wrong inferences.
In the recent era, a substantial mass of research has concentrated on the analysis of "Big Data" in the field of economics. As a result, a substantial focus is being paid to the wide variety of techniques that are available in the areas of data mining, machine learning, dimension reduction, and penalized least squares [8,9]. Recently, in the regression context, [1] categorized Big Data into three classes: Tall Big Data, Huge Big Data, and Fat Big Data. Each type can be defined as follows: (i) Tall Big Data: more observations and several covariates (N >> P) (ii) Huge Big Data: more observations and more covariates (N > P) (iii) Fat Big Data: fewer observations and more covariates (N < P) Here, N and P represent the number of observations and covariates, respectively. We graphically represent the types of Big Data in Figure 1.
It is quite obvious that Big Data's handling is not an easy task and to date in literature, there exist just a couple of methods, which can be utilized for improving the least squares estimates under a data-rich environment (Big Data). In Figure 2, we identify all common methods and their modification. Now, we briefly discuss these methods. Penalized least square methods are an integral component of machine learning (ML). It has already been shown in the literature that ML methods are efficient approaches for using Big Data [10]. Penalized regression methods are the modified form of ordinary least squares (OLS). Mathematically we can write the modified form: Like in classical regression, the first component is the sum of squared residuals and the remaining part represents the shrinkage penalty. Here "k" refers to the tuning parameter and is often selected by cross-validation. e other parameter is ϑ; hence by altering its value, we get different models. More specifically, equating ϑ � 0, results in ridge regression model form and if ϑ � 1 is taken as there in Lasso regression. While for the value of ϑ between zero and one, we get the model for elastic net [6]. As its name reflects penalized least square methods are based on some constraints. A good penalty consists of the following three oracle properties: unbiasedness, continuity, and sparsity [11]. Methods belonging to the family of penalized regression like ridge, Lasso and Elastic net do not satisfy all the aforementioned oracle properties [12,13]. Although in the literature, some modified methods satisfy the required oracle properties including smoothly clipped absolute deviation (SCAD) and adaptive lasso, but the drawback associated with these two methods is as follows: they only select one variable from a group of correlated covariates and ignore other variables. e selected variable may or may not be theoretically important. [14] modified SCAD by adding another property to its penalty, which spurs a set of highly correlated covariates to be in or out of the model at the same time. In other words, the new version of SCAD is able to select a group of correlated variables instead of a single one. Similarly, [2] modified the elastic net in the form of an adaptive elastic net, which achieved an oracle property. e method is capable of including and excluding features simultaneously. Minimax concave penalty (MCP) is another extended method, which is developed by [6] and is based on the concave penalty. e method also enjoys an oracle property. To summarize, Adaptive Elastic net, MCP and Elastic SCAD are the updated forms of penalization techniques, primarily used for variable selection and will be explored in the next sections.
Another approach for automatic model selection was proposed by [15,16], known as PcGets. is method is based on the idea of general to specific (gets) modeling. It starts from a general unrestricted model which captures the key attributes of the underlying dataset. eir standard testing approaches are utilized to decrease its complexity by removing statistically insignificant variables, inspecting the validity of the reductions at every stage to ensure the congruence of the selected model. ey studied PcGets  probabilities recovering the data generating process (DGP) through Monte Carlo experiments and got reliable results. e consistency of the PcGets procedure was established by [17].
e new version of the PcGets algorithm was proposed by [18] as Autometrics. is version is based on the same principles as PcGets. Autometrics utilizes a tree-path search to identify and knock out statistically insignificant covariates. If the relevant covariate is eliminated by chance, the algorithm works and does not get stuck even in a single route, containing other covariates as proxies (like in stepwise regression). e beauty of this algorithm is that it works well even if the number of covariates exceeds the number of observations [10].
Our study contributes theoretically as well as empirically to literature. ere exists immense literature on using conventional approaches like vector autoregressions, vector error correction models, etc. Such approaches adjust not more than 10 covariates, as more covariates create serious issues, due to which the results are invalid. More precisely, increasing the number of predictors (Big Data) leads to a few major problems in the models, such as degrees of freedom, high variability, and multicollinearity. For fixing these problems and achieving valid results, this study adopts several updated classical and machine learning techniques. e techniques will be compared under simulated scenarios for multicollinearity, heteroscedasticity, and autocorrelation, and will there be applied to macroeconomic data to provide conclusive solutions to the predictability and validity of distinct theoretical scenarios simultaneously. Our study aims to provide an improved technique to help policymakers; the improved tool is not restricted to worker's remittances (in our case) but is valid for any macroeconomic data set under Huge Big Data (P < N). e goal of this study is to compare the performance of the classical approach (Autometrics) with improved shrinkage methods including Adaptive Elastic net; Elastic Smoothly Clipped Absolute Deviation; Minimax Concave Penalty under different scenarios like multicollinearity, heteroscedasticity and autocorrelation in terms of variable selection. In this study, we focus solely on exploring these techniques for the case of Huge Big Data. e rest of the article is arranged as Section 2 gives an overview of methods. Section 3 discusses the simulation exercise. Section 4 carries out the real data analysis. Section 5 comprises conclusion.

Methods
In statistics and econometrics, it is imperative to investigate the performance of statistical models theoretically and empirically. is work attempts to describe both aspects of the included methods. Our study considers a variety of modified forms of penalization techniques and classical approaches.
e methods considered here are Adaptive Elastic net, Elastic Smooth Clipped Absolute Deviation, Minimax Concave Penalty, and Autometrics. Here, we provide a detailed description of each method.

Adaptive Elastic Net (AEnet).
e lasso estimator has been designed to improve the performance of the ridge estimator. It is certainly useful, particularly when most coefficients of the true model are zero. Albeit, ridge regression performs better than lasso when a correlation between predictors is high [19].
To overcome the shortcomings of lasso and ridge regression, the elastic net method was proposed by [19] and used both lasso and ridge penalty simultaneously. e penalty function of the elastic net (EN) is given by the following: (2) Using a cross-validation approach, the tuning parameters k 1 and k 2 control the relative significance of L 1 norm and L 2 norm penalty. Both Lasso and Ridge regression are the special form of the elastic net, which have already been discussed in Section 1. In this sense, the elastic net contains dual features that are shrinkage and variable selection.
To estimate α EN , [19] proposed an algorithm called least angle regression (LAR). is is the fact that EN does not satisfy an oracle property like Adaptive Lasso, albeit it performs better than Adaptive Lasso [11]. Later on, the idea of the Adaptive Lasso and the Elastic net regularization was combined to achieve further improvement known as Adaptive Elastic net (AEnet) and is defined as follows: . ., m) are adaptive data-driven weights. According to [2], initially, we estimate α EN by using an EN method as given in (2) and then utilize it while computing the weights as here τ is constant and should be positive. us, AEnet, the modified form of elastic net, attains an oracle property.

Elastic Smoothly Clipped Absolute Deviation (E-SCAD).
Reference [12] developed a new regularization method known as SCAD. is method is nonconvex and fulfills the properties of a good penalty. is method not only selects the important features consistently and yields the estimates of unknown coefficients more efficiently given that the true model is known. erefore, the SCAD function covers all the limitations related to the existing methods like Ridge and Lasso. e penalty function of SCAD is defined as follows: ey considered the value of c equal to 3.7, and the unknown tuning parameter k was computed by generalized cross-validation. As foregoing, the penalty function is continuous, and the resulting solution is given by the following: e tuning parameters can be induced from the datadriven techniques. e idea of a combination of SCAD and L 2 penalty was proposed by [14] and called it Elastic SCAD.
Mathematically, E-SCAD can be written as follows:

Minimax Concave
Penalty. e idea of minimax concave penalty (MCP) was initially proposed by [20].
is method provides the convexity of the penalized loss in sparse regions to the greatest extent, given certain thresholds for variable selection and unbiasedness. e MCP is described as follows: e tuning parameter (c > 0) reduces the maximal concavity subject to the following constraints, i.e., unbiasedness and features selection: e role of dual tuning parameters in concave penalty regression is to control the amount of regularization. Besides, the concavity of the MCP penalty substantially prevents the sparse convexity on account of reducing the maximal concavity. As the value of the regularization parameter rises, a result bears more convexity and attain near an unbiased penalty [20]. e penalty function is a part of the quadratic spline function and dual tuning parameters.

Classical Approach.
Autometrics comprises five fundamental phases. e initial phase concerns the construction of a linear model known as General Unrestricted Model (GUM); the second step yields the estimates of unknown parameters and statistical testing of the GUM; the third step consists of the presearch process; the fourth step provides the tree-path search; the last step involves a selection of the final model. e complete algorithm is precisely delineated in [18]. e key notion is to commence modeling with a linear model incorporating each essential feature. Estimate the GUM by the least square method and then execute the statistical tests to ensure the congruency of a model. If the estimated GUM contains statistically insignificant coefficients at prespecified criteria, then again estimate the simpler model by utilizing different path searches and ratified by statistical or diagnostic tests. As some terminal models are detected, Autometrics undertakes their union testing. Rejected models are eliminated, and the union of those terminal models who survived induces new GUM for another tree-path search iteration.
is whole inspection process remains, and the terminal models are statistically examined against their union. If two or more terminal models assure the encompassing tests, then the prechosen information criterion is a gateway to a final decision. e forecasting model is obtained by using Autometrics approach on the GUM: Here, two strategies are widely used for variable selection, a conservative and a superconservative (Liberal) strategy. is study adopts the super conservative strategy based on a one percent level of significance instead of five percent.

Simulation Study
Our simulation experiment involves three main scenarios, namely simulations on a data generating process (DGP) with (i) multicollinearity, (ii) heteroscedasticity, and (iii) autocorrelation. In each case, we vary the DGP characteristics as the correlation structure among predictors, the level of variance of the error term, and the level of correlation between the current and lagged value of the error term.

Data Generating Process.
We generate data from the following equation: where Y t is an outcome variable. e features set, X t � x 1 , x 2 , . . ., x P , is generated from multivariate normal distribution as X t ∼MVN (0, ) where the mean of covariates is zero and is the variance-covariance matrix. e same data generating process (DGP) was used by [1,21] as mentioned in equation (9) for artificial data generation. ree sorts of sample sizes are to be used in the simulation exercise. Moreover, we assume two sets of candidate variables with varying the number of relevant (p) and irrelevant variables (q) respectively, presented in Figure 3.
In the first scenario, we generate the pairwise correlation between the predictors, i.e., x m and x n as cov(x m , x n ) � |m− n| . e population covariance matrix is generated as follows: With varying the parameter, , we get the different pairwise correlation; here, we assume the values for as {0.25, 0.5, 0.9} followed by [22]. In the second scenario, we generate a correlation between the current and lagged residuals (autocorrelation), denoted by ρ. e autocorrelation is generated by the following equation: We assign the following values to the coefficient of lagged residuals: ρ ∈ {0.25, 0.5, 0.9}. ird scenario: in the case of heteroscedasticity, the variance of the error term is not constant and varies across observations by σ k .
us, we divide the variance σ k into two parts, i.e., σ 1 and σ 2 . For half of the observations (n/2), we set the variance by σ 1 and σ 2 for the remaining (n/2) data points. Our experiment assumes three cases of heteroscedasticity and set the values of π i � (σ 1 / σ 2 ), where i � 1, 2, 3 as π i ∈ {0.1/0.3, 0.2/ 0.6, 0.3/0.9}. is study attempts to evaluate the performance of Autometrics, AEnet, E-SCAD, and MCP using Huge Big Data under all preceding scenarios. Tenfold cross-validation is executed to determine the optimal value of the tuning parameter.
To evaluate the performance, the authors [1] have used potency and gauge to assess the best model in features selection relatively. erefore, we follow the same criteria for model selection as well. e entire process is replicated 1,000 times. e comparison of regularization techniques and Autometrics is assessed in the form of incorrect zero identification, namely gauge and correct zero identification, namely potency [1]. For simulation as well as empirical analysis, we use R software.

Simulation Results.
e Monte Carlo simulation results are described in Tables 1-3. Table 1 depicts the findings of simulation in the case of low, moderate, and high multicollinearity for different combinations of observations (n) and covariates. e performance of all methods is improving with increasing sample size: (1) In the case of low and moderate multicollinearity, the potency associated with all methods is one under most simulated scenarios, clearly revealing that they retain all the relevant variables under low multicollinearity. Increasing the level of multicollinearity tends to improve the performance of AEnet and Autometrics in such a way that holds less irrelevant variables but adversely affects the MCP performance, particularly in small and moderate samples. Across low and moderate multicollinearity, the gauge associated with AEnet is lower than the gauge of other methods, which exhibits that it retains less irrelevant covariates. Comparatively, the E-SCAD retains more irrelevant variables and thus overspecify the true model. (2) In the case of high collinearity: high collinearity among variables substantially distorts the performance of MCP and Autometrics in terms of potency and gauge specification. e AEnet retained more than 93 percent correct variables with an outstanding gauge (zero percent). However, the potency and gauge of other methods tend to increase with increasing sample size, particularly MCP and E-SCAD significantly overspecifying the true model (retain more irrelevant variables). AEnet showed an outstanding performance in terms of gauge. Under the large sample, improvement in the E-SCAD gauge was achieved in contrast to the case of low and moderate levels of multicollinearity. Table 2 presents the simulation results by varying heteroscedasticity along with sample size and many covariates (both relevant and irrelevant).
(1) In the case of heteroscedasticity: the potency of all included methods is one in almost all scenarios, certainly manifesting that they hold all the active covariates. In contrast, the gauge of AEnet and Autometrics exhibit that it avoids the irrelevant variables and very precisely identifies the true model. Higher level of Autocorrelation adversely affects the potency of Autometrics in contrast to rival methods. e results suggest that MCP drops the inactive variables, particularly when the sample size is increased. E-SCAD has considerably overspecified the model. Increasing the number of covariates tends to Mathematical Problems in Engineering 5 affect the gauge associated with Autometrics and AEnet.

Real Data Implications
After Monte Carlo experiments, this study performs real data analysis using Huge Big Data. We consider worker's remittances inflow and all its possible determinants data for real data analysis. ere are so many factors that affect the worker's remittances inflow. Some covariates are recommended by economic theory to be included in the model. Apart from this, a long list of variables has been recommended by past studies. is study considers all the possible determinants based on economic theories and literature to make a general model. In econometrics literature, such a model is known as the general unrestricted model (GUM).

Data Source.
is study collects the yearly data for Pakistan from 1972 to 2020 using different sources such as world development indicators (WDI), international financial statistics (IFS), international country risk guide, and state bank of Pakistan. e few missing observations in the data set are replaced by averaging the neighbor observations. Most variables are transformed into logarithm form to ensure normality. Detail regarding the variables has been given in Table 4. Table 4 describes the variables, symbols, definition of each variable, and data source. Figure 4, blue and red colors exhibit Positive and negative correlations between the variables. e colors severity and area of the circles indicate a high pairwise correlation. Besides the right side of the correlogram, the legend color shows the pairwise correlation. We can observe numerous severe color circles in blue and red, evidence of high pairwise correlation.      Figure 4 shows that there exists high multicollinearity among the predictors using the data period spanning from 1972 to 2020. We noted that in Monte Carlo simulations in the case of high multicollinearity, the AEnet outperformed the rival counterparts in terms of potency and gauge, mainly when the sample size is small. It reveals that AEnet is more robust in such circumstances, and thus we should proceed with AEnet output.

Correlation Matrix. In
We performed diagnostic tests and found that the residuals of an estimated model are homoscedastic and uncorrelated. LGDP UEMP  LTO  LGOLD  LDEX  LTIND  LMW  D911  LSSEN  LREER  LFINL  DMOC  ICNF  XCNF  LAOR  CORR  LDEBT  LAGC  LWAGE  LBMP  IRPak  IRUS   LREM  INF  LGDP  IR  LFDI  UEMP  LTO  LGOLD  LDEX  LTIND  LMW  D911  LSP  LGS  LSSEN  LREER  LFINL  DMOC  ICNF  XCNF  LAOR  CORR  LDEBT  LAGC  LWAGE  LBMP  IRPak  IRUS   1 Table 5 depicts the features selection based on real data using classical and shrinkage methods. In Table 5, the AEnet suggests almost 13 important determinants of workers' remittance among 27 determinants. In contrast, MCP and E-SCAD recommend many unrelated determinants of workers' remittance. In other words, we can conclude that they have overspecified the model. Apart from this, Autometrics keep the least number of irrelevant variables. e selection of an irrelevant set of covariates leads to poor forecasting. In contrast, the right set of covariates can improve forecasting, leading to low forecast error. Consequently, an accurate forecast can help the government and other sectors in decision-making. To summarize the results, the empirical application strongly supports the findings of the simulation exercise.

Conclusion and Recommendations
is study compares Autometrics and three machine learning techniques, namely, Minimax Concave Penalty (MCP), Elastic Smoothly Clipped Absolute Deviation (E-SCAD), and Adaptive Elastic net (AEnet), under different scenarios: multicollinearity, heteroscedasticity, and autocorrelation with varying sample size and several covariates. We conducted Monte Carlo experiments to compare all methods in terms of variable selection using potency and gauge. All methods are improving their performance with expanding sample size. Considering the cases of low and moderate multicollinearity as well as low and moderate autocorrelation, the techniques retain all relevant predictor variables. However, for low and moderate multicollinearity, except AEnet, all methods keep many irrelevant predictors as well, whereas under low and moderate autocorrelation, including AEnet, the Autometrics also retain less irrelevant predictor variables. In presence of extreme multicollinearity, AEnet retains more than 93 percent of correct variables. Albeit, the potency of remaining techniques, specifically MCP and E-SCAD tends towards unity with increasing sample size but capturing massive irrelevant predictors as well. Considering the higher level of autocorrelation, E-SCAD has shown good performance in the selection of relevant variables under small sample. However, the same method collapsed under gauge. Similarly, Autometrics and AEnet performed better in gauge and often held less than 5 percent irrelevant variables. In the presence of heteroscedasticity, all techniques often hold all relevant variables but also suffer from overspecification problems except AEnet and Autometrics, which avoid the irrelevant predictors and identify the true model precisely.
On the application side, we take the workers' remittance data along its twenty-seven determinants spanning from 1972 to 2020. AEnet keeps thirteen predictors of workers' remittance. MCP and E-SCAD have selected many irrelevant determinants and consequently overspecified the model. is study has several recommendations: (i) When there is a low/moderate multicollinearity case, and the sample size is small, practitioners and policymakers can use E-SCAD provided if there are less number of irrelevant covariates. Except for this case, AEnet is recommended in the presence of multicollinearity, particularly if the covariates are highly correlated with each other. (ii) e study recommends AEnet when the residuals are heteroscedastic. (iii) In the presence of autocorrelation, if there are more active variables and fewer inactive variables, then researchers should adopt E-SCAD if the scenario is converse, then use AEnet or Autometrics. (iv) In the case of Pakistan, the AEnet showed remarkable performance in relevant variables. Hence, the policymakers and practitioners should focus on the relevant variables selected by AEnet to improve workers' remittance in the case of Pakistan. In this regard, the Pakistan government has devised policies that make it easy to transfer remittances legally and mitigate the cost of transferring remittances from abroad. e AEnet approach can help policymakers arrive at relevant variables in the presence of a huge set of covariates, which in turn produce accurate predictions.