Composite Quantile Regression Neural Network for Massive Datasets

Traditional statistical methods and machine learning on massive datasets are challenging owing to limitations of computer primary memory. Composite quantile regression neural network (CQRNN) is an efficient and robust estimation method. But most of existing computational algorithms cannot solve CQRNN for massive datasets reliably and efficiently. In this end, we propose a divide and conquer CQRNN (DC-CQRNN)method to extend CQRNN onmassive datasets.,emajor idea is to divide the overall dataset into some subsets, applying the CQRNN for data within each subsets, and final results through combining these training results via weighted average. It is obvious that the demand for the amount of primary memory can be significantly reduced through our approach, and at the same time, the computational time is also significantly reduced. ,e Monte Carlo simulation studies and an environmental dataset application verify and illustrate that our proposed approach performs well for CQRNN on massive datasets. ,e environmental dataset has millions of observations. ,e proposed DC-CQRNN method has been implemented by Python on Spark system, and it takes 8minutes to complete the model training, whereas a full dataset CQRNN takes 5.27 hours to get a result.


Introduction
With the development of information technology, mobile Internet, social networks, and e-commerce have greatly expanded the boundaries and applications of the Internet. Terabytes of dataset are becoming more common. For example, the National Aeronautics and Space Administration Earth Observing System Terra and Aqua Satellites monitor the Earth atmosphere, oceans, and land, producing approximately 1.5 TB of environmental data per day. According to Intel's forecast, in 2020, a networked selfdriving car will generate 4 TB of data every 8 hours of operation. Massive datasets offer researchers both unprecedented challenges and opportunities. e key challenge is that using conventional computing methods to directly apply machine learning and statistical methods to these massive datasets is prohibitive. First, the calculation time is too long to get results quickly. Second, the data can be too big so that the computer primary memory overflowed. In order to overcome these challenges, researchers have proposed a divide-and-conquer method [1][2][3], which may be an effective method to analyze massive datasets.
In this paper, we consider a divide-and-conquer method for massive datasets. Fan et al. [4] analyzed the least squares regression of the linear model on a massive dataset using a divide-and-conquer method. Lin [1] considered a divide-and-conquer method for estimating equations in massive datasets. Chen [2] analyzed the generalized linear models in extraordinarily large data using the divide-andconquer method. Zhang [5] proposed a divide-andconquer kernel ridge regression. Schifano [3] extended the divide-and-conquer approach in [3] to online updating for stream data. A block average quantile regression (QR) approach for the massive dataset is proposed in [6] by combining the divide-and-conquer method with QR. Jiang [7] extended the work of [6] to composite quantile regression (CQR) for massive datasets. Recently, Chen et al. [8] studied QR under memory constraint for massive datasets. Chen [9] considered a divide and conquer approach for quantile regression in big data.
As we all know, QR is more robust than ordinary least squares regression (OLS) when error distribution is heavily skewed. However, the relative efficiency of the QR can be arbitrarily small compared to OLS. To this end, CQR, proposed in [10], is always effective regardless of the error distribution and works much better in efficiency when compared with the OLS. Since then, CQR has been extensively studied in the nonlinear model. Reference [11] utilized local polynomials and CQR to estimating nonparametric model and proposed the local polynomial CQR. Kai [12] studied CQR for varying coefficient partially linear model. Guo [13] proposed CQR for partially linear additive model. Jiang [14] studied two-step CQR for the single index model.
Artificial intelligence (AI) does not require any priori assumptions about the model in dealing with nonlinear problems, so that it is a significant advantage compared to the statistical model such as nonparametric model and varying coefficient partially linear model. ere has been a lot of research on combining the nice properties of AI and QR or CQR. For instance, Taylor and Cannon [15,16] proposed the quantile regression neural network (QRNN) by combining the artificial neural network (ANN) with QR. A support vector quantile regression (SVQR) method is proposed in [17] through combining support vector machine (SVM) and QR. A composite quantile regression neural network (CQRNN) method is studied in [18], which adds ANN structure to CQR. But, when the amount of data is large, it is well known that using conventional computing methods to directly compute CQR and ANN are very slow, so the computation of CQRNN is slower. It is shown that computation is a bottleneck for application of CQRNN on massive datasets.
In this paper, our focus is on CQRNN for massive datasets whose size exceeds the limitations of a single computer primary memory. Fortunately, we are not limited to a single computer. In this end, we consider CQRNN for massive datasets by the divide-and-conquer method on the distributed system. A distributed system is composed of multiple computers (called nodes) that can run independently, and each node uses wire protocols (such as RPC and HTTP) to transfer information to achieve a common goal or task. In the distributed system, the communication cost of data communication between different nodes is usually very expensive. A "master and worker" type of distributed system is considered where workers do not communicate directly with each other, which is shown in Figure 1. e concrete steps are illustrated as follows.
Step 1: randomize the initial parameters.
Step 2: distribute the initial parameters to each worker.
Step 3: each worker trains a subset of the data and sends the training results to the master. Step 4: the master takes the weighted average of the training results of each worker as the approximate global training results.
Step 5: when there is more data to be processed, return to Step 2.
Traditional statistical methods and machine learning on massive datasets are challenging owing to limitations of computer primary memory. CQRNN is an efficient and robust estimation method. But, most of existing computational algorithms cannot solve CQRNN for massive datasets reliably and efficiently. In this paper, we propose a DC-CQRNN method to extend CQRNN on massive datasets. e major idea is to divide the overall dataset into some subsets, applying the CQRNN for data within each subsets, and final results through combining these training results via weighted average. e proposed DC-CQRNN method can significantly reduce the computational time and the required amount of primary memory, and the training results will be as effective as analyzing the full data at the same time. For illustration, we used Monte Carlo simulation to compare the performance of DC-CQRNN method with CQRNN [18], QRNN [15,16], the artificial neural network (ANN) [19], SVM [20], and random forest (RF) [21]. In addition, an environmental dataset application verifies and illustrates that our proposed approach performs well for CQRNN on massive datasets, where the environmental dataset has millions of observations. e remainder of this paper is organized as follows. In Section 2, we present the DC-CQRNN for massive datasets in detail. In Section 3, we used Monte Carlo simulation to illustrate the finite sample performance of the proposed DC-CQRNN method. A detailed presentation of the environmental dataset analysis is in Section 4. In Section 5, we conclude the paper.

Our Motivation.
In recent years, China GDP has grown rapidly. is has also intensified the contradiction between national economic development and the environment. At the same time, smog pollution has occurred in parts of North China, Northeast China, and Central China, which had a huge effect on China. erefore, it is extremely important to accurately report air quality to the public and to prevent smog measures in advance. At present, a large number of air quality monitoring stations are established in many places in China such as Beijing, Chengde, Tangshan, and Tianjin. However, for those areas that do not have air monitoring stations, how to accurately predict air quality and report to the public in a timely manner remains a problem.
To this end, we collected environmental datasets from air quality monitoring stations in different places, with a total of 1,018,562 observations. When using the CQRNN method to dealing with environmental datasets, we found that the data can be too big so that the general computer primary memory overflowed, and the computational time is too long to get results quickly. CQRNN method is basically ineffective in dealing with massive data. At the same time, it is obvious that SVM, ANN, RF, etc., also take a long time to processing massive datasets. In this section, we propose a divide-andconquer CQRNN (DC-CQRNN) method to extend CQRNN on massive datasets. e same idea can be applied to SVM, ANN, and RF.

Composite Quantile Regression Neural Network.
In the real world, there is usually a nonlinear relationship between predicted y and predictor x � (x 1 , x 2 , . . . , x p ) ′ , which can be seen as a stochastic model as follows: where ϵ is the model error and θ is a vector of unknown parameters. ere are many regression techniques to solve the unknown parameters of the model, such as OLS, QR, CQR, and their derived methods. Xu et al. [18] proposed CQRNN by combining the nice properties of ANN and CQR. Given predictor x i and predicted y i , the CQRNN objective aims for minimizing the empirical loss function: where ρ τ q (u) � τ q uI(u ≥ 0) + (τ q − 1)uI(u < 0), τ q � q/ (Q + 1) for q � 1, . . . , Q (see [10]), and y i (τ q ) is the conditional quantile of y at τ q and can be estimated by the following two steps. Firstly, the outcome of the jth hiddenlayer node g j,l (τ q ) is calculated by an activation function f h to the inner product among x i and the hidden-layer weights w h jl plus the hidden layer bias b (h) j .
where f (h) denotes an activation function generally using the hyperbolic tangent, T is a bias vector of the hidden layer. Secondly, an estimate of predicted y i is consequently given by where f (o) is the output layer activation function, ′ contains all coefficients including weights and biases, to be trained or estimated. e main purpose of CQRNN is to estimate θ using Remark 1.
(1) According to the suggestion of [16], we use the Huber norm to overcome the problem that ρ τ (u) is not always differentiable.
(2) To avoid over fitting the model, we add a penalty term in equation (5) to obtain equation (6).
where ‖.‖ 2 is the L 2 -norm and λ is a regularization parameter. According to the suggestion of [18], we choose L and λ through EBIC-like criterion: where df is the number of selected variables corresponding to λ and L.

Divide-and-Conquer Composite Quantile Regression
Neural Network. When the sample size N is too big, using conventional computing methods to directly solve the optimization problem in (5) is infeasible. Based on the ideas of [1,2], our method is to divide the overall dataset into some subsets with each containable in the computer primary memory. en, we implement the CQRNN for data within each subset. Final, we obtain results through combining these training results via weighted average. In detail, we proposed DC-CQRNN which can be obtained by the following concrete steps:  Step 1: randomize the initial parameters and divide the full dataset into K subsets, so that the kth subset contains n k observations: (x k,i , y k,i ), i � 1, . . . , n k .
Step 2: distribute the initial parameters to each worker.
Step 3: each worker trains a subset of the full data set; obtains the estimators θ k , k � 1, . . . , K, using the methodology in solving equation (5); and sends θ k to the master.
Step 4: the master implements the weighted average of θ k , k � 1, . . . , K to obtain θ as the resulting estimator of θ: Step 5: when there are more data to be processed, return to Step 2. e detailed process is shown in Figure 2.

Remark 2.
Obviously, if the number of K is large, the subsets' data size n k will be very small, so the correlation among the values of the dataset is destroyed. To this end, we give some restrictions on K to manage correlation. Regularity conditions are as follows: (a) e sample size of the kth subset is where n max � max(n k ) and n min �min(n k )

Numerical Simulations
In this section we exploit Monte Carlo simulation to compare the finite sample performance of the DC-CQRNN method with CQRNN [18], QRNN [15,16], ANN [19], SVM [20], and RF [21]. e performances of the CQRNN method with different Q are very similar in the simulation of [22]. us, we only consider Q � 4, 9 as a compromise between the estimation and computational efficiency of the CQRNN method.

Simulation Data.
In order to investigate the performance of DC-CQRNN methods with different structures, we choose the various values of parameters Q and K, namely, Q � 4, 9 and K � 1, 10, 50. Simultaneously, to illustrate the effectiveness and robustness of our method, we consider three different error distributions for ϵ: standard normal distribution (N(0, 0.25)), a student t distribution with three degrees of freedom (t(3)), and a Chi-square distribution with three degrees of freedom (χ 2 (3)) and two cases for ϵ.
We generated 100, 000 samples with the same observations and variables under case 1 and case 2 with three error distributions, respectively, and randomly assigned 50,000 samples as a training dataset (In sample) and the remaining samples as testing datasets (out of sample). In order to test the performance of the competitive model, all of the simulations are run for 100 replicates.

Prediction
where f(x i ) is an estimator of f(x i ). We employ EBIC in (7) to select L and λ, and the results are showed in Table 1. To reduce the computational load, we consider all the combinations of λ � 0, 0.05, . . . , 1 and L � 1, 2, . . . , 10. e predicted performance is listed in Tables 2 and 3.
From Tables 2 and 3, we can see that  Table 4. It can be seen from Table 4 that the CPU time of CQRNN is longer than that of QRNN. At the same time, DC-CQRNN runs faster than QRNN and CQRNN, and the CPU time of DC-CQRNN decreases as K increases. e main advantage of DC-CQRNN is that each worker in the master and worker type of distributed system is independent, so the CQRNN on each worker can be executed in parallel. As a result, the computing time of DC-CQRNN in massive datasets is greatly reduced compared to CQRNN.
In addition, to compare the computational efficiency of the DC-CQRNN method in both sequential and parallel distributed environment, in Table 5, the computing time of DC-CQRNN and DC-QRNN under sequential distributed environment (SDE) and parallel distributed environment (PDE) is recorded. It can be seen from Table 5 that the computational efficiency of PDE has obvious advantages compared with SDE.

Environmental Dataset.
e research areas of this paper contain Baoding, Beijing, Chengde, Shijiazhuang, Tangshan, Tianjin, Xingtai, and Zhangjiakou in China from January 1, 2015, to July 1, 2019, totally 1,018,561 observations. Our study collected hourly historical monitoring PM2.5 from website-the Ministry of Environmental Protection of the People's Republic of China (http://datacenter.mep.gov.cn)-and collected meteorological data from website-National Oceanic and Atmospheric Administration (https://www.noaa.gov/). Each sample includes PM2.5 concentration, temperature, pressure, humidity, wind direction, wind speed, and visibility, respectively, measured at the air quality monitoring stations. Details for the dataset are shown in Table 6.

Normalization.
Data normalization is an important step for many machine-learning estimators, particularly when dealing with neural networks. Dataset with a wider range is likely to cause instability when training models. Standardization was used to standardize the features by deducting the mean and scaling the data, with the variance of variable (Z i ) calculated as where x i , x, and δ x are the sample values, mean, and standard deviation, respectively.

Empirical
Results. e aim of our experiments is to examine the effectiveness of the proposed DC-CQRNN for the spatial prediction of PM2.5 concentration. We also consider models including ANN, QRNN, CQRNN, RF, and SVM. For the PM2.5 concentration monitoring dataset, it is divided into a training set and a testing set according to a ratio of 7 : 3. To get the results in a most objective way, we have repeated 100 times the experiments of training and testing at randomly chosen composition of training and testing data. e final results of training and testing are the average of all trials. For parameters setting in intelligent model, we use the EBIC approach, and Table 7 presents the results of parameters selection, and for machine learning model, we consider cross-validation [23] for their parameter setting.
Setting K � 1, 10, 50 and Q � 4, 9. For DC-QRNN, we only report the results at τ � 0.5. e results of our experiments measured by RMSE, MAE, and CPU time are reported in Table 8.
For the prediction accuracy of the model, the training results of DC-QRNN and DC-CQRNN are close to the QRNN and CQRNN based on full sample, respectively. e CQRNN is significantly better than the ANN, RF, and SVM. In particular, the CQRNN performs best when K � 9 for both in sample and out of sample.
Considering the CPU time of the model, the CPU time of the ANN based on full sample training is higher than RF and SVM. In the DC-QRNN and DC-CQRNN, the training of each piece of data is independent, so parallel calculation can be performed on each piece of data, which significantly reduces the computational time. e experimental results show that the CPU time of DC-QRNN and DC-CQRNN are much lower than QRNN and CQRNN based on full sample, respectively. e environmental dataset includes 1,018,562 observations.
e entire methodology has been implemented by Python on Spark system. By using the proposed DC-CQRNN method with K � 50 and Q � 9 on the Spark system, it takes 8 minutes to complete the model training, whereas a full sample CQRNN method with Q � 9 takes 5.27 hours to get a result.

Results and Discussion
Massive datasets offer researchers both unprecedented challenges and opportunities. e key challenge is that using conventional computing methods to directly apply machine learning and statistical methods to these massive datasets is         prohibitive. In this paper, the Monte Carlo simulation studies and an environmental dataset application verify and illustrate that our proposed approach performs well for CQRNN on massive datasets. erefore, the DC-CQRNN method is effective and important. Obviously, the larger the value of K, the more efficient the DC-CQRNN method is. But, it should be noted that if the number of K is large, the subsets data size n k will be very small, so the correlation among the values of the dataset is destroyed. erefore, as long as the value of K is not very large, it is ok.

Conclusion
Using composite quantile regression neural networks to dealing with massive datasets faces two main challenges: first, the calculation time is too long to get results quickly. Second, the data can be too big so that the computer primary memory overflowed. To solve this difficulty, we propose DC-CQRNN by a divide-and-conquer method on a "master and worker" type of distributed system. e proposed DC-CQRNN can significantly reduce the computational time and the required amount of computer primary memory, and the training results are as effective as analyzing the full data simultaneously. e divide-and-conquer ideal also extends to QRNN, ANN, and SVM. In the future, we will try to use the subsampling method to reduce the time-consuming of the neural network training massive data and save the input cost.
Data Availability e research areas of this paper contains Baoding, Beijing, Chengde, Shijiazhuang, Tangshan, Tianjin, Xingtai, Zhangjiakou in China from January 1, 2015 to July 1, 2019, totally 1,018,561 observations. Our study collected hourly historical monitoring PM2.5 from website: the Ministry of Environmental Protection of the People's Republic of China (http://datacenter.mep.gov.cn), and collected meteorological data from website: National Oceanic and Atmospheric Administration (https://www.noaa.gov/).

Conflicts of Interest
e authors declare that they have no conflicts of interest.