The Optimal Bandwidth Parameter Selection in GPH Estimation

In this paper, the optimal bandwidth parameter is investigated in the GPH algorithm. Firstly, combining with the stylized facts of financial time series, we generate long memory sequences by using the ARFIMA (1, d, 1) process. Secondly, we use the Monte Carlo method to study the impact of the GPH algorithm on existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of the long memory parameter. *e results show that the accuracy of above three factors in the long memory test reached a relatively high level within the bandwidth parameter interval of 0.5< a< 0.7. For different lengths of time series, bandwidth parameter a� 0.6 can be used as the optimal choice of the GPH estimation. Furthermore, we give the calculation accuracy of the GPH algorithm on existence, persistence or antipersistence of long memory, and long memory parameter d when a� 0.6.


Introduction
Long-term memory widely exists in the fields of biology, medicine, geology, hydrology, climate, and social science fields [1][2][3]. It refers to the fact that observations depend on each other in a long term and the autocorrelation function of a sequence decays slowly. In a system with long memory, some important historical events influence the future in long time spans, which contribute to the formation of long memory. For example, it is shown that the big price rise and fall in stock markets and extreme high and low temperatures have an impact on the long memory of their corresponding sequences [4,5]. According to the relationship between long memory and approximate entropy revealed by Pincus and Kalman, the stronger the long memory of the sequence is, the better its predictability will be [6]. In addition, Baillie found that if a time series has a long memory, it is difficult to characterize the internal structural features with short memory models, such as ARMA model. Additionally, the simulative and predictive accuracy of those models are comparatively low [7]. erefore, the research on long memory is of great importance for the theoretical and practical applications [8,9].
ere are two types of long memory in time series. One is the persistent long memory, which means that the development trend of time series will keep in line with the current movement direction in future. e corresponding long memory parameter is 0 < d < 0.5. Contrary to the persistent long memory, the other is antipersistent long memory, indicating that the motion in the future will be opposite to that of present state, and its long memory parameter is 0 < d < 0.5. It is generally believed that the British hydrologist Hurst was the first to study the long memory characteristics in a system. He used Hurst index (H) to depict the long memory strength of a time series [10]. e relationship between Hurst index and long memory parameters is H � 0.5 + d [11,12]. When H ⟶ 1 (or d ⟶ 0.5), it implies that persistent long memory of a time series is strong. When H ⟶ 0 (or d ⟶ − 0.5), it indicates that the antipersistent long memory of a time series is strong. When H ⟶ 0.5 (or d ⟶ 0), the time series appears as random walk, suggesting its unpredictability in theory.
So far, there are no fewer than ten methods for calculating long memory parameters, which can be roughly divided into three categories. e first algorithm is estimated in the time domain, such as the Aggregated Variance, Differencing the Variance [13], Higuchi [14], R/S Analysis, and Detrended Fluctuation Analysis (DFA) [15]. e second algorithm is the frequency-domain estimation, such as Whittle and Averaged Periodogram Estimation [16,17]. e third algorithm is the wavelet-domain estimation methods, such as Wavelet Maximum Estimator and Wavelet-Based Estimation [18,19]. In addition, based on the above algorithms, the researchers proposed many improved estimation methods, such as modified Rescaled Range [20], exact local Whittle method, and modified local Whittle estimation [21,22]. However, as far as the time-domain estimation method is concerned, it is difficult to judge the significance of long memory since the statistic distribution of long memory parameters cannot be given. For the wavelet-domain estimation algorithm, the requirements of structural features in a sequence are often too harsh to correctly extract the modulus, which sometimes makes the results different from qualitative analysis. For the full-parameter estimation method in the frequency domain, it requires that the perturbed random item has a Gaussian distribution, such as Whittle estimation, and involves integral operation, which is difficult to meet in practice. For example, it is well known that the distribution curve of return time series in stock markets possesses a sharp peak and heavy tail feature [23,24]. If the estimation methods in the time and the wavelet domains are regarded as a nonparametric method, the research of the semiparametric method is gradually developed as a compromise between full-parameter and nonparametric method in the frequency domain. Different from the previous method, the GPH proposed by Geweke and Porter-Hudak [25] has better advantages in semiparametric estimation, such as it reduces the normality requirement of random items in estimation and its statistical distribution of estimators is provided within a certain range. Based on the framework of the GPH method, several improved algorithms to estimate long memory of sequences were proposed [26,27], which extend the application of GPH method to different long time memory sequences, simplify the concept, and improve the computational speed. Robinson establishes the asymptotic normality of the GPH estimator, and the results show that it is suitable for stationary and reversible Gaussian vector sequences [28]. Hurvich et al. established the asymptotic properties of the GPH estimation method and derived the expressions of asymptotic bias, variance, and mean square error of estimators, effectively evaluating the accuracy of asymptotic theory for the mean square error of finite sample size [29]. On this basis, Velasco generalized Robinson's results, showing that, with sufficient data cones, the revised estimates of any d (including nonstationary and irreversible processes) are consistent and asymptotically normally distributed [30]. In addition, Velasco proved the consistency of the logarithmic periodic graph regression estimates of the long memory parameters of the series when studying the long-range dependent linear time series and obtained the asymptotic distribution of the asymptotic periodic graph estimates of the long-range dependent time series under possibly non-Gaussian observations [31]. However, on the one hand, from the application point of view, the GPH method is still a basic estimation method [32,33]. On the other hand, the new method requires programming in the operation, while the GPH algorithm has been implemented with menu-based operation on some metrology software, such as OX software and R software. erefore, the GPH method is an indispensable method for estimating long memory in terms of the universality and maneuverability of application. However, there are three problems when using GPH to test the long memory of time series (financial data) [34][35][36][37]. Firstly, for the parameter a of bandwidth g(N) � N a (N is the sequence length), most of the studies mainly choose 0.5, 0.6, 0.7, and 0.8 or directly select g(N) � N/2 to estimate the long memory, which is formed by subjectivity of the authors. Secondly, we have no clear understanding of the action mechanism of how the bandwidth parameter a influences the existence of long memory, persistence or antipersistence of long memory, and the accuracy of the estimated parameter d of long memory, which are calculated by utilizing the GPH algorithm. irdly, as Jeong et al. [38] proposed, there is a common problem with GPH and other methods to estimate long-term memory of sequences, that is, few authors conducted a simulation analysis close to the actual sequence to test the accuracy of parameter d.
In this paper, based on the ARFIMA (1, d, 1) process and some typical features of financial time series, we use the Monte Carlo method to test the impact of parameter a on the existence of long memory, persistence or antipersistence of long memory, and estimation accuracy of the long memory parameter d, so as to give the optimal bandwidth in the GPH algorithm. e structure of this paper is arranged as follows: Section 2 is the introduction of the GPH method. Section 3 gives the Monte Carlo simulation method and validation rules. Section 4 is the analysis of simulation results. e conclusion is summarized in Section 5.

The GPH Semiparametric Method of Long Memory Estimation
Many scholars' studies [39] show that the basis of GPH semiparametric method is that the data process is a fractional white noise process. erefore, the fractional white noise process where u t is a stationary process. If the f u (λ) is the spectral density function of u t , then the spectral density function f x (λ) of x t can be expressed as Discretizing the logarithmic form of equation (1), where λ j � 2πj/N, j � 1, 2, . . . , g(N), and g(N) � N a . N is the length of the sequence x t , and λ j is called the harmonic frequency of the sample data. Geweke and Porter-Hudak proved that the last term ln f u (λ j )/f u (0) in equation (2) is negligible or close to a constant in the sufficiently small harmonic frequency coordinates. erefore, the Ordinary Least Squares (OLS) algorithm can be performed on equation (2) to estimate the long memory parameter d.
Besides, when d < 0, Geweke and Porter-Hudak illustrated that the estimator d of equation (2) has an approximate distribution: where z(j) � ln 4 sin 2 (λ j /2) and z(j) � g(N) (3) is verified empirically, and its theoretical proof is still an open question. However, for actual sequences, it is difficult to know the true value of d in advance. As the d is estimated by equation (2), the existence of long memory can be validated by judging whether the estimator d is significantly different from 0. Ignoring the ln f u (λ j )/f u (0) part, equation (2) is transformed as follows: where ln I(λ j ) is the period gram, i.e., the square of the magnitude of the spectral density function. Porter-Hudak proved that ln f u (λ j )/f u (0) obeyed the Gumbel distribution with a negative Euler constant, − 0.57721 mean, and π 2 /6 variance. Hence, equation (3) is further simplified to where α � ln f u (0) + 0.57721 and e j ∼ N(0, π 2 /6) under the large sample scenario. e d can be estimated by equation (5). Testing the existence of the long memory in sequence can be judged as follows: where . When the sample size is large, the t distribution approximates normal distribution and the statistical test of the estimator d by equation (6) can be approximately equal to that of equation (3). Setting a confidence level α, we can check the existence of the long memory parameter d. In this paper, taking into account the robust characteristics of the GPH algorithm in estimating the long memory, we will let α � 0.1. For a large sample, Agiakloglou et al. and Sowell mentioned that equation (5) can still estimate the long memory of the sequence, even if there are short-term components in the sequence, such as the ARFIMA process [39,40]. In addition, Geweke and Porter-Hudak demonstrated the relationship between long memory parameter d and the Hurst exponent H by the structured method, i.e., d � H − 0.5.

Simulation Method.
In empirical financial research, it is generally believed that the first-order model can adequately depict autocorrelation and fluctuation in financial time series [41]. Combined with the typical features of financial time series, such as sharp peak, heavy tail, asymmetric distribution, and long memory, this paper uses ARFI-MA(1,d,1) model with Skew Student's t Distribution (SKST) to generate simulation data close to the actual sequence, so as to test the impact of bandwidth parameter a of the GPH algorithm on long memory estimation. e ARFIMA (1, d, 1) model is expressed as λ and ] are the skewness coefficient and the freedom degree of the biased student's t distribution. We randomly select λ from (-3, 3) and set ] � 4 in this paper. ϕ and θ are the autoregressive coefficient (AR) and the moving average coefficient (MA), respectively. It is found that most of the autoregressive and moving average coefficients in financial time series models with first order are between -1 and 1. us, ϕ and θ are taken from (-1, 1) randomly. We generate nine types of data with long memory parameters d � − 0.4, − 0.3, − 0.2, − 0.1, 0, 0.1, 0.2, 0.3, and 0.4 by equation (7). Given that the GPH algorithm is subject to sequence length in estimating long memory, we generate 5000 sequences with length N � 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 5000, 10000, and 50000 for each long memory parameter. Hence, there are 9 * 14 * 5000 � 630000 sequences in total. Figure 1 shows one simulated sequence and its probability distribution. It can be seen that the simulated data has a sharp peak, a heavy tail, and asymmetric characteristics. When applying the GPH algorithm to estimate sequence having long memory, most of the literature suggests bandwidth parameters 0.2 ≤ a ≤ 0.5, a ≥ 0.5, and N a ≤ N/2 [42,43]. In order to fully understand the influence of different bandwidth parameters a on the GPH estimation under different long memory parameters d and sequence lengths, this paper takes 0.2 ≤ a ≤ sup a: N a ≤ N/2 { } and the discretization step size is taken as 0.01, where sup a: N a ≤ N/2 { } ≈ 0.9.

Validation Rules.
It is mainly from three aspects to analyze the impact of the bandwidth parameter a on testing of the long memory in sequences, including the existence of long memory, the persistence and antipersistence of long memory, and the calculation accuracy of long memory parameters with different sequence lengths. e validation of the first two aspects should be a progressive relationship. e long memory existence of a sequence is firstly tested. If a sequence has a long memory, we judge the persistence or antipersistence of the long memory. However, based on our simulation results, it is found that if we analyze the effect of bandwidth parameter a on examining long memory as described above, a lot of useful information would be lost. e optimal range of the parameter a derived from the existence judgment of long memory may be a very small interval, or even only a point. Hence, it is difficult to fully investigate the impact of different bandwidth parameters a on the persistence or antipersistence judgment of long memory and estimation accuracy of long memory parameter d, which is not conducive to finding out the optimal bandwidth parameter a. To this end, we set the following rules to select the optimal bandwidth parameter a. Rule 1. Based on Monte Carlo simulation and the GPH algorithm, the optimal bandwidth parameter set on the existence test, persistence or antipersistence judgment of long memory, and estimation accuracy of long memory parameter d are recorded as A 1 , A 2 , and A 3 .

Rule 2.
According to the different requirements on testing long memory of sequences by the GPH algorithm, three related subsets of optimal parameters a are constructed, that is, A 1 , A 1 ∩ A 2 , and A 1 ∩ A 2 ∩ A 3 . A 1 ∩ A 2 denotes the optimal parameter set which satisfies the existence test and persistence or antipersistence judgment of long memory synchronously. A 1 ∩ A 2 ∩ A 3 represents the optimal parameter a set which satisfies the existence test, persistence or antipersistence judgment of long memory, and estimation accuracy of the long memory parameter. Based on the above definition, the judging accuracy for the existence test under parameter a belonging to A 1 set is higher than other sets. e same meaning is suitable for A 1 ∩ A 2 and A 1 ∩ A 2 ∩ A 3 sets.
Rule 3. Given a time series, we assume that the probability of its long memory and no long memory is equal, i.e., 0.5. And, for time series with long memory, the probability of different long memory parameters d is equal.  (d, a) is the judging accuracy for the existence test of long memory using the GPH algorithm under bandwidth parameter a and long memory parameter d. Obviously, the closer the Pxz(d, a) approaches to 1, the more accurate the GPH algorithm is. Pxz(0, a) is employed for measuring the accuracy of the existence test on sequences with no long memory. In order to comprehensively judge the ability of the GPH algorithm to estimate sequences with different long memory, we give the mean of Pxz(d, a) (d ≠ 0), i.e., Pxzc(a). As d � 0, we set Pxzfc(a) � Pxz(0, a) for comparative analysis. In the actual analysis, for a sequence x t , it is impossible to know whether it has long memory in advance. erefore, the Pxzzl(a) is constructed to test whether the long memory of

e Persistence or Antipersistence Judgment of Long
Memory. As d < 0 and d > 0 denote the persistence and antipersistence of long memory parameter, respectively, we construct the judging accuracies Pzc(a) and Pfc(a) to study the impact of the GPH algorithm with different parameters a on the long memory test. Setting

e Estimation Accuracy of the Long Memory
Parameter. According to the simulation results, if the error rate 1/5000 5000 j�1 |d − d j |/|d| is used to verify the precision of the GPH algorithm, it is found that there are several orders of magnitude between big and small |d|. It is not conducive to finding out the optimal range of bandwidth parameter a. In this paper, some rules are made as follows. If the estimated d falls within neighborhood of the truth value d, i.e., U(d, δ), it is considered that the estimated d by the GPH algorithm is valid under the accuracy δ. A basic selection principle of δ is that the neighborhood U(d, δ) of different parameters d does not overlap with each other. e estimation efficiency of the GPH algorithm under long memory parameter d and bandwidth parameter a is defined as follows: Num(d − δ ≤ d < d + δ) refers to the number of the estimated parameter d falling in the neighborhood U(d, δ). Pyxm(a) is the average estimation efficiency of the GPH algorithm with different long memory parameters. e larger the Pyxm(a) is, the higher the estimation efficiency of the GPH algorithm under the estimation accuracy δ is. Generally, the smaller the choice of δ, the farther the distance between neighborhoods U(d, δ) of different parameters d and the fewer the number of estimated d falling into U(d, δ), so that the discrimination degree of Pyx(d, a) may decline, which is not conducive to finding out the optimal range of bandwidth parameter a. According to the long memory parameter d in Monte Carlos simulation, we set δ � 0.05 in this paper.

e Existence Judgment of Long Memory.
As can be seen in Figure 2, the judging accuracy of the existence test on long memory increases gradually in the bandwidth parameter [0.2, sup a: N a ≤ N/2 { }], while the judging accuracy on that of no long memory decreases gradually. To be more specific, within the range [0.2, 0.4] of the bandwidth parameter a, no matter how long the sequence is (simulated length in this paper), if the sequence has a long memory, the judging accuracy is less than 0.3. When a ⟶ 0.2, the judging accuracy is only 0.1. However, if a sequence has no long memory, the judging accuracy of the GPH algorithm can approximately reach 0.9. It is impossible to know whether the long memory of the time series exists in advance, so it is difficult to distinguish the long memory from the no long memory reasonably by the GPH algorithm with bandwidth parameter a ∈ [0.2, 0.4]. In the range [0.8, sup a: N a ≤ N/2 { }] of bandwidth parameter a, for sequences with different lengths, if the sequence has long memory, the judging accuracy of the GPH algorithm is about 0.9. For sequences with no long memory, the judging accuracy is less than 0.2. Hence, with bandwidth parameter a ∈ [0.8, sup a: N a ≤ N/2 { }], it is not suitable to distinguish the long memory from no long memory. Further analysis shows that it is not suitable to use the GPH method to estimate the existence of long memory in time series on the bandwidth parameter a ∈ [0.4, 0.5] ∪ [0.7, 0.8]. According to the judging accuracy curves of the existence test on long memory and no long memory, the intersection of two curves is a � 0.6. And, at the left side of the point, the judging accuracy of the no long memory is low, and on the right side of the point, the judging accuracy of the long memory is also low. erefore, the judging accuracy of the no long memory and the long memory reaches a relatively high level at this point, which is beneficial to estimating the long memory by the GPH algorithm. In addition, the longer the time series length is, the higher the judging accuracy corresponding to the point is. In order to find out the optimal range of bandwidth parameter a, we plot the judging accuracy curves Pxzzl(a) in Figure 3. When the sequence length is 2000 or more, it is seen that the Pxzzl(a) is more than 0.75 around a � 0.6.

4.2.
e Persistence or Antipersistence Judgment of Long Memory. As seen in Figure 4, for time series with different lengths, the judging accuracy curves of persistence or antipersistence for long memory present the same parabola shape. Within the bandwidth parameter a ∈ [0.5, 0.7], the judging accuracy reaches a relatively high value. Figure 5 gives the comprehensive judging accuracy Pcszl(a) of persistence and antipersistence. Within the bandwidth parameter a ∈ [0.5, 0.7], it is seen that the judging accuracy increases gradually with the increase in the time series length, and when the length of the sequence is above 2000, the judging accuracy Pcszl(a) is over 0.9. Figure 6, for the time series with different lengths and long memory parameters d, the estimation accuracy curves with long memory parameters d exhibit a similar shape. Within the bandwidth parameter a ∈ [0.55, 0.7], the accuracy reaches a relatively high value. In Figure 7, the average estimation accuracy Pyxm(a) of the GPH algorithm is given. Within the bandwidth parameter a ∈ [0.55, 0.7], it can be seen that with the increase in the time series length, the average estimation accuracy increases gradually under δ � 0.05, which indicates that the probability of the estimated value d falling into the neighborhood U(d, δ) increases with the increase in the sequence length and the GPH algorithm is effective.  In order to make the optimal parameters a suitable for three branches of the long memory test together, ten bandwidth parameters a corresponding to the high judging accuracy of the GPH estimation under different sequence lengths are recorded as the optimal bandwidth parameter range, as seen in Table 1.

Estimation of Long Memory Parameters. In
According to Table 1, we use intersection which refers to the common part of different sets to find out the optimal bandwidth parameter a ranges in several scenarios. Without considering the sequence length, we can select [0.59, 0.62] as the optimal bandwidth range of the GPH algorithm for existence test and persistence or antipersistence judgment of      the existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of long memory parameter with sequence length below 1000. Given the operational convenience, a � 0.6 is recommended as the optimal bandwidth parameter for estimating the long memory by the GPH algorithm. Table 2 provides the calculation accuracy of the GPH algorithm for estimating long memory with the bandwidth parameter a � 0.6. P(A 1 ) (A 1 � 0.6) denotes the calculation accuracy of the GPH algorithm for the existence test of long memory, which is equal to Pxzzl(a). P(A 1 ∩ A 2 ) (A 1 ∩ A 2 � 0.6) is the calculation accuracy for satisfying the existence test and persistence or antipersistence judgment of long memory synchronously. e calculation step of P(A 1 ∩ A 2 ) is to take the ratio of the minimum number among the existence test and persistence or antipersistence judgment of long memory to the total simulation number, which is similar to the calculation of Pcszl(a), i.e.,   Table 2: Calculation accuracy of GPH long memory estimation under the optimal bandwidth parameter.
As seen from Table 2, with the increase in the sequence length, the P(A 1 ), P(A 1 ∩ A 2 ), and P(A 1 ∩ A 2 ∩ A 3 ) are gradually enlarged. When the sequence length is more than 700, P(A 1 ) exceeds 0.7, and P(A 1 ∩ A 2 ) is over 0.7 when the sequence length is more than 1000. However, for P(A 1 ∩ A 2 ∩ A 3 ), when the sequence length is 5000, its value is only 0.4822, which is mainly caused by the poor accuracy of the GPH algorithm in estimating the long memory parameter. In Figure 7, when the sequence length is short, such as 300, the judging accuracy is only 0.2 under the estimation accuracy δ � 0.05, implying that about 80% of estimated parameter d falls outside the neighborhood of true value d. e result is consistent with the conclusion in [26]. erefore, the GPH algorithm has certain defects when estimating long memory parameters. Only when the sequence length is over 10000, the estimation result is effective.

Conclusions
In this paper, we use the Monte Carlo simulation method to generate long memory sequences with different lengths by using the ARFIMA (1, d, 1) process, so as to study the impact of the GPH algorithm on existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of long memory parameter. Within the bandwidth parameter a ∈ [0.5, 0.7], for the time series with different lengths, the judging accuracy of the GPH algorithm for the existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of long memory parameter all reaches a relatively high level. a � 0.6 can be selected as the optimal bandwidth parameter in application. With the length of time series increasing from 100 to 50000, the accuracy rate of the GPH algorithm for estimating the existence test of long memory increases from 0.5612 to 0.8786. e calculation accuracy of the GPH algorithm for persistence or antipersistence judgment of long memory is from 0.4697 to 0.8673. e calculation accuracy for satisfying the existence test and persistence or antipersistence judgment of long memory is from 0.0623 to 0.6624. e rules used in the analysis of long memory estimation by the GPH algorithm are gradually discussed from the experimental results. It is a practical and novel method, which can be used as a reference for other methods in testing the long memory.

Data Availability
e data used in this study are available on request from the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.