Selection of Stationarity Tests for Time Series Forecasting Using Reliability Analysis

Stationarity is an essential concept in time series forecasting. A reliable stationarity test that yields unbiased test outcomes is vital as it is the gateway before a suitable forecasting model development. Renewable generation time series is inherently seasonal, comprising trend components, and often volatile. These characterizing facets alongside time series length tend to bias stationarity tests’ outcomes. A critical comparison study to check the tests’ reliability is carried out in this paper using different synthetic data required for the case-to-case analysis. Based on the tests’ working, reliabilities are analyzed for different time series lengths and group sizes, varying from 200 to 1000 with an increment of 200. This provides information about changes in reliabilities of the tests for various time series lengths or group sizes. This comprehensive comparison report with a necessary set of well-illustrated figures, table, and thorough explanation of the obtained results is expected to help novice readers to select an apt combination of tests for stationarity check for renewable generation applications.


Introduction
In renewable generation forecasting, stationarity is a crucial notion [1]. As a result, knowing whether a renewable generation time series is e ectively stationarized is important [2,3]. A reliable stationarity test that can deliver impartial results for a particular application is necessary on this note. erefore, a set of tests' reliability information would instill enough con dence in the user for the apt selection of tests. e calculation of the power of a test used for the reliability study con rms whether the test behavior is ideal for a set of parameters associated with the test. Any deviations from the ideal for speci c parameters indicate that the test is unreliable [4]. us, a complete reliability record of tests by analyzing the plot of power vs. test parameters is crucial for the appropriate selection of tests for a given application.
Reliability analysis through power calculation is welldocumented in the literature for various tests. Unit root tests examine time series stationarity using the concept of unit root, and power for some of these tests is calculated in [5] and is analyzed for various time series lengths. Similarly, the power calculations in [4,6] for various data distributions, time series lengths, and signi cance levels expose the MK test's limitations. e power calculation of Levene's test is well explained in [7,8], along with a comparison of type-I error probabilities enlightening the test's sensitivity for variance di erences, various data distributions, and sample sizes. Besides, the power study and error analysis of the KW test considering sample sizes and data distribution notify the test's limitations [9,10]. Power plots were also computed for SW and KS tests and are analyzed for various data distributions to study tests' behavior against nonnormal distributions [11][12][13][14].
ough reliability analysis of the above well-established tests is presented through detailed reasoning, a comparative analysis of the above tests in a common platform which is vital for assisting in the apt selection of tests for a particular application was not performed in the literature. Furthermore, in the above studies, the analysis of the e ect of time series lengths on tests' reliability was studied only for a few specific tests. Besides certain tests, for example, Levene's and KW tests whose work is based on dividing the data samples into various clusters open up opportunities to perform the reliability analysis for different selections of group sizes. However, such an analysis was never carried out in the literature. Furthermore, for KS and SW tests, no reliability analysis was done with respect to the tests' integral parameters, such as skewness and kurtosis. Lastly, a comparison table of the above tests' merits, demerits, and key application tips for identifying the best set of tests for a particular application is always of interest for novice researchers in time series forecasting. e authors have considered all the above-highlighted research gaps to provide a unique reliability analysis, and the significant contributions are mentioned as follows. Firstly, the importance of power calculation for reliability study is enlightened, and then power calculation steps for the nine well-established tests are pictorially represented. e basis for stationarity outcome for the above tests alongside their ideal reliability plots is given special attention. Secondly, five different time series lengths/groups are considered for the above tests' reliability analysis to compare their reliability performance critically and to expose the reason for a test being reliable or not reliable for a particular case. Finally, considering their merits and demerits, a critical comparison of the above tests is tabulated and recommended with each test's key application tips for a better outcome. e remainder of the paper includes a thorough discussion on the importance of reliability analysis, power calculation steps, and comparison of reliability plots.

Reliability Analysis of Well-Established Stationarity Tests
A reliable stationarity test is expected to indicate that a time series is stationary if the time series satisfies the conditions for stationarity. Nevertheless, in some instances, a test declares a stationary time series as nonstationary. e calculation of the power of a test, defined as "the probability with which a test detects a divergence from the null hypothesis conditional that the divergence exists," helps indicate the test's capability in yielding a fair outcome. For ADF, PP, Breitung, MK, Levene's, KW, KS, and SW tests, the power 1β is the probability that the test rejects the null hypothesis conditional that it is actually false [4,5,8,10,13]. Here, β is the probability of accepting the null hypothesis when it is actually false whereas, for the KPSS test, the power calculation metric is different as the hypothesis in the KPSS test is reversed compared to other unit root tests [5]. erefore, the power 1-α for this test indicates the probability that the test does not reject the null hypothesis conditional that it is true, where α is the probability of rejecting the null hypothesis when it is actually true. e plots of power of the test indicate the deviations in the test behavior compared to that of in the ideal case. e ideal plots of power for all the well-established tests for reliability analysis are presented in Figure 1. Furthermore, the basis for tests' outcome is highlighted, and various symbols used in Figure 1 are described underneath.
Power is calculated according to the test properties. For example, unit root tests are designed to detect the presence of a unit root. e unit root can be easily characterized using AR(1) process. When the AR(1) parameter (φ) is less than 1, the unit root is absent, whereas the unit root is present if φ ≥ 1. In an ideal case, for φ < 1, the power values should be 1, and it is zero for φ � 1.
MK test confirms time series stationarity by detecting the presence of a monotonic trend. Hence, for any value of its slope (s) other than 0, the test should reject the null hypothesis. us

Power Calculation for Stationarity Tests
Time series length "T" or group size "G" tends to affect stationarity tests' outcomes [5,7]. erefore, their impact on the well-established tests' reliability is always of interest. Power calculation being essential for reliability analysis, Figure 2 systematically elucidates power calculation steps for the well-established tests.

Power Calculation for Unit Root Tests.
e power of a unit root test for a specific value of φ can be calculated by following the set of steps as suggested in Figure 2. Calculation of the same for different values of φ ranging from 0 to 1 [5], that is, 0.01, 0.02, . . ., 0.99, yields a grid of values of power corresponding to the set of φ. e plots of power vs. φ for various values of "T" can help note how much a test is complying with the ideal behavior, confirming the test's reliability for a particular value of "T." is approach is suitable for unit root tests, and hence, ADF, PP, KPSS, and Breitung tests can be analyzed using this approach.   provide a detailed comparison report of the reliability of wellestablished stationarity tests and to reveal the pertinent issues, the analyses carried out in this section are twofold, as listed underneath.
(i) Firstly, the powers of all the well-established tests for different time series lengths/group sizes are compared and critically analyzed (ii) Next, the merits and demerits of the well-established tests are compared, further suggested with tips for better test outcomes

Analysis of Power Values of Stationarity Tests.
To inspect the tests' reliability, the primary task is to determine the number of replications, that is, the value of "R" (refer to Figure 2). For all the analyses in the simulation study, the value of "R" is set to 3000. e steps as suggested in Figure 2 are followed to construct reliability plots. For randomly chosen five different values of "T/G," the comparison of reliability plots using the nine tests is portrayed in Figures 3-6. Authors in [5] have carried out a reliability analysis considering only a few selective stationarity tests. e time series lengths chosen for analysis were also very small for real-time applications. erefore, this work includes other established unit root tests like the Breitung test, and the time series lengths considered for the analysis are suitable for real-time applications. Similar problems are associated with the MK test too. Furthermore, Levene's and KW tests are analyzed with relatively low and high-sized groups to notice any vital changes in the results. e KS and SW tests have been analyzed previously for various data distributions in the literature [12][13][14]. But this approach does not provide any useful information about the tests' reliabilities based on their working; instead, it provides information on the usability of these tests for various distributions. erefore, a novel approach is used for KS and SW tests where reliability is analyzed for changing skewness and kurtosis values that notify about any possibilities of discrepancies in outputs of the tests based on their basic working of detecting the normal distribution of data. e ensuing paragraphs critically elucidate the reliability performance of unit root and nonunit root tests.
e power values for the ADF test can be seen to start approaching zero at a lower value of φ for smaller lengths. And, for the increase in length, the power of the ADF test increases (refer to Figure 3). However, no such pattern of change in power is seen for the KPSS test. e KPSS test power value begins to fall to zero for T � 800 at the lowest value of φ compared to other lengths. Furthermore, the power plot for T � 200 begins to fall at the highest φ value. It is to highlight that the overall performance of the test is better for shorter lengths. Hence, the KPSS test is suggested to be used along with other tests due to its disadvantage of frequently committing a type-I error, leading to discrepancies in the obtained power values. PP test power values follow a similar trend to that of the ADF test, but, for shorter lengths, the PP test is significantly reliable compared to the ADF test. is is because the former uses nonparametrically adjusted test statistics. For the Breitung test, the output is the same for all lengths except for T �1000. It is noticed that the power for T �1000 falls by a small amount for φ � 0.98 while, for other lengths, the values do not approach zero even for φ � 1. Here, the test statistic is based on that of the KPSS test, with specific changes made to counter the disadvantages of the KPSS test. Although the power problem of the KPSS test is solved, the Step-by-step procedure for power calculation of stationarity tests.
issue of power value not approaching to zero for φ � 1 arose, that is, while solving the type-I error of the KPSS test, the test statistics of the Breitung test became prone to type-II error. MK test has very low reliability (refer to Figure 4) due to frequent type-I errors. e test formulations cannot differentiate between trend effect and general data highs and lows. Levene's test is completely reliable as the plots obtained are analogous to the ideal plot (refer to Figure 5). KW test fails to detect a very small difference in mean values between groups, and hence, biased results are seen. is biased nature is prominent for lower group sizes. For G � 200, power values begin to rise for higher m value as compared to that of G � 1000 (refer to Figure 5). Reliability for SW and KS tests is checked against various c 1 and c 2 values. SW test performs better than the KS test in both aspects, but a very high rate of committing type-I errors in the SW test results in biased power values with respect to c 2 . e KS test is designed to be sensitive to every form of difference between two distributions leading to low power.

Conclusion
e objective of this paper was to effectively compare and critically study nine well-established time series stationarity tests taking reliability into account and assisting the reader in selecting tests for a given application. e tests' reliability was characterized using a metric known as power, and the inferences from the reliability plots were examined. Furthermore, the merits and demerits of the tests were compared. And suggestions for the tests' direct application with apt setting(s) and information on other pertinent aspects are expected to help novice readers build accurate forecasting models.
highly reliable in handling inherent renewable generation time series components, such as trend, seasonality, and volatility. e Breitung test suitably solves the reliability problem of the PP test for lower time series lengths, while the latter solves the same problem for higher time series lengths with the former. But, the above two unit root tests suffer from the incapability of detecting seasonality and volatility effects in a time series. erefore, their suitable hybridization with a nonunit root test, such as Levene's test, can solve the above limitations. Levene's test's inability to detect the trend component can be further resolved by the above two unit root tests.

Data Availability
e figures and tables used to support the findings of this study are included in the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.