A New Nonparametric Multivariate Control Scheme for Simultaneous Monitoring Changes in Location and Scale

Real-time monitoring of the breast cancer index is becoming increasingly important. It can help create advances in the diagnosis and treatment of breast cancer. In today's modern medical processes, simultaneously monitoring changes in observations in terms of location and scale are convenient for the implementation of control schemes but can be challenging. In this paper, we consider a new nonparametric control scheme for monitoring location and scale parameters in multivariate processes. The proposed method is easy to implement, and the performance of the proposed control procedure is discussed. Then, we compare the proposed scheme with some competing methods. Simulation results show that the proposed scheme can efficiently detect a range of shifts. The proposed chart can trigger an alert and timely discover the change of the breast cancer index.

Nonparametric control charts are important in manufacturing and service sectors when samples of observations are nonnormal. Some control schemes are used to monitor high-dimensional processes when we know little about the underlying distribution [39][40][41][42]. Most control schemes are designed to monitor location parameters. For example, Liu and Singh [43] introduced several multivariate rank tests based on data depth. Liu [44] used the concept of data depth to propose several new control charts to monitor multivariate process. Data depth provides an efficient metric of the process' performance without using parametric assumptions. In addition, Zou et al. [45] provided a multivariate spatial rank for monitoring high-dimensional processes with unknown parameters. For detecting the location changes in nonparametric multivariate processes, we also recommend the discussions by [46,47]. To detect the changes in the location and scale of observations simultaneously, several monitoring methods are proposed in the literature, including Mukherjee and Chakraborti [48] and Chowdhury et al. [49]. Recently, Mukherjee and Marozzi [50] consider the sum of the squares of standardized Wilcoxon and the Bradley statistics for monitoring high-dimensional processes with unknown parameters which is advantageous in simultaneous monitoring of multiple aspects.
Recently, some schemes have been proposed to monitor the changes in location and scale simultaneously using a single chart. Performance advantages of these charts have been clearly established [51]. Lepage [52] discussed a nonparametric two-sample test for location and dispersion. Based on Lepage [52], Mukherjee and Marozzi [51] introduced new circular-grid charts for simultaneous monitoring of process location and process scale based on Lepage-type statistics. Meanwhile, Mukherjee and Marozzi [53] investigated a new single distribution-free Phase-II CUSUM procedure based on the Cucconi statistic for simultaneously monitoring changes in location and scale parameters of a process. In addition, Mukherjee and Sen [54] discussed a distribution-free (nonparametric) Shewhart-Lepage scheme for simultaneous monitoring of location and scale parameters using an adaptive strategy. Li et al. [55] and Shi et al. [56] provided powerful control schemes aimed at simultaneously monitoring the location and the scale parameters of any continuous process. Moreover, Zafar et al. [57] proposed a new parametric memory-type charting structure based on progressive mean under max statistic for the joint monitoring of location and dispersion parameters. Song et al. [58] introduced distribution-free adaptive Shewhart-Lepage-type schemes for simultaneous monitoring of location and scale parameters using information about symmetry and tail weights of the process distribution. Huang et al. [59] proposed a new statistical process monitoring scheme with a double-sampling plan for simultaneously monitoring location and scale shifts. Bai and Li [60] considered monitoring ordinal categorical factors for monitoring which considers shifts in the location or scale parameters of latent variables. For multivariate processes, Cheng and Shiau [61] proposed a distribution-free phase I monitoring scheme for both location and scale parameters based on the multisample Lepage statistic.
Although these literatures contain many control schemes for monitoring location and scale parameters simultaneously, much less focus has been placed on control strategies that simultaneously monitor location and scale parameters in multivariate processes. In this study, we propose a useful and easy-to-implement control scheme for simultaneously monitoring location and scale parameters, which is based on nonparametric location and scale hypothesis testing. Reference samples are denoted as phase I data streams, and test samples are denoted as phase II data streams. One problem is that the size of phase II increases with the number of data streams. Considering this issue, we performed hypothesis testing repeatedly with each new data stream. Thus, the amount of phase II data became a constant for each acquisition time.
The remainder of this paper is organized as follows: In Section 2, we review nonparametric hypothesis testing in detail. In Section 3, we propose a new scheme based on a hypothesis testing statistic for monitoring location and scale parameters. Then, we discuss the proposed method's performance and validity. In Section 4, we perform a simulation-based comparison to compare the proposed chart with other existing charts. In Section 5, breast cancer data are investigated to describe the performance of the proposed chart. Lastly, we briefly draw conclusions in Section 6.

Review of Nonparametric Hypothesis Testing
Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution, considering reference sample fX 1,t , X 2,t ,⋯,X m,t g of size m and test sample fY 1,t , Y 2,t ,⋯,Y n,t g of size n. Thus, null hypothesis H 0 : μ 1 = μ 2 , σ 2 1 = σ 2 2 versus alternative hypothesis H 1 : μ 1 ≠ μ 2 or σ 2 1 ≠ σ 2 2 , where μ 1 is the location parameter of reference sample; μ 2 is the location parameter of test sample; σ 2 1 and σ 2 2 are the scale parameters of the reference and test samples, respectively. We can use a reasonable statistical decision procedure to reject the null hypothesis H 0 . In real situations, it is difficult for us to identify the exact distribution of data streams. Therefore, nonparametric hypothesis testing is also introduced, which does not consider the distribution of the original data. For hypothesis testing about the location parameter, Mood [62] proposed the median test, which is based on the rank of each datum. Considering the interaction between the reference and test samples, Wilcoxon [63] and Mann and Whitney [64] introduced the Mann-Whitney-Wilcoxon statistic. In addition, rank-based nonparametric hypothesis testing of scale parameter is used in the literature [65][66][67].

Methods for Location Detection.
In general, people often check whether there is a change for a given location parameter in a process. We often use the t-statistic under the assumption that the distribution is normal. However, there is a risk in using the t-statistic with unknown population distributions. Thus, some distribution-free statistics have been developed. Brown-Mood median testing is a useful nonparametric method. However, the bilateral test does not yield satisfactory results when m ≠ n. To use more information about the relative size of the reference sample and test sample, the Wilcoxon rank-sum test was developed. We assume that a reference sample of size m and test sample of size n are given, and we let N = m + n. Considering the pooled sample fX 1,t , X 2,t ,⋯,X m,t , Y 1,t , Y 2,t ,⋯,Y n,t g at time t, Mann and Whitney [64] developed the Mann-Whitney statistic as follows: Therefore, the Wilcoxon rank-sum statistic is where Computational and Mathematical Methods in Medicine N + 1Þ/2. It can be seen that [68] Under the null hypothesis, we also calculate the approximate normal statistic when the sample N is sufficiently large.

Methods for Scale Detecting.
A location parameter typically describes the position of a distribution, and a scale parameter is also an important characteristic that describes a distribution. When the distribution of observations is unknown, some distribution-free methods are typically used. Given a two-phase independent sample fX 1,t , X 2,t ,⋯,X m,t g Fðμ 1 , σ 2 1 Þ and fY 1,t , Y 2,t ,⋯,Y n,t g~Fðμ 2 , σ 2 2 Þ. We assume that the location parameters of the two samples are equal ð μ 1 = μ 2 Þ. Based on the Mann-Whitney statistic, Siegel and Tukey [65] proposed the Siegel-Tukey statistic. The implementation design of this statistic consists of the following steps: (1) mix the two samples fX 1,t , Mood [62] also provided a useful test statistic for scale parameters. As before, we consider two sequences of fX 1,t , X 2,t ,⋯,X m,t g~Gðμ 1 , σ 2 1 Þ and fY 1,t , Y 2,t ,⋯,Y n,t g~Gðμ 2 , σ 2 2 Þ, where μ 1 = μ 2 . The Mood statistic can be described as follows: where R i,t is the rank of Y i,t , i = 1, 2, ⋯, n, in sample fX 1,t , X 2,t ,⋯,X m,t , Y 1,t , Y 2,t ,⋯,Y n,t g of size Nð= m + nÞ. For m, n ⟶ +∞ and m/N⟶ constant C. Additionally [68], Filgner and Killeen [69] also introduced a test statistic for scale parameters that is based on the absolute rank. The statistic is defined as rank-sum statistic under the null hypothesis. Therefore,

Proposed Monitoring Strategy
We assume that there are m-independent observations from an unknown multivariate continuous distribution with dimensionality p. We assume that independent observations, X i , follow the model below: where μ 0 and μ 1 are the in-control (IC) location vector and the OC location vector, respectively; Σ 0 and Σ 1 represent the IC covariance matrix and the OC covariance matrix, respectively, where ðμ 0 , Σ 0 Þ ≠ ðμ 1 , Σ 1 Þ; τ represents an unknown change point; and G p ð·Þ is an unknown continuous distribution function. In phase I, we assume that the IC sample of size m is given at time After the phase I sample R is analyzed, the phase II sample T is monitored. Inspired by Mukherjee and Marozzi [50] for multivariate processes, we consider the p-dimension statistic of the Euclidean distance of new observations and the mean vector of phase I data, Then, a Shewhart-type chart for monitoring location changes that is based on the Wilcoxon ranksum statistic (i.e., S-W chart) can be constructed. The statistic of the S-W chart is Z W,t = ðW 1,t − mn/2Þ/ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi mnðN + 1Þ/12 p with upper control limit (UCL) and lower control limit (LCL) where L is an unknown constant.
We then use the average run length (ARL) to evaluate the performance of these methods. ARL is the number of points that, on average, will be plotted on a control chart before an OC condition occurs. If the process is IC, ARL 0 = 1/α; otherwise, ARL 1 = 1/ð1 − βÞ when the process is OC. In addition, α is the probability of a type I error occurring, and β is the probability of a type II error occurring. Therefore, we typically fix IC ARL, which is denoted as AR L 0 , and compare the OC ARL, which is denoted as ARL 1 . A small ARL 1 is considered better. Figure 1 shows the OC ARL of the S-ST, S-MD, and S-FK charts. We let m = 50, n = f5,10,20g, and p = 4 under the multivariate Gaussian distribution with expectations μ 0 and the variance matrix, Σ 0 . For a fair comparison, we set ARL 0 = 500 for all control schemes. Figure 1 shows the OC ARL of the three Shewhart-type schemes when detecting scale parameters. Figure 1 shows that the S-MD chart's performance is better than the other charts when detecting a range of scale shifts.
When calculating the Mahalanobis distance, the sample population must exceed the sample dimension; otherwise, the inverse matrix of the population sample covariance matrix obtained does not exist. Thus, the Mahalanobis distance sometimes fails to meet practical requirements. It is also not appropriate to simply use the Euclidean distance to reduce the dimensionality of high-dimensional data, because this process would equate the differences between different data attributes (i.e., the dimensions of each index or variable). The standardized Euclidean distance is an improvement strategy that can overcome the shortcoming of the simple Euclidean distance. Since the distribution of each dimension component of the data is different, the first to "standardize" each component to the associated mean and variance are equal.
Mukherjee and Marozzi [50] consider the sum of the squares of standardized Wilcoxon and Bradley statistics for monitoring high-dimensional processes with unknown parameters. Inspired by Mukherjee and Marozzi [50], we combine the idea of control schemes and hypothesis testing to propose an effective control scheme that simultaneously monitors expectation and variance. Based on this analysis, we propose an alternative control scheme, whose statistic is with The term asymptotic distribution is used in the sense of convergence in law when m ⟶ ∞ and n ⟶ ∞ with the ratio m/N constant [52]. Under H 0 , the statistics Z W,t and Z MD,t are uncorrelated for all m and n. Since, for all m and n, Thus, we have Equality (14) is the product of EðW 2,t jH 0 Þ and EðMD t j H 0 Þ. Therefore, It is obvious that Under H 0 , Z W,t = ðW 1,t − EðW 1,t jH 0 ÞÞ/ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi VarðW 1,t jH 0 Þ p ⟶ Nð0, 1Þ and Z MD,t = ðMD t − EðMD t jH 0 ÞÞ/ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarðMD t jH 0 Þ p ⟶ Nð0, 1Þ with m ⟶ ∞, n ⟶ ∞, and the ratio m/N constant.

Performance Evaluation
In this section, we compare the performances of these charts with different reference sample sizes m and test sample sizes n when shifts occur. We assume that the tth future observation, X t , is collected over time using the following multivariate model: where μ 0 = ð0, 0, 0, 0Þ, μ 1 = ð0, 0, δ, δÞ, and Σ 0 represents the 4 × 4 identity matrix. We let τ = 50 and dimensionality p = 4. Table 2 shows the OC ARL of these charts. Table 3 presents the OC ARL of these charts when there is a correlation between variables: X t~N p μ 0 , Σ 2 ð Þ, for t = 1, 2, ⋯, τ, The Weibull type of distributional changes for detecting general distributional changes is shown in Table 4, where Weibullðθ 1 , θ 2 Þ represents the Weibull distribution with the shape parameter θ 1 and the scale parameter θ 2 . The IC distribution is Weibullð1, 1Þ, and the OC distribution is Weibullð1, 1 + δÞ. We also consider the three types of general changes (multivariate t with 3 df , multivariate exponential, and multivariate gamma distributions) in Table 5. Tables 2-5 show that the proposed method performs well for detecting a range of shifts.

Data Source.
To describe the proposed method, we analyze a real clinical case. Samples arrive periodically as Dr.   [70][71][72][73]. In this work, we aim to monitor the Breast Cancer Wisconsin Data Set and identify whether there is a shift in a process.    Figure 2, which highlights that the normality assumption is invalid, which leads us to reject the null hypothesis that the data are normally distributed. Thus, we use the proposed distribution-free control scheme to monitor the breast cancer data.
We let m = 100 and n = 5. We use the 1-350 IC data to find the control limits of the S-W chart, S-MD chart, and proposed chart. For a fair comparison, the IC ARL of all   Computational and Mathematical Methods in Medicine control charts is set equal to 400, and the remaining 249 breast cancer data are monitored. The curves of the S-W and S-MD charts of the monitored banknote authentication data are shown in Figure 3, which indicates that the S-W chart produces a false alarm when the process is IC; conversely, the S-MD chart produces no OC signal when the process is OC. Figure 4 shows the proposed chart for monitoring breast cancer data and shows that the statistic of the proposed chart falls out of the control limits after 353 observations. Compared with the S-W and S-MD charts, the proposed chart can detect a shift more accurately and earlier than the other charts.

Conclusions and Discussion
This paper provided a new control scheme for detecting location and scale changes. Inspired by Mukherjee and Marozzi [50], we proposed an effective control chart that simultaneously monitors changes in both location and scale. In this paper, Breast Cancer Wisconsin Data Sets are provided by using the proposed method. Spectral analysis is also reviewed and conducted to investigate the periodicities of shorter time series, and then, nonlinear least squares fitting is used for fitting analysis. The real-data example shows that the proposed scheme performed well for detecting process changes. In this study, we mainly considered the standard Euclidean distance to reduce the dimensionality of highdimensional data; the other methods of dimensionality reduction still need to be investigated in more detail.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare no conflicts of interest statement.