Improved Kaplan-Meier Estimator in Survival Analysis Based on Partially Rank-Ordered Set Samples

This study presents a novel methodology to investigate the nonparametric estimation of a survival probability under random censoring time using the ranked observations from a Partially Rank-Ordered Set (PROS) sampling design and employs it in a hematological disorder study. The PROS sampling design has numerous applications in medicine, social sciences and ecology where the exact measurement of the sampling units is costly; however, sampling units can be ordered by using judgment ranking or available concomitant information. The general estimation methods are not directly applicable to the case where samples are from rank-based sampling designs, because the sampling units do not meet the identically distributed assumption. We derive asymptotic distribution of a Kaplan-Meier (KM) estimator under PROS sampling design. Finally, we compare the performance of the suggested estimators via several simulation studies and apply the proposed methods to a real data set. The results show that the proposed estimator under rank-based sampling designs outperforms its counterpart in a simple random sample (SRS).


Introduction
The idea of ranked set sampling (RSS) was introduced by McIntyre [1] for the first time. It can provide a more structural method for collecting the sample units. A generalization of RSS is the PROS sampling design. Both sampling methods are similar with a clear difference; in the PROS sampling design that we use in this paper, the ranker divides the sampling units into ranked subsets of prespecified sizes based on their partial ranks [2]. These sampling designs are techniques to obtain more representative samples from the underlying population where measurement of the units is costly and/or time-consuming. In such sampling designs, sampling units are ordered fairly accurately by using available auxiliary information which may be costly to some extent (see [3]).
After the PROS sampling design was introduced by Ozturk [4], many statisticians became interested in this rank-based sampling method. For example, Ozturk [5] and Frey [6] have relaxed the assumption concerning the prespecification of the number of subsets in each set. Nazari et al. [7] have developed nonparametric kernel density estimators using PROS data. Hatefi et al. [3] have applied PROS sampling in mixture modeling to estimate the age structures of short-lived fish species. Ozturk [8] have used the properties of PROS samples under multiple auxiliary information in the estimation of the population mean and total infinite population settings. Nazari et al. [9] have estimated the distribution function using PROS samples. Hatefi et al. [10] have studied the information and uncertainty structures of PROS data.
Currently, survival study is one of the important statistical tools for analyzing the data extracted from medical studies and social sciences. Presence of censoring observations is the distinction between survival analysis and other statistical analyses (see [11]). However, survival analyses are expensive due to the need of a large sample size and the potentially long follow-up duration [12]. For the sake of parsimony, we may consider the cost-effective sampling methods, in which only a small proportion of the available units is measured; however, they contain a portion of the information contributed by all of the units; for more information, see [13].
In this study, we develop the KM nonparametric estimator using the PROS sampling design. The KM estimator measures the probability that a person survives longer than a specific time, which is fundamental in survival analysis. We study the asymptotic properties of this new estimator and compare it with SRS and RSS counterparts. What distinguishes the present research from previous endeavors is that we employ the PROS sampling design for incomplete data containing censored observations, while all research on PROS sampling design has been concerned with the inference procedure for complete data. There are only a few results available when the researcher has incomplete data and the sampling design is based on RSS not PROS samples. For example, Yu and Tam [14] have considered maximum likelihood estimation of parameters of the log-normal distribution and have introduced a KM estimator for RSS. Zhang et al. [15] have used RSS for estimating the KM estimator of a reliability function with random right-censored data where the population distribution is unknown. Strzalkowska-Kominiak and Mahdizadeh [13] have proposed a KM estimator based on RSS when censored data are under random detection limit assumption. Mahdizadeh and Strzalkowska-Kominiak [16] have proposed a confidence interval for a distribution function when data are right-censored with random censoring time by applying RSS design.
In Section 2, we present some primary notes. In Section 3, we introduce the nonparametric KM estimator. In Section 4, we show the asymptotic normality of the KM estimator based on imperfect PROS sampling design. We compare the performance of the PROS KM estimator with respect to its SRS and RSS counterparts using simulation studies in Section 5. In addition, we illustrate our proposed method with a real example. We consider a dataset collected in Amir Medical Oncology Center, as our population in Section 6.

Necessary Background
2.1. Ranked Set Sampling. To obtain a RSS of size nL, with set size n and L cycles, from the underlying population, a set of n units is randomly selected from the population. The units are ranked via some mechanisms. Then, the unit that ranked as the smallest was selected for the final measurement. Another set of n units is drawn and ranked, and the unit ranked as the second smallest is selected for measurement. This process is continued until the unit ranked as the maximum is selected and measured. This is one cycle of the RSS procedure; the cycle can be repeated L times to generate RSS of size nL (see [17]).

Partially
Rank-Ordered Set Sampling. In this section, we introduce the PROS sampling design and present the necessary notation. This sampling design is of the form G * * design in Ozturk (see [4]). In order to extract a PROS sample of size N = nL, we choose a set size s = nm and a design parameter D = fd 1 , ⋯, d n g that partitions the set f1, 2, ⋯, sg into n mutually exclusive subsets d j = fðj − 1Þm + 1, ⋯, jmg, j = 1, ⋯, n. Sampling units are then assigned to the subsets d j , j = 1, ⋯, n, based on visual inspection, judgment ranking, or using a concomitant variable such that all units in the subset d j are judged to have smaller ranks than all units in the subset d j ′ , when j < j′. A unit is then randomly selected from the subset d 1 for full measurement and denoted by X ½d 1 1 . Again, we randomly select a set containing s units and assign them to n subsets; after that, we randomly draw a member from subset d 2 and denote it by X ½d 2 1 . These steps are continued until we randomly extract a unit from d n , X ½d n 1 . These observations constitute one cycle of the PROS sampling design; after L repetitions of this process, we achieve a PROS sample of size nL, denoted by X PROS = fX ½d j i , i = 1, ⋯, L, j = 1, ⋯, ng; for more details, see [9]. Table 1 presents a simple example of the construction of a PROS sample when s = 9, n = 3 , and m = 3, the cycle size is L = 2, and the design parameter is D = fd 1 , d 2 , d 3 g = ff1, 2, 3g, f4, 5, 6g, f7, 8, 9gg. Each set contains nine units assigned to three partially rank-ordered subsets. In this process, units in each subset have equal chance to take any place in the subset. One unit, in each set from the bold-faced subset, is randomly drawn and measured. The resulting PROS sample is denoted by fX ½d j i , i = 1, 2, j = 1, 2, 3g.
It should be noted that, if all members in the subset d j have exactly smaller ranks than all members in d j ′ , j < j ′ , the PROS sampling design is perfect. Otherwise, we have an imperfect PROS sampling design. Suppose that α is a doubly stochastic matrix; we model the subsetting error probabilities in the imperfect PROS as follows (see [7] and [9]): where α d j ,d h is the probability of assigning a unit into the subset d j when it belongs to the subset Throughout this paper, we use PROS α ðn, L, s, DÞ as a symbol of an imperfect PROS sampling design with the design D = fd j , j = 1, ⋯, ng, where α represents a subsetting error probability matrix, n shows the number of subsets, and L and s exhibit the number of cycles and the set size, respectively. It should be pointed out that m = s/n.
SRS and RSS designs are special cases of the PROS sampling design when s = 1 and s = n, respectively. For a perfect PROS design, since α d j ,d h = 0 for h ≠ j and α d j ,d j = 1 for j = 1, ⋯, n, the subsetting error matrix is an identity matrix and the notation PROS I ðn, L, s, DÞ can be used.
In this paper, the cumulative distribution function (CDF) of the studied variable in the population, CDF of X ½d j i for i = 1, ⋯, L, and CDF of the rth-order statistic among a simple random sample of size s are denoted by F, F ½d j , and F ðr:sÞ , respectively. In addition, the corresponding probability density functions are represented by f , f ½d j , and f ðr:sÞ .

Kaplan-Meier Estimator Based on PROS
Sampling Design Definition 1. Let X 1 , ⋯, X n~F and C 1 , ⋯, C n~G be two independent random variables where we observe Y i = min fX i , C i g~H and δ i = 1fX i ≤ C i g be the indicator variable which specifies the event/censored status. The KM estimator defined as where Y ð1Þ , ⋯, Y ðnÞ are ordered values of the simple random sample (SRS) with related δ ½1 , ⋯, δ ½n values; see [18] for more information.
Based on the above Definition 1 and Definition 1 in [9], we estimate the KM estimator based on the imperfect PROS sampling design PROS α ðn, L, s, DÞ.
The KM estimator based on the PROS α ðn, L, s, DÞ sample, X PROS , defined as where 1 −F ½d j is the KM estimator based on the independent and identically distributed ( where

Asymptotic Properties
In this section, we study the behavior of the nonparametric KM estimator in large samples based on the imperfect PROS sampling design. The asymptotic properties of the KM estimator under the SRS were widely available in the literature survey [19][20][21]. We demonstrate that no stronger assumptions are needed while using the imperfect PROS-based KM estimator. At first, we introduce the following lemma, which is a straight result of Lemma 2.1 in Stute and Wang [18]. Lemma 1. Suppose X~F and C~G are two independent random variables. In addition, let X ½d j i~F½d j be the PROS sample from subset d j in the ith cycle and C ½d j i be the corresponding censored time.
Set H ½d j ðtÞ = Pðmin ðX ½d j i , C ½d j i Þ ≤ tÞ, then we have Proof. See Appendix A. Due to the expressed lemma, we can definẽ We also set Let φðwÞ be a score function Now, we present Theorem 1.

Theorem 1.
Assume F and G are continuous and wherẽ Also, set As L ⟶ ∞ and N = nL, we have where Proof. In view of the equivalent theorem in SRS sampling design [21], it suffices to show that, for every j = 1, ⋯, n, As to equation (15), under continuity of F and G and γ 0dj ðXÞ = ð1 − GðxÞÞ −1 , we also havẽ Under the continuity of F, there exists a density f . We have FðdxÞ = ð1/nÞ∑ n j=1 F ½d j ðdxÞ; hence, nFðdxÞ = ∑ n j=1 F ½d j ðdxÞ.
By using the above relationship, By equation (9), this phrase is finite, so we prove equation (15).
To prove that (16) holds, we have to determine a lower bound for 1 − F ½d j ðzÞ.
We know that Therefore, we have We know Also, so, Then, Based on Lemma 1 and the above equations We define constant C as Because s − u ≥ 0, we have so (26) is smaller than In view of equation (10), this equation is finite, and this completes the proof.
□ It should be noted that Theorem 1 has been proven only for the imperfect model, which has already been described in Section 2.2, and this model is not completely general.

Simulation Study
In this section, we compare the performance of the KM estimator of survival function under the PROS sampling design relative to its SRS and RSS counterparts.
To do so, we considered two situations in which the original random variables were generated from an exponential distribution with mean 1 (model A) and standard lognormal distribution with mean 1.649 (model B). The censored variables in the two cases are supposed to have an exponential distribution; a common rate of exponential distribution was determined when the desired censoring level was prespecified. In all simulation scenarios s = nm and the set size for the RSS sampling design is n. The algorithm of the simulation study is explained in Appendix B.
By using distribution theory, if D and E are independent and distributed exponentially with means θ 1 and θ 2 , respectively, then PðD ≤ EÞ = θ 2 /ðθ 1 + θ 2 Þ. On the other hand, PðD ≤ EÞ = 1 − p. Setting the values of the censoring level ðpÞ and θ 1 = 1 in these equations, we can find the appropriate value of the exponential rate in model A. Given the fact that there is no such expression for model B, we found the exponential common rate for the censoring variable by trial and error, although one can easily solve this problem numerically by using software like R. The values of the exponential rate were equal to 0.013 and 0.190 and led to censoring levels of 0.1 and 0.6, respectively.
For each combination of sample sizes N = 30, 120, and 240 and the mentioned censoring levels 0.1 and 0.6, 5000 samples were generated under the SRS, RSS, and PROS sampling designs. For different values of n, m, and L and the misplacement probabilities α d i ,d i = α 0 and α d i ,d j = ð1 − α 0 Þ/ ðn − 1Þ for i ≠ j, the values of the mean squared error (MSE) were computed for the three estimators from each sample when α 0 = 0, 0.5, 0.7, and 1.

Comparing the Kaplan-Meier Estimators.
We compare the performance of the KM estimators of the survival function between the studied sampling designs. The efficiency of the PROS estimation with respect to its SRS and RSS counterparts, at the point t, is defined as In the literature, the sample sizes in the PROS and RSS designs were similar but they have used a much smaller set size for RSS sampling design than for PROS. However, simulation studies that are not presented here show that the RSSbased estimator may performs better than the one using the PROS sampling design under the same sample size and the same set size.
As shown in Figures 1 and 2, in model A, the KM estimator based on the PROS sampling design in most cases is more efficient than the KM estimator based on the RSS and SRS sampling designs with similar sample sizes. The best performance of the PROS design over the SRS and RSS designs happens when the ranking errors are small or zero, i.e., when α 0 = 0:7 and 1. The efficiency of the KM estimator based on PROS relative to SRS is as good as or higher than the efficiency of the KM estimator based on the PROS relative to the RSS procedure, regardless of the censoring level and ranking error. Assuming a fixed sample size and censoring level, by increasing the n for large values of α 0 , the efficiency of the KM estimator based on the PROS sampling design is enhanced. It should be noted that in an imperfect PROS sampling design ðα 0 = 0Þ, the efficiency reduced as n increased.
We can conclude that increasing the level of censorship in a smaller sample size leads to a reduction in efficiency in both models, but for a larger sample size, this rarely happens; in other words, the level of censored data in the smaller sample size has a greater impact on the performance of the PROS sampling design compared to the that in the larger sample size.
We conclude that, regardless of the censoring level and ranking error, increasing the sample size leads to increased efficiency. The perfect PROS KM estimator performs three times more efficiently than the SRS KM estimator in several simulation scenarios. It is worth noting that RP might decrease when one considers the same set size in the PROS and RSS designs with similar sample sizes. In all figures, we consider m = 3 for the PROS design.
In addition, we compared these three sampling methods using a mean integrated squared error (MISE) indicator, defined as From Table 2, we can conclude that most of the time, PROS has less MISE than the RSS and SRS sampling methods with similar sample sizes, especially for a large α 0 . In addition, we observe that as the level of censored data increases, the amount of the MISE value increases as well in both models. It should be mentioned that in the low level of censorship, the log-normally distributed (model B) has lower MISE than the exponentially distributed (model A), but at the high level of censorship, model B has larger MISE than The results show that when α 0 = 0:5, 0.7, and 1 in a smaller sample size with a low percentage of censored data, the larger n leads to the smaller MISE of the estimators, but with a high percentage of censored data, the MISE value increases as n increases. However, in larger sample sizes, the MISE of the estimator decreases as the n goes up in all censoring levels.
In Table 2, as the misplacement probabilities decrease, the superiority of the PROS estimator compared to the RSS and SRS estimators becomes more obvious. The MISE values of the KM estimator derived from perfect PROS and perfect RSS sampling designs are smaller than those in imperfect methods. Note that the KM estimator based on the SRS  sampling design has a smaller MISE value than the one based on the imperfect rank-based sampling designs for some cases in small sample sizes and high censorship percentage. Note that the RSS KM estimator can have a lower MISE than the PROS one, when we consider a similar set size and fixed sample size.

Real Data Application
In this section, we use the information of children under 18 years of age with nonhematological disorders such as Beta-Thalassemia and Idiopathic Thrombocytopenic Purpura (ITP) and children with hematological malignancies including various types of lymphoma and Acute Lymphocytic Leukemia (ALL), registered in the Amir Medical Oncology Center during May 2014 to August 2017. The dataset contains the survival information of 61 patients. We provide KM estimates of Y which is the survival time (in months) as the variable of interest by using Z which is the white blood cells as the concomitant variable, which are used for ranking purpose. The correlation coefficient between Z and Y is 0.455 and is significant (p value = 0.0001); also, we should add that 50.8% of people are censored. We considered the perfect PROS and RSS sampling designs. In order to estimate the KM estimator of survival time, we regarded this data set as a target population and extract PROS, RSS, and SRS samples (with replacement) of size N = nL from the population. We considered design parameter D = fd 1 , d 2 , d 3 , d 4 , d 5 g. At the first step, we randomly selected nm = 15 patients from the target population and then partitioned these patients into subsets d 1 , d 2 , d 3 , d 4 , and d 5 based on their WBC values. At the next step, we randomly selected a unit from subset d 1 and observed its survival time. Again, we randomly selected 15 patients and assigned them to d 1 , d 2 , d 3 , d 4 , and d 5 and randomly drew a member from subset d 2 and repeated these steps until we selected a unit from subset d 5 ; these observations constitute one cycle of PROS; in this real data, we considered 3 cycles, and finally, we have 15 survival time observations from patients.
In RSS, we randomly selected 5 patients from the target population and ranked them based on their WBC values, then we selected the patient with the smallest WBC and observed its survival time. This procedure continued until the survival time of the 5th ranked unit in the 5th set of units measured. These 5 observations constitute one cycle of RSS; in this example, we considered 3 cycles, and finally, we observed the survival time of 15 patients.
For each sampling design, the KM estimator was calculated in different time points. Then, this process was repeated M times. We took ðn, m, L, MÞ = ð5, 3, 3, 50Þ. These 50 KM charts under the three sampling designs are shown in Figure 3. Figure 3 shows that the variation of the KM  estimators in each fixed time under the PROS sampling design is less than the variation of the RSS and SRS counterparts. We conclude that in this real data, the PROS estimate performs better than the RSS and SRS designs. We uploaded the raw data as a supplementary material (available here).

Summary and Concluding Remarks
In numerous medical fields, the exact measurement of the desired variable is expensive or time-consuming. Rank-based sampling designs such as PROS can help overcome this difficulty by ranking a small number of sampling units based on a concomitant variable. These sampling designs can be used to obtain samples that are more informative and also result in more accurate inference about the parameters of interest.
In this paper, we considered the problem of the KM estimator that is a proper and commonly used technique in survival analysis associated with an imperfect PROS sampling design. PROS is a new sampling design that avoids ranking all units in a given set. Furthermore, we developed asymptotic distributional properties of the new KM estimator based on a proposed sampling method. We showed how well this estimator performs in comparison with its RSS and SRS counterparts. The simulation results recommend that under both perfect and imperfect subsetting assumptions, the efficiency of the estimator based on the PROS sampling design is higher than the efficiency of the estimator based on the two other sampling methods with the same sample sizes. It is noteworthy that, by increasing the set size in RSS while keeping the sample size fixed in both designs, the RSS KM estimator can have smaller values of MSE than the PROS one. Finally, we applied all the introduced sampling designs to a real data set. We believe that it would be appealing to apply the proposed methodology to useful statistical models, for example, a Cox regression model for analyzing time to event data that is applicable to the majority of medical fields.
Finally, we will recommend the use of recently proposed sampling designs to extend this study, for example, even order ranked set sampling (EORSS) [22] and quartile pair ranked set sampling (QPRSS) [23] designs that have recently received attention by some researchers.

Appendix
A. The Proof of Lemma 1 We have ðA:1Þ

B. Algorithm of Simulation Scenarios
The steps of simulation study algorithm are as follows: Step 1: Perform data generation in the following ways: (i) Generate 1000 random event time observations from the desired distribution (X) (ii) Generate 1000 random censored time observations from the desired distribution (C) (iii) Observe the status variables (δ = IðX < CÞ) (iv) Calculate survival time variable (T = min ðX, CÞ) Step 2: Perform sampling in the different studied designs: (i) Generate PROS, RSS, and SRS samples from the target population. For PROS and RSS, we generate the samples based on different values for subsetting error matrices, set sizes, and cycle sizes Step 3: Estimate the desired estimators: (i) estimate the KM estimator using the corresponding formula coding Step 4: calculate comparison criteria: (i) Compute the MSE of the KM estimator in different percentile points (ii) Compute the MISE values for KM estimators under the three different sampling designs Step 5: Repeat all the above steps 5000 times.
Step 6: Compute the mean of 5000 calculated MSE and MISE and report them.

Data Availability
In the present study, we used the information about children under 18 years of age with non-hematological disorders such as Beta-Thalassemia and Idiopathic Thrombocytopenic Purpura (ITP) and also children with hematological malignancies including various types of lymphoma and Acute Lymphocytic Leukemia (ALL), registered in Amir Medical Oncology Center during May 2014 to August 2017, as a population of interest.

Conflicts of Interest
The authors report no conflicts of interest. The authors alone are responsible for the content and writing of this article.

Supplementary Materials
This supplementary file includes the data for the Section 6 (real data) example in the paper. This file contains the information of children under 18 years of age with nonhematological disorders such as Beta-Thalassemia and Idiopathic Thrombocytopenic Purpura (ITP) and children with hematological malignancies including various types of lymphoma and Acute Lymphocytic Leukemia (ALL), registered in Amir Medical Oncology Center during May 2014 to August 2017. The dataset contains the survival information of 61 patients. (Supplementary Materials)