Variable Selection Methods for Right-Censored Time-to-Event Data with High-Dimensional Covariates

Advancement in technology has led to greater accessibility of massive and complex data in many fields such as quality and reliability. The proper management and utilization of valuable data could significantly increase knowledge and reduce cost by preventive actions, whereas erroneous and misinterpreted data could lead to poor inference and decision making. On the other side, it has become more difficult to process the streaming high-dimensional time-to-event data in traditional application approaches, specifically in the presence of censored observations. This paper presents a multipurpose analytic model and practical nonparametric methods to analyze right-censored time-to-event data with high-dimensional covariates. In order to reduce redundant information and to facilitate practical interpretation, variable inefficiency in failure time is determined for the specific field of application. To investigate the performance of the proposed methods, these methods are compared with recent relevant approaches through numerical experiments and simulations.


Introduction
Time-to-event data such as failure or survival times have been extensively studied in reliability engineering.By the advent of modern data collection technologies, a huge amount of this type of data includes high-dimensional covariates.This massive amount of data is increasingly accessible from various sources such as transaction-based information, informationsensing devices, remote sensing technologies, machines and logistics statistics, wireless sensor networks, and analytics in quality engineering, manufacturing, service operations, and many other segments.Unlike traditional datasets with few explanatory variables, the analysis of datasets with a high number of variables requires different approaches.In this situation, variable selection techniques could be used to determine a subset of variables that are significantly more valuable to analyze high-dimensional time-to-event datasets.If data is compiled and processed correctly, it can enable informed decision making [1][2][3][4][5][6][7].
In many professional areas and activities, as well as manufacturing and services, decision making is increasingly based on the type and size of data, as well as analytic methods, rather than on experience and intuition.It has been suggested that business should discover new ways to collect and use data every day and develop the ability to interpret data to increase the performance of their decision makers [8,9].As stated in a broad survey [10], advanced analytics are among the most popular techniques used in high-dimensional and massive data analysis and decision making process.As an analytical approach, decision making is the process of finding the best option from all feasible alternatives [11,12].
The opportunity for manufacturing and services in the era of data is to analyze their performance to enhance the quality.Quality dimensions of products and services, defined by quality experts [13,14] or perceived by customers [15], are summarized as performance, availability, reliability, maintainability, durability, serviceability, conformance, warranty, and aesthetics and reputation.Access to valuable data for

Basic Definitions
In this section, an applied introduction of time-to-event data analysis and a brief review of prominent data mining tools and techniques relevant to this study are presented.
2.1.Time-to-Event Data Analysis.Time-to-event data analysis methods consider the time until the occurrence of an event.This time can be measured in any unit such as days, weeks, or years with this analysis widely used in reliability engineering.In time-to-event data, subjects are usually followed over a specified time period.The study of time-to-event data focuses on predicting the probability of survival or failure.Examples of time-to-event data are the lifetime of mechanic devices, electronic components, or complex systems [21,27,28].
Regression models cannot effectively perform in the presence of the censoring of observations [27,29].Censored data occurs when the information about the event time is not complete or missed for any reason.Right censoring occurs when a test subject does not remain under the test for a full test period or until it fails.In this paper, we focus on time-toevent data with this type of censoring.
Available methods to analyze time-to-event data and to find a relationship between survival time and other variables can be categorized in parametric, semiparametric, and nonparametric methods.Parametric survival analysis is based on survival function distributions such as the exponential function.Semiparametric models do not assume knowledge of absolute risk.These models estimate relative risk rather than absolute risk with this assumption called the proportional hazards assumption.In this category, the Cox proportional hazards regression analysis is by far the most popular model for survival data analysis.For moderateto high-dimensional covariates, it is difficult to apply semiparametric methods [25].In nonparametric methods which are useful when the underlying distribution of the problem is unknown, statistical assumptions are not required.These methods are commonly used to describe survivorship of a study population or compare two or more study populations.The Kaplan-Meier product limit estimate is a commonly used nonparametric method in estimating the survival function.This estimator has clear advantages since it does not require an approximation of the follow-up time assumption [27,30].
The probability of the failure time occurring at time  is In time-to-event or survival analysis, the information on an event status and follow-up time are used to estimate a survival function, (), which is defined as the probability that an object survives at least until time :  () =  (an object survives longer than ) =  ( > ) . ( From the definition of the cumulative distribution function (or failure function) Accordingly, the survival function is calculated by a probability density function as In most applications, the survival function is shown as a step function rather than a smooth curve.The nonparametric estimate of () according to the Kaplan-Meier (KM) estimator for distinct ordered event times  1 to   is as follows: where at each event time   there are   subjects at risk and   is the number of subjects which experienced the event, for example, failed.Let   denote the number of subjects censored between   and  +1 .Then the likelihood function takes the following form: For the conditional probability of surviving, if we define   = (  )/( −1 ), then the maximum-likelihood estimation of   is as follows: Graphically, the Kaplan-Meier estimate is a step function with discontinuities which increases at observed failure times.It has been shown [31] that the KM estimator is consistent.The completely nonparametric nature of this estimator assures little or no loss in efficiency.A quick review of commonly used data mining tools and techniques in this study is presented next.

Data Mining Tools and Techniques.
To analyze time-toevent data, when the size and dimensions are large, advanced analytics are advantageous.Data reduction techniques are categorized in three main strategies: dimensionality reduction, numerical reduction, and data compression [32,33].Dimensionality reduction is the most efficient strategy in the field of large-scale data deals by reducing the number of random variables or attributes in the special circumstances of the problem.Dimensionality reduction methods are mainly wavelet transformations and principal components analysis (PCA) [34,35].The transformation and projection of the original data eliminate a subset of the original data in terms of the variables' covariance.
All dimensionality reduction techniques are also classified as feature extraction and feature selection approaches.
Feature extraction is defined as transforming the original data into a new lower dimensional space through some functional mapping such as PCA and SVD [36,37].Most unsupervised dimensionality reduction techniques are closely related to PCA which is one of the oldest and most well-known multivariate analysis techniques, but this technique is not applicable to large complex datasets [38].Feature selection is denoted by selecting a subset of the original data (features) without a transformation in order to filter out irrelevant or redundant features, such as filter methods, wrapper methods, and embedded methods [39,40].The next section presents a proposed analytic logical model for the transformation of the explanatory variable dataset of a time-to-event data.

Proposed Analytic Model
The direction for developing an analytic model and analyzing datasets depends on the type and size of the data.A multipurpose, flexible, and innovative model for a type of rightcensored time-to-event data with a large number of variables when the correlation between variables is complicated or unknown provides the motivation to find an applicable solution for this type of data.For such data, we propose a model to simplify the original covariate dataset into a logical dataset by transformation lemma.In order to select the most significant variables in terms of efficiency, variable reduction methods and clustering algorithms are proposed in Section 4. The analytic model [41] and its following methods and algorithms are potentially applicable solutions for many problems in a vast area of science and technology.
The original right-censored high-dimensional time-toevent dataset may include any type of explanatory data as binary, continuous, categorical, or ordinal data.The concept of this proposed analytic model is that many variables are even binary or interchangeable with a binary variable such as dichotomous and Bernoulli variables.Also, the interpretation of a binary variable is simple, understandable, and comprehensible.In addition, the model is appropriate for fast and low-cost calculation which makes the time-dependent analysis with data streaming possible.
The random variables  and  represent the time-toevent and censoring time, respectively.Time-to-event data is represented by (T, Δ, U) where T = min(, ) and the censoring indicator Δ = 1 if the event occurred, for instance failure is observed, otherwise 0. The observed covariate U represents a set of variables.Let denote any observations by (  ,   ,   ),  = 1, . . ., ,  = 1, . . ., .It is assumed that the hazard at time  only depends on the survivals at time  which assures the independent right censoring assumption [21].
In order to simplify the original complex highdimensional time-to-event dataset, we propose a transformed logical model.Based on the concept of this model, for any by- dataset matrix U, there are  independent observations and  variables.Each array of  variables vectors will take only two possible values, canonically 0 and 1.Therefore, it is required to define a Bernoulli criterion to split all arrays as a set of binary outcomes and reach a logical dataset.Transforming numerical attributes to binary variables has been well studied [42].In order to construct the abovementioned transformed logical time-to-event dataset as a simplified and applied representation of the original one, we define   as an initial Bernoulli criterion: For any array   in the -by- dataset matrix U = [  ], assign a substituting array V  as Note that for each of the variable vectors v . the criterion   could be defined by an expert using experimental or historical data as well (8).The proposed model assumes any array with a value of 1 as desired for an expert and 0 otherwise.In other words, V  = 0 represent the lack of the th variable in the th observation.In this fashion, only desired variables will be considered in each variable vector.The transformed dataset is used in the proposed methods and algorithms.Therefore, the result of the transformation is an -by- dataset matrix V = [V  ] which will be used in the following methods and algorithms.Also, we define the time-to-event vector T = [  ] including all observed failure times and S = [  ] as the survival function.The proposed logical model validation and verification of the robustness were presented comprehensively in [41,43].The logical model initially could be satisfied by the proper design of the data collection process based on Boolean logic to generate binary attributes.

Proposed Methods and Heuristic Algorithms
In order to design appropriate methods and algorithms, a test is performed on the efficiency of a cluster of variables as a subdataset of the complete time-to-event covariates dataset.The test is done by comparing this subdataset with the complete transformed logical dataset.A key assumption in this approach is that the variable which is completely inefficient solely can provide a significant performance improvement when engaged with others, and two variables that are inefficient by themselves can be efficient together [40].By expanding these assumptions over the presence of a large number of covariates with a complexity in correlation, the efficient subset of variables does not differ meaningfully from the effect of the whole body of variables on the time-to-event outcome.Therefore, comparing through nonparametric test, any subset from the complete dataset could be determined as an efficient or inefficient selection.Based on these assumptions, we design two methods and hybrid algorithms for the proposed analytic model for selecting inefficient variables in right-censored time-to-event datasets with high-dimensional covariates.
We use the Kaplan-Meier estimator in this study to estimate and graph survival probabilities as a function of time.In addition, a nonparametric method is used to test a null hypothesis of whether two samples are drawn from the same distribution as compared to a given alternative hypothesis.Among many nonparametric tests for comparing survival functions for the aforementioned propose, we use a log-rank test in our methods as best fit for the comparison of two nonparametric distributions.This test is the most commonly used one for a typical study under different models for the relationship between the groups [27,30].
V is constructed by  observation vectors that correspond to each of the variables D = [  ] as a -by- matrix, a selected subset of V.  is defined as the number of observations in any subset of V, where  ≤ .We also define vector R as follows: Vector R is constructed by all nonzero arrays .The preliminary step for the highest efficiency in proposed methods is to cluster the variables based on the correlation coefficient matrix of the original dataset M = [  ] and choose a representative variable from each highly correlated cluster and then eliminate the other variables from the dataset.Let   denote the covariance of variables  and .The correlation coefficient matrix is defined as follows: V  and V  represent the values of variable  in observation  and the mean of variable  and the second parenthesis defined similarly for variable .
Applying this lemma, for instance, to any given dataset, determines three highly correlated variables from M. Only one of them is selected randomly and the other two are eliminated from the dataset.The outcome of this process ensures that the remaining variables for applying methods and heuristic algorithms are not highly correlated.

Method I: Nonparametric Test Score Variable Clustering.
The log-rank test score variable selection method is applied for selecting a subset of the best and the worst variables in terms of efficiency.The nonparametric test score (NTS) method is a variable clustering technique which selects a set of size  variables from the transformed logical dataset V and calculates the score of each variable in two levels.The first level is to determine the priority of the variable efficiency via the scores.The scores are obtained from the frequency of each variable in the rejected subsets from comparison with the original time-to-even vector T. We code this level of calculation with letter F. The second level rates the variables by the cumulative score of each variable from comparisons of selected subsets of all nonparametric test results with the original time-to-even vector T.This level acts as a searching procedure to detect the less efficient variables.The code which this level is denoted by is the letter C. The randomization (RN) algorithm randomly chooses a defined  subset of  from the V transformed logical dataset of  variable.We define a randomization dataset matrix Ψ = [  ] where each row is formed by  variable identification numbers in any selected subsets for overall  subsets.The heuristic algorithm of NTS method level F is as in Algorithm 1.
The heuristic algorithm of NTS method level C is presented is Algorithm 2. The experiment results for these algorithms are followed in Section 5.

Method II: Splitting Semigreedy Clustering Algorithm.
The splitting semigreedy (SSG) method to select an inefficient variable subset is proposed.And a clustering procedure through a randomly splitting approach to select the best local subset according to a defined criterion is incorporated.The concept of this method is inspired by the semigreedy heuristic [44,45] and tabu search [46].The criterion of this search is similar to Method I which is to collect the most inefficient variable subset via a log-rank test score.At each of  trials, all  variables from the transformed logical dataset V are randomly clustered into subsets of size  variables, where one cluster possibly contains less than  variables and the number of clusters is equal to ⌈/⌉.To calculate the score summation for each variable over all trials, we define a randomization dataset matrix Ξ = [  ] where each row is formed by  variable identification numbers in any selected subsets for all  selected subsets.The heuristic algorithm of SSG is presented in Algorithm 3. Comprehensive experimental results for the validation of the proposed methods by comparison with similar methods are presented next.

Experimental Results and Analysis
To evaluate the performance of the proposed methods, the designed algorithms under different types of datasets including collected and simulated data are investigated.In order to obtain an estimation of desired number of variables in any selected subset of NTS (F and C levels) and SSG methods for calculations in hybrid algorithms (Algorithms 1, 2, and 3), we use principal component analysis (PCA) scree plot criterion [47].The criterion for this evaluation is based on the eigenvalue of components in a dataset.The cut-off in the scree plot is interpreted as a set of significant eigenvalues among all components which leads to determining , number of efficient variables in a given dataset (see Figure 1).We apply this technique on the original dataset to determine .

PBC Data Numerical Experiment.
First, the well-known primary biliary cirrhosis (PBC) dataset (Fleming and Harrington 1991) is considered as the sample collected dataset.These data are from a double-blinded randomized trial including 312 observations.This right-censored dataset contains 17 variables in addition to censoring information, identification number, and event times for each observation.Experimental results for the 276 observations with complete records are presented.For the original PBC dataset, approximate value of  is 3.This value is the subset size in the algorithms applied on V transformed logical dataset in the -by- matrix, shown in Figure 1.
To verify the performance of the proposed methods, the result of methods and algorithms for the transformed logical PBC dataset is compared with the results of random survival forests (RSF) method [23,48], additive risk model (ADD) [24], and weighted least square (LS) [25] for variable selection in the PBC dataset given in Table 1.RSF is a method for the analysis of right-censored survival data using splitting rules for growing survival trees, and it is based on the random forests (RF) approach that ensembles learning which can be improved further by injecting randomization into the base learning process.Therefore, the RSF methodology is an extension of the RF method in which randomization is used in two forms; by randomly drawing a bootstrap sample of the data and by randomly selecting a subset of variables [23].The ADD proposes a principal component regression to give unique and numerically stable estimators.This model assumes that the hazard function is the summation of the baseline hazard function and the regression function of covariates [24].The LS estimator uses KM weights to account for censoring in the least squares criterion and makes the adaptation of this approach to multiple covariate settings computationally feasible [25].A comprehensive comparison of RSF, ADD, and LS performance with other relevant methods in high-dimensional time-to-event data analysis such as Cox's proportional hazard model, LASSO, and PCR has been presented in [23][24][25].
Each number in Tables 1 and 2 represents a specific variable in experiment dataset.For example, in Table 1, variable 15 is among the selected less efficient variables by NTS (F), SSG, RSF, and LS methods.
From the results shown on Table 1, the NTS (F) method has the closest performance to the SSG.This is an advantage of these methods that more than 80% of inefficient variables which have been detected by other methods (RSF, LS, and ADD) are collected by proposed algorithms at significantly short periods of calculation.The robustness of this class of methods has been examined for several time-to-event data samples collected with high-dimensional covariates in this study.

Simulation Numerical Experiment.
Continuing the validation of the proposed methods, we set  = 400 observations and  = 25 variables and simulated event times from a pseudorandom algorithm.In addition to constant and periodic binary numbers, normal and exponential distributed pseudorandom numbers are generated as independent values of explanatory variables.The censored observations were also generated independent of the events.Additionally, some variables are set as a linear function of event time intentionally.We present the results of methods and algorithms applying the simulated data in Table 2.These results are compared with the simulation defined pattern and the comparison verifies the performance of all proposed methods and algorithms.Inefficiency analysis results for the simulation experiment show that variables with identification numbers 5, 10, and 25 are detected as less efficient variables among all 25 simulated variables by NTS (F), NTS (C), and SSG methods.In this example, one could eliminate these less efficient identified variables if the objective is to reduce the number of variables in the dataset for further analysis.
In addition, a set of numerical experiments are presented to evaluate the performance of the proposed methods as shown in Table 3. Six different simulation models are defined.These models are set as explained above for  = 400 observations and  = 25 variables following same data generating algorithm.The number of significant inefficient variables in each simulation model is set  = {4, 5, 6} as presented in Definition in Table 3.The simulations are repeated 100 times independently.For each method,  represents integer average number of determined and selected inefficient variables and  denotes the performance of the method, for 100 trails, where  is as follows: and  is the average of performance of each method.The adjusted censoring rates for all simulation numerical experiments are 50 ± 5%. .Therefore, it is interpreted from the hybrid scatter plot that each variable with larger diameter and more distance from the center has less efficiency and is an ideal candidate to remove from the dataset if it is desired.

Conclusions
This paper presents applied procedures beneficial to covariate reduction through an inefficient variable selection approach which enables one to obtain an appropriate variable subset in a right-censored high-dimensional and large-scale timeto-event data.Two hybrid methods are introduced in which the experimental results and comparisons demonstrate their performance.By using such novel methods in the field of reliability, data analysis and decision making processes will be faster, simpler, and more accurate.We aim to apply median rank estimates in the proposed nonparametric methods and develop a weighted score and penalized nonparametric maximum likelihood and least square in the proposed logical time-to-event model by the accelerated failure time model.Also, the next challenge in this research is to face the data streaming circumstances in which

Step 1 .Figure 1 :
Figure 1: Scree plot of the original PBC dataset including 17 variables and 276 observations.

Figure 2 :
Figure 2: Hybrid scatter plot (upper right variables have less efficiency).

Figure 2
displays a graphical interpretation of the results from the proposed methods, hybrid scattering for variable inefficiency for three criteria as (a) NTS (F) score, (b) NTS (C) score, and (c) SSG score for PBC data.Each variable is exposed by a circle icon where two axes represent the standardized criteria (a) and (b) and the diameter of each circle demonstrates the standardized criterion (c) Step 1. for i = 1 to q do Step 2. ComposethedatasetD  for variable set i in Ψ including variables   where j = 1 to k Step 3. CalculateR  over the dataset D  Step 4. CompareT and R  with log-rank test Step 5. Save the test score for variables in subset i as  (+1) Step 6. end for Step 7. Eliminate rows of Ψ where the array  + 1 of the row has a value of more than 0.05, and call this new set Ψ * Step 8. Count the contribution (presence) of each variable ℎ based on its identification number in the Ψ * first k columns and save as  ℎ Step 9.Return Γ = [  ] as the variable efficiency vector Algorithm 1: NTS (F) algorithm.Step 1. for  = 1 to  do Step 2. Compose the dataset D  for variable set  in Ψ including variables   where  = 1 to  Step 3. Calculate R  over the dataset D  Step 4. Compare T and R  with log-rank test Step 5. Save the test score for variables in subset  as  (+1) Step 6. end for Step 7. Assume Ω = [  ] as the reverse variable efficiency vector where initially each array as the cumulative contribution score corresponding to a variable is zero Step 8. for  = 1 to  do Step 9. for  = 1 to  do Step 10.Add the value of  (+1) to the cumulative contribution score   of the variable  based on its identification number =   end for Step 11. end for Step 12.Return Ω = [  ] as the variable inefficiency vector

Table 1 :
Selected less efficient variables in proposed methods as compared to those from RSF, ADD, and LS methods.

Table 2 :
Selected less efficient variables in all proposed methods and comparison to simulation defined pattern.