Nonlinear Survival Regression Using Artificial Neural Network

,


Introduction
Many different parametric, nonparametric, and semiparametric regression methods are increasingly examined to explore the relationship between a response variable and a set of covariates.The choice of an appropriate method for modeling depends on the methodology of the survey and the nature of the outcome and explanatory variables.
A common research question in medical research is to determine whether a set of covariates are correlated with the survival or failure times.Two major characteristics of survival data are censoring and violation of normal assumption for ordinary least squares multiple regressions.These two characteristics of time variable are reasons that straightforward multiple regression techniques cannot be used.Different parametric and semiparametric models in survival regression were introduced which model survival or hazard function.Parametric models, for instance, exponential or weibull, predict survival function while accelerated failure time models are parametric regression methods with logarithm failure time as dependent variable [1,2].
Choosing an appropriate model for the analysis of the survival data depends on some conditions which are called the underlying assumptions of the model.Sometimes, these assumptions may not be true, for example: (a) lack of independence between consequent waiting times to occurrence of an event or nonproportionality of hazards in semiparametric models, (b) lack of independency of censoring or the distribution of failure times in the case of parametric models [1][2][3].
Although, the Cox regression model is an efficient strategy in analyzing survival data, but when the assumptions of this model are fail, the free assumption methods could be suitable.
Note that when several covariates and complex interactions are of concern the best method is ANN; otherwise, based on model assumptions simple regression models can be appropriately used.
In this study, simulated data sets with different rates of censoring were used to predict the outcome using ANN and traditional Cox regression models, and then the results of predictions were compared.

Cox Regression Model.
Suppose that  denotes a continuous nonnegative random variable describing the failure time of an event (i.e., time-to-event) in a system.The probability density function of ; that is, the actual survival time is ().The survival function, (), is probability that the failure occurs later than time .The related hazard function, ℎ(), denotes the probability density of an event occurring around time , given that it has not occurred prior to time .
As we know, an inherent characteristic of survival data is censoring.Right censored data which is the commonest form of censoring occurs when survival times are greater than some defined time point [1,2].The generated data used in this study contains right-censored data.
Proportional hazards model, which also called Cox regression, is a popular method in analysis of survival data.This model is presented as In this model,  is the regression coefficients vector, x is a vector of covariates, and ℎ 0 () is the baseline hazard which is unspecified and a function of time.The likelihood and partial likelihood function for right censored survival data [18], given by ( 2) and (3): , is censoring status where   = 1 if the observation is complete and   = 0 if it is censored.(  ) is defined as the risk set at time   .
To fit the Cox regression and estimate , the partial likelihood in (3) is maximized using iteratively reweighted least squares to implement the Newton-Raphson method.However, in the high-dimensional case, this approach cannot be used to estimate ; even  is not unique.

Neural Networks
Model.An ANN consists of several layers.Layers are interconnected group of artificial neurons.In addition, each layer has a weight that indicates the amount of the effect of neurons on each other.Usually an ANN model has three layers that called input, hidden (middle), and output.The input layer contains the predictors.The hidden layer contains unobservable nodes and applies a nonlinear transformation to the linear combination of input layer.The value of each hidden node is a function of the predictors.The output layer contains the outcome which is some functions of the hidden units.In hidden and output layers, the exact form of the function depends on the network type and user definition (based on response variable).
There are different methods for learning in the NN.For example, in multiple layers perceptron (MLP), which is the most commonly used, the learning performs with minimization of the mean square error of the output and by back-propagation learning algorithm [16,19,20].
In this paper, we use the activation transfer function ( ℎ ) as sigmoid function in hidden and in output layers (  ).The th response   for the predictor values  ℎ is a nonlinear function as Note that X   is th row of the input data matrix X,  ℎ is a nonlinear function of linear combination of input data,  is the vector weights of the hidden to the output units, and  is the matrix weights of the input to the hidden units.Equation ( 4) together yield the MLP model: By the sigmoid activation function, (5) can be written as below which is a nonlinear regression: where: ,  1 ,  2 , . . .,   are unknown parameter vectors, X  is a vector of known constants, and   are residuals.The parameters (weights) can be estimated by optimizing some criterion function such as maximizing the log-likelihood function or minimizing the sum of squared errors.In an MLP framework, a serious problem is overfitting.To control of the overfitting, usually a penalty term is added to the optimization criterion.To this, penalized least squares criterion for parameter estimation is given by [1]: where the penalty term is In likelihood schema which is often used in shrinkage method, an adaption of ( 7) is [8,21] The penalty weight  regulates between over-and underfitting.A best value of  is between 0.001 and 0.1 and is chosen by cross-validation [1,8].In this paper, we use (8) to get the parameter estimated.It is mentioned that, for an outcome (the response variable) with two classes  = (0/1),   is probability of event for the th patient, and the error function provides the cross-entropy error function as An ANN can be modeled as a generalized linear modeling with nonlinear predictors [8][9][10][11].Bignazoli et al. [8] introduced a method called partial logistic ANN, and Lisboa [22] developed it with fit smooth estimates of the discrete time hazard in structure.It is similar to MLP [23] with additional covariate, namely, time as an input and given by where   and  ℎ denote the number of the input and the hidden nodes, respectively. ℎ and  denote bias term in the hidden and output layers, respectively.After the estimation of the network weights, w, a single output node estimates conditional failure probability values from the connections with the middle units, and the survivorship is calculated from the estimated discrete time hazard by multiplying the conditionals for survival over time interval.Then −log(likelihood) statistics could be obtained as [8] where  is the censoring indicator function.

Model Fitting.
The ultimate goal of the learning process is to minimize the error by net.In training step, to fit the model by a fixed number of hidden nodes, we use penalized likelihood as By using this, we improve the convergence of the optimization and also control overfitting problem [1,8,9,16,21].
To identify the number of the hidden nodes and then model selection, Bayesian Information Criterion (BIC) and Network Information Criterion (NIC) [8,23,24], that is, generalization of Akaike Information Criterion (AIC), are calculated: where  is the number of the parameters estimated, and  is the number of observations in training set.The best model is with the smallest value of these criterions.In addition, to assess prediction accuracy in validation (testing) group, we calculated classification accuracy and mean square error (MSE).
The best model is selected with the smallest value of MSE.The models considered 2, 3, 4, 5, 10, 15, and 20 hidden nodes.The weight decay was considered 0.012 which is chosen based on some empirical study [25].
At finally, in order to comparison of the Cox and ANN predictions, classification accuracy and concordance indexes were calculated.All simulations and comparisons were performed by R 2.14.1.

Simulation
In order to compare the accuracy of the predictions by ANN and Cox regression, four different simulation schemes, based on Monte Carlo simulation, were used.In each schema, hazard at any time  was considered as exponential form [26], namely,  (Table 1).For each schema, 1,000 independent random observations were generated and then with the based on the relationship between exponential parameter and independent variables, survival times were generated.Afterward, the survival times were transformed as right censored.In this context, if generated time   is greater than the quantile of exponential function with parameter of , it is considered as censorship.This process was repeated 100 times.To access the accuracy of predictions, each sample is randomly divided to two parts.The first part, the training group, was consisting of 700 observations, and the 300 remainder observations were allocated to second group, that is, the testing group.Furthermore, in all simulation, the average rates of censorship were considered equal to 20%, 30%, 40%, 50%, 60%, 70%, and 80%.In addition, the models were considered with the main effects and without/with any interaction terms as simple and complex model, respectively.
In simulation 1 and 2, two covariates were used which was generated randomly from binomial and standard normal distributions.The models of these simulations were consisting of any and one interaction terms.In simulation 3, three covariates were used which was generated randomly from binomial and standard normal distributions, respectively.The model of this simulation has had two interaction terms.In simulation 4, four covariates were used which was generated randomly from binomial and standard normal distributions.The models of these simulations are complex and consist of two-, three-and four-interaction terms (Table 1).The model selection is based on BIC for learning set and SSE criterion for the testing subset data as a verification.The results in Table 2 show that the simple model performs with less hidden node but complex model performs better with more hidden nodes.The MSE values confirm these results (Table 2).
In the next step, to compare of ANN and Cox regression predictions, concordance indexes were calculated from classification accuracy table in testing subset.Concordance index was reported as a generalization of the area under receiver operating characteristic curve for censored data [27,28].This index means that the proportion of the cases that are classified correctly in noncensored (event) and censored groups and 0 to 1 values indicated as the ability of the models accuracy.The concordance index of ANN and Cox regression models was reported in Table 3.The results of simulation study in simpler model showed that there was not any different between the predictions of Cox regression and NN models.But NN predictions were better than Cox regression predictions in complex model with high rates of censoring.

Conclusion
In this paper, we presented two approaches for modeling of survival data with different degrees of censoring: Cox regression and neural network models.A Monte-Carlo simulation study was performed to compare predictive accuracy of Cox and neural network models in simulation data sets.
In the simulation study, four different models were considered.The rate of censorship in each of these models was considered from 20% up 80%.These models were considered with the main effects and also with the interaction terms.Then the ability of these models in prediction was evaluated.As was seen, in simple models and with less censored cases, there was little difference in ANN and Cox regression models predictions.It seems that for simpler models, the levels of censorship have no effect on predictions, but the predictions in more complex models depend on the levels of censorship.The results showed that the NN model for more complex models was provided better predictions.But for simpler models predictions there was not any different in results.This result was consistent with the finding from Xiang's study [26].Therefore, NN model is proposed in two cases of: (1) occurrence of high censorship (i.e., censoring rate of 60% and higher) and/or (2) in the complex models (i.e., with many covariates and any interaction terms).This is a very good result and can be used in practical issues which often are faced on with many numbers of variables and also many cases of censorship.For that reason, in these two cases the ANN strategy can be used as an alternate of traditional Cox model.Finally, it is mentioned that there are some flexible alternative methods, such as piecewise exponential and grouped time models which can be used for survival data and then its ability compared with ANN model.

Table 2 :
Results of ANN models for simulated data.

Table 3 :
Results of concordance indexes of simulation study in testing subset (300 cases with 100 replications).
*Four models with different rates of censoring.