Two Artificial Neural Networks for Modeling Discrete Survival Time of Censored Data

Artificial neural network (ANN) theory is emerging as an alternative to conventional statistical methods in modeling nonlinear functions. The popular Cox proportional hazard model falls short in modeling survival data with nonlinear behaviors. ANN is a good alternative to the Cox PH as the proportionality of the hazard assumption and model relaxations are not required. In addition, ANN possesses a powerful capability of handling complex nonlinear relations within the risk factors associated with survival time. In this study, we present a comprehensive comparison of two different approaches of utilizing ANN in modeling smooth conditional hazard probability function. We use real melanoma cancer data to illustrate the usefulness of the proposed ANN methods. We report some significant results in comparing the survival time of male and female melanoma patients.


Introduction
Artificial neural network (ANN) is becoming one of the most popular alternatives to conventional statistical modeling.It is actually conceived as an advanced generalized linear model.We have seen various applications of ANN utilized in different scientific subjects like engineering, economics, environment, and health, among others.For example, van Hinsbergen et al. 2009 [1] applied artificial neural networks to short-term time prediction of traffic travel time.Kingston et al. 2005 [2] proposed ANN to model water resources.In economics, Baesens et al. 2005 [3] used ANN to predict survival time of personal loan data.Baesens et al. compared the ANN model used with other survival analysis models like logistic regression and Cox PH and the results came in favor of the ANN.
In the medical sciences, most of the proposed applications of ANN were on prognostic models.For example, one of the most paramount research entities is cancer.Classifying a tumor as malignant or benign is important in cancer research.Chen et al. in 2002 used ANN to diagnose breast cancer tumors [4].Ercal et al. in 1994 presented an ANN model to distinguish between three benign skin cancer categories and malignant melanoma [5].But fitting a complex nonlinear modeling such as ANN in regression problems is less prevalent.Determining the risk factors that cause cancer or modeling the survival time of a patient once he/she is diagnosed with cancer using ANN is less common.
In this present study, we are interested in utilizing ANN in survival time modeling of skin cancer (melanoma) patients.Soong et al. [6] in 2010 developed a statistical model to predict the survival time of localized melanoma patients.They used the proportional hazard model developed by Cox [7], but the assumptions of hazard function proportionality may not be applicable to a different set of data.Moreover, they did not study the effect of interaction terms.Thus, applying ANN is more applicable and efficient, especially when the data does not satisfy Cox PH assumptions.ANN does not require any assumptions that need to be justified, and it is more precise in fitting nonlinear models [8][9][10].One of the basic approaches in utilizing ANN in survival analysis is by classification, whether a patient will survive over a fixed time interval or not [11].However, the latter classification method lacks the information about the survival probability function estimates.In 1995, Lapuerta et al. proposed the use of multiple neural networks one for each time interval [12].This model 2 Advances in Artificial Intelligence predicts the survival probability of each time period based on a neural network trained on the observations of the same time period only.The pitfall of this approach is the large number of networks that will be trained if one studies the survival time over immense time intervals.
Other methods of ANN applied to survival time were proposed by Faraggi and Simon in 1995 [8] and by Ohno-Machado in 1996 [13].We consider in this study the approach represented by Biganzoli et al. in 1998 [9], which was a modification of a study done by Ravdin and Clark in 1992 [14], in addition to the approach represented by Mani et al. in 1999 [10].Thus, in this study a comparison between the two methods Biganzoli and Mani is given.Also, we study the difference between the survival time of male and female melanoma patients.
In the following section, we discuss the data used to perform our comparison, along with significant results exhibiting the differences between male and female melanoma patient survival times.In the third section we discuss briefly the two methods emphasizing the differences, advantages, and disadvantages of both.In the fourth section we present our results and identify the model that gave the best performance in estimating the survival probability function with less error.

Materials and Methods
2.1.Data.We have 130,006 patients diagnosed with melanoma between the years 2000 and 2009 in the USA.Data accumulated from 13 registers of the Surveillance, Epidemiology, and end results program (SEER) [15].We filter out this large dataset to contain only consummate information with respect to the patient's age at diagnosis, tumor thickness, stage of cancer, and ulceration.Soong et al. [6] in 2010 used these four variables, but their study did not consider the difference between male and female survival.We found that there exists a significant difference between the median survival time of males and females based on a 5% level of significance using the Kruskal-Wallis test.Thus, studying the effect of gender on survival time by making one model for both males and females is not statistically correct, as the survival time for male and females does not have the same distribution.Figure 1 represents a schematic diagram of the distribution of the complete data with respect to gender and cancer stage.
We train neural network model separately for males and females.Figure 1 also includes the total number of patients with complete information who were diagnosed with melanoma between the years 2000 and 2009.Patients were either alive at the end of the 10-year period (censored) or lost to follow-up during the ten-year interval (censored), in addition to patients who died because of melanoma (uncensored) (95% of male patients were censored and 98.1% of female patients were censored).We omitted patients with incomplete information.For training purposes, each of the gender category data was divided into six groups; five will be used in the cross validation technique to train and validate the neural network model, while the last group is used for prediction and for accomplishing the comparison between the two modeling approaches.More details about the cross validation will follow in the next section.
In our modeling procedure, we used age at diagnosis and tumor thickness (in millimeter) as quantitative variables along with three dummy variables representing stage 1 (referring to a localized tumor), stage 2 (referring to a regional tumor), and stage 3 (referring to a distant tumor).The base level is referred to as in situ tumor.[16].Today a multilayer perceptron (MLP) is known now by neural networks and consists of multiple layers of neurons.The first layer (input layer) represents the covariates or the risk factors, which are the inputs of the hidden neurons in the first hidden layer.The output of the hidden layer is the input of another hidden layer (if more than one hidden layer exists) or the output layer.This type of MLP is called feed forward artificial neural network (FFANN).We shall discuss both methods and compare them in the current study using FFANN with one hidden layer.

Method 1: PLANN.
Partial logistic artificial neural network (PLANN) is the approach that was introduced by Biganzoli et al. [9].PLANN is a three-layer feed forward artificial neural network with one output unit in the output layer.The activation function used in the hidden and output layer is the logistic function given by PLANN estimates the conditional hazard function that is based on the discrete survival method.The discrete survival method was introduced by Allison [17] in 1982 and then Singer and Willett [18] in 1993.The discrete survival method considers grouping the continuous survival time into  = 1, 2, . . .,  disjoint intervals, in which the individual records will be replicated  times, where  is the number of time period in which the event occurred.The discrete hazard probability function for time period  given a vector of covariates x i is given by where   represents the time input for time period .To estimate the conditional hazard in (2) PLANN uses threelayer FFANN with an activation function for the hidden and output layers given in (1).The output of the network with an  number of hidden units is given by ĥ where  ℎ and  ℎ are the weights of the ANN to be estimated for the first layer and second layer, respectively, and also  ℎ and  are the weights for the bias connection with the hidden units and with the output unit, respectively.The target of this network is the censoring indicator   , which is equal to 1 if the event occurred for subject  and 0 otherwise.The cost function used in PLANN is the cross entropy function which is appropriate for binary classification problems [19].The weights of PLANN can be estimated by minimizing the cost function given by Once the network weights are estimated, the monotone survival probabilities can be easily found by converting the discrete hazard rate estimates obtained from the network output by the following equation: The advantage of this approach is that the time dependent covariates can easily be introduced in the model, as the individual records are available for each time period.However, for large datasets or studies conducted over a long period of time this approach is inaccessible due to the immense number of replication requisites [3]. Figure 2 shows the architecture of the PLANN introduced by Biganzoli.
The first layer in Figure 2 contains the bias and one node for the time period and the rest of the nodes for the covariates.PLANN uses one input for the time to estimate smooth discrete hazard rates.However, we have used 10 nodes for the time (one for each time period) to be able to compare it with the second method.[20] that predicts the survival function using a neural network with  outputs, where  is the number of time periods.He trained his network utilizing a target vector derived by Kaplan-Meier survival curves [21].Mani used the same neural architecture as Street, but to estimate the hazard function instead.In order to estimate the hazard function, each individual or subject would have a training vector (1 by ) target of hazard probabilities ℎ  as follows:

Method 2. Mani et al. developed an approach which was utilized by Street
for  ≤  ≤  and event = 1     for  ≤  ≤  and event = 0.
Here, ℎ  = 0 for each time period if patient  survived.ℎ  = 1 from time interval  to  if patient died because of melanoma at duration  within the study time.And, for those patients who are lost to follow-up during the study of duration  < , their hazards are equal to the ratio   /  , which is the Kaplan-Meier hazard estimate for time interval .  is the number of patients who died because of melanoma in time period , and   is the number of patients that are at risk in time interval .For training the neural network, Mani used the logistic sigmoid function given by The network weights are estimated by minimizing the cost function, which is the cross entropy function.Figure 3 shows the neural network architecture utilized in Method 2. The number of units  of the input layer is equivalent to the number of independent variables or risk factors.The  output units of the output layer learn to estimate the hazard probability of each individual.Once the ANN is trained and the hazard estimates are predicted, we convert those hazard estimates to the survival estimates by using (5) (for each method).We have trained the weights of the ANN in both methods using the quasi-Newton algorithm.

Model
Selection.Now, we are concerned with the optimal number of hidden units in the hidden layer that will give us the best neural network model.There are several methods in the literature that we can use to select the best neural network.The most popular method is the V-fold cross validation method as it does not rely on any probabilistic assumptions and helps in determining when overfitting occurs.Other statistical methods like hypothesis testing or information criteria were introduced and examined by Anders and Korn in 1999 [22] for neural network model selection, and they suggested that those statistical methods should take part in neural network modeling.However, since their proposed methods were based on certain probabilistic assumptions, it may not be always applicable in modeling real phenomena.
In order to do our comparison we took the best neural network for each method and then tested their performance on the same set of data (this set of data was removed from the training dataset).In the current study, we have used 5-fold cross validation to select the best model for each method (Methods 1 and 2).We divide the male and female datasets into six groups.Five were used in the training and validation, and the last group was used for comparing the best models from the two methods together (hold-out dataset).In addition, we use the weight decay that helps avoid overfitting and penalize large weight solutions to help in generalization.As mentioned by Ripley [23,24] a weight decay value between  = 0.01 and 0.1 would be more appropriate depending on the degree of fit that is expected.We have used the cross validation method along with four different values of weight decay  = {0.025,0.05, 0.075, 0.1}, to pick the best model.The same procedure of trying different weight decay values was used in [9].
The cross validation method will help us in finding the optimal number of hidden nodes.In addition, we consider the model with the lowest prediction error when applied to a new data.Therefore, for each method, we picked the best model (with lowest cross validation error) and then compared its performance on the hold-out dataset.We repeated our comparison for the four values of weight decay, since for the same data two factors affect ANN performance (number of hidden units and weight decay value).

Results
In our analysis, we used ten time intervals (12 months each) and in order to do the comparison between the PLANN and Mani's method we used 10 inputs for the ten time intervals in PLANN instead of one so that we can compare the output of the PLANN with the second method.
After training, the cross validation method resulted in choosing the networks with 52 hidden units (number of hidden nodes seems to be large but, by taking into account the number of output units, we have 10 parallel networks with five hidden nodes each) for both methods as the best model.We obtain similar results from ANN trained with the four different values of weight decay.Still, we want to examine the prediction accuracy for all eight available models to choose our best-fit model.Table 1 exhibits the comparison between the eight models for male melanoma patients and Table 2 exhibits the comparison for the female melanoma patients.
It is clear from Table 1 that using Mani's method yields better predictive neural network model than the PLANN proposed by Elia.Among the four competing models of Mani's method we have chosen the model with weight decay  = 0.1 as the best-fit model for predicting survival times of male melanoma patients that yield smaller mean error and standard deviation.
The results in Table 2 support the decision we have made for male melanoma patients that Mani's method has a better predictive accuracy than that of the PLANN.But the best model for predicting the female melanoma patient's survival time is the model with weight decay value  = 0.075.It is clear to us that male and female melanoma patients need to be treated differently as shown by the survival plots in Figure 4, which displays the surface plot of survival probability of males (Figure 4(a)) and females (Figure 4(b)).In this figure, the survival is estimated as a function of age at diagnosis and time in years.The tumor thickness is 0.58 mm and ulceration variable is set to 0, considering that the patient was diagnosed in the initial stage.Male infant patients have less survival probability than that of female infant patients over a 10-year period, whereas a male patient at the age of 40 to 50 seems to have higher survival probabilities compared to female patients at the same age.
Figure 5 displays the surface plot of survival results for male melanoma patients for tumor thickness ranging from 0.01 mm to 9 mm.The surface plot (Figure 5(a)) is for male patient diagnosed at 20 years of age, whereas the plot (Figure 5(b)) is for male patient diagnosed at the age of 60 years.As we can see, survival estimate for young men is farther away and lower than that of older men and these findings were found similar to a recent study by Fisher and Geller in 2013 [27].
Fisher and Geller mentioned that more attention was given to older men over the past years and suggested that more awareness is needed to be addressed to young men to help in early detection of melanoma.They also mentioned the difference between young men and young women, which we can figure out by comparing the two left plots of Figures 5 and 6.The survival probability for young men (diagnosed with tumor thickness larger than 4 mm) within two years of diagnosis is too low (almost 0) compared to that of young women.Some of our significant findings were found to be similar to those found in another study by Gamba et al. in 2013 [28].However, more investigation and statistical data analysis are required to better understand the causes of the differences between young males and females and to plan new strategies to fight the major pernicious form of skin cancer (melanoma) [29].

Figure 1 :
Figure 1: Distribution of complete information of melanoma patients.

Figure 2 :
Figure 2: FFANN for partial logistic artificial neural network, with three layers.The input layer has  covariates and hidden layer with  hidden units and one output unit in the output layer.Activation function used in both hidden and output layer is logistic function (1).

Figure 3 :
Figure 3: Three-layer network with  output units in the output layer where  is equal to the number of time intervals.

Figure 4 :
Figure 4: Survival probability function surface plot results for age: (a) for male melanoma patients and (b) for female melanoma patients.The survival probability is estimated as a function of time (ten-year period) and age.Other risk factors like tumor thickness are fixed at 0.58 mm with no ulceration and in the initial stage.

Figure 5 :
Figure 5: Survival probability function surface plot results for tumor thickness: (a) for the male patient diagnosed at age of 20 years old and (b) for male diagnosed at the age of 60 years.

Figure 6 :
Figure 6: Survival probability function surface plot results for tumor thickness: (a) for the female patient diagnosed at the age of 20 years and (b) for female diagnosed at the age of 60 years.

Table 1 :
Mean prediction error for the eight competing neural network models for estimating the survival time of male melanoma patients.

Table 2 :
Mean prediction error for the eight competing neural network models for estimating the survival time of female melanoma patients.
[26]earning techniques for ANN, Lisboa et al.[25]have amended the PLANN by adapting the Bayesian learning for neural networks, developed by Mackay in 1995[26].It is still an open problem: how the Bayesian learning will affect the performance of Mani's ANN?Is it going to change the comparison results with the PLANN?These are among other questions that we need answers to and we open more areas of research on this type of problems.