Model-Independent Evaluation of Tumor Markers and a Logistic-Tree Approach to Diagnostic Decision Support

Sensitivity and specificity of using individual tumor markers hardly meet the clinical requirement. This challenge gave rise to many efforts, e.g., combing multiple tumor markers and employing machine learning algorithms. However, results from different studies are often inconsistent, which are partially attributed to the use of different evaluation criteria. Also, the wide use of model-dependent validation leads to high possibility of data overfitting when complex models are used for diagnosis. We propose two model-independent criteria, namely, area under the curve (AUC) and Relief to evaluate the diagnostic values of individual and multiple tumor markers, respectively. For diagnostic decision support, we propose the use of logistic-tree which combines decision tree and logistic regression. Application on a colorectal cancer dataset shows that the proposed evaluation criteria produce results that are consistent with current knowledge. Furthermore, the simple and highly interpretable logistic-tree has diagnostic performance that is competitive with other complex models.


INTRODUCTION
Serum-based tumor markers are claimed to be "most ideal for early diagnosis of cancer" [1].As such, researchers have applied a wide array of machine learning methods and statistical tools on multiple tumor markers in pursuit of accurately predicting tumor types.Although promising outcomes have been achieved, results from different studies are often inconsistent.In this paper, we investigate the origins of the inconsistency in three aspects, namely, evaluation criteria, evaluation procedures, and predictive models.Based on the findings, we propose proper criteria to evaluate the diagnostic value of tumor markers and a new predictive model with the objective of providing an unbiased, effective, and interpretable tumor marker-based cancer diagnosis procedure.
Early diagnosis will significantly increase the chance of successful cancer treatment.Conventional computed tomography (CT) and transabdominal ultrasonography are often employed for initial imaging in early diagnosis [2].However, CT devices and operations are too costly to be deployed for large population, and it is not applicable to tumors with small undetectable size.A great deal of efforts have been focused on serum-based tumor markers due to their ease of operation and cost advantage, which have the potential of application on asymptomatic population of large scale.In the past decades, an increasing number of tumor makers have been discovered and reported to be related to specific cancers.For instance, Ni et al. [3] reported more than 10 tumor markers for pancreatic cancer.Sun et al. [4] evaluated 12 tumor markers on hundreds of patients with diverse cancer types including lung cancer, breast cancer, and liver cancer.The use of individual tumor markers for cancer diagnosis, however, is limited since the resulting sensitivity and specificity hardly meet the strict clinical requirement.For example, the 10 reported markers in [3] are found to lack sufficient sensitivity and specificity for diagnosis.Therefore, researchers resorted to combining multiple tumor markers and employing machine learning algorithms to improve diagnosis accuracy.However, inconsistency of outcomes across different studies was found.Tumor markers initially reported to be useful may fail to yield consistently good outcomes in subsequent studies, and different studies drew contradictory conclusions of optimal tumor markers for the same type of cancer.This inconsistency could be explained by a number of factors such as poor study design, assays that are not standardized, and inappropriate or misleading statistical analyses [5].The problems of evaluation criteria, evaluation procedures, and predictive models could potentially be the cause of the inconsistency, which will be looked into and discussed as follows.
In many earlier studies, the individual diagnostic value of tumor marker is determined by sensitivity and specificity that are calculated by labeling patients as elevated or un-elevated based on suggested cutoff points.Unfortunately, individual tumor marker is rarely found to be clinically acceptable due to the low sensitivity, which stems from the fact that elevation of tumor marker is frequently found in benign disorder and normal individual [6].Researchers also combined multiple tumor markers by parallel test [2,3] and serial test [7].In a parallel test, patients are labeled as elevated only when all tumor markers levels are above the corresponding cutoff points; while in a serial test, patients are labeled as elevated as long as any one of the serum levels is above the cutoff point.These two procedures have been widely used due to its simplicity and interpretability.However, strictly adhering to the suggested cutoff points has two drawbacks:1) they are determined for different patient groups in different studies thus may represent different percentiles of benign controls [8]; and 2) referring to only one fixed cutoff point does not fully take advantage of diagnostic information conveyed by the continuous values.The former drawback can be illustrated by the following example.Based on the same suggested cutoff point, Moghimi et al. [2] calculated the sensitivity and specificity of tumor marker CEA for pancreatic cancer as 83.78% and 69.44%, respectively, whereas Ni et al. [3] achieved results of 45% for sensitivity and 75% for specificity.
From the standpoint of machine learning, assessing the diagnostic value of tumor markers can be viewed as the procedure of building a predictive model and evaluating the discriminative power of independent variables (e.g., tumor markers, sex, and age) with respect to the response variable (e.g., tumor type) based on the prediction accuracy.Parallel and serial tests can be viewed as simple predictive models taking specificity and sensitivity as the criteria of prediction performance.Their prediction procedures seem to be "rough" in combining multiple independent variables and are not suited for accommodating the contradictory effect of sensitivity and specificity.By this rationale, several models originated from machine learning such as support vector machine (SVM) [4], classification tree [9], evolutionary algorithm [10][11][12] and neural networks [9,[13][14][15][16][17] have been widely applied for both evaluation of diagnostic values and diagnostic decision support.Sun et al. [4] applied SVM on a sample set with 12 tumor markers, and it was found that compared with parallel test, SVM achieved more specificity increase at the expense of less sensitivity decrease.Poon et al. [9] applied classification tree and neural networks for the diagnosis of the hepatocellular carcinoma subgroup without elevated tumor marker AFP, and it was concluded that neural networks appeared to outperform classification tree and its potential of recognizing subtle diagnostic pattern for multiple tumor markers was demonstrated.Flores-Fernández et al. [17] selected CEA, CA125, and CRP from 14 biomarkers using neural networks to constitute an optimized set of biomarkers for lung cancer diagnosis.However, complex models such as SVM and neural networks may not be the best choices for capturing the simple dependency underlying tumor markers and tumor type.Inappropriate evaluation procedure that lacks independent model validation is another origin of misleading results.Carpelan-Holmstršm et al. [8] took AUC derived from the probabilities returned by logistic regression model which was trained on the whole dataset as the evaluation criterion, ignoring the model bias toward overoptimistic result.This could be avoided by training a model on a training dataset and then evaluating it on an independent validation dataset.
In light of these observations and analyses, we realize that a good evaluation criterion for diagnostic value of tumor markers should be model-independent, that serves universally under different scenarios and considers multivariate diagnostic value.We propose area under the curve (AUC) and Relief weights as the evaluation criteria for univariate and multivariate diagnostic values, respectively.A simple model termed logistic-tree is developed specifically for diagnostic decision support.Combining the advantages of classification tree and logistic regression, a logistic-tree is able to present understandable diagnosis procedure, explore multi-variable interaction effect, and output continuous probabilities.

Univariate Evaluation Criterion
AUC has been applied to cancer diagnosis in several studies [18][19][20][21].In this paper, AUC serves as the evaluation criterion for both diagnostic values of individual markers and performance of predictive models.The advantages of ROC curve and AUC for cancer diagnosis can be summarized as follows: 1) ROC curve allows clinicians to arbitrarily choose a cutoff point by which a certain requirement for sensitivity (or specificity) can be guaranteed, 2) the diagnostic information conveyed by continuous values can be fully utilized, and 3) it is also applicable to continuous probabilities returned by predictive models.

Multivariate Evaluation Criterion
Multivariate dependency represents the discriminative power in the context of multiple variables with respect to the response variable.Other than individual discriminative power, it includes two more factors of dependency: interaction effect and redundancy.Interaction effect in cancer diagnosis can be explained by the phenomenon that high diagnosis accuracy is achieved by considering two tumor markers together, although each one has weak diagnostic value.Redundancy is explained as follows: when the information carried by two tumor markers are highly correlated, combing these two tumor markers will not improve the diagnosis accuracy compared to the case when only one of the tumor marker is used.It is preferable to remove redundant variables for the purpose of identifying an optimal subset of discriminative variables.
Good at detecting interaction effect, Relief algorithm [22] is a variable estimation algorithm with a natural interpretation and many successful applications.Its pseudo code is presented as follows: Input: training dataset comprised of m variables and a response variable with two classes.Output: Relief weights W[m] for the m variables.Initialize the vector W[m] = 0; do i = 1 to n randomly select a sample Si; find the nearest hit H and nearest miss M; The input to Relief algorithm is a dataset characterized by m variables and a binary response variable.The variables are firstly standardized to jointly construct a mdimensional variable space.In each iteration, a sample is randomly selected.Then the nearest hit can be identified which is defined as the sample whose position in the variable space is nearest to the selected sample and also has the same class.Analogously, the nearest miss is defined as the sample whose position in the variable space is nearest to the selected sample but has the different class.Intuitively, the variables whose values differ more between the selected sample and its nearest miss are thought to contribute more to differentiate them.Therefore, the difference of variable values between the selected sample and its nearest miss is referred to as having positive discriminative power.Likewise, for the selected sample and the nearest hit, discriminative variables are supposed to have close values [23].Difference of variable values in this situation is called negative discriminative power.Following this rationale, Relief values are updated by adding the weighted positive discriminative power and subtracting the weighted negative discriminative power.In this manner, Relief values gradually evolve by iterating this procedure.
It should be noted that Relief algorithm is incapable of detecting redundant variables.Since the number of tumor markers is low in cancer diagnosis application and they are rarely correlated perfectly in reality, the benefits of the diagnosis information that redundant variables bring probably will not be offset by the redundancy effect.Therefore, redundant variables are unlikely to degrade the efficiency or effectiveness of predictive models.On the other hand, the redundancy between tumor markers merits analysis with domain knowledge and it could be helpful if used wisely.For example, two tumor markers are found to be related, so their relationship may help reveal some valuable biomedical information of the interaction between two cells producing these two tumor markers.For practical cancer diagnosis, domain knowledge can help choose the tumor marker whose level is easier to be measured in order to lower the experiment cost.

Logistic-tree Model
Many seemingly "sophisticated" models such as SVM and neural networks may not be suitable for cancer diagnosis.The predominant reason lies in the assumption of simple structure underlying the tumor markers and tumor type.Normally the number of tumor markers found to be informative is less than a dozen.There is a low probability of complicated dependency between tumor markers and tumor type, so models designed to discover complex structure have higher probability of overfiting dataset than simple models.The second reason is that the complicated prediction procedures, often described as "black box," are uninterpretable for domain experts as the process and results cannot be explained and validated by domain knowledge.This hinders the acceptance of these models by clinicians.Logistic regression model and classification tree can overcome this problem.With sound statistical justification and straightforward prediction procedure, "what" they predict and "how" they predict can be easily interpreted by domain knowledge and validated by statistical methods.
In cancer diagnosis application, predictive models normally act as decision support, so the continuous probabilities of tumor type are favored for the follow-up diagnosis and interpretation.Logistic regression model typically produces continuous probabilities ranging from 0 to 1, while classification tree provides discrete probabilities since all instances residing in the same terminal node share same probabilities irrespective of the different continuous variable values.On the other hand, the prediction mechanism of classification tree helps detect interaction effect of multiple independent variables which happens to be the weakness of logistic regression model.To combine their respective strengths, we design a model termed logistic-tree by merging logistic regression model and classification tree in the following way.A logistic regression model is developed on each terminal node after the pruned tree has been built.In contrast to all other models, the advantages of logistic-tree can be characterized as 1) presenting understandable prediction procedure, 2) detecting interaction effect, and (3) providing continuous probabilities.Its diagnosis power will be assessed by comparing it with other models in the experimental study.

Dataset
The dataset of colorectal cancer [24] was provided by Renji Hospital, a university hospital affiliated with Shanghai Jiaotong University, Shanghai, China, and the study was approved by the same university.The raw dataset is composed of data of 578 patients characterized by measurements of 9 serum-based tumor markers and age.Each patient has colorectal tumor either benign or malignant.These 9 tumor markers are not all measured and recorded for all patients, resulting in large amount of missing values.Table 1 lists these 9 tumor markers and age, along with their corresponding percentages of missing values in increasing order.For simplicity, the tumor markers whose percentages of missing values greater than 50% are excluded and only the cases having no missing values for all the remaining variables are retained for analysis.Thus, the final dataset is reduced to a dataset of 253 patient samples and 6 independent variables (5 tumor markers and age).Among the 253 patients, 111 patients are diagnosed with malignant tumors and 142 patients as having benign tumors.Note that the dataset is transformed by applying natural logarithmic scale on each tumor marker, and the tumor marker values presented below are the transformed values.
An experiment is conducted to classify these patients as having either benign or malignant tumors based on the tumor marker values.The methodologies described in the previous section are tested with the data.The study is implemented on the statistical computing environment of R, which has modeling packages including SVM, ANNs, random forest, and others.A brief summary of the experimental procedure can be found in Figure 1.

Diagnostic Value Evaluation
Figure 2 shows the ROC curves of 5 tumor markers and age, and the corresponding AUC values are presented in Table 2. CEA has the highest AUC value with significant margin in comparison with others.AFP and CA125 have AUC values quite close to 0.5, so they can be viewed as having nearly no individual diagnostic value.Age, CA50, and CA199 have AUC values ranging from 0.57 to 0.62, showing weak individual dependency on tumor type, but their interaction effect can be further explored by multivariate diagnostic value analysis.
As shown in Table 3, age is ranked second on Relief measurements while it shows nearly no individual diagnostic value with AUC close to 0.5.Therefore, it is reasonable to assume that age may have interaction effect with other tumor markers on tumor type.The scatter plot of CEA vs. age is shown in Figure 3  • Logarithm is applied on tumor marker values.
• Some samples and variables of missing value are removed.
• Univariate discriminative power of variables are ranked by AUC.
• Mulativaariate discriminative power of variables are ranked by relief algorithm.• Result: 3 groups of variables for modeling are determined basd on variable ranking.
• Model parameters are optimized by grid serch.
• Variable group of CEA and age is recommened.
• Logistic-tree shows comparable diagnostic performance as other sophisticated models.

Figure 1.
Flow chart of experimental procedure.
Model-Independent Evaluation of Tumor Markers and a Logistic-Tree Approach to Diagnostic Decision Support

Diagnostic Performance Comparison
Logistic-tree, logistic regression model, classification tree, SVM, neural networks, and random forest are tested with the same dataset for comparison of diagnostic performance.
To verify the effectiveness of Relief algorithm when working with diverse models, three variable sets are considered: 1) CEA and age, which are found to have interaction effect on tumor type and rank highest on Relief measurements, 2) CEA, age and CA199, whose Relief measurements are above 0, and 3) all independent variables.As all the models' outputs are continuous, AUC is chosen as the performance evaluation criterion.Classification accuracy is also provided using a cutoff point of 0.5.Throughout the experiment, the procedure of 5 repetitions of cross-validation is utilized to achieve unbiased assessment.In n-fold cross-validation, the whole dataset is split into n folds, each having roughly the same sample size.Every fold is subsequently used to test the model that is trained on the other n-1 folds.Thus, n-fold cross-validation will produce n models and n corresponding evaluation results.When n is set to be equal to the sample size N, it becomes the special version named leave-one-out cross-validation (LOOCV).LOOCV method has low bias but high variance.To accommodate the trade-off between reducing measurement variance and fully utilizing the samples for model training, we choose crossvalidation in the experiment.Additionally, parameters for SVM, neural networks, and random forest are tuned by means of grid search.For SVM, radial basis kernel is chosen, and the kernel coefficient gamma and cost (C-constant of the regularization term in the Lagrange formulation) are tuned.The tuned parameters for neural networks are size and decay which are, respectively, number of units in the hidden layer and coefficient for weight decay.For random forest, the tuned parameter mtry represents the number of variables randomly sampled as candidates at each split.In a grid search, a vector of values for each parameter is manually defined, and the modeling is performed based on each combination of parameters values.The combination of parameters' values leading to the highest performance is considered as optimized parameters.The summary of performance comparison in the procedure of 5 repetitions of 10-fold cross validation is presented in Table 4, and the result of each 10-fold cross validation can be found in the Appendix.Totally 3 scenarios of variable sets and 6 models are considered and they form 18 combinations.AUC averaged over 5 repetitions of 10fold cross validation are obtained as the primary indicator of predictive performance.
Classification rates in terms of cutoff point 0.5 are also averaged over 5 repetitions of 10-fold cross validation for reference.The corresponding tuned parameters for SVM, neural networks, and random forest are also shown.classification tree constantly outperforms other models when a cutoff point of 0.5 is used.Again, this is an indication that sophisticated models do not necessarily lead to better classification performance (classification tree is much simpler than SVM and neural networks).In summary, the performance comparison demonstrated that, either for criterion of AUC or classification rate, logistic-tree shows competitive diagnostic performance compared with other parameter-optimized sophisticated models in the case of discriminative variables being selected by Relief algorithm.Considering that the AUC mean achieved by applying logistic-tree on the variable set of CEA and age is comparable to other highest AUC means, we recommend using the variable set of CEA and age for this application of colorectal cancer diagnosis.Table 5 shows the sensitivity and specificity on training and test data (10-fold cross validation) achieved using logistic-tree with CEA and age.It can be seen that the model performance is quite stable across different folds.

DISCUSSION
In the diagnostic value evaluation, Relief successfully identifies the interaction effects of CEA and age on colorectal cancer diagnosis.It further reveals how CEA and age are interacting in differentiating two tumor types as illustrated in Figure 3.As logistic-tree achieves highest AUC for the variable set of CEA and age, its prediction process in terms of tree structure is investigated.For the variable set of CEA and age, 8 out of 10 logistic-trees built and pruned in 10-fold cross-validation procedure have the same tree structure as that presented in Figure 4, with slightly different splitting values on the nodes.The pair of numbers in each terminal node represents the proportions of patient tumors labeled as benign and malignant.The classification mechanism of the tree structure as shown in Figure 4 is consistent with the partition mechanism illustrated by Figure 3.As stated before, Relief has the potential of exploring the interaction effect of multiple independent variables on the response variable, which makes the informative variables recommended by Relief algorithm especially suited for models with classification mechanism of a tree structure.As expected, neural networks, SVM, and random forest which are renowned for discovering complex dependency structure are not superior to logistic-tree with simple modeling mechanism in this case.The response surface of logistic-tree in terms of the probability of having malignant tumor against CEA and age is presented in Figure 5.It can be seen that CEA is highly indicative of the status of colorectal tumor.This is consistent with current knowledge, although CEA alone is not recommended to be used as a screen test for colorectal cancer [25].As for age, when it is over approximately 55, even if the CEA value is low, there is a drastic increase of the probability of malignant tumor.This is also consistent   with the recommendation by the US Preventive Services Task Force [26] of starting regular screening of colorectal cancer at the age of 50.

CONCLUSION
Evaluation of the diagnostic values of tumor markers for decision support in the context of cancer diagnosis is not an ordinary data analysis problem.In this type of problem, multivariate dependency, interpretability of diagnosis procedure and issue of overfitting need to be carefully addressed.The drawbacks of some extant approaches and their unintended consequences are pointed out and illustrated, which makes us realize the need of evaluation criteria that are model-independent.The effectiveness of Relief algorithm for multivariate diagnostic value evaluation is validated both theoretically and experimentally.We also demonstrate the necessity of using simple model for clinical decision support due to high interpretability.Accordingly, logistictree is developed which combines the advantages of classification tree and logistic regression.Its diagnostic performance is also proved to be highly competitive with other complex models in the colorectal dataset application.This cancer diagnosis procedure is evaluated on only a colorectal cancer dataset; its effectiveness need to be further validated using more datasets of different cancers in the future.

Figure 5 .
Figure 5.Response surface of the logistic-tree.

Table 1 . Percentages of missing values for all independent variables
for further investigation.A red vertical line and blue horizontal line are manually added to delineate their joint effect 398Model-Independent Evaluation of Tumor Markers and a Logistic-Tree Approach to Diagnostic Decision Support Evidently, the red vertical line which denotes a cutoff value of around 1.0 on CEA distinguishes two tumor types the best.In the section to the left of the red vertical line, we find the blue horizontal line reflecting the cutoff point of around 55 on age best distinguishes two tumor types.Without considering CEA, age indeed does not exhibit much discriminative power.Only when CEA values are less than 1.0 does age play a role in diagnosis.This two-line separation mechanism results in one mixed section and two other sections where high prediction accuracy can be achieved.In cancer diagnosis, the mixed section deserves more detailed clinical analysis and its final diagnosis can be made by referring to other factors.