Modified Logistic Regression Models Using Gene Coexpression and Clinical Features to Predict Prostate Cancer Progression

Predicting disease progression is one of the most challenging problems in prostate cancer research. Adding gene expression data to prediction models that are based on clinical features has been proposed to improve accuracy. In the current study, we applied a logistic regression (LR) model combining clinical features and gene co-expression data to improve the accuracy of the prediction of prostate cancer progression. The top-scoring pair (TSP) method was used to select genes for the model. The proposed models not only preserved the basic properties of the TSP algorithm but also incorporated the clinical features into the prognostic models. Based on the statistical inference with the iterative cross validation, we demonstrated that prediction LR models that included genes selected by the TSP method provided better predictions of prostate cancer progression than those using clinical variables only and/or those that included genes selected by the one-gene-at-a-time approach. Thus, we conclude that TSP selection is a useful tool for feature (and/or gene) selection to use in prognostic models and our model also provides an alternative for predicting prostate cancer progression.


Introduction
Prostate cancer (PCa) is the second leading cause of cancerrelated deaths among men in the USA [1,2]. Screening using serum prostate-specific antigen (PSA) has improved the early detection of PCa and has resulted in an increase in the proportion of patients with disease that is curable by prostatectomy [3,4]. However, 20% to 30% of treated patients will develop a local or metastatic recurrence which reflects the most adverse clinical outcome [4]. Thus, from the clinical perspective, it is important to be able to predict which patients will experience a relapse.
Traditional PCa prognosis models are based on some clinical features, such as pretreatment PSA levels, biopsy Gleason score (GS), and clinical stage, but in practice, they are inadequate to accurately predict disease progression [5]. With the development of microarray technology in recent years, a number of studies have been conducted to characterize the dynamics of gene expression in PCa progression by using DNA microarrays. In some studies, tumor expression signatures associated with clinical parameters and outcomes have been identified [6][7][8][9]. As a result, it is possible to develop the clinical models with the variables of gene signatures identified from microarray data and some clinical features to predict which men would experience progression to the metastatic form of PCa.
However, it has been found that none of the predictive models using gene expression profiles are significantly better than models using clinical variables only in predicting PCa progression [10,11]. In fact, only a limited number of genes are used to avoid overfitting in these models. The genes are usually selected through a gene-by-gene comparison. The results of recent studies, however, suggest that assessing the expression of more than one gene (i.e., coexpression analysis) yields a better prediction of tumor progression than the analysis of individual genes does [12][13][14][15].
In this study, we tried to propose such models by merging the coexpressed genes' profiles and some clinical features to predict the patients who would suffer from PCa progression. The genes used in our models are identified by a topscoring pair (TSP) algorithm. The TSP method was initially introduced by Geman et al. as a classification technique for microarray data [16]. We applied the TSP-based LR model to published microarray experiments whose patients suffered from PCa progression. We analyzed the effects of the number of coexpressed genes included in the models and the selection of the clinical variables on the accuracy of the prediction. We also compared the performance of the most commonly used classification methods with our proposed method.

Logistic Regression Model for the Classification of Gene
Microarrays. Genome-wide microarray data from different cells give insight into the gene expression variation of various genotypes and phenotypes. Classification of patients is an important aspect of cancer diagnosis and treatment. For example, microarray experiments can be employed to screen gene expression levels from cancerous and normal phenotypes so that proper prediction rules can be built from these gene expression data. In this section, we introduce a logistic regression (LR) model to classify the phenotypes of microarray data.
We denote a gene expression matrix by = { } × , where there are genes and samples, and denotes the expression value of the th gene, ∈ {1, . . . , }, from the th sample, ∈ {1, . . . , }. The vector ⋅ = ( 1 , . . . , ) represents the gene expression values over all samples and ) is the expression profile of all genes for the th sample. Let be the binary phenotype of the th profile: = 0 indicates that the th sample belongs to class 0 (e.g., normal tissues) and = 1 indicates that the th sample belongs to class 1 (e.g., tumor tissues).
The classification of microarray data has been intensively researched for years. But some limitations have stood out, such as the small-sample dilemma, "black box, " and lack of prediction strength [16][17][18]. We used LR to build the prediction models for a binary outcome. Obviously, the underlying probability of labels and contribution of predictor variables can be explicitly provided in LR models, which is helpful for biologists in discovering the genes that interact and cause the occurrence of disease.
The goal of classification with LR was to find a formula that gives the probability that the th sample with all its measured expressions ⋅ represents a class 1 case. Since only two classes are considered, the probability of the sample representing class 0 is consequently 1 − . We used the following normal LR model: where 0 , 1 , . . . , are parameters that can be estimated by maximizing the following likelihood: For microarray experiments of typical "large , small , " the number of samples, , is usually on the order of tens, but the number of genes, , is usually on the order of thousands or even tens of thousands. So the number of samples is much less than the number of variables ( ≪ ). This situation presents a number of problems when building a LR model, such as overfitting, multicollinearity of the gene expression profiles, and infinite solutions for [17][18][19]. Feature selection can be used to identify the significant genes that contribute to most of the classification. Thus, some dimension-reduction techniques, such as support vector machines (SVMs), singular value decomposition, and partial least squares, are commonly used to tackle those problems and make the computation feasible [17][18][19]. However, the featured genes are usually selected one by one. According to the biological mechanism, genes do not work by themselves, so we employed coexpressed TSP genes in the model, as described in the following section.

Identification of Coexpressed Genes.
Recent studies have suggested that assessing the expression of more than one gene (i.e., coexpression analysis) provides a better prediction of tumor progression than analyzing the expression of individual genes [20][21][22]. We identified the coexpressed gene with the paired-gene approach of the top-scoring pairs (TSP) algorithm as described by Geman et al. [16]. The TSP algorithm was originally developed for the binary classification of phenotypes according to the relative expression profiles of one-gene pair. The TSP classifier has the following advantages over the standard classifiers used in gene expression studies: (i) it is a parameter-free and data-driven machine-learning method that avoids overfitting by eliminating the need to perform specific parameter tuning, as in other machinelearning techniques, such as SVMs and neural networks; (ii) it involves only two genes, which leads to easily interpretable data and inexpensive diagnostic tests; (iii) the rank-based TSP classifiers are less affected by technical factors or normalization than classifiers which are based on expression levels of individual genes; and (iv) the simple and accurate results generated by TSP facilitate follow-up studies.
TSP gene pairs may be considered biomarker genes in a diagnostic test from microarray experiments [16,[20][21][22]. The methodology is being extended from one TSP gene pair to top-scoring pair of groups (TSPG) as gene signatures [20][21][22]. However, there are still some unresolved issues of biological explanation and the selection criteria related to the use of gene pairs instead of larger groups of significant genes. Most of the algorithms in gene selection are based on the distribution assumption of the gene expression data. However, the rank-based TSP algorithm is a parameterfree, data-driven machine-learning method. It is difficult to determine the number of gene pairs selected, but current Computational and Mathematical Methods in Medicine 3 research indicates that only a few gene pairs with the top scores need to be considered [20,21].
For simplicity, using the gene expression matrix = { } × with genes and samples, we assume that 1 samples are labeled class 0, = 0 ( ∈ {1, . . . , 1 }), and 2 samples are labeled class 1, = 1 ( ∈ { 1 + 1, . . . , }), and 1 + 2 = . In this method, we focused on detecting "marker gene pairs" ( , V) because there is a significant difference in the probability of observing < V between class 1 and class 0, where and V denote the th and Vth rows of . The conditional probabilities of observing < V in each class are defined as where ( < V ) is the indicator function defined as The typical TSP method is based on maximizing the following score of ( , V) defined by German et al. [16]: This approach has been shown to be as accurate as SVMs and other more sophisticated methods [20][21][22]. Although maximizing delta identifies the best classifier with high accuracy, it may be associated with relatively low sensitivity or specificity, as pointed out by German et al. and Ummanni et al. [16,23]. For example, in the classification of cancer versus normal samples, accuracy is defined as the ratio between the number of correctly predicted samples and the total number of samples, and sensitivity (resp., specificity) is the ratio between the number of correctly predicted cancer (resp., normal) samples and the total number of cancer (resp., normal) samples [16]. This low sensitivity or specificity restricts us to use the classifier of one TSP for making medical decisions. This issue was improved with the use of multiple gene pairs as the classifier, which can achieve similar scores with high accuracy, sensitivity, and specificity [20][21][22]. Thus, we considered not just one but multiple TSP gene pairs in our models.

Evaluation of the Model Using Published Datasets.
To evaluate the efficiency of the TSP-based LR model, we applied our model to datasets with both clinical parameters and gene expression values. We selected a dataset with a large sample size because we could obtain more reliable estimates of the efficiency of the classifiers. The dataset was from the recently published study of Sboner et al. [5], who analyzed gene expression in patients with up to 30 years of clinical followup data. Men who died within 10 years of being diagnosed with PCa were considered to have "lethal" disease, and those who survived at least 10 years after diagnosis were considered to have "indolent" disease. There were 165 men with lethal and 116 with indolent disease. The GS, tumor percentage, and presence of an estrogen-regulated gene (ERG) rearrangement were provided for each patient in the study. The expression of 6,100 genes was assessed using a custom gene expression array (GSE 16560).
For our model, we first randomly separated the 281 samples into a learning set with 186 samples and a validation set with the other 95 samples, with an approximately equal proportion of men with lethal and indolent PCa in each group. The learning set was utilized to create the models whose performance was evaluated in the validation set by means of the area under the receiver operating characteristic (ROC) curve (AUC). To compare the performance of our model, we performed the statistical testing based on the null hypothesis that there is no difference between the AUCs of Sboner's models and ours. Similar to the estimation of AUCs in [5], the corresponding 95% confidence intervals of the AUCs were computed in 100 iterative 10-fold cross validation procedures that enabled an unbiased estimation of the model's performance since the evaluation was performed on an independent dataset. The model is inferred to be better only if its AUC is statistically larger than that of the other models. In the original study, the authors conducted an extensive comparative analysis of the most frequently used classification methods, including the k-nearest neighbor, the nearest template prediction, diagonal linear discriminate analysis, SVMs, and neural network analysis. Their results allowed us to compare the performance of the TSP-based LR classifiers with that of the other classifiers.
To optimize and select the best models, we adopted an iterative cross validation procedure within the learning set that was similar to the procedure used by Sboner et al. [5]. The stratified tenfold cross validation procedure split the learning set into 10 disjointed partitions, test i ( = 1, . . . , 10), with approximately equal proportions of lethal and indolent cases in each. For a given partition, test i , the models were fitted using all the other cases in the learning set, that is, the training i set and then were evaluated with AUC analysis of test i . In the procedure of 10-fold cross validation, the model i ( = 1, . . . , 10) was first parameterized in the training i sets and then the corresponding AUC ontest i ( = 1, . . . , 10) sets were calculated from model i . To avoid potential biases in the selection of the 10 partitions, the entire procedure was repeated 100 times, for 1,000 different partitions. We identified the best model with the largest AUC by comparing them as obtained in the 100 iterations. Furthermore, the featured gene pairs and estimated parameters in the model were also considered as the best model in learning set. The rationale was that the results of this procedure enable the identification of the best model, which can then be used to build a classifier that was finally evaluated on the validation set.
During the iterations of our cross validation procedure, the feature-selection procedure was carried out to identify the subsets of genes that are expressed differently in the lethal and indolent samples. In the study by Sboner et al. [5], a two-sided -test was performed for each gene to identify the differently expressed genes. We then compared our models using TSPselected coexpressed genes with the models described by Sboner et al. [5].

Results
We proposed the LR models by combining TSP-selected genes and clinical features to identify and predict the patients whose PCa will progress. The performance of the models was evaluated with dataset GSE 16560. Table 1 lists the 16 LR models that we tested. Our models include all possible combinations of the following variables: age, GS, tumor percentage, presence of ERG gene rearrangement, and TSPselected genes. The AUCs of 1,000 different partitions were calculated to select the best models. Figure 1(a) shows the AUC boxplots for the 100 tenfold cross validations of the 16 models listed in Table 1, with one pair of TSP-selected genes per model. The red stars denote the AUC values from the validation datasets that correspond to the best LR models from the learning dataset. Figure 1(b) shows the AUC boxplots for the same 16 models but with two TSP-selected gene pairs per model. We plotted the AUC values of the validation dataset to assess the effects of the variables on the models (Figure 2). The blue line represents AUC values from the models with one TSP-selected gene pair, and the black line represents those from the models with two TSP-selected gene pairs. Furthermore, we tested the statistical significance of the models based on the null hypothesis that there is no difference between the AUCs of Sboner's models. It is found that most of AUC values in Sboner's models were out of the 95% confidence intervals of AUCs in our models. So our models can provide an alternative in predicting prostate cancer progression. The addition of TSP-selected gene pairs can improve our models' prediction of PCa progression, which differed from Sboner's results.
What is the role of TSP-selected gene pairs in comparison with the fusion ERG and the other clinical features, especially GS? Obviously, the GS was the most statistically significant variable because all the top models included it. In Figure 2, the red circles label the 8 models that include the GS. The AUCs in those 8 models were much higher than they were in the others and were very similar in the one-and twogene-pair models. The 8 AUCs were more than 0.8, as shown in Figure 2, so we can conclude that the models with TSPselected gene pairs performed better than all of Sboner's models, for which the largest AUC was 0.79 in [5].
The model using only the GS yielded an AUC of 0.76; by adding fusion ERG, the largest AUC observed by Sboner et al. was 0.79 [5]. Similarly, the other models that used only GS and tumor percent (or age) without molecular profiles could yield a higher AUC if fusion ERG was added. Therefore, the addition of fusion ERG may improve the prediction capability of models that use only clinical features [5].
However, the effect of fusion ERG was a little different in our analysis. First, our models could perform better by replacing fusion ERG with TSP-selected genes. In comparison with the best model with GS and fusion ERG (AUC, 0.79) in [5], our model 1.3 with the GS and TSP-selected gene pairs performed better, with an AUC of 0.84 (95% CI = [0.81, 0.88]); our best model was model 1.9, which used GS, tumor percentage, and one TSP-selected gene pair (AUC, 0.86; 95% CI = [0.79, 0.92]), but the corresponding model reported by Sboner with fusion ERG for replacement yielded an AUC of 0.75 [5]. On the other hand, the addition of fusion ERG had little or no effect on our models that included TSPselected gene pairs. For example, the same AUC was obtained with our Model 1.3 and 1.10, with GS, fusion ERG, and TSPselected gene pairs. Thus, TSP-selected genes seem to have a more important effect than the fusion ERG does in predicting PCa progression.
The addition of genes other than fusion ERG could not improve the prediction capability because the best models in the Sboner study [5] lacked molecular profiles. However,  Table 1, and the y-axis is the AUC values. The red star denotes the corresponding AUC values of the validation dataset that uses the best logistic regression models from the learning dataset. (a) Models that included one pair of TSP-selected genes and (b) those that included two such gene pairs.   Table 1, and the y-axis is the AUC values. The blue line shows the AUCs from the models with one TSP-selected gene pair, and the black line shows those from models with two TSP-selected gene pairs. The points circled in red are the AUCs in the 8 models that included the Gleason score as a variable.
some improvement was observed in our study: by replacing the molecular profiles in the Sboner models with one or two TSP-selected gene pairs, our models performed better than theirs did. For example, the best model with molecular profiles in the Sboner study used GS, age, and 12 genes and yielded an AUC of 0.75, whereas our model 1.6, which used GS, age, and TSP-selected gene pairs, yielded an AUC of 0.8 (95% CI = [0.76, 0.85]). Thus, although we added fewer genes to our model, its performance was better. Moreover, the prediction capability of Sboner's models was also improved when the same number of genes was replaced with TSPselected gene pairs, as demonstrated in Table 2. Therefore, adding the TSP-selected genes had an important effect on the performance of the original models. Models that use only clinical features may perform better if appropriate genes, such as those selected with the TSP algorithm, are added. To explore the effect of adding genes, we compared our approach with that used in the nine models in the original study by Sboner et al. [5] that included genes. For the comparison, we selected the same number of genes for our models. However, the featured genes in the models differed each time because the 1,000 random training and testing partitions were different in the iterative cross validation procedure.
The results of our comparison were presented in Table 2. As noted, the AUCs in our models were often higher than those in the study by Sboner  The model from the Sboner study that included the GS and 16 genes did not perform any better than their model did that used the GS only, with AUCs of 0.75 and 0.75, respectively [5]. However, the AUC of our model that included the GS and 16 TSP-selected genes was 0.81 (95% CI = [0.76, 0.85]) in Table 2, and the models that used GS and one (or two) TSP gene pair(s) performed better, with an AUC of 0.84 in Figure 2.
Finally, of all the models tested in the original study, the one that included the GS and the ERG rearrangement (with no gene expression data) had the highest AUC value, 0.79 [5], whereas most of the AUC values for our models were higher than that. Therefore, in contrast to the conclusion reached by Sboner et al., we believe that adding the molecular profiles can improve the results obtained with the traditional prognostic models of PCa if the appropriate genes are selected.
From the results in Figure 2 and Table 2, we may conclude that the models' performance was not improved by the addition of large numbers of genes but was improved by the addition of significant clinical features and molecular profiles. For example, adding one TSP-selected gene pair is enough if the important clinical variables, such as GS, are included in the model. However, in the case of model 1.1, which included only one gene signature, and models 1.2 and 1.8, which also included age, the addition of more gene pairs can greatly improve the performance. Obviously, the gene selection strongly depends on the patient samples and so some statistical techniques such as bootstrap, repeated sampling, or cross validation were commonly used in the TSP-extended algorithms. In the current research, the computation cost of TSP-based algorithms is not the main concern, but the topics about the optimal number of gene pairs added to improve the clinical models are still interesting in further research.

Conclusion and Discussion
We have introduced an LR-based classification method that combines TSP-selected genes and clinical measurements. The empirical results of [19,20] based on the datasets of prostate cancer progression show that the classification models using one or two TSP-selected gene pairs perform better than most commonly used one-gene-at-a-time approaches. With the combination of LR, our models not only preserved the basic advantages of the TSP algorithm but also incorporated the clinical features. Furthermore, the LR-TSP model provides the underlying probability of predictionand coexpressed genes that are used as biomarkers in the model. Thus, our proposed method provides explicit biologic interpretation of the clinical tests. Based on the statistical inference with the iterative cross validation, the better performance was shown in our models.
As mentioned in the report of Sboner et al. [5], many factors can influence the performance of the models, such as the definitions of lethal and indolent PCa, the use of samples contaminated with stromal tissue, the selection of genes assayed using a DASL (cDNA-mediated annealing, selection, extension, and ligation) array, and the effect of intertumor heterogeneity. Based on the study of GSE 16560, we explored the possible effect of genes used in the clinical models. The featured genes are often selected by using a one-gene-at-atime approach. Sboner et al. performed a two-sided -test for each gene within the training i partition, thereby avoiding overfitting because the selection of the genes was performed on only training sets [5]. They also implemented stepwise-like feature selection, sorted the genes according to their values from the t-testing, and then added the genes to their models. Our study, on the other hand, demonstrates that coexpression analysis yields better prediction of tumor progression than the analysis of individual genes does. Therefore, we conclude that TSP selection is a useful tool for feature (and/or gene) selection to use in prognostic models.