Regularized F-Measure Maximization for Feature Selection and Classification

Receiver Operating Characteristic (ROC) analysis is a common tool for assessing the performance of various classifications. It gained much popularity in medical and other fields including biological markers and, diagnostic test. This is particularly due to the fact that in real-world problems misclassification costs are not known, and thus, ROC curve and related utility functions such as F-measure can be more meaningful performance measures. F-measure combines recall and precision into a global measure. In this paper, we propose a novel method through regularized F-measure maximization. The proposed method assigns different costs to positive and negative samples and does simultaneous feature selection and prediction with L1 penalty. This method is useful especially when data set is highly unbalanced, or the labels for negative (positive) samples are missing. Our experiments with the benchmark, methylation, and high dimensional microarray data show that the performance of proposed algorithm is better or equivalent compared with the other popular classifiers in limited experiments.


Introduction
Receiver Operating Characteristic (ROC) analysis has received increasing attention in the recent statistics and machine learning literatures (Pepe [1,2]; Pepe and Janes [3]; Provost and Fawcett [4]; Lasko et al. [5]; Kun et al. [6]). ROC analysis originates in signal detection theory and is widely used in medical statistics for visualization and comparison of performance of binary classifiers. Traditionally, evaluation of a classifier is done by minimizing an estimation of a generalization error or some other related measures (Vapnik [7]). However the accuracy (the rate of correct classification) of a model does not always work. In fact when the data are highly unbalanced, accuracy may be misleading, since the all-positive or all-negative classifiers may achieve very good classification rate. In real life applications, the situations for which the data sets are unbalanced arise frequently. Utility functions such as F-measure or AUC provide a better way for classifier evaluation, since they can assign different error costs for positive and negative samples.
When the goal is to achieve the best performance under a ROC-based utility functions, it may be better to build classifiers through directly optimizing the utility functions. In fact, optimizing the log-likelihood function or the mean-square error does not necessarily imply good ROC curve performance. Hence, several algorithms have been recently developed for optimizing the area under ROC curve (AUC) function (Freund et al. [8]; Cortes and Mohri [9]; Rakotomamonjy [10]), and they have been proven to work well with different degrees of success. However, there are not many methods proposed for F-measure maximization. Most approaches to date that we know of maximize Fmeasure using SVMs and do so by varying parameters in standard SVM in an attempt to maximize F-measure as much as possible (Musicant et al. [11]). While this may result in a "best possible" F-measure for a standard SVM, there is no evidence that this technique should produce an Fmeasure comparable with one from the classifier designed to specifically optimize F-measure. Jansche [12] proposed an approximation algorithm for F-measure maximization in the logistic regression framework. His method, however, gives extremely large values for the estimated parameters and creates too many steep gradients. It, therefore, either converges very slow or fails to converge for large datasets.
Our aim in this paper is to propose a novel algorithm that directly optimizes an approximation of the regularized F-measure. The regularization term can be an L 2 , L 1 or a combination of L 1 and L 2 penalty based on different prior assumptions (Tibshirani [13,14]; Wang et al. [15]). Due to the nature of L 1 penalty, our algorithm provides simultaneous feature selection and classification with L 1 penalty. The proposed algorithm can be easily applied to high dimensional microarray data. One advantage with this method is that it is very efficient when data is highly unbalanced, since it assigns different costs to the positive and negative samples.
The paper is organized as follows. In Section 2 we introduce the related concept of ROC and F-measure. The algorithm and the brief proof of its generalization bounds are proposed in Section 3. The computational experiments and performance evaluation are given in Section 4. Finally the conclusions and remarks are discussed in Section 5.

ROC Curves and F-Measure
In binary classification, a classifier attempts to map the instances into two classes: positive (p) and negative (n). There are four possible outcomes with the given classifier: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Table 1 summarizes these outcomes with their associated terminology. The number of positive instances is N p = TP + FN. Similarly N n = TN + FP is the number of negative instances.
From these counts the following statistics are derived: where true positive rate (also called recall or sensitivity) is denoted by tpr and true negative rate (specificity) by tnr. False positive rate and false negative rate are denoted by fpr and fnr, respectively. Note that tnr = 1 − fpr, and fnr = 1 − tpr. We also define the precision Pr = TP/(TP + FP). ROC curves plot the true positive rate versus false positive rate by varying the threshold which is usually the probability of the membership to a class, distance to a decision surface, or a score produced by a decision function. In the ROC space, the upper left corner represents a perfect classification, while a diagonal line represents random classification. A point in ROC curve that lies upper left of another point represents a better model. F-measure combines the true positive rate (recall) and precision Pr into a single utility function which is defined as γ-weighted harmonic mean: F γ can be expressed with TP, FP, and FN as follows: or equivalently where N p is the number of positive samples, and M p = TP + FP. Clearly 0 ≤ F γ ≤ 1 and F γ = 1 only when all the data are classified correctly. Maximizing F-measure is equivalent to maximizing the weighted sensitivity and specificity. Therefore, maximizing F γ will indirectly lead to maximize the area under ROC curve (AUC).
To optimize F γ , we have to define TP, FN, and FP mathematically. We first introduce an indicator function where C is a set. Let y = f (w, x) be a classifier with coefficients (weights) w and input variable x, and let y be the predicted value. Given n samples, D = {(x 1 , y 1 ), . . . , (x n , y n )}, where x i is a multidimensional input vector with dimension m and class label y i ∈ {−1, 1}; TP, FN, and FP are given, respectively: It is clear that F-measure is a utility function that applies for the whole data set.

The Algorithm
Usually given a classifier with known parameters w, Fmeasure can be calculated with the test data to evaluate the performance of the model. The aim of this paper is, however, to learn a classifier and estimate the corresponding parameters w with a given training data D and regularized F- Statistically F γ is a probability that measures the proportion of samples correctly classified. Based on these observations, we can maximize the log F γ in the maximum log likelihood framework. Different assumptions for the Journal of Biomedicine and Biotechnology 3 prior distribution of w will lead to different penalty terms. Given the coefficient vector w with dimension m, we have L 2 = (1/2) m j=1 |w j | 2 for the assumption of Gaussian distribution and L 1 = m j=1 |w j | with that of Laplacian prior. In general, L 1 penalty encourages sparse solutions, while the classifiers with L 2 are more robust. We make TP, FN, and FP depend on w explicitly and maximize the following penalized F-measure functions: We have Note that TP(w), FN(w), and FP(w) are all integers, and the index function I in (7) is not differentiable. We first define an S-type function to approximate the index function I: Let z = w T x be a linear score function, The decision role such that y(w, x) = 1 if z = w T x > 0 can be represented as Figure 1 gives some insight about the h(z). Figure 1 shows that h(z) is a better approximation of I(z > 0) than the sigmoid function g(z) = 1/(1 + e −z ). The first derivative of h(z) is continuous and given in (12): Based on (10) and (11), the approximated version of TP(w) and M p (w) = TP(w) + FP(w) can be written as follows: We can find the first-order derivatives of E 1 and E 2 , respectively, as follows: where, , Knowing E 1 and E 2 , and their derivatives ∇E 1 = [∂E 1 /∂w j ] and ∇E 2 = [∂E 2 /∂w j ], we can maximize the penalized function E 1 and E 2 with gradient descent-related algorithm such as Broyden-Fletcher-Goldfarb-Shanno-(BFGS-) related quasi-Newton method (Broyden [16]). The algorithm for E 2 maximization is straight forward as shown in Algorithm 1. The step-size μ in the algorithm can be found with line search. The regularized F-measure maximization with L 1 penalty (E 1 ) is of especial interest because it favors sparse solutions and can select features automatically. However, maximizing E 1 is a little bit complex since L 1 and E 1 are not differentiable at 0. For simplicity, let LF = log F γ (w), we have 1. Given γ, λ, a small number ε, Initialize w t = w 0 , and set Algorithm 1: L 2 regularized F-measure maximization.
The Karush-Kuhn-Tucker (KKT) conditions for optimality are given as follows: The KKT conditions tell us that we have a set Ψ of nonzero coefficients which corresponds to the variables whose absolute value of first-order derivative is maximal and equal to λ, and that all variables with smaller derivatives have zero coefficients at the optimal penalized solution. Since L 1 is differentiable everywhere except at 0, we can design an algorithm to deal with the nonzero coefficients only. Algorithm 2 proposes an algorithm that can be applied to the subspace of nonzero coefficient set denoted by Ψ. The algorithm has a procedure to add or remove variables from Ψ, when the first-order derivative becomes large and when a coefficient hits 0, respectively.

Computational Considerations.
Both γ and λ are free parameters that need to be chosen. We will choose the best parameter for γ and λ with the area under ROC curve (AUC). Area under the ROC curve (AUC) is another scalar measure for classifier comparison. Its value is between (0, 1). Larger AUC values indicate better classifier performance across the full range of possible thresholds. For datasets with skewed class or cost distribution is unknown as in our applications, AUC is a better measure than prediction accuracy. Given a binary classification problem with N p positive class samples and N n negative class samples, let f (x) be the score function to rank a sample x. AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Mathematically Nn where I(·) is an index function and I(·) = 1 if f (x i ) > f (y j ), otherwise I(·) = 0. AUC is also called Wilcoxon-Mann-Whitney statistic (Rakotomamonjy [10]). Note that log F γ (w) is generally a nonconcave function with respect to w; only local maximum is guaranteed. One way to deal with this difficulty is to employ the multiplepoints initialization. Multiple random points are generated, and our proposed algorithms are used to find the maximum for each point. The result with the lowest test error is chosen as our best solution.

Benchmark Data.
To evaluate the performance of the proposed method, experiments were performed on six benchmark datasets which can be downloaded from http://ida.first.fraunhofer.de/projects/bench/benchmarks.htm. These benchmark datasets have been widely used in model comparison studies in machine learning. They are all binary classification problems, and the datasets were randomly divided into train and test data 100 times to prevent bias and overfitting. The data are normalized with zero mean and standard deviation. The overview of the datasets is given in Table 2. The computational results with our algorithms, logistic regression, and linear support vector machines are given in Figures 2-3. Figures 2-3 show that L 2 F-measure maximization performs better or equivalent compared with logistic regression and linear support vector machines (SVM) in limited experiments. In fact, the test errors for all datasets except for Thyroid are competitive with that of the nonlinear classification methods reported by Ratsch (http://ida.first.fraunhofer.de/projects/bench/benchmarks .htm). The inferior performance of L 2 F-measure with Thyroid data indicates the strong nonlinear factors in that data. from small cell lung cancer and 46 lines from nonsmall cell lung cancer. The proportion of positive values for the different regions ranges from 39% to 100% for the small cell lung cancer and from 65% to 98% for the nonsmall cell lung cancer. The data are available at http://www-rcf.usc.edu/kims/SupplementaryInfo.html. We utilize the twofold cross validation scheme to choose the best λ and test our algorithms. Other cross-validation schemes such as 10-fold cross validation will lead to similar results but are more computational intensive. We randomly split 6 Journal of Biomedicine and Biotechnology the data into two roughly equal-sized subsets and build the classifier with one subset and test it with the other.

Real Methylation
To avoid the bias arising from a particular partition, the procedure is repeated 100 times, each time splitting the data randomly into two folds and doing the cross validation. The average computational results with different γs and λ = 0.05 are given in Table 3. Table 3 shows the selected variables (1: selected; 0: not selected), sensitivity, specificity, test errors, and AUC values with different γ's. We can see clearly the sensitivity increases while the specificity decreases as γ increases. When γ = 0.9, every example is classified as positive examples. The best γ will be 0.4 according to AUC but it will be 0.2 based on test error. Therefore, again there is some inconsistence between two measures. Figure 4 gives some sight about how to choose λ and the number of features. Given γ = 0.4, the optimal λ = 0.04, and those 5 out of 7 CpG regions selected by L 1 F-measure maximization have been proved to be predictive of lung cancer subtype (Siegmund et al. [18]). The performance of the model is improved roughly 6% in AUC and 3% in test error with only 5 instead of 7 CpG regions.  Table 4. Table 4 gives us some insight that how the model performance changes with different γ's. Generally we can see that the false negative (FN) decreases and the false positive (FP) increases as γ increases. The only exception is when γ = 0.1, both FN and FP have the worst performance. The best performance is achieved when γ ∈ [0.7, 0.8] according to both AUC and the number of misclassified samples.

High Dimensional
The 10 genes selected are given in Table 5. The selected genes allow the separation of cancer from normal samples in the gene expression map. Some genes were selected because their activities resulted in the difference in the   of colon cancer because these tissue types shared similarity. Our method is supported by the meaningful biological interpretation of selected genes. For instance, three musclerelated genes (H20709, T92451, and J02854) were selected from the colon cancer data, reflecting the fact that normal colon tissue had higher muscle content, whereas colon cancer tissue had lower muscle content (biased toward epithelial cells), and the selection of x12671 ribosomal protein agreed with an observation that ribosomal protein genes had lower expression in normal than in cancer colon tissue.

Conclusions and Remarks
We have presented a novel regularized F-measure maximization for feature selection and classification. This technique directly maximizes the tradeoff between specificity and sensitivity. Regularization with L 2 and L 1 allows the algorithm to converge quickly and to do simultaneous feature selection and classification. We found that it has better or equivalent performances when compared with the other popular classifiers in limited experiments.
The proposed method has the ability to incorporate nonstandard tradeoffs between sensitivity and specificity with different γ. It is well suited for dealing with unbalanced data or data with missing negative (positive) samples. For instance, in the problem of gene function prediction, the available information is only about positive samples. In other words, we know which genes have the function of interested, while it is generally unclear which genes do not have the function. Most standard classification methods will fail but our method can train the model with only positive labels by setting γ = 1.
One difficulty with the regularized F-measure maximization is the nonconcavity of the error function. We utilized the random multiple points initialization to find the optimal solutions. More efficient algorithms for nonconcave optimization will be considered to speed up the computations. The applications of the proposed method in gene function predictions and others will be explored in the future.