Regularization in Retrieval-Driven Classification of Clustered Microcalcifications for Breast Cancer

We propose a regularization based approach for case-adaptive classification in computer-aided diagnosis (CAD) of breast cancer. The goal is to improve the classification accuracy on a query case by making use of a set of similar cases retrieved from an existing library of known cases. In the proposed approach, a prior is first derived from a traditional CAD classifier (which is typically pre-trained offline on a set of training cases). It is then used together with the retrieved similar cases to obtain an adaptive classifier on the query case. We consider two different forms for the regularization prior: one is fixed for all query cases and the other is allowed to vary with different query cases. In the experiments the proposed approach is demonstrated on a dataset of 1,006 clinical cases. The results show that it could achieve significant improvement in numerical efficiency compared with a previously proposed case adaptive approach (by about an order of magnitude) while maintaining similar (or better) improvement in classification accuracy; it could also adapt faster in performance with a small number of retrieved cases. Measured by the area of under the ROC curve (AUC), the regularization based approach achieved AUC = 0.8215, compared with AUC = 0.7329 for the baseline classifier (P-value = 0.001).


Introduction
Clustered microcalcifications (MCs) can be an important early sign of breast cancer in women. MCs are calcium deposits of very small dimension and appear as granular bright spots in a mammogram (e.g., Figure 1). Due to their subtlety in appearance and variation in size and shape in mammogram images, accurate diagnosis of MC lesions as benign or malignant is a very challenging clinical problem for radiologists [1]. In recent years, there has been significant research in development of computer-aided diagnosis (CADx) techniques for clustered MCs, aiming to provide a second opinion to radiologists in their diagnosis in order to improve their performance and efficiency [1][2][3]. Laboratory observer studies have shown that with CADx radiologists can improve their biopsy recommendation by sending more cancer cases and fewer benign cases for biopsy [2][3][4][5].
In CADx, a pattern classifier is typically first pretrained on a set of existing cases and subsequently applied to predict the likelihood that a given lesion is malignant or benign. For this purpose, many different machine-learning methods have been investigated, for example, [6][7][8][9]. In recent years, content-based image retrieval (CBIR) has been studied as an alternative approach in CADx [10][11][12]. Instead of predicting likelihood, this approach is to provide radiologists with examples of lesions with known pathology that are similar to the lesion being evaluated. The purpose is to provide relevant information from the retrieved cases to boost the diagnostic accuracy on the case under consideration [13]. In the literature, there exist a number of studies on the predictive value of retrieved mammogram cases. For example, the correlation in disease condition between the query and retrieved cases was examined in [14,15]. The fraction of malignant cases among all retrieved cases was used as a useful predictor for the query by Floyd et al. [16,17]. The similarity level between a retrieved case and the query was used as a weighting factor in the prediction by Zheng et al. [18]. A genetic algorithm was used to adjust the 2 International Journal of Biomedical Imaging weighting factors of the retrieved cases by Mazurowski et al. [19]. An observer study was used to investigate the potential diagnostic value of similar cases by Nakayama et al. [20]. Recently, we have been exploring a case-adaptive approach to boost the performance of a CADx classifier by using retrieved similar cases [21,22]. The basic idea behind this approach is as follows: for a query case under consideration, we will first apply CBIR to obtain a set of similar cases from a reference library of known cases; we will then use these retrieved cases to modify the decision boundary of an existing classifier (baseline) in the neighborhood of the query case so as to improve its classification accuracy on the latter. In particular, we demonstrated this case-adaptive approach on a classifier based on logistic regression [22]. The adaptive classifier was obtained through retraining with a weighted mixture of the retrieved cases and the training cases of the baseline classifier. This adaptive approach was demonstrated to yield improved classification accuracy when compared to its baseline counterpart. Moreover, it could even outperform the classifier when it was retrained with all the cases in the reference library. Based on this prior success, in this work, we further develop this case-adaptive classification approach by using a regularized adaptive classifier. One drawback of the adaptive classifier in [22] is the extra cost associated with retraining of the classifier for each query case. It also needs access to the training cases of the baseline classifier, which are required for retraining the classifier. To reduce this computational complexity associated with the adaptive classifier, we will use a prior to regularize the adaptive classifier as opposed to reusing the entire training set of the baseline classifier each time for a new query. This prior is derived from the baseline classifier, and it plays the following two key roles: (1) incorporate the information of the baseline classifier, and (2) prevent overfitting by the adaptive classifier when the number of retrieved samples is small. We will consider two specific forms for this prior: one is uniform for all query cases, and the other varies adaptively with the query. Our results demonstrate that such a regularized adaptive classifier not only can be much simpler computationally, it also can adapt faster in performance with a small number of retrieved cases.
Regularization techniques are often used in machine learning to deal with ill-posed problems or to prevent overfitting by an underlying model. They usually assume the form of a penalty to the complexity of the model, such as L 2 -norm penalty in ridge regression [23], which penalizes the length of the solution in a least-square problem. In the well-known support vector machine (SVM) [24], the separation margin of the classifier is used in the form of an L 2 -norm penalty term. Parallel to L 2 -norm, L 1 -norm penalty has also been used for regularization, for example, the LASSO algorithm [25]. In this work, we will derive from the baseline classifier a regularization term for adaptive classification. The regularization term has the form of L 2 -norm penalty, which can also be viewed as a prior distribution of the solution.
The rest of the paper is organized as follows: The development of the adaptive classification schemes with regularization is given in Section 2. Details related to evaluation methods on CADx classification performance are described in Section 3. Experimental results and discussions are furnished in Section 4. Finally, conclusions are given in Section 5.

Regularized Adaptive Classification with Retrieval of Similar Cases
The problem we consider can be stated as follows: for a given query lesion x, we first obtain from a reference library a set of known cases which have similar image features to x; our goal is to make use of these similar, known cases to improve the classification accuracy on x. To motivate the proposed development, below we first briefly review the case-adaptive approach developed previously in [22]. For simplicity, our approach will be presented using a linear classifier. However, it can be readily extended to a nonlinear classifier by using the kernel trick as in [22].

Adaptive Classification Boosted with Similar Cases.
Consider a linear classifier of the form: where x is a vector denoting an input pattern (i.e., lesion), and f (x) is the classifier output which is typically compared against an operating threshold for decision on x. For notational simplicity, in (1) the input vector x is augmented by a constant element 1 so that the bias term is absorbed into the discriminant vector w.
In practice, the unknown vector w is determined from a set of training samples {(x i , y i ), i = 1, . . . , N}, where the labels y i ∈ {0, 1} are given for each sample x i . In [22], we considered logistic regression [26], in which w is determined by maximizing the following log-likelihood function: where p(y i = 1, Now, consider a query lesion x, and a set of N r retrieved The adaptive classifier for x is obtained by modifying the objective function (2) as where the weighting factors β j are defined according to the similarity of x (r) j to the query x [22]. These factors are larger than 1 in magnitude. The idea is to put more emphasis on the retrieved samples, particularly those more similar to the query, so as to refine the decision boundary of the classifier in the neighborhood of x. For retrieval of similar cases, the Euclidean distance between their image features to the query was used in [22], and it is also used in this work.

Regularized Adaptive Classification with Uniform Prior.
Observe that the objective function in (3) consists of two terms: the first term is that of the baseline classifier in (2), and the second term is the weighted sum of the log-likelihood of the retrieved cases. Conceptually, the first term can be viewed as a stabilizer for the adaptive classifier to avoid overfitting for the retrieved cases (which would lead to poor generalization on the query). However, this term involves all the training cases of the baseline classifier, which can be computationally demanding particularly when the number of retrieved cases is much smaller than the number of existing training cases, that is, N r N. To address this problem, we propose a regularized approach for designing the adaptive classifier, as illustrated in Figure 2. The idea is to replace the baseline classifier term in (3) by a prior term on the discriminant vector w. Naturally, this prior term is desired to be predetermined from the training set {(x i , y i ), i = 1, . . . , N}, so that the resulting adaptive classifier will be computationally more efficient for online implementation.
Let vector w denote the solution of the baseline classifier in (2), that is, the likelihood function L(w) assumes maximum at w. Noting that the gradient ∇L(w) = 0 at w, we can apply Taylor series expansion about w and rewrite L(w) as Thus, we can rewrite the modified objective L Ada (w) in (3) as (after ignoring the constant term) The second term in (5) can be viewed as a penalty term defined by a multivariate Gaussian prior which has mean w and covariance matrix [−∇ 2 L(w)] −1 . Consequently, the objective function in (5)  Prior estimation the retrieved cases are weighted according to their similarity level to the query). Note that the Hessian matrix ∇ 2 L(w) in (5) can be precomputed from the likelihood function L(w) of the trainings samples. By comparing to (3), we can see that the numerical complexity associated with the objective function in (5) is much reduced, because it consists of far fewer data terms than (3) when N r N. Furthermore, there is no longer need in (5) to access the training cases, which can be advantageous in practice.
To further simplify the computational complexity of the adaptive classifier, in this study we assume that the components of w are independent and approximate the covariance matrix [−∇ 2 L(w)] −1 in (5) by C −1 I, where C is a constant. Upon such approximation, we can further simplify the objective function in (5) as The constant C in (6) can be viewed as a parameter to control the influence of the regularization term, which has two important roles. First, it is used to prevent overfitting by the adaptive classifier especially when N r is small. Second, and more importantly, it is used to also enforce the fidelity of the adaptive classifier to the training cases {(x i , y i ), i = 1, . . . , N} as in (3). Consequently, L Uni (w) in (6) consists of information from both the retrieved cases and the existing training cases. In particular, in the extreme case that C = 0, the objective function L Uni (w) in (6) simply amounts to retraining the classifier with only the retrieved cases; on the other hand, when C = ∞, the adaptive classifier in (6) coincides with the baseline classifier w.
In this study, the weighting coefficient for a retrieved case x (r) j in (6) is defined as 4 International Journal of Biomedical Imaging That is, β j varies according to the distance between x (r) j and x. The parameter σ is used to adjust the sensitivity of β j with respect to the distance.
In our experiments, the Newton-Raphson algorithm was used for optimization of the adaptive classifier in (6). For completeness, the detailed algorithm is provided in the appendix 2.3. Adaptive Classification with Varying Regularization. In (6), the regularization term is the same for all query cases, the purpose of which is to keep the adaptive classifier from being too different from the baseline classifier. As an alternative, it might be advantageous to adjust this term according to the input feature of the query case. Below, we consider such an approach in which the mean vector w in the regularization term is allowed to vary for each query case. That is, we modify the objective function in (6) as where w Var is now varied with the query case x.
To determine the mean vector w Var , we first predetermine an adaptive vector w i customized for each case in the training set, that is, {(x i , y i ), i = 1, . . . , N}, as described below. Afterward, for a given query x, the mean vector w Var is interpolated from the adaptive vectors of the training cases according to their distances to the query. Specifically, we have where the weighting coefficients c i are so defined that those cases closer to x will have more contributions to w Var . In this study, the following is used for c i : To determine the adaptive vectors w i for the cases in the training set {(x i , y i ), i = 1, . . . , N}, we modify the objective function in (2) as The rationale for the introduced penalty term in (11) is that those cases with similar features should also have similar discriminant vectors. The parameter C is used to control the trade-off between this penalty and the likelihood term. In (11), w i denotes the weighted average of the adaptive vectors from the rest of the cases as in (9) with x i treated as the query in (10).

Dataset.
In this study, we use a dataset as in our previous work [22]. This dataset consists of digitized, standardview, screen-film mammographic images collected from two sources: one from the Department of Radiology, The University of Chicago (UC), and the other from the DDSM dataset maintained at The University of South Florida [27]. Altogether, there were a total of 1,006 cases (646 benign, 360 malignant) in the dataset, all containing clustered MCs. To characterize the MC lesions, we use a set of nine features previously determined in [22],

Experiment Setup.
To demonstrate the proposed approach for case-adaptive classification, we used the following setting in our experiments. The dataset of all 1,006 cases was first randomly divided into three subsets, denoted by S 1 , S 2 , and S 3 , respectively, such that S 1 and S 2 consisted of 175 cases (100 benign, 75 malignant) each, and S 3 had the remaining 656 cases (446 benign, 210 malignant). These three subsets were used as follows: S 1 was used as the training set, S 2 was used as the test set for performance evaluation, and S 3 was set aside as a library of known cases for retrieval for adaptive classification. The distribution of the different cases among the three subsets was out of the consideration to balance the malignant and benign cases for both training and testing while maintaining a large number of cases for retrieval.
To avoid any potential bias, the training set S 1 was used to determine the parameters C, C of the classifiers from the following candidate values: [0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1, 5, 10, 100] using a 10-fold cross-validation. For parameter σ, we follow our previous work [22] and set it to be 1.63, which corresponds to the 10th percentile of the inter-distance among training cases; the test set S 2 was used exclusively for evaluation. When testing the adaptive classifiers, for each case in S 2 , a set of cases similar to the test case was retrieved from S 3 , and subsequently used to train the adaptive classifiers. The resulting classifiers were then applied to classify the test case. This was to ensure that the test case itself will not be used in any way for boosting the adaptive classifier.
To evaluate the classification performance, we conducted a receiver operating characteristic (ROC) analysis, which is now routinely used for performance evaluation in classification tasks. An ROC curve is a plot of the classification sensitivity (i.e., true positive fraction) as the ordinate versus the specificity (i.e., false positive fraction) as the abscissa; International Journal of Biomedical Imaging 5 for a given classifier, it is obtained by continuously varying the threshold associated with its decision function. As a summary measure of overall diagnostic performance, the area under an ROC curve (denoted by AUC) is used. A larger AUC means better classification performance. In our experiments the ROCKIT program [28] was used to calculate the AUC values for the different classifiers.
To remove the effect of case distributions, we applied a bootstrapping methodology for testing the performance of the classifiers. A total of 2,000 bootstrap sample sets were used [29,30], of which each was obtained by sampling with replacements from the cases in S 2 . The classifier performance was subsequently obtained over each bootstrap sample set.
In our evaluation, the proposed adaptive approach was compared against the following different classifiers: (1) the baseline classifier (2) trained with S 1 ; (2) the adaptive classifier in (3). In addition, to demonstrate the effect of regularization, we also tested the adaptive classifier without using regularization, that is, by ignoring the regularization term in (6). Finally, we also considered the classifier trained with both S 1 and S 3 , which represents the scenario where all the cases in the retrieval library were used for training the classifier. Figure 3, we show the performance results obtained by the regularized adaptive classifiers with uniform prior in (6) (Ada-Reg-Uni) and with adaptive prior in (8) (Ada-Reg-Var). To demonstrate the effect of retrieved cases, the results are shown for the number of retrieved cases N r varied from 6 to 300. For comparison, results are also given in Figure 3 for the following classifiers: the baseline classifier (LR), the adaptive classifier previously developed in [22] (Ada-LR), the adaptive classifier without regularization (Ada-Reg-Non), and, finally, the classifier trained with all the cases in S 1 and S 3 (LR-all).

Regularized Adaptive Classification. In
From Figure 3, it can be seen that the regularized adaptive classifiers Ada-Reg-Uni and Ada-Reg-Var both could outperform the baseline classifier LR. In particular, with N r = 50, Ada-Reg-Uni achieved AUC = 0.8111 and Ada-Reg-Var achived AUC = 0.8059, compared with AUC = 0.7329 for the baseline classifier LR (P-value = 0.001 for Ada-Reg-Uni, and 0.004 for Ada-Reg-Var). With N r = 100, Ada-Reg-Uni obtained its best performance of AUC = 0.8215; Ada-Reg-Var achieved its best performance of AUC = 0.8192 with N r = 200. However, no further improvement was observed when N r was increased beyond 200. We believe that this is because that the benefit from additional retrieved cases diminishes as they are not sufficiently similar to the query. Furthermore, from Figure 3, it can be seen that when N r < 50 Ada-Reg-Var and Ada-Reg-Uni are both higher in AUC than Ada-LR. This indicates that Ada-Reg-Var and Ada-Reg-Uni could adapt faster to the local decision boundary with a small number of retrieved cases. This could be attributed to the use of the prior in the regularized adaptive classifiers. With N r further increased (above 100), Ada-Reg-Uni and Ada-Reg-Var became similar in performance; this is because with a large N r the retrieved cases became more influential than the prior on the classifier. The respective effects of retrieved cases and regularization can be illuminated by examining the results achieved by Ada-Reg-Non, that is, when no regularization was used in the adaptive classifier. With N r < 50, Ada-Reg-Non was much lower in performance than even the baseline classifier LR; this was clearly due to the issue of over-fitting. However, with increased N r , its performance AUC was improved from 0.7633 with N r = 50 to 0.7927 with N r = 200, approaching its regularized counterparts.
Furthermore, the regularized adaptive classifiers Ada-Reg-Uni and Ada-Reg-Var could also outperform the baseline classifier LR-all (AUC = 0.7643) which was trained with all the available cases in S 1 and S 3 . Specifically, the Ada-Reg-Uni and Ada-Reg-Var outperformed LR-all with P-value = 0.004 (N r = 100) and P-value = 0.005 (N r = 200), respectively.

Effect of Regularization.
The rationale behind the proposed regularization-based approach for adaptive classification is to use a prior to regularize the adaptive classifier in order to prevent it from over-fitting to the retrieved cases. As can be seen from (6), the regularization parameter C is used to control the balance between the retrieved cases and the baseline classifier. A larger C means more influence of the prior on the adaptive classifier (and less influence by the retrieved cases), and vice versa. To demonstrate this effect, in Figure 4, we show the resulting performance achieved by the classifier Ada-Reg-Uni with the parameter C varied over a large range. The number of retrieved cases N r was fixed at 100. Note that, as C → 0, the classifier performance approaches that of Ada-Reg-Non (i.e., retrieval only, N r = 100); on the other hand, as C → ∞, the classifier performance approaches that of the baseline classifier LR (i.e., no retrieval). The best performance was obtained with C varied between these two extreme cases.

Execution Time.
In Figure 5, we show the execution time taken for classifying the cases in the test set by the different adaptive classifiers Ada-Reg-Uni, Ada-Reg-Var, and Ada-LR. For comparison, results are also shown for the baseline classifier LR. Our implementation was in MATLAB on a 2-GHz PC. As can be seen, the regularized classifiers Ada-Reg-Uni and Ada-Reg-Var were similar in execution time. While slower than the baseline classifier LR, both were notably faster than Ada-LR. Specifically, with N r < 100, the regularization-based approaches were about 10 times faster. For N r larger than 100, the execution time increased for the adaptive classifiers as more samples were used in training, but still much lower than that of Ada-LR. Interestingly, the execution time for Ada-LR slightly decreased with N r larger than 10. We believe that this was due to improved conditioning in the Hessian matrix of the objective function with increased N r . The higher numerical efficiency of the regularized classifiers over the adaptive classifier was due to their much simplified objective functions in which only retrieved cases were used.

Conclusion
In this work, we investigated a regularization based approach for case-adaptive classification of microcalcification lesions in mammograms. Deviating from a previously developed adaptive approach, in which a set of retrieved cases was used in conjunction with the training cases of a baseline classifier to re-retrain an adaptive classifier, we derived a prior in place of the baseline classifier as a regularization term in the adaptive classifier. This prior was used together with the retrieved cases from a reference library to optimize the classification on a query case. Our goal was to reduce the numerical complexity associated with online training of the adaptive classifier. We explored two different forms for the regularization prior: one is invariant for the different query cases and the other is allowed to vary with respect to the features of the query cases. We demonstrated the proposed regularization approach on a dataset of 1,006 cases. The results show that it could achieve significant improvement in numerical efficiency (around 10 times in execution speed) while maintaining similar (or better) improvement in classification accuracy compared to a previous nonregularization approach. The regularization approach was also observed to achieve faster adaption in performance with a small number of retrieved cases.