Deep Active Learning Framework for Lymph Node Metastasis Prediction in Medical Support System

Assessing the extent of cancer spread by histopathological analysis of sentinel axillary lymph nodes is an important part of breast cancer staging. With the maturity and prevalence of deep learning technology, building auxiliary medical systems can help to relieve the burden of pathologists and increase the diagnostic precision and accuracy during this process. However, such histopathological images have complex patterns that are difficult for ordinary people to understand and require professional medical practitioners to annotate. This increases the cost of constructing such medical systems. To reduce the cost of annotating and improve the performance of the model as much as possible, in other words, using as few labeled samples as possible to obtain a greater performance improvement, we propose a deep learning framework with a three-stage query strategy and novel model update strategy. The framework first trains an auto-encoder with all the samples to obtain a global representation in a low-dimensional space. In the query stage, the unlabeled samples are first selected according to uncertainty, and then, coreset-based methods are employed to reduce sample redundancy. Finally, distribution differences between labeled samples and unlabeled samples are evaluated and samples that can quickly eliminate the distribution differences are selected. This method achieves faster iterative efficiency than the uncertainty strategies, representative strategies, or hybrid strategies on the lymph node slice dataset and other commonly used datasets. It reaches the performance of training with all data, but only uses 50% of the labeled. During the model update process, we randomly freeze some weights and only train the task model on new labeled samples with a smaller learning rate. Compared with fine-tuning task model on new samples, large-scale performance degradation is avoided. Compared with the retraining strategy or the replay strategy, it reduces the training cost of updating the task model by 79.87% and 90.07%, respectively.


Introduction
Accurate breast cancer staging is an essential task performed by pathologists worldwide to inform clinical management [1]. e histopathological analysis is the gold standard for precancerous lesion diagnosis. It has very high accuracy and reliability for diagnosis. Assessing the extent of cancer spread by the histopathological analysis of sentinel axillary lymph nodes is an important part of breast cancer staging. However, this assessment process is tedious, time-consuming, and prone to make mistakes when handled by pathologists. With the development of artificial intelligence technologies and the prevalence of auxiliary medical diagnostic systems based on them [2][3][4][5][6][7][8], developing an auxiliary system for the detection of lymph node metastases in breast cancer is feasible and valuable. It could result in a significant reduction in the workload of pathologists.
e construction of such systems generally relies on supervised learning technology. However, supervised learning requires a large number of labeled samples. Histopathological scans of lymph nodes are complex, as shown in Figure 1. It is not easy to find deterministic features in human eyes. erefore, nonprofessional people can hardly distinguish between positive and negative types. is complexity makes the construction of such diagnostic systems or other sophisticated medical systems require a large number of labeled samples to train on the one hand and consumes a lot of resources, especially precious medical resources for annotating. When resources are limited, building such a medical system is very challenging. Fortunately, although there is a lack of labeled samples and the cost of labeling is high, there are currently a large number of unlabeled samples in hospitals, and some useful information can be obtained from these unlabeled samples.
To alleviate the limitation of insufficient labeled data, researchers have proposed different kinds of methods including active learning. Active learning [9] is an effective method to solve the problem of lacking labeled data. It is an iterative process that follows three steps. First, the model is trained with a labeled small dataset. Second, the most informative samples are selected from unlabeled data based on some strategy and sent to human experts for labeling. en, the model is retrained by the new training data. Samples selected based on specific strategies aim to quickly improve the performance of the original model. e application scenarios are suitable for auxiliary medical support systems: insufficient initial training set (system builders do not have enough labeled data at first and have to collect data or annotate data before system construction), an enormous quantity of unlabeled data (amount of preserved data in hospital databases is huge but have few labels), and expensive annotation cost (medical image usually needs annotation from professional practitioners). erefore, active learning is applied widely in medical informatic fields [10][11][12].
At present, most strategies are based on the uncertainty of the model to the sample, such as least confidence, margin sampling, and entropy [13]. Compared with blindly spending time and energy on labeling data, active learning can improve model performance with a smaller labeling cost [14].
However, most active learning frameworks have two defaults that limit their application. e first one is that the selection strategies are not efficient enough. is is provoked by the similar samples selected in one batch and will decrease the annotating efficiency. To overcome this shortcoming, we proposed a hybrid three-stage selection, which aims to reduce the sample redundancy caused by the uncertainty selection method. Besides, this hybrid strategy selects samples that can eliminate the distribution difference between labeled data and unlabeled data quickly and improves the annotation efficiency further. e other default is that most active learning frameworks rely on retraining to update the task model. is is because it is difficult for neural networks to acquire incremental knowledge. Training new tasks or new data on the old neural network will lead to a sharp drop in performance on the original data or tasks.
is phenomenon is called catastrophic forgetting [15]. It is very serious, especially when the task types or data domains have great differences. While in the active learning iteration process, the data distribution difference between the newly labeled samples and the old labeled samples may be small, it will also lead to performance degradation, which is called concept shift [14]. Retraining is a simple way to avoid concept shift but has high time and computation costs, which will cause obstacles in some application scenarios. In this study, we investigate a new method that reduces the performance drop and training cost simultaneously. e main innovation or contribution of the study includes the following: (i) We constructed a classification system for breast cancer lymph node metastasis prediction based on deep active learning and proposed a new three-stage selection strategy. Different from the traditional uncertainty-based strategy, a diversity strategy is introduced to reduce data redundancy. Meanwhile, distribution differences between labeled and unlabeled samples are measured to reduce the distribution difference. is hybrid strategy obtains higher annotating efficiency compared with uncertainty-based strategies or diversity-based strategies.
(ii) We explore a new incremental approach for model updating. Different from the general active learning iteration process that uses all the labeled data to retrain the model, we use a freezing and fine-tuning method to ensure that the model acquires new knowledge, while reducing the forgetting of the  e rest of this study is organized as follows: Section 2 summarizes the research work in related fields, Section 3 introduces the method used in this study, Section 4 is the experiment and result analysis, and Section 5 is the conclusion.

Related Works
Active learning has been widely combined with deep learning models due to its significant reduction in labeling costs [16][17][18][19]. Yang et al. [10]combined active learning with a fully convolutional neural network for segmentation tasks on lymph node ultrasound images and finally achieved and trained using only 50% of the labeled samples. Smailagic et al. [17] used active learning and convolutional neural networks to classify fundus blood vessel images, melanoma images, and breast cancer pathology images. e experimental results showed that the model combined with active learning strategy can only use 25% of the labeled data to train the model. It still achieves an accuracy rate of 6.3% higher than the base model under the same conditions. Zhao et al. [18] used an active learning framework based on the U-Net model to segment hand bone images and only used 43.16% of the labeled samples to achieve the same effect as training with all the labeled samples. Zhou et al. [19] used active learning for colonoscopy frame classification, polyp detection, and pulmonary embolism detection, reducing the labeling cost by 82%, 86%, and 80%, respectively. ese applications fully demonstrate the effectiveness of active learning.
A typical active learning process [20,21] is composed of a dataset, a model, and experts or oracles for the model to query. e dataset in active learning is generally made up of a small number of labeled samples and a large number of unlabeled samples. e model is first trained on the labeled dataset, and then, based on a certain strategy some samples are selected from the unlabeled data and given to experts for labeling. e new labeled data are put into the training set for retraining the model. is process iterates until a certain convergence condition, such as the performance meets the requirements, or the labeling cost exceeds budget. e core of active learning is to design a selection strategy so that the labeled samples can effectively improve the model performance.
e classic selection strategy is based on model uncertainty [22,23].
Many researchers have carried out research based on uncertainty. For example, Wang et al. integrated active learning with the training process of deep belief networks for the first time, introduced a loss function specific to active learning tasks, and trained the model to minimize the loss function. Houlsby et al. [24] proposed the Bayesian active learning by disagreement (BALD) uncertainty, which is mainly used in the Bayesian networks. Gal et al. [25] proposed the MC-dropout method as a proxy for BALD, which obtains model perturbations by turning on dropout during prediction so that BALD uncertainty can be captured in general convolutional networks. Gal et al. [26] validated the effectiveness of the MC-dropout method on high-dimensional image data. William et al. [22] used an ensemblebased method to measure the uncertainty of convolutional neural networks, which integrates the results of multiple convolutional neural networks to obtain the uncertainty measure of the model, which is better than the geometrybased method, and faster performance improvement based on the MC-dropout [25] method. Zhao et al. [18] used the output difference in the middle layers of the network to measure the uncertainty of the convolutional neural network on the segmentation task. In particular, the Dice index is calculated from the output of the previous layer in the network, the output of the middle layer of the network, and the output of the final layer, and the average of the two is taken as the uncertainty proxy. Experiments show that the proxy uncertainty and the true Dice index exhibit a significant correlation, which can be used as an uncertainty measure; that is, the larger the calculated average Dice index, the smaller the uncertainty.
However, the use of uncertainty-based strategies in neural networks is generally to select a batch of samples at a time.
e uncertainty-based strategies cannot deal with sample redundancy and often select a batch of samples that contains many similar samples, which reduces the labeling efficiency. erefore, strategies based on representation or diversity are proposed. e representative strategy aims to pick representative samples for annotation so that the model has a better understanding of the overall data distribution. As shown in Figure 2, the green circles represent class A, and the blue circles represent class B. e size of the circle represents the uncertainty of the sample model. Generally speaking, the decision boundary of disagreement regions (intersection regions) is complex, so annotating samples in disagreement regions will obtain higher performance improvement. e samples selected by the uncertainty-based strategy may be clustered together; for example, three samples A, B, and C may be selected based on uncertainty, while A and D may be selected by the representative-based strategy. Samples A and D are more useful for the model to understand overall data distribution, so they tend to achieve higher performance improvement. ere are many active learning application cases based on representational strategies [27,28].
Rather than using the representative strategy alone, a hybrid strategy combining representative and uncertainty strategies is used more often [29][30][31][32][33][34]. Yang et al. [16] trained a cluster of models by replacing the labeled data, using the output variance of each model to measure the uncertainty, and using the intermediate output layer of the convolutional neural network as the representation of the image. e similarity of the representation was used as a metric of similarity between images. en, a greedy strategy is used to select batches with a small similarity between samples for annotating. Andreas et al. [31] proposed BatchBALD. Different from the general BALD selection strategy, which is only based on the BALD score, BatchBALD selects samples Computational Intelligence and Neuroscience one by one and calculates the mutual information between the selected samples every time. Among the unlabeled samples, the mutual information between the selected sample and the currently to-be-labeled sample is the smallest, so that the sample diversity in a selection batch constructed greedily is the largest, but there is no guarantee that the selected batch is the most diverse among all possible combinations. Fedor et al. [29] also combined uncertainty and diversity. First, a batch of samples with large uncertainty was selected, and then, the samples with large uncertainty were clustered to select samples that are nearest to the class center. Experiments on text and image datasets show that it outperforms strategies using uncertainty strategies and clustering alone. Jordan et al. [33] proposed an adaptive gradient embedding method, which uses the gradient size of the last layer of the model to represent uncertainty and takes into account uncertainty and diversity by embedding samples into the gradient space and performing clustering. e benefit of this approach is that clustering based on the gradient space automatically balances uncertainty and diversity without manual tuning of other hyperparameters and thus has better adaptability to different batch sizes. Zhou et al. [19] used the difference between the output of the rotation-augmented image and the original image of the classifier to measure the uncertainty and used the class difference in the samples within the batch as the diversity measure. Sampling probability is explicitly calculated before sampling from unlabeled data. Figure 3 shows the general process of our proposed framework. We use the proposed three-stage selection strategy, aiming to obtain samples with large uncertainty, low redundancy, and can quickly eliminate the distribution difference between labeled samples and unlabeled samples. Each stage focuses on a selection indicator, namely uncertainty, sample diversity, and distribution difference between labeled samples and unlabeled samples. Overall, the selection strategy is still an improvement based on uncertainty. Traditional uncertainty-based strategies face the problem of high sample redundancy. As described in Section 2, many works incorporate diversity strategies and balance the weights of the two explicitly or implicitly. On this basis, we added a selection criterion for the distribution difference between labeled samples and unlabeled samples. e motivation of this selection criterion is that due to the model's preference for data, the distribution difference between labeled samples and unlabeled samples will become larger and larger, and reducing this distribution difference will help speed up performance improvement. Section 3.1 describes each component in Figure 3 in detail and the overall workflow. Section 3.2 describes the specific implementation process in each stage.

Components and Workflow
3.1.1. Task Model. Breast cancer lymph node prediction is a classification problem, and we use convolutional neural networks as a classification model. e breast cancer lymph node image and its category are represented by x and y, respectively, the classification network is represented by M, the parameter is θ M , and the predicted class y � M(x). M is optimized according to the following equation: (1)

Labeled and Unlabeled Datasets.
e labeled sample is defined as D L , the unlabeled sample is defined as D U , and then the total sample is D � D L ⋃D U . e initial labeled sample is marked as D L 0 , the labeled sample in the ith round is D L i , and the unlabeled sample is D U i . e goal of active learning is to design a selection strategy Q, using Q selects out D U I from D Q i , where D Q i is the sample selected and sent to the expert for annotation in the ith iteration. After e selection strategy Q follows the following equation: where l M (·) is the loss function of task model M.

Auto-Encoder.
In addition, we need to learn a representation of the global distribution of samples. Embedding the samples into a low-dimensional space is conducive to measuring the representative of the samples. At the same time, it is helpful to distinguish whether D L and D U are from the same distribution. We use an auto-encoder to complete this operation. A well-learned auto-encoder is beneficial to improve the accuracy of diversity metrics and reduce the learning difficulty of the distribution discriminator. e auto-encoder is divided into two parts: the encoder and the decoder, which are represented by E and G, respectively, and its network parameters are represented by θ G and θ E , respectively. E is responsible for encoding, z � E(x ), and G is responsible for reconstructing the original image using the encoding result of E or z. We expect the size of z to be smaller than the size of the original x. e optimization of θ G and θ E follows the following expression:  Computational Intelligence and Neuroscience where l AE (·) is the loss function of the auto-encoder, generally mean square error. In (3), the auto-encoder uses all the data (D L ⋃D U ) for training without adding additional loss terms other than the reconstruction loss. e reason for emphasizing this is that this ensures that the auto-encoder treats the labeled samples and unlabeled samples fairly, and there is no bias. So, we can think that the learned low-dimensional variable z is subject to the same distribution on D L and D U , although z does not necessarily obey N(0, 1) (in VAE [35], z is bound to a fixed distribution to facilitate sampling from z to obtain fake data, and we do not need to obtain fake data, so we can focus on to optimize the reconstruction loss, regardless of the distribution of the latent variable z).

Discriminator.
e discriminator D is used to measure the distribution difference between D U and D L during each iteration, it receives z as input, and the output sample belongs to D U or D L . is is a self-supervised process without labeling. e discriminator follows a general classification neural network.

Doctors (Oracle).
After completing the data selection, professional personnel is needed for annotation. In the breast cancer lymph node classification problem, this role is generally doctors. By annotating new samples, they help the model acquire new knowledge and improve performance. e biggest advantage of active learning is to reduce the number of annotations in situation that needs professional but expensive annotation, thereby reducing the cost of building task models. In the experiment section, annotation by doctors is simulated by database queries.

Proposed Query Strategy.
e query strategy is the core of active learning. We have designed a three-stage active learning selection strategy. e entire selection process is marked with red arrows in Figure 3 and is divided into 5 steps, which are marked with A-➄, respectively. In the ith iteration, we first use D L to train task model M and then calculate the uncertainty of D U according to M, denoted as unc(D u , M) (Step 1). unc(·) is the uncertainty metric.
Samples with large uncertainty were selected from D U and are recorded as x batch1 (Step 2) where samples have high uncertainty, but maybe similar, as described in Section 2. Next, the representative of x batch1 is evaluated, and the most representative samples are selected and recorded as x batch2 (Step 3), which is a subset of x batch1 . en, we encode x batch2 with the pretrained encoder E to obtain E(x batch2 ), and discriminator D is used to evaluate distribution difference and obtain D (E(x batch2 )). x batch3 is obtained by sorting D (E(x batch2 )). x batch3 is the final selected D Q i . After querying its label (Step 5), it is then merged with the existing labeled dataset D L i . e entire query process is completed. Computational Intelligence and Neuroscience

Uncertainty.
e first stage of the selection strategy is selected based on uncertainty. e uncertainty-based query strategy is the most basic and most commonly used. Deep active learning is an active learning method based on deep learning models, which involves a measure of uncertainty in neural networks. Generally, a very natural idea is to regard the output of the neural network as a probability distribution, from which a variety of uncertainty measurement methods are derived, such as least confidence, entropy, margin sampling, and BALD method.
Assume that the probability of sample i belongs to category c is p c , and C is the set of all categories. en, for least confidence, the uncertainty is measured according to the following equation: However, neural networks tend to be overconfident in their prediction results. erefore confidence-based methods are not good.
Entropy-based uncertainty is calculated by the entropy of the output probability distribution of the neural network as follows: Margin sampling uncertainty is calculated by the probability difference between the class with the largest confidence and the class with the next largest confidence as follows: where c 1 � argmin c∈C p c and c 2 � argmin c∈ C\c 1 p c . BALD uncertainty is measured by opening the dropout layer during the prediction process and performing multiple dropouts as follows: where T is the total number of predictions and p t c is the probability that sample i belongs to c in tth predictions. Since multiple predictions are required, it often takes a long time expense.
In this study, we use uncertainty based on margin sampling.

Diversity.
e second stage is to select based on sample representative or diversity. is approach is inspired by the fact that uncertainty strategies focus on uncertainty and select many similar samples. Performing secondary selection based on sample representative will help to improve the selection efficiency. We model the selection of representative samples as the k-center problem. e k-center problem aims to select k centers from a dataset to minimize the maximum distance from other points to the nearest center point. e whole dataset can be represented by kcenter points. Here our purpose is to reduce the redundancy of samples in x batch1 , so that x batch2 and D L i can represent x batch1 , and this process can be described as follows: where dis(·) is distance metric and δ is the minimum distance between center points and non-center points. Here, it is based on the L2 distance of the embedding of previously trained auto-encoder, namely: is process is depicted in more detail in Figure 4. Each circle represents a sample point. Points surrounded by a larger circle with a radius of δ are the center points. e green point represents D L i . e red and blue points together form x batch 1 . Red and green points are the center points of all sample points. e red point is the result x batch 2 .
However, the k-center problem is NP-hard. In practice, we use the improved greedy algorithm proposed by [26]. We can formulate this process as follows:

Distribution Difference.
e initial labeled samples D L 0 and unlabeled samples D U 0 are randomly sampled from D, so there is no distribution difference between D L 0 and D U 0 , but with the biased selection of D U i based on M, there will be a distribution difference between D L i and D U i .
In the third stage, our goal is to use a small number of labeled samples to represent unlabeled samples, so D L i and D U i need to obey the same distribution. e purpose of the third stage of selection is to select samples from D U i that have the most dissimilar distribution with D L i .
We do not need to know what distribution D L i and D U i follow, and we just need to determine whether they are the same.
is can be obtained by training a discriminator whose functions are similar to the discriminator in GAN [36]. In GAN, a discriminator is used to discriminate whether a sample is real or synthetic. Here, it is used to determine a sample from D L i or D U i . We input the results of the encoder into the discriminator D for training, and the training loss is as follows: is will force D to output 0 for E(D U I ) and 1 for E(D L i ).
When querying, E(x batch 2 ) is input into D and the point is picked with the smallest output value. e final obtained x batch 3 is D Q i . D Q i is sent to experts for annotation and combined with D L i as D L i+1 , while removing D Q i from D U I to form D U i+1 .
In summary, the entire process can be summarized as Algorithm 1.

6
Computational Intelligence and Neuroscience

Update Strategy.
ere are two ways to update the model, one is retraining: reinitializing the model, using all the labeled data for training, and the other is to update incrementally, using part of the labeled data to update the original task model.
Retraining gives the newly added samples the same weight as the original samples so that the model is neither hindered by the deviations learned from the old samples, nor affected too much by the new samples. e overall data distribution is more accurately grasped, and therefore, it is widely used. However, the time cost of retraining is huge. As the iteration process increases, the size of the labeled dataset also increases, and the cost of retraining each time is high. erefore, we use a fine-tuning-based method to update the model. It is different from the general fine-tuning method. It not only reduces the learning rate but also adds some dropout layers. During the first training, these dropout layers preserve all the weights. When the model is updated, only the newly labeled data are sent for training. Meanwhile, dropout layers are turned on to suppress some neurons with a certain probability.

Implementation Details.
We define Conv(x, y) to denote a convolutional layer, which consists of a 2D convolutional operation with x kernels each having a y × y size, a batch norm operation, an activation operation by ReLU function, and a 2 × 2 max pool operation with the kernel of x × y; FC(x) to denote a fully connected layer, which has x output units activated by ReLU function; and DP(p) as dropout layer with the probability of p to reserve the units.  (2). e encoder part for the auto-encoder of the PCam dataset is acquired by deleting the last four layers based on the task model, and the decoder part is the reversed version (the convolution is replaced with transposed convolution and the structure is inverted) of the encoder. e structure of task models for MNIST and CIFAR10 are as follows [33], and auto-encoders are built in a similar procedure to PCam. Suppose the embedding dimension of the encoder is d, the discriminators' structure All the datasets are split into training set, validation set, and testing set. We randomly preserve 7,000 samples and 3, 000 samples for testing and evaluation. After each epoch of training, the task model is evaluated and saved. e final testing performance is calculated on the model with the best evaluation performance. Each experiment is carried out 3 times with different dataset splits. We use the Adam optimizer with a learning rate of 0.0001. e training process is stopped if the evaluation performance does not increase in 20 epochs.
When updating by the proposed method, we add extra DP(p) layer after layers not followed by dropout layers and set p � 1 for training and p � 0.7 for fine-tuning. We finetune 20 epochs in the proposed method and fine-tuning method.

Effectivity of Proposed Strategy.
First, we conducted experiments to prove the effectiveness of the proposed framework on the public PatchCamelyon dataset [37] (PCam).
e PatchCamelyon dataset consists of 327,680 color images (96 × 96px) extracted from histopathological scans of lymph node sections. Each image is annotated with a binary label indicating the presence of metastatic tissue.
e PCam dataset has a large amount of data. It is difficult to find such a large dataset in the real application. erefore, we only use 50,000 training data as the total number of training samples, of which positive and negative samples account for the same proportion.
In the experiment, 10% of the total training samples are selected as the initial training set, and then, 5% of the samples are annotated according to a specific query strategy in each iteration. e accuracy curve is recorded as shown in Figure 5. All strategies use the same structure of the classification model. When querying by the proposed query strategy, 15%, 10%, and 5% of the total samples are selected at each stage respectively. If the remaining samples are less than 15% or 10%, all the remaining samples are selected.
As shown in Figure 5, both the uncertainty-based strategy and the representation-based strategy are better than random selection. In the first iteration, our strategy achieves much higher accuracy than other strategies. In the entire iterative process, our strategy can improve the accuracy by up to 3.8% (when the labeled dataset accounts for 50%) compared with the random selection strategy and at 1.2% (when the labeled dataset accounts for 30%) compared with other selection strategies. When the labeled dataset reaches 50%, the accuracy achieved by our strategy already exceeds the accuracy trained with the entire dataset, while the uncertainty-based strategy outperforms training with the entire dataset when the labeled dataset reaches 85%.
To further compare the performance of the proposed method, we calculate the receiver operating characteristic curve (ROC) and area under the curve (AUC) of different active learning strategies after each iteration. e experimental results are shown in Figure 6. In Figure 6, the results of the proposed method and the uncertainty-based and diversity-based strategies are compared. e performance of the proposed strategy improves significantly in the first half of the iteration process. In the second half, with the increase in the sample size, the performance obtained by various methods gradually flattened. Even in the second half, the proposed strategy maintained a higher AUC. e result is shown in Figure 7. e results on CIFAR10 and MNIST also support the effectiveness of our method. Although with the increase in the amount of data, various selection strategies gradually achieved close performance. However, in the early stage of iteration, the performance of the proposed strategy outperforms other strategies significantly, which shows its application value in reducing the cost of labeling. e proposed selection strategy is at most 2.04% higher than the random selection strategy on MNISTdata (when the number of labeled samples accounts for 15%) and is higher than other strategies by up to 0.5% (when the labeled dataset accounts for 15%). On the CIFAR10 dataset, it is at most 6.77% higher than the random selection strategy (when the labeled dataset accounts for 30%) and 3.68% higher than other selection strategies (when the labeled dataset accounts for 30%).
To verify that the strategy that introduces the difference in the distribution of labeled data and unlabeled data is better than the pure hybrid strategy based on uncertainty and representative, we compare the selection efficiency of the proposed strategy and the hybrid strategy.
Assume that the number of samples queried in each iteration is n (here n is 5% of all samples). As shown in Figure 8, the "coreset-marg" strategy means the coreset method is used to select 2n samples from D U i and then select n samples based on the uncertainty of margin sampling. e "marg-coreset" method first selects 2n samples from D U i based on the uncertainty of margin sampling and then uses the coreset method to select n samples. Strategy that focuses on uncertainty first is better than that focuses on representative first. e strategy that combines the distribution differences in D U i and D L i performs better than the other two, which proves the validity of the introduction of the discriminator of D.

e Effectiveness of the Update Strategy.
We compare the proposed update strategy with two other incremental update strategies. e first is to train the model with newly selected samples with the learning rate becoming one-fifth of the original, denoted as "queried only." In the second strategy, in addition to using the newly selected data for training, it trains the task model with the old labeled samples whose model prediction and real label differ greatly. is error-based selection is denoted as "mistake replay." e mistake replay strategy selects 40% old labeled samples in each iteration. e proposed update strategy freezes 70% of weight in the dropout layers while 8 Computational Intelligence and Neuroscience  To demonstrate the generality of the proposed framework, we also conduct experiments on multiclass datasets, including MNIST [38] and CIFAR10 [39]. Computational Intelligence and Neuroscience training in addition to keeping the learning rate decay to one-fifth of the original. e training time and accuracy of each iteration are recorded. e pros and cons of the strategy are measured through training time and accuracy drop. e experimental platform is a server with a 15-Core AMD EPYC 7543 32-Core Processor, 80 GB RAM, and an RTX 3090 GPU. Figure 9 shows the accuracy change when querying by random selection strategy. e "retrain" series uses all the labeled data (old labeled and newly labeled) for training each time, which is the upper bound of other update strategies. Training with only the queried data does not improve the performance of the model but shows a slight downward trend. Both the mistake replay strategy and the proposed strategy can avoid the performance degradation caused by training only with query data. e accuracy under the proposed strategy is only slightly decreased compared with retraining with all data (Table 1).
To further verify its effectiveness, we use the margin sampling-based strategy to carry out experiments, and the results are shown in Figure 10.
Similar results are obtained on the margin sampling strategy. e performance degradation caused by training only with query data was more prominent in the margin sampling strategy. is may be attributed to the distribution difference between the samples selected by margin sampling strategy and all data, while the random selection does not have such bias (Table 2).     Computational Intelligence and Neuroscience Figures 9 and 10 show that under various datasets and query strategies, the proposed fine-tuning strategy achieves close performance with the mistake replay strategy, but our proposed method consumes a similar amount of time to update with only query samples. Its update cost is far from lower than retraining and mistake replay strategies.

Conclusion
e construction of an auxiliary medical image system requires a large amount of labeled data, which requires expensive annotation costs. In this study, based on the prediction of lymph node metastasis in breast cancer, an efficient active learning selection strategy is proposed. Its effectiveness is verified on other classification datasets. e three-stage selection strategy proposed in this study is an improvement on the traditional uncertainty-based selection. In particular, samples with large uncertainty are firstly selected according to the uncertainty measure, then the redundancy of the samples to be labeled is reduced by the coreset-based method, and finally, the discriminator of the distribution difference between the labeled samples and the unlabeled samples further filters the samples. is selection strategy, which takes into account the distribution differences between labeled samples and unlabeled samples, will try to eliminate such differences. Compared with simply using uncertainty strategies, representative strategies, or hybrid strategies, it has greater labeling efficiency. On the breast cancer lymph node dataset, only 50% of the data is used to achieve the effect of using all the data for training. Aiming at the problem that retraining consumes a lot of time in the model update process, we propose a dropout-based fine-tuning method, which achieves similar performance as the mistake replay update method but reduces training cost by an average of 79.87%. Compared with the retraining update strategy, training cost is reduced by 90.07% on average without causing excessive accuracy loss.

Data Availability
e data used to support the findings of this study are currently under embargo, while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.