Medical Specialty Classification Based on Semiadversarial Data Augmentation

Rapidly increasing adoption of electronic health record (EHR) systems has caused automated medical specialty classification to become an important research field. Medical specialty classification not only improves EHR system retrieval efficiency and helps general practitioners identify urgent patient issues but also is useful in studying the practice and validity of clinical referral patterns. However, currently available medical note data are imbalanced and insufficient. In addition, medical specialty classification is a multicategory problem, and it is not easy to remove sensitive information from numerous medical notes and tag them. To solve those problems, we propose a data augmentation method based on adversarial attacks. The semiadversarial examples generated during the dynamic process of adversarial attacking are added to the training set as augmented examples, which can effectively expand the coverage of the training data on the decision space. Besides, as nouns in medical notes are critical information, we design a classification framework incorporating probabilistic information of nouns, with confidence recalculation after the softmax layer. We validate our proposed method on an 18-class dataset with extremely unbalanced data, and comparison experiments with four benchmarks show that our method improves accuracy and F1 score to the optimal level, by an average of 14.9%.


Introduction
Recently, deep neural networks (DNNs) have achieved remarkable success in classifcation tasks in various felds, such as computer vision [1], network anomalous behavior [2][3][4], and medical domain [5,6].Te widespread use of electronic health record (EHR) systems has made the task of medical specialty classifcation become more important in modern healthcare.Classifying clinical notes into medical specialty felds improves the retrieval efciency of the EHR system, which enables the doctor to quickly access the target information.In addition, automated medical specialty classifcation can be extended to other downstream applications, for example, assisting in medical knowledge extraction and supporting intelligent medical decision systems.
However, obtaining and labeling unstructured medical notes is not easy.Physician writing styles vary widely, as well as diferent probabilities of disease outbreaks in diferent medical subfelds.Tese objective factors lead to existing datasets with signifcant defciencies: insufcient data volume [6], nonopen access [5], and unbalanced categories [7].Abundant medical specialty categories with little and unbalanced data are seriously impacting the performance of the classifcation model, which is the greatest challenge in the task of medical specialty classifcation.
As far as we know, the existing work focuses on how to design a more optimal model and tune the best parameters [6] to improve accuracy, such as comparing the efectiveness of diferent machine-learning models and deep-learning models, determining the best combination of models [7] or algorithms [5].An approach of integrated data analysis was proposed in [5], where the researchers applied various techniques to extract features, including the unifed medical language system and semantic network.However, the problems of insufcient and imbalanced data have been hardly considered in the existing work.In addition, the maximum number of categories considered in the available work is 9, less than the medical classifcation in medical specialty classifcation.A fner classifcation is more in line with the needs of realistic application scenarios, but it also implies a greater challenge.
Standing for the realistic scenario, we explore how to improve the performance of the classifer with the limited corpus.In this paper, instead of focusing on the model comparison and selection, we pay more attention to employing data augmentation technology which is an effective method to address the data imbalance problem.In the machine vision feld, many outstanding augmentation techniques have been demonstrated to be efective in previous work [8].However, for textual data, randomly modifying examples is inefective due to the natural discrete nature of the text.In addition, data augmentation techniques applicable to diferent tasks vary widely, which leads to poor transferability.
To tackle these challenges, we developed a data augmentation method based on adversarial attacks.Te adversarial attack aims to generate adversarial examples which are similar to the original examples but make model predictions wrong.From the geometric space perspective, the process of adversarial attacks is described as the process of clean examples approaching the decision boundary until it is completely crossed.Interestingly, the intermediate product of the attack process is identical to the defnition of augmented data: data with a distribution close to that of the original data.(3) We designed a medical specialty classifer based on a tough dataset situation.To the best of our knowledge, we cover the largest number of specialty categories.In addition, experimental results show that the classifer obtained by our method has stronger robustness.

Related Work
In this section, we explore existing related work in three areas: (1) classifcation tasks in the medical domain, (2) data enhancement methods, and (3) adversarial attacks as well as adversarial enhancement methods.

Medical Classifcation.
Machine learning excels at classifcation tasks and plays an important role in smart healthcare.Image classifcation-related applications are particularly widespread.For example, in breast cancer detection, Fotin et al. [9] used AlexNet trained on a proprietary database to produce better performance than that achieved by years of engineering manual feature systems; in Alzheimer's detection, Lim and Schaar [10] utilized the fexibility and scalability of deep neural networks to enhance a joint longitudinal and temporal model of event data to predict the trajectory of Alzheimer's disease over time; in heart disease detection, Poudel et al. [11] introduced an RNN recursive connection in the U-net architecture to learn which information of the previous ventricle to remember when segmenting the next ventricle in a slice-by-slice segmentation of the left ventricle.
Compared to medical image classifcation, the application of machine learning to the medical classifcation of textual data has not been widely explored.For electronic health records, Weng et al. [5] constructed a machine learning-based natural language processing (NLP) pipeline and developed a medical subdomain classifer based on medical record content.Ahnaf et al. [6] used Bengali for training machine-learning and deep-learning models and used a bidirectional LSTM model to classify text-based records based on medical specialties.Cheng et al. [12] trained a CNN on a temporal matrix of medical codes for each patient to predict the onset of congestive heart failure (CHF) and chronic obstructive pulmonary disease (COPD).

Data Augmentation. Data augmentation techniques are
proposed for solving insufcient data and poor data quality by constructing new examples to enrich the training data to improve the generalization ability of machine-learning models [13][14][15].
In terms of execution granularity, text data enhancement is classifed into the character level, word level, phrase level, and document level.Character-level text data augmentation includes randomly changing a letter in a word [16], deleting or inserting characters [17], and modifying punctuation to induce weak text sounds [18].

2
Computational Intelligence and Neuroscience Such methods have been shown to enable models to better handle noisy text.Phrase-level methods are based on structure [19] and interpolation [20].Tis type of method is more restricted to specifc languages and tasks.Common document-level methods include back translation [21] and generative methods [22].Te most widely promoted word-level approach is the text enhancement method based on synonym substitution [23,24].Embedding-based replacement aims at identifying more contextually appropriate words by using neural network embedding models and vector similarity calculation [25][26][27].In contrast to plain synonym substitution, semantic and high-dimension-based methods take the context into account and have more comprehensive distributional assumptions.Te BERT [28] model has been trained in a completion task with a large-scale corpus, making it capable of predicting [MASK] as a specifc word.Tis feature of BERT is fully exploited in data augmentation techniques for word replacement, for example [18], proposed conditional BERT (c-BERT), which uses BERT contextual augmentation to generate augmented data.
Data augmentation techniques applied to the medical domain have focused on image enhancement.Janowczyk et al. [29] used SAEs to normalize H&E-stained histopathology images; Benou et al. [30] used CNNs to denoise DCE-MRI time series.Aydin et al. [31] combined images and text, using attention mechanisms and transfer-learning approaches to further improve medical data classifcation accuracy in small batches of data.In addition, methods based on GAN [32,33] and reinforcement learning [34] are also used in image synthesis for the medical domain.Textonly data augmentation is difcult because label-preserving text transformations are hard to defne [35,36], and this disadvantage is accentuated in specifc specialized felds, such as medicine.

Adversarial Attacks.
Given a text x, the attacker adds imperceptible disturbance Δx to x and aims to make the pretrained model F misclassify.Δx operation includes adding, deleting, and replacing characters or words.In terms of textual form, there is some similarity between the adversarial and augmented examples in that they both generate similar copies of original examples by performing certain modifcation operations in the original example.In the natural language feld, gradient-based adversarial training is efective in improving the accuracy and generalization of models [7,21] but has weak gains in adversarial robustness.In addition, adversarial data augmentation [37,38] and virtual adversarial data augmentation [21] also efectively improve the adversarial robustness of models, but such methods are prone to decrease model accuracy.Lee et al. [38] proposed a combination of friendly data augmentation and gradient-based adversarial training that can improve the adversarial robustness of models while maintaining their accuracy.

Methodology
3.1.Notions and Defnitions.We denote F as the target model and D orig � (x i , y i ) n i�1 as the original dataset.x i is the text, denoted as the set of words x i � w 1 , w 2 , ..., w m  , and m is the number of words.y i is the label of x i , and y i ∈ Y, where Y is the set of all labels.F y (x) is the confdence (probabilistic score outputted by the softmax layer) of F predicting x as y.F(x) is the predicting label of x.
An adversarial example x adv is generated by implementing imperceptible perturbations on x and indicated as Te dataset after data augmentation is indicated as D ada .As for adversarial data augmentation, the steps are as follows: (1)

Semiadversarial Data Augmentation.
Established data augmentation techniques fully consider how to enrich the training set by generating new data close to the original data but ignore the data distribution in the model decision space.Te process of adversarial example generation well simulates the transformation of data location in the decision space.We presume that adversarial attacks can augment the dataset with a more comprehensive distribution.Although adversarial data augmentation has been shown to hurt model performance [39], perturbed examples that do not cross decision boundaries can overcome this drawback [40]."Friendly adversarial examples" have been proposed and shown to improve the adversarial robustness of the model while maintaining accuracy [40].Inspired by this, we propose semiadversarial data augmentation (SemiADA).Specifcally, the multiple-step adversarial attack method (MSAA) generates semiadversarial examples for data augmentation.Semiadversarial examples are perturbed but do not successfully attack the target model.Multiple-step means we perturb several words for each attacking action.
A visual illustrative example is shown in Figure 1. Figure 1(a) describes the general data augmentation approach to generate semiexamples distributed around the original examples.Te dynamic process of the adversarial attack is described in Figure 1(b).As shown in Figure 1(c), SemiADA can cover a larger area of the decision space.It is worth noting that there is a relatively large divide in the decision space between the perturbed and original samples as shown in Figure 1(c), but the texts are still highly similar to each other, which means perturbed examples reserve semantics.
In common attack algorithms, only one word or embedding vector is perturbed in each attack action, which is described as a single-step attack.Diferent from them, we propose a multiple-step adversarial attack method (MSAA), Computational Intelligence and Neuroscience in which multiple words are selected as being perturbed in each attack action, and fnally, a set of combined candidates are identifed.During the MASS process, the semiadversarial examples generated in intermediate steps are retained as enhanced examples.Whole semiadversarial data augmentation is shown in Algorithm 1, which mainly consists of three steps as follows.
Step 1. Wording Importance Ranking.For any input x � [w 1 , w 2 , ..., w m ], each word plays a diferent efect on the fnal prediction result.Terefore, we rank the importance of all words and perturb the important words in priority.Calculating the diference in confdence by deleting the word is a common way to compare words' importance.Tis type of method requires an access target model m times and is time consuming.To improve computational efciency, we calculate the embedding vector diference of the replacement word as [MASK] and measure the importance of the word by the projection of the vector diference in the gradient direction.Te importance of each word w i in x is computed as I(w i , x): where the V [MASK] is the embedding of [MASK], the V w i is the embedding of word w i , and J is the loss function of the model F. It only requires querying the model once to get the scores of all words, which greatly boosts efciency.We further flter out the stop words derived from NLTK (https://ww.nltk.org/)and Spacy (https://spcay.io/)libraries such as "the," "then," and "• • •." Finally, we get the sorted and fltered set W.
Step 2. Identify Candidate Word Combinations.We construct a vocabulary dictionary by D orig , which contains 27816 words.We determine the synonym set Syn w i for each w i in the dictionary, which is initiated with k closest words from the synonyms set of w i by WordNet based on cosine similarity computation.WordNet [41] is a semanticoriented English dictionary with 155,287 words and 117,659 synonyms.Te word vectors used for similarity computation are from pretrained word embedding model Glove [42].
Human-written medical notes are not perfect and always contain some syntactic errors, so we do not need the generated augmented examples to be perfect.Unlike adversarial example generation, we aim to generate data that better meet the data augmentation conditions, so syntactic correctness checking is not strictly necessary.
In each attack action, we select the top t words from the sorted set W as the perturbed word set PerSet � w i , ..., w i+t where i � j * t and j is the j-th attack action.Tere are k t kinds of all possible combinations, so it is extremely time consuming to try all replacements.To save overhead, we randomly example r � k × t times to reduce the number of combinations of exponential complexity by a constant value.Te candidate substitution words are obtained as follows: where R(w, Syn w ) represents randomly selecting a word from Syn w .toreplace w.
Step  4 Computational Intelligence and Neuroscience

Complexity Analysis of Algorithm 1.
According to the cyclical functions in the workfow, time complexity can be expressed as As k is a constant, the computation time is increasing as the input text size grows in a constant multiple.Te time complexity of mainstream black-box adversarial attack methods tends to be above O(n 2 ) [35,43].Benefting from the idea of a multistep combinatorial attack in the attack (as shown in Step 1), our method is at least one rate lower than mainstream attack methods.We have confrmed it experimentally, as shown in Table 1.

Weighted Classifcation by Probabilistic Information.
Data augmentation mechanisms considerably alleviate the problem of unbalanced and insufcient data, but the accuracy under supermultiple categories is still unsatisfactory.We focus attention on the task and the data itself to seek more solutions.In the medical feld, nouns play an important role, and high-frequency words vary greatly across medical specialties.For example, the "stomach" often appears in the "gastroenterology" category but rarely in the "podiatry" category.We inferred that simple probabilistic statistical information is useful to express the diferences between categories.Terefore, we considered incorporating probabilistic information (PI) for classifcation.
We add the probabilistic information (PI) layer after the softmax layer (Figure 2).Its function is to recompute the probabilistic distribution and make the model prediction more accurate by incorporating the knowledge of probabilistic statistics.In the inference phase, for any input x, we perform the following steps.

Calculating Word Category Importance.
We propose the concept of word category importance (WCI) to indicate the relevance of diferent nouns to diferent medical specialties.Referring to the BM25 algorithm in information retrieval, we design the formula for WCI as where  (2) W ⟵ Sort all words in x by the descending order of their importance scores via equation (1) (3) Filter the stopwords from W (4) n ⟵ length of W (5) For j � 1 to (n/t − 1) do i ⟵ j * t (6) PerSet in ⟵ the words in W where index is i to i + t (7) CandiSet ⟵ { } (8) for w i in PerSet do (9) Initiate the candidates set Syn w i by extracting the top k synonyms for w i from WordNet using cosine similarity (10) end for (11) for i � 1 to k × t do (12) Candidates ⟵ Randomly sample t words from Syn w i to Syn w i+t (13) Add Candidates to CandiSet (14) end for (15) x adv ⟵ x (16) for Candidates in CandiSet do (17) x ′ ⟵ Replace w i to w i+k of x adv with their corresponding candidate in Candidates (18) if F y (x ′ ) < F y (x adv ) then (19) Add x ′ to AESet (20) x adv ⟵ x ′ (21) end if (22) end for (23) if there exits x ′ whose prediction result F(x ′ ) ≠ y Ten (24) return AESet (25) end if (26) where z i denotes the output of the i-th node and c denotes the number of categories.After the probabilistic information layer, the output of each node is where M � max z j Score(x, y j ) c j�1  serves to prevent overfow of values.

Experiment Setup
4.1.1.Dataset.We adopt the medical specialty classifcation dataset from Kaggle (https://www.kaggle.com/competitions/medical-specialty-classifcation/overview).Te dataset of patient notes contains initial consultations, procedure visits, and so on.As some categories contain less than 30 items and are too difcult to train, we flter out the class where data numbers are less than 30.Te fltered dataset includes 3,140 notes and 18 medical specialty categories.Te distribution of the data is shown in Figure 3, and the distribution of text length after preprocessing is shown in Figure 4.
In the performance evaluation of diferent models trained in a plain way and the proposed method, we used stratifed K-fold (k = 5).We divide the dataset into fve folds and assign the training and test data in a 4 : 1 ratio.Data augmentation is processed for the training set only.Each metric score (Table 2) is derived from the average score of the test data of k-models.Considering the time consumption of data augmentation and retraining of a large model, in other experiments, we fx the test data and the training data, corresponding to the trained BioBERT model performs at the median in Table 2.In all training, the fnal training and validation sets are obtained by randomly dividing the training data in a 9 : 1 ratio in a stratifed manner.

Models.
We adopt BioBERT as the classifer model.We utilize BioBERT with 12 transformer layers, 12 selfattention heads, and a hidden size of 768.We set dropout as 0.1, epochs as 10, max sequence length as 512, and batch size as 16.Te learning rate of 1e − 5 is selected.In addition, we compare BioBERT with diferent models, including CNN, LSTM, and BERT.Specifcally, the parameters of BERT are the same as those of BioBERT.Te CNN model contains    SemiADA + PI is our proposed method, where SemiADA represents the data augmentation mechanism and PI represents the classifcation mechanism incorporating probabilistic information.
Computational Intelligence and Neuroscience

Evaluation Metrics.
In this paper, we used accuracy, precision, recall, and F1 score to evaluate the performance of the model.Because the medical classifcation task is a multicategory problem, after the confusion matrix is formed by two categories, we average the confusion matrix to obtain the average of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) as TP, FP, TN, and FN and then calculate accuracy (Acc), microprecision (micro-P), microrecall (micro-R), and micro-F1 (micro-F1).
Te formulas expressions of all the used metrics are as follows:

Data Augmentation Based on Embedding Replacement (ERA).
According to the replacement method described in [27], replacement is determined by two factors: whether the vector cosine similarity is less than the threshold and whether the lexical identity is consistent.Keeping the same experimental conditions as in the original paper, the threshold size is set to 0.7 in the experiments, and the NLTK library is used for lexical annotation.

Data Augmentation Based on the Language Model (LMA).
We choose conditional BERT as the augmented language model [46], and the specifc implementation follows the original algorithm scheme: randomly mask k words and then predict label-compatible words of the masked position and generate multiple new examples by replacing the predicted words.Te value of k is 20% of the total number of words in the input examples.

Main Results.
We investigated the efect of diferent models on the generalization ability of the models using the method proposed in this paper, and the results are shown in Table 2.We observe that the performance of BoiBERT and BERT models improves more than that of CNN and LSTM.
Compared with the plain training of the four models, the BioBERT model pretrained with medical data has signifcantly better performance than other models.As shown in Table 3, SemiADA + PI shows a signifcant improvement in performance in contrast to other augmentation techniques.It is worth noting that ADA leads to degradation in performance.Te main reason for this phenomenon is that the augmented examples in ADA are adversarial examples that have led to changes in the labels and relatively large shifts in the decision boundaries of the model.In addition, SRA and ERA have comparable augmentation capabilities, and LMA performs better as it is based on the language model.

Ablation Studies.
We conduct ablation studies on the BioBERT model to clarify the impact of two parts of the proposed method.As shown in Table 4, SemiADA commendably improves the performance of the model in each metric, but precision is still higher than recall due to imbalance categories still existing.Tis issue can be well mitigated by the PI strategy.It is worth noting that although the PI strategy alone does not improve model performance significantly, it improves microrecall which means the classifcation accuracy of categories with small data size is improved.

Impact of Synonym Set Size.
A larger synonym set size k means that there are more possibilities for word replacement and that more diverse augmented data can be generated.But does a larger k necessarily mean better performance?To further clarify the relationship between k and model performance, we slid k with a window size of 5 in the interval [5,50] and observed the change in model classifcation performance.As shown in Figure 5, model performance does not signifcantly improve after k is greater than 20 and even has a slightly decreasing trend when k reaches 40.

Impact of Attack
Step Size.We propose the MSAA method which perturbs t words in an attacking action for semiadversarial data augmentation.Larger t leads to a greater diference between the generated examples and the original examples, so there is less risk of the model falling into overftting during the training phase.On the other hand, large t will make the semiadversarial examples to be very limited and insufcient to augment the dataset.We evaluated the efect of the generated augmented examples for diferent t ∈ [1,10], and the results are shown in Figure 6.We observe that the model works best for t � 3. How to determine the value of t for diferent datasets in a more direct and automatic way needs to be further explored.7, test accuracy no longer increases when the augmented data amount for each category reaches 5000.

Robustness Analysis.
We evaluate the robustness of our method against four attack methods, which rely on the TextAttack library.Due to the inefciency of the attacks for long text, we select 200 data for each experiment and repeat the experiment three times to take the average value.Te maximum perturbation rate is set to 0.1, and the minimum text similarity threshold is set to 0.84.We summarize the robustness results of the plain training mode and our proposed method as shown in Table 5.

Visualization Analysis.
To further verify our interpretation given in Figure 1, we compare the diference between SRA and SemiADA in the two types of vector representations: the diference in the embedding distribution on the hidden layer output of [CLS] position and the output of the softmax layer.Te output embedding of [CLS] can be viewed as a sentence vector.Te output embedding of the softmax layer is the most direct-viewing response to the distribution of examples in the decision space.Since the candidate words for both methods are derived from WordNet, the word vector distribution is the same, so we do not visualize and compare the word embeddings.

Computational Intelligence and Neuroscience
As we can see from Figure 8, although the candidate word distributions used by two methods are the same, sentence embeddings are markedly diferent from each other.Te distribution of the new sentences generated by SRA is much closer to that of the original sentences (smaller area of the same color).As shown in Figure 9, in the decision space, the new samples generated by SemiADA are obviously distributed more scattered, while the samples generated by SRA are very close to each other.In other words, the new samples generated by SemiADA are richer and cover a wider area in the decision

and Limitations.
Pretrained models are currently the most powerful tools for NLP as they signifcantly improve the accuracy of many NLP tasks and have strong generality.However, we also need to consider the resource consumption in model implementation, because the huge model architecture is not convenient for physical storage and application.We believe that lightweight models will be more popular in the medical industry, and this is the direction of our future research.
Healthcare is an important feld regarding human life and development, with low fault tolerance for models and higher requirements for model interpretability.Dealing with vague and uncertain medical texts remains a challenging task.Literature studies [47,48] give applications of fuzzy classifers in key areas, which give us some insights.As fuzzy classifers are transferable, we believe that the accuracy and stability of the models will be greatly improved by applying them to the healthcare domain.
Adversarial robustness aims to enhance security of the deep-learning model, and we have accomplished some throwaway work in this paper.We hope this will trigger Computational Intelligence and Neuroscience more thoughts and exploration on the security and reliability of deep-learning model applications in the healthcare feld.

Conclusions
In intelligent medical scenarios, training a high-quality model with nonideal data is an important task, which is the starting point of our work in this paper.We propose a data augmentation method based on semiadversarial attacks and probabilistic information, to address the problem of insufcient data amount and imbalanced data distribution in supermultiple classifcation tasks.Our approach signifcantly improves the performance of medical specialty classifers in a cost-friendly manner.Experiments show that our proposed method performs signifcantly better than various data augmentation methods.In addition, the robustness of the model is evaluated under various attack methods.Te results show our proposed method improves the adversarial robustness of the target model to a certain degree.
Our approach takes into consideration the idea of solving data problems in deep learning and the unique characteristics of data in the medical feld, to complement each other and maximize performance gain.Such an idea is of great interest in cross-disciplines, such as the intersection of medicine and artifcial intelligence, where this paper is positioned.

Figure 1 :
Figure 1: A visual illustrative example of (a) data augmentation and (b) adversarial attack.Te circles and squares represent the diferent categories.Te black curve represents the resultant decision boundary.As shown in the yellow-shaded part in (c), semiadversarial data augmentation covers a larger decision space.

Figure 3 :
Figure 3: Data category distribution statistics.Te horizontal coordinate indicates the categories, and the vertical coordinate indicates the total number of samples under that category.As we can see from the fgure, data distribution is severely imbalanced.

Figure 4 :
Figure 4: Text length statistics.Te highest number of texts with lengths of 180-230 words.Te horizontal coordinate indicates the total number of words in a note, and the vertical coordinate indicates the number of texts with a diferent total number.

Figure 5 :
Figure 5: Te performance of BioBERT with diferent synonym set sizes k.We select attacking step sizes of t � 3, 4, and 5 to conduct the repeated experiments, but only the results for t � 3 are shown in the fgure, as the performance shows the same trend in the three sets of experiments.

Figure 6 :Figure 7 :
Figure 6: Te efect of diferent t on the performance of the enhanced model.

Figure 8 :
Figure8: Comparison of sentence embedding distribution generated by SRA and SemiADA.We randomly select 10 original samples, which are sampled from diferent categories in the dataset.Ten, we generate 20 new samples for each original sample by SRA and SemiADA.In order to avoid the overlap of embeddings from diferent categories, we add bias terms of diferent sizes to embeddings from diferent categories in the visualization, so that the categories are far away from each other.We repeat the experiment three times to obtain three sets of plots (each column is a set of experimental results), where the visualization results under the SRA method are shown in (a1-a3) and the visualization results under SemiADA are shown in (b1-b3).We only need to observe the coverage of each category (the area covered by each color).Te larger the coverage means examples cover a wider range in the decision space.

Figure 9 :
Figure 9: Visualization comparison for the output of the softmax layer.Te samples used are exactly same as in Figure 8. (a1-a3) Te visualization results under SRA and (b1-b3) the visualization results under SemiADA.It is easy to observe that the distribution of SemiADA is signifcantly wider (area covered by the same color).
Taking advantage of this property, we extend the training dataset using the intermediate examples generated in the attack process as augmented examples, which are called semiadversarial examples.Tose examples better cover diferent regions of the decision space and improve both the generalization ability and robustness of the model.Furthermore, since nouns in medical notes play a key role in identifying the subfeld to which the note belongs, we designed classifer architecture with confdence recalculation after the softmax layer by probabilistic information.Tis mechanism has advantages in supermultiple classifcation tasks, especially for categories with insufcient examples.Our contributions are summarized as follows.
train F on the original dataset D orig to obtain a base model F base , (2) generate several semiadversarial examples x ′ adv   for each text in D orig , (3) construct the adversarial dataset D adv � (x ′ adv , y i )  , and (4) train F on D ada � D orig ∪ D adv to get the fnal model.
3. Construct Semiadversarial Examples.We sequentially replace words in PerSet with the combination of candidate words in CandiSet to generate the perturbed examples x ′ .If the prediction probabilistic of x [39,40]e original label y is reduced, we add x ′ to the fnal augmentation set.It is worth noting that we do not add the fnal adversarial examples to the augmented set because they mislead the decision boundaries of the model to deviate more from the true one.Te idea that adversarial data augmentation leads to a decrease in model accuracy has also been experimentally verifed in several works[39,40].
Medical note text x, the ground truth y, target model F, attack step size t, synonym sets size k, original dataset D orig Output: Semiadversarial examples set AESet (1) F base ⟵ train F on D orig |D y | denotes the total number of examples in the dataset whose labels are y, D Y j is the average data amount for all categories, and IDF ′ is a variant of the inverse document frequency and expressed as Input:

Table 1 :
Te performance of diferent attacks.

Table 2 :
Te performance of diferent models trained in a plain way and the proposed method.
We use the dataset D orig for plain training in four models without any extra optimization.
) 4.1.4.Experimental Environment.All models are trained in 4 GeForce RTX 2080 GPUs; the version number of the python environment used is 3.6.13;the model architecture used is the pytorch (https://pytorch.org/)library, and the version is 1.10.2.4.2.Baselines.We utilize multiple data augmentation methods as comparison methods.Te size of the augmented dataset is consistent.In addition to the examples in the augmented dataset, other training details are consistent.4.2.1.Plain Training.
Augmented Data Amount.Te appropriate number of augmented examples is important.An excessive number of augmented examples may lead the model into an overftting dilemma.We compare the variation in training accuracy and testing accuracy of the models obtained by training diferent numbers of augmented texts.As shown in Figure

Table 3 :
Te performance of diferent data augmentation modes on the BioBERT model.

Table 4 :
Ablation studies of our method on BioBERT.

Table 5 :
Te robustness experiment results of the plain training mode and SemiADA + PI training mode, including accuracy under attack (AUA %) and attack successful rate (ASR %).
Te black bold values denote the stronger robustness capability among two modes.