Improved Double-Layer Structure Multilabel Classification Model via Optimal Sequence and Attention Mechanism

Multilabel classifcation is a key research topic in the machine learning feld. In this study, the author put forward a two/two-layer chain classifcation algorithm with optimal sequence based on the attention mechanism. Tis algorithm is a classifcation model with a two-layer structure. By introducing an attention mechanism, this study analyzes the key attributes to achieve the goal of classifcation. To solve the problem of algorithm accuracy degradation caused by the order of classifers, we adopt the OSS (optimal sequence selection) algorithm to fnd the optimal sequence of tags. Te test results based on the actual dataset show that the ATDCC-OS algorithm has good performance on all performance evaluation metrics. Te average accuracy of this algorithm is over 80%. Te microaverage AUC performance reaches 0.96. In terms of coverage performance, its coverage performance is below 10%. Te comprehensive result of single error performance is the best. Te loss performance is about 0.03. Te purpose of the ATDCC-OS algorithm proposed in the study is to help improve the accuracy of multilabel classifcation so as to obtain more efective data information.


Introduction
Multilabel classifcation, a commonly used method in big data analysis, aims to associate multiple labels to a sample at the same time. Te ubiquity of multilabel data in real-life scenarios makes multilabel classifcation methods a popular research topic. However, in real-life applications, the integrity of the labels is usually not guaranteed. Due to poor data collection and the high cost of labeling and other reasons, only part of the labels in those samples is marked.
Tere are many ambiguous examples in the real world. Sample instances are of a certain probability to be calibrated to diferent attributes. Many multilabel classifcation algorithms come into being. Usually, it is very challenging to extend the theory of single-label classifcation to multilabel classifcation. With the development of machine learning, multilabel classifcation algorithms can be applied to imaging, recommendation systems, medical diagnosis, information retrieval, and many other felds [1][2][3][4][5][6][7][8]. In recent years, an ocean of research works accepted by top conferences (e.g., ACL, AAAI, COLING, KDD, NIPS, ICDM, CIKM, INTERSPEECH, ICML, and IJCAI) proposed technologies and solutions for multilabel classifcation. Te multilabel classifcation theory is a heated topic in data mining, which has attracted wide attention in the machine learning community.
Tere are two commonly used methods to construct a multilabel classifcation model: algorithm adaptation and problem transformation. Te algorithm adaptation method is to adjust the existing algorithms (e.g., AdaBoost and decision trees) to solve multilabel classifcation issues. Te performance of the algorithm adaptation method often remains poor. Te problem transformation method splits a multilabel classifcation task into several single-label classifcation tasks. Ten, the classical single-label classifcation theory is utilized to solve the problem, which brings the trained single-label classifers together as a super-classifer through linear combination. In this study, we investigate the multilabel classifcation theory algorithm based on problem transformation.
Tere are many existing problem transformation methods, such as the BR method [9], the CC theory [10], the MBR model [11], and the DLMC-OS algorithm [12]. However, these methods usually ignore the correlation among labels, the randomness of label sequences, and the redundant interactive label information, which reduces the accuracy of classifcation. Te problem transformation method uses extended attributes to dig out the correlation between labels, but for diferent classifcation tasks, the importance of feature attributes is usually ignored during the process, which decreases the sensitivity of the classifers. Terefore, we try to introduce an attention mechanism into the methods. Such attention mechanism method [13] is a bionic process based on how the human brain works. It is widely used in machine learning in areas such as speech recognition, image recognition, natural language processing, and so on. Te attention mechanism usually calculates the probability mapping from an input to diferent outputs. Te result with the largest probability will be chosen as the output, which has a great impact on considering the correlation between multiple attributes and labels. Ten, we propose an attention mechanism-based multilabel classifcation algorithm, based on the double-layer chain structure.
In the proposed algorithm (algorithm of two/doublelayer chain classifcation with optimal sequence based on attention mechanism, ATDCC-OS), we integrate three multilabel classifcation frameworks (including BR, MBR, and CC) and an attention mechanism into a chain structure with two layers. Tis structure exploits a binary association classifcation framework. In layer one, it carries out the initial classifcation. In layer two, the chain-based classifer utilizes an updating process to complete the fnal classifcation, which interacts with the label information coming from the output of layer one. In particular, we put an attention mechanism in layer two and use the output of layer one to calculate the probability of fnal classifcation results. Tus, this can fnd important information between diferent attributes and can improve the fnal classifer accuracy for diferent tasks. However, there is a random chain order problem in ATDCC-OS. We leverage the optimal sequence selection (OSS) algorithm to solve this issue. OSS integrates several variables and methods (including the hierarchical traversal algorithm, PageRank, Kruskal's algorithm, and mutual information) to decide labels' priority. Ten, the priority rank is used to help ATDCC-OS to assign classifers and construct the chain classifcation model.
In this study, the main contributions are as follows: (1) A double-layer structure multilabel classifcation model is constructed to fully integrate the advantages of three classical classifcation models. At the same time, an attention mechanism is introduced to further analyze the infuence of key attributes on classifcation results to optimize traditional classifcation. (2) Te OSS algorithm is proposed to solve the problem of low classifcation accuracy due to the existence of random chain order in the chain classifcation model. It is applied to improve the second layer of the chained classifcation model. Tis classifcation model does not depend on any classifcation algorithm separately. Experiments on benchmark datasets validate the efectiveness of the proposed approach by comparing it with the state-of-the-art methods in terms of predictive performance.
Te rest of this study includes the following: Section 2 deals with related work. Section 3 displays the proposed ATDCC-OS method. Ten, we introduce the datasets used in the experiments and perform some simulations to verify the proposed method and discuss the experimental results in Section 4. We conclude our work in Section 5.

Multilabel Classifcation Method.
Te multilabel classifcation approach has received much attention and is widely used in various felds, including text classifcation, scene and video classifcation, and bioinformatics. Te multilabel classifcation includes two common methods: problem transformation process and algorithm adaptation process. Te former changes a multilabel problem into one or several single-label issues [11] and uses basic classifcation algorithms, such as Naive Bayesian, supporting vector machine [14], k-nearest neighbor algorithm, and so on to solve them. Te latter transforms the existing algorithms so that they can solve the multiclassifcation problem, e.g., ML-RBF method [15,16], ML-kNN approach [17,18], rank-SVM classifcation [9], and associated classifcation algorithm [19,20].
BR (binary relevance) [9] is a common method of problem transformation, which transforms the multilabel classifcation issue into several binary relevance problems where it trains a binary classifcation model one by one for all labels. However, BR is often overlooked because it cannot efectively use the correlation between labels. Te MBR based on BR was proposed [11], which was constructed as a two-layer model. Layer one in MBR is taken as the input of layer two as a sample attribute to consider label correlation. However, the problem of the label value redundancy is ignored in the training process of layer two.
Te CC method was proposed by the authors in [10], where the chain is exploited to build the correlation among all labels. It converts all classifers into the linear stochastic data chain and adds previous classifers' output to the data sample attribute set and takes it as the input to the next classifer. However, there are many disadvantages to the random chain. First, in the CC training process, the classifer output is input as a new attribute together with the original attribute into the next classifer. So, the former classifers in the chain have a greater impact on classifcation than the latter classifers. Te order of classifers in the chain afects the classifcation result. Second, CC considers the correlation of attributes, but two linked classifers can use the correlation between adjacent attributes, and the other correlation between attributes cannot be used. Finally, the order of classifers in the chain is randomly assigned, so the CC model is not unique, which makes the model have strong randomness and ruins the stability of the algorithm [21,22].

Complexity
Te two-layer classifcation model, DLMC-OS, is proposed to solve the classifcation problem [12]. In this model, the output of the frst-level classifer is forwarded to the second-level classifer as an extended feature. Each classifer in the second layer of the model passes the latest classifcation results backward through a chain to consider the correlation between labels. Tis approach suppresses the classifer chain randomness, but it cannot obtain the unique classifer sequence in the chain.

Attention Mechanism.
Te attention mechanism method [23][24][25] is derived from the study of human vision, which simulates the perspective interest of human vision when observing. When the human eye scans the global image, the part of the information that assists the judgment is tracked dynamically in the image, and irrelevant information is ignored. Tis process can efectively decrease or reduce the amount of information processing when the eyes recognize images by paying more attention to part information. Te modern attention mechanism is adopted for machine translation, and it greatly improves the performance of the model [26]. In 2014, Google Brain published an article on the attention mechanism [27]. Te article pointed out that when viewing an image, people do not frst look at the image pixels but pay more attention to the image's specifc parts based on their requirements. In addition, as humans, we will focus on the required attention locations in the future based on previous observations of images. Te authors designed a new architecture named transformer. In a transformer, the self-attention mechanisms are extensively utilized to perform text representations [28], which break away from the traditional RNN/CNN. In recent years, transformer-style models achieved many good results in various tasks. Subsequently, attention mechanisms have become more common and are widely used in classifcation tasks, such as sentiment classifcation [29], musical instrument recognition [30], visual recommender systems [31], multilabel text classifcation [32], and multiple protein subcellular location prediction [33].

ATDCC-OS
3.1. Preliminaries. We set χ ∈ R d and Y ∈ R L as the input domain and output domain, respectively. Tere are d-dimensional attributes in the input domain and L-dimensional labels in the output domain. Instance x belongs to a subset of attributes. We use the set Lvector x i ∈ χ to represent that x is the input and y is the output. If the label j is related to x, then y j � 1, or , · · · , y f L ) and c S � (y S 1 , y S 2 , · · · , y S L ) are the outputs of the frst and second layers.

Te ATDCC Framework.
By referring to algorithm DLMC-OS, we construct the double-layer chain classifcation based on the attention mechanism (ATDCC). ATDCC converts the multilabel classifcation issue into a series of binary classifcation issues, each one of which is independent of others. In layer one, ATDCC performs binary transformation on labels and constructs some classifers between attributes and labels. After training, the classifers of each binary classifcation model can be obtained [12]. ATDCC completes the binary classifcation of instances in layer one and then makes the classifcation results as the extended attributes transfer to layer two. In layer two, ATDCC constructs a classifcation method with a chain structure by realizing the updating process of dynamic feedback. It exploits the classifer chain to transfer and change the labels. It realizes the interaction among labels and optimizes the classifcation result. ATDCC utilizes correlation among all labels for multilabel classifcation through label information interaction within layers and labels information transfer between layers. First-Layer . ATDCC First-layer follows the idea of a binary correlation classifcation model. It constructs a classifer with a binary structure for all labels. Tese binary classifers are combined as classifcation one, as shown in Figure 1.

ATDCC
In step one, assume there is an annotated dataset with a size being L. ATDCC First-layer constructs an attribute set for all labels by using the following equation: (1) In step two, some binary algorithms B (such as SMO) are utilized to create the binary classifer of the training instance: In step three, we use the obtained binary classifer to classify and predict the unseen instance X.
Finally, the prediction result of each classifer (i.e., , as shown in Figure 1(b)) is the output of the unseen instance in the frst layer of ATDCC, integrating these output c f with the attribute set of samples to build a new attribute set AT-Layer . Te attention mechanism is usually exploited in sequence-to-sequence learning paradigms. For diferent multilabel classifcation tasks, the attribute mapping weights between the two layers of ATDCC are diferent. Te attention mechanism method can capture the weight value of all attributes in samples according to requirements. It can improve the fnal accuracy of classifcation results.

Complexity 3
ATDCC AT-layer uses the attention mechanism mentioned above to dynamically compute the extended attributes' weights. Te layer two model can adapt to the requirement of the current classifcation task by adjusting the weight value of the transfer attributes between the two layers in ATDCC.
In step one, according to the original sample attributes' dimension of the frst layer to defne weight matrix W, the tanh function is exploited to train ATDCC AT-layer to capture correlations between input attributes and label i. Te trained model can be expressed as where W and b, respectively, denote the weight matrix and the model's bias. In step two, ATDCC AT-layer uses a softmax function to transform the output of equation (3) to a probability value and then obtains the weight value of the attention scores.
Finally, the extended attribute set is weighted based on the attention weights obtained from equation (4): Te parameters in our model are optimized by carrying out the minimization of the feedback result of the loss function. Te cross-entropy loss in equation (6) is used as the loss function. Te following equation calculates the accumulated loss derived from actual and predicted labels for each instance:

ATDCC Second-Layer .
ATDCC Second-layer is the second layer of the ATDCC model ( Figure 2), which uses the classifcation structure with a chain and exploits an updating process to classify instances in the second time. Te attribute set of each binary model expands the correlation of the classifcation labels before the instance to create the chain structure of classifers. Te attribute set of all binary models is augmented via the 0/1 label estimation value obtained in layer one as well as the whole prior binary correlation estimations from layer two. In the second layer, the correlations between each label are fully applied. Given the attribute set, each classifer in the chain will learn and predict the binary association of labels. In step one, ATDCC Second-layer creates the extended attribute vector D s yk (1 ≤ k ≤ L) for each class label as shown in the following equation.
where W represents the set of attributes' weight value.
In step two, we use binary approach B (such as SM) to learn the constructed extended attribute vector (O) to create the binary classifer, H s yk ←B(D s yk ).
In the third step, use the constructed binary classifer to classify and predict the unseen instance X. 4 Complexity In the model training process, we use the latest predicted label value to change each sample attribute set's label value. For example, for the third classifer H S y 3 in a chain, the next Finally, ATDCC evaluates the classifcation prediction result c s � (y s 1 , y s 2′ , . . . , y s L ) of each classifer as the fnal classifcation for the unseen instance.

OSS Method.
In the MBR model, the sequence of the classifers in the chain is randomly arranged. If the classifcation accuracy of the classifer at the core of this chain is very low, an error will be propagated via a backward way along this classifer chain, decreasing the classifer's accuracy. Tis further leads to lower classifcation correctness and accuracy for the whole chain. As the number of labels increases, the randomness of the OSS classifer chain also increases rapidly. Te algorithm DLMC-OS can reduce the classifer chain's randomness, but the optimal label recognition sequence cannot be determined due to the nonuniqueness of the root node. Te most efective method is to sort the sequence of the chain. Te sequence of the classifer needs to be ranked according to attributes and the characteristics of the chain classifcation model. For this reason, the following constraints are proposed to search for the optimal chain sequence: (A) Te label list is ordered according to a sequence which contains all label information (B) Te label sequence satisfes the greatest correlation of labels (C) Te label list sequence is optimal under current conditions Under these design rules, we propose OSS in the model, which integrates mutual information and PageRank with the Kruskal algorithm and the hierarchical traversal method to fnd an optimal label sequence. Te chain classifcation model uses sequences as the rules to assign the order of each classifer, and the second layer will optimize the ATDCC with the OSS algorithm.

Subalgorithm Related with OSS
(1) Mutual Information (MI) Teory. In the information theory and probability theory, mutual information (MI) is used to evaluate the interdependence between two random variables, so we can obtain the "information amount" of a stochastic variable by observing the other random variables. Equation (9) shows the MI of the two variables. In current information technologies, the probability theory and information theory have been widely used. Te MI theory is widely exploited in research works. In the machine learning feld, MI can be utilized to select the features [34,35]. Te search engine often uses MI among phrases and contexts to fnd discover semantic clusters [36]. In statistical mechanics, MI is usually used to solve mechanical problems together with Loschmidt's paradox [37,38]. Based on the MI application, we evaluate the correlation between labels by capturing MI among labels. Ten, we exploit it as edges' weight in the fully connected graph. (2) PageRank. PageRank (PR) is used to overcome the page ranking issue in the detailed link analysis process, which was proposed in reference [39]. Te key idea of this algorithm includes that the page's importance is related to the number as well as the detailed quality of another page that points to this page. Tis algorithm is applied in Google's search engine [40]. Te importance of a Webpage can be quantifed by the number of links in the link structure, rather than relying on specifc search requests. Twitter uses a personalized PageRank to show users' another account [41]. In this study, we use PageRank and priority search to build the customized PageRank algorithm to decide a very important label to act as the chain's frst node. Tis can overcome the issue of nonuniqueness of the chain head.  Complexity formulated as a graph with edge weights [42]. Te weight of an edge may represent cost, length, or capacity, depending on the current problem to be solved [43][44][45][46]. In the model, we exploit this weighted graph method to create the graph with a fully connected relationship related to the labels in Algorithm 1. (4) Detailed Kruskal's Algorithm Idea. In this study, the referred Kruskal's algorithm is utilized to seek a tree with minimum spanning [47]. We use Kruskal's method to seek a tree with the largest label spanning. Tis can provide a basis to create a sequence in which the association with labels is the largest. Te designed algorithm is shown in Algorithm 2. (5) Breadth-First Based Search Method. In this study, the breadth-frst based search (BFS) is an algorithm used for seeking the available paths of the graph, which traverses or searches the tree or graph data structures. Ten, we use PageRank to fnd the starting point and use BFS to traverse the spanning tree with the maximum label to construct the resulting label order, as shown in Algorithm 3.

Te OSS's Detailed Design Framework.
Te detailed design steps for the OSS algorithm in this study are shown in Figure 3.
Step 1. Calculate the MI of the correlation between labels. Assuming that there are N labels y 1 , y 2 , . . . , y n , we use formula (9) to calculate the MI on any two labels y i and y j , and the MI must be nonnegative.
Step 2. Construct a fully connected graph G via labels, where the labels are the graph's vertices, and MI volume among labels acts as edges' weights. Utilize the Kruskal algorithm to build the label tree with the maximum weight. Ten, invert the mutual information value to obtain the maximum weight spanning tree.
Step 3. Use the PageRank algorithm to sort each label in the dataset by "voting" and decide on the label node whose PR value is the highest. Tis node acts as the root node that belongs to a tree with the maximum weight. It is also selected to act as the frst node of the traversal algorithm that is hierarchical. Tis can overcome the issue of not unique head label in the chain.
Step 4. Use Kruskal's algorithm to generate a minimum weight label tree (MWT) used for the fully connected graph G. Te label tree includes the whole labels and the entire edges. Tese edges connect the label nodes. Te weighted sum is the largest.
Step 5. Traverse the MWT with the label nodes obtained by BFS and PageRank to obtain the label sequence. Use this sequence as a guide for constructing the sequence of each classifer in a chain to overcome the uncertainty issue for the classifcation, as shown in Algorithm 4.

Te ATDCC-OS Framework.
Te ATDCC-OS design framework is plotted in Figure 4. Figure 4 shows that the ATDCC First-layer and ATDCC Second-layer are the frst and second layers. We utilize the OSS approach to optimize the chain structure in the ATDCC-OS framework. Ten, we can seek an optimal sequence of labels. According to the best and optimal serials, we train each classifer in our model. We utilize this attention mechanism layer between layer one and layer two to fnd important attributes and features from the current task. In such a case, we can build a better classifer in layer two, as shown in Algorithms 5 and 6.

Experiments
To validate the method, we perform some simulations and use the experimental results to analyze the performance of the proposed algorithm. In the simulation, we analyze the algorithm (ATDCC-OS) presented in this study with other algorithms of multilabel classifcation (including DLMC-OS and BR and CC and MBR) via fve metrics. We then take seven datasets as the multilabel benchmark.

Test Datasets.
We utilize the standard datasets provided on the Mulan [48] platform as the multilabel benchmark. Table 1 describes each dataset and related statistical data in the simulation. N, F, and L represent instances' numbers, attributes' numbers contained in each instance, and labels' numbers in the dataset, respectively. Te notation label cardinality (LCard) represents the normal measure as shown in [49]. LCard denotes the average label number associated with an instance.

Evaluation Methods.
Te evaluation indicator is a measure that directs the indication of the algorithm's performance. To better evaluate the method, we used mean accuracy, coverage rate, single error, ranking loss rate, and microaverage AUC to analyze the performance of ATDCC-OS.
(1) Average precision: average precision [12] is an accurate metric, which associates recall with precision to sort search results. It reviews a mean score of labels with a higher rank than a specifc tag. Te larger the value of the average precision is, the better the classifer will be. Te average precision can be expressed as 6 Complexity Where rank(.) is a sort function.
Input label values to construct a label map with weights: Input Y � y 1 , y 2 , . . . , y L Output: each (u, v) in G.V (4) Calculate the mutual information of MI (u, v) according to Defnition 1.
Constructing the minimum spanning tree of labels based on Algorithm 1: Hierarchical traversal to get the label sequence: v←Q.head, w←Q.next (6) while(w! � ∅) (7) Q←Q ∪ (w) (8) end while (9) end while (10)   Find labels' optimized chain order: Input: Variable D � (x 1 , x 2 , · · · , x n |y 1 , y 2 , · · · , y L ) Output: OS(y 1 , y 2 , . . . , y L ) (1) ⊳Calculate mutual information according to Defnition 1 ArrayA←I ij (6) End for (7) End for (8) ⊳Make a fully connected graph (9) G←Edge − weightedgraph(L, A) (10) ⊳Determine the root node by PageRank   Complexity (2) Coverage [12]: coverage indicates that the algorithm can cover all possible labels. Tis metric describes how far or how deep we are to go in the tag list on average to include possible labels related to the document. At the perfect recall level, coverage is loosely related to accuracy. Te smaller the value of coverage is, the better the algorithm will be. Te coverage can be calculated as Where notation rank(.) denotes a sort and ranking function related to the classifer H(.). (3) One-error metric [50]: one-error metric is used to indicate the proportion of examples where the top label does not fall into the selected label set. Te bigger this metric is, the worse the algorithm will be. Te one-error metric can be expressed as D is the training set, L is the labels' number TRAINING D � (x i , Y i )|i � 1, 2, . . . , m (1) ⊳Train the frst-layer classifer (2) forj � 1, 2, . . . , L (8) ⊳Use the OSS algorithm to obtain the label priority order to guide the training of the second-layer classifer (9) forj � sort1, 2, . . . , LbyOSS(D) (10) D s y j ← (11) dox←[x i1 , x i2 , . . . , x im , y 1 , . . . , y j−1 , y j+1 , . . . , y L ] (12) ⊳Compute attribute value weights using the attention mechanism (13) W← W i1 , W i2 , . . . , W im+L (14) W←attention(x) Complexity 9 Where f(.) stands for a function associated with a classifer H(.) with multiple labels. (4) Ranking loss metric [12]: the ranking loss metric is related to those situations in which the classed labels of samples are not sorted in order, that is, in the label serials, the classifed labels (that are not related to the researched instance) fall into the previous related labels. Te bigger this indicator is, the better the algorithm performance will be. Te ranking loss metric can be expressed as (5) Microaveraged AUC [50]: this metric shows the area covered by a ROC curve graph. Its value is from 0.1 to 1. Tis metric is directly exploited to review the classifer's performance. Te smaller this metric is, the worse this algorithm will be. Te microaveraged AUC metric is is a realvalued function [51] and the following equations can be obtained: where they denote label pairs' sets which are related or unrelated.

Experimental Setting.
We use the dataset provided by the Mulan platform to evaluate all algorithms. Te Mulan [48] is an open-source dataset used for classifcation with multiple labels, which is based on Weka. In this study, we use SMO as a basis for classifcation algorithms. Four diferent classifers are utilized to carry out comparisons, including the DLMC-OS algorithm, the MBR algorithm, the CC algorithm, and the BR algorithm. We select 80% of instances from every dataset to act as training datasets, while we choose the rest to act as testing datasets. We adopt Adam [52] as the optimizer during the training process. We list the default parameters of Adam's hyperparameters as follows: let alpha be 0.001, set beta1 to be 0.9, let beta2 be 0.999, and set epsilon to be 10 −8 . Our simulation platform includes the Intel(R) Xeon(R) E5-2630 CPU, 128 GB RAM, as well as the operating system Centos 7.6. We design and implement the algorithms in the Java (JDK 1.8) running environment. Figures 5-10 show the performance comparison among ATDCC-OS, DLMC, MBR, CC, and BR algorithms, using mean accuracy, coverage metric, single error metric, ranking loss metric, and microaverage AUC metric. We use the metric of the mean ranking (Ave. rank) parameter to review diferent classifcation results of the algorithms [53]. In these fgures, each color represents an algorithm and the name of the algorithm has been listed in the upper left corner of the graph. Te number on the top of each bar is the performance rank of the algorithms in the dataset. In Figures 5-9, the ordinate y-axis denotes the results of the evaluation, while the abscissa x-axis stands for the names of the dataset. In Figure 10, x-axis denotes the name of the algorithm, while y-axis shows the average rank of algorithms in all datasets. Figure 5 shows the accuracy of each algorithm in each dataset. Te ATDCC-OS method proposed in the study has the best performance in the dataset. Compared with other methods, except for the lowest accuracy in the yeast dataset, the accuracy in other datasets is the highest. Among them, the accuracy in the datasets of fags, emotions, and the medical dataset is over 80%.

Results and Discussion.
In Figure 6, we can see the comparison of the microaverage AUC performance of the algorithms. Te ATDCC-OS algorithm is also the most excellent and stable in terms of microaverage AUC performance. Te performance of this algorithm is the best except for that in the birds dataset, and the performance in the medical dataset is 0.96. Figure 7 shows the comparison of the coverage performance of each algorithm. Te lower the coverage, the better the performance of the algorithm. Te coverage performance of the ATDCC-OS algorithm proposed in the study is optimal in all datasets, and its coverage performance is less than 10% in fags, emotions, birds, medical datasets, and yeast datasets.
Te single error performance of each algorithm is shown in Figure 8. Te performance of the proposed ATDCC-OS algorithm in this graph is relatively unstable compared with other algorithms. However, from a comprehensive perspective, the performance of this algorithm is still good, and Complexity the performance in the fags, birds, Enron, and BibTeX datasets is the best. In the emotion dataset, the performance of this algorithm is second only to that of the MBR algorithm.
From Figures 5-9, we can see that ATDCC-OS shows the optimal classifcation performance, while algorithm DLMC-OS presents better performance. However, other methods indicate worse performance. For all reviewing metrics, the One-Error mean precision metric and microaveraged AUC metric directly indicate the performance of the classifers. Te larger the values, the better the performance of the algorithms. According to Figures 5 and 6, we can see that the algorithm ATDCC-OS proposed in this study and the method DLMC-OS demonstrate much better performance compared with other algorithms. Tis is because they utilize a two-layer classifcation structure and the label information interaction to create detailed classifers. Tis design structure takes into consideration the interrelationship between labels. At the same time, the algorithm ATDCC-OS also exploits the classical attention mechanism theory to improve the sensibility of classifers and adapt them to a variety of tasks. Tree indicators, namely, coverage, ranking loss, and the one-error metric are often exploited to decide and fnd irrelevant labels in classifcation results. As shown in Figures 7-9, we fnd that the algorithm ATDCC-OS and the previous algorithm DLMC-OS also demonstrate better performance compared with the rest of the algorithms, while the BR approach presents a medium performance. Te MBR method and the CC approach are the worst in this metric. Tis is because the algorithm ATDCC-OS and the previous approach DLMC-OS utilize optimization algorithms to train all classifers in order. Te randomness of serials in the CC method and the MBR approach directly leads to poorer performance. On the contrary, the BR method does not take into account the sequence of the labels, while it shows better performance. Te loss performance of each algorithm is shown in Figure 9. Among them, the ATDCC-OS algorithm is the most excellent in terms of loss performance. In all datasets, the performance of this algorithm is one level better than other algorithms. In the medical dataset, the loss performance is about 0.03.
From Figure 10, we can see the comprehensive performance ranking of the comparison algorithms in various indicators. Among all the indexes, the ATDCC-OS algorithm has the best performance. Te comprehensive performance of the DLMC-OS algorithm is second only to that of the ATDCC-OS algorithm, and the subsequent performance is diferent in diferent algorithms. Figure 10 shows the mean ranking performance metrics of the fve classifers for mean accuracy, coverage metric, single error metric, ranking loss metric, and microaverage AUC metric.
From our simulations, we can fnd that our algorithm ATDCC-OS outperforms the rest of the algorithms for most of the datasets, while it performs poorly in yeast and birds. As we all know, this algorithm cannot obtain the best performance for all types of diferent test datasets [10]. Te algorithm performance is related not only to the detailed structure of the algorithm but also to the dataset's detailed type and size, as well as labels' balance in our test dataset. Figures 11 and 12 show the plots of the percentage of training data versus average precision and ranking loss. Tese two fgures illustrate how the percentage change of training data afects the enhancement of performance. In this experiment, we take the emotions dataset as an example for both comparisons. Figure 11 shows the change curve for average precision under the two pairs of classifers scale with respect to the percentage of training data. From Figure 11, we observe that the average precision is elevated for the four classifers when the percentage of training data increases. When the percentage of training data is between 10% and 30%, the accuracy of all algorithms foats up and down. When the percentage of training data is over 30%, the average precision of the ATDCC-OS and DLMC-OS rises steadily, while MBR needs to reach 40%, and CC and BR need to reach 60%. Overall, as the training data increase, ATDCC-OS shows better performance than DLMC-OS, followed by MBR and BR, while CC is the worst.
From Figure 12, we can see the results of the comparison in terms of ranking loss. In this fgure, as the percentage of training data increases, the ranking loss of ATDCC-OS and DLMC-OS tend to decrease steadily, compared to MBR, CC, and BR. When the training data is between 10% and 40%, the ranking loss of each algorithm is unstable, among which the MBR fuctuates the most, followed by CC and BR, while ATDCC-OS and DLMC-OS perform better. When the dataset is larger than 40%, the ranking loss curves of all algorithms show a downward trend. ATDCC-OS still presents the lowest loss in such a scenario.

Conclusion
In this study, we propose a simple and efective multilabel classifcation model (ATDCC-OS) that integrates the multilabel classifcation framework of three classic problem- conversion types. It fully explores all kinds of advantages of every method to resolve these issues without considering the correlation among labels when performing classifcations. In order to further improve the performance of classifcation, the algorithm solves the problem of nonreal-time label information interaction in the second-layer chained classifcation model by introducing the idea of "update replacement." At the same time, the algorithm dynamically calculates the weight values of all feature attributes through an attention mechanism in order to add more important attribute features to the current classifcation target for each classifer. It is helpful to add the classifcation sensibility of classifers, which greatly improves the preciseness of classifcation. Five diferent metrics are utilized to describe diferent algorithms on seven diferent datasets. Te results of the experiments show that the proposed method obtains high predictive performance compared with the state-ofthe-art multilabel classifcation methods in most cases. In terms of average accuracy, the average accuracy of the ATDCC-OS algorithm is basically the highest in all datasets, and the accuracy in fags, emotions, and the medical dataset is more than 80%. In the microaverage AUC performance, the performance of the ATDCC-OS algorithm in all datasets is the best except for that in the bird's dataset, and the performance in the medical dataset is 0.96. In terms of coverage performance, the ATDCC-OS algorithm has the best coverage performance in all datasets, and its coverage performance is less than 10% in some datasets. In single error performance, this algorithm has the best comprehensive performance. In the loss performance, the algorithm has a loss performance of about 0.03 in the medical dataset. Based on the above results, it is concluded that the performance of the proposed ATDCC-OS algorithm is the best. Tis is only the preliminary result of this study. In the future, we will further optimize the algorithm to solve the problem of time complexity caused by the model structure, and we will also try to apply the algorithm to solve classifcation problems in everyday work and life. Finally, we hope that the research work in this study can provide some reference and assistance to researchers or scholars in the feld of multilabel classifcation of problem transformation types.

Data Availability
Te data used and/or analyzed during the current study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.