Weighted Decoding for the Competence Reliability Problem of ECOC Multiclass Classification

Error-Correcting Output Codes has become a well-known, established technique for multiclass classification due to its simplicity and efficiency. Each binary split contains different original classes. A noncompetent classifier emerges when it classifies an instance whose real class does not belong to the metasubclasses which is used to learn the classifier. How to reduce the error caused by the noncompetent classifiers under diversity big enough is urgent for ECOC classification. The weighted decoding strategy can be used to reduce the error caused by the noncompetence contradiction through relearning the weight coefficient matrix. To this end, a new weighted decoding strategy taking the classifier competence reliability into consideration is presented in this paper, which is suitable for any coding matrix. Support Vector Data Description is applied to compute the distance from an instance to the metasubclasses. The distance reflects the competence reliability and is fused as the weight in the base classifier combination. In so doing, the effect of the competent classifiers on classification is reinforced, while the bias induced by the noncompetent ones is decreased. Reflecting the competence reliability, the weights of classifiers for each instance change dynamically, which accords with the classification practice. The statistical simulations based on benchmark datasets indicate that our proposed algorithm outperforms other methods and provides new thought for solving the noncompetence problem.


Introduction
Multiclass classification has been an open issue in machine learning and many solutions are possible. As a "divide-andconquer" framework, Error-Correcting Output Codes (ECOC) [1,2] can realize the class decomposition and dichotomize ensemble effectively through a binary or ternary matrix, which not only simplifies the complexity of pattern recognition but also enables the classic binary classifiers suitable for multiclass classification. So far, ECOC has been widely applied to biological data recognition [3,4], disease diagnosis [5][6][7], military target recognition [8], and intelligent transportation systems [9] with fairly good recognition performance. e procedures of using ECOC to solve the multiclass problems are usually divided into three steps: coding, base classifier training, and decoding. e goal of coding is to construct a binary or ternary matrix M � (m ij ) c×l , m ij ∈ 1, 0, −1 { }, where the rows hold the codewords and the columns represent the binary splits. e coding methods can be categorized into three kinds based on the existing researches: predefined code, datadependent code, and dichotomies-based code. Independent of the specific application and instances, the predefined code ignores the potential information of the original classes and confines the improvement of classification performance. e dichotomies-based code involves finding an optimal matrix given a set of binary classifiers, which is proven to be an NPcomplete problem by Crammer and Singer [10]. However, the data-dependent code can make the best of the class separability to enhance performance as a whole, which has drawn special attention. e classic data-dependent codes contain Discriminant ECOC [11], Subclass ECOC [12], and HECOC [13]. In order to construct a data-driven matrix and enhance the diversity between dichotomizers, Zhou et al. [14] proposed a Confusion Matrix Superclass ECOC (CMECOC) based on the Fisher criterion. CMECOC used "one-versus-all" method to encode classes between the superclasses to avoid the unnecessary redundancy and "oneversus-one" method among the superclasses to reduce the complexity of decision boundaries. More research studies on data-driven ECOC are available [15][16][17][18][19][20][21], all of which promote the ECOC development. e second step is base classifier training. e most common practice is to train the classifiers (decision functions) with the binary splits. e outputs of the base classifiers are merged in the final decoding step. It is generally acknowledged that there are three categories of decoding strategies according to the existing literatures. e first type is those based on the distance between the predicted codeword and the target codeword.
e state-of-the-art decoding strategies of this type contain the Hamming decoding, Euclidean decoding, loss-based decoding, and so on. Zhou et al. [22] proposed a new weighted decoding by using genetic algorithm to learn the weight coefficient matrix and taking the final generalization error of the ensemble classifiers as fitness function to reduce the error caused by the consistent-diverse balance problem. However, the predicted codewords of the base classifiers are hard (either 1 or −1), which may be not suitable for the cost-sensitive classification. On this account, Lei et al. [23] introduced the reject option into coding phase to extend a binary-symbol output to a three-symbol one and used the modified Hamming decoding strategy for decoding. e second type is those based on class posterior probability estimation, which estimates the class membership based on the decomposition framework of ECOCs. Suppose that the base classifier h(x) outputs the class probability for a given test sample x; then the probability vector can be obtained as follows: where L is the number of M columns. en the relationship between the coding matrix, probability vector, and the posterior probability p � (P(c 1 | x), . . . , P(c k | x)) can be described as where k is the number of classes. Many research works have been done based on the equation. Zhou et al. [24] put forward a variation of the product rule and a linear rule based on posterior probabilities for ternary ECOC. e authors refined the decomposition process of the probability to get smoother estimates in the product rule and extend the linear rule in binary ECOC to ternary ones. Simeon et al. [25] designed the reject rules in decoding with an external and internal approach. e third type is those based on the analysis of the pattern space [26].
As Prior and Windeat et al. [27] pointed out, it was exactly the diversity inherent to ECOC decomposing framework that makes multiclass classification based on ECOC efficient. It is known that, due to the third symbol, ternary ECOC becomes more versatile and robust. However, when using ternary ECOC, we usually need to face the contradiction between a base classifier and the testing samples whose real class is not used to learn the classifier. We call this contradiction the noncompetence problem. On this account, Galar et al. [28] proposed a dynamic classifier selection strategy for the specific one-versus-one matrix that tried to avoid the noncompetent classifiers. e strategy considered the neighborhood of each instance to decide whether a classifier is competent or not. Furthermore, they developed a distance-based combination strategy to reduce the effect the noncompetence contradiction. In the strategy, the degree of the base classifiers' competence reliability was measured with the distance from the instance to each class [29].
However, Galar et al.'s work mainly concentrated the noncompetence problem on the predefined code (oneversus-one scheme) [30]. With the development of coding, most of the ternary ECOC are data-driven matrices with high base classifier diversity and low column redundancy. How to reduce the error caused by the noncompetence contradiction for the data-driven matrices under diversity big enough is the breakthrough of the ECOC classification. To this end, we propose a new weighted decoding strategy by considering the competence reliability (WCR) to decrease the error caused by the noncompetent classifiers. In the proposed algorithm, the SVDD is used to measure the closeness of an instance to each class, which can be regarded as the estimates of the competence reliability of the base classifiers for an instance. en, we can get the weight coefficient matrix, in which the weights of the competent classifiers are higher and those of the noncompetent ones are lower. In so doing, the influence of the noncompetence contradiction on classification can be weakened or eliminated. e outline of the paper is organized as follows: a brief introduction of the noncompetence problem and SVDD are discussed in Section 2. Section 3 focuses on the proposed WCR weighted algorithm. Experiments and results are summarized in Section 4 and Section 5 presents conclusions.

Noncompetence Problem
According to the pattern recognition theory, a good classification algorithm always demands for the same distribution of training samples and testing samples. Some original classes are denoted by zero when a ternary ECOC is used for classification. e testing samples and training samples for a base classifier may not belong to the same metasubclass. erefore, a classifier is noncompetent for an instance whose real class does not belong to the metasubclass used to learn the classifier. It is a prior question to estimate whether a base classifier is competent or not for an instance, which is equivalent to the classification problem itself. However, the outputs of noncompetent classifiers are merged in the decoding step equivalently. e existing datadriven matrices possess big diversity and fewer columns. e bias induced by the noncompetent classifiers may yield a decisive error for the final decoding results.
Taking the analysis above into consideration, to reduce the error induced by the noncompetent classifiers, it is crucial to determine how to estimate the competence reliability of a classifier for a testing instance. In order to achieve this goal, we present a new WCR decoding strategy to balance the noncompetence contradiction. As for the weighted decoding, Hüllermeie and Vanderlooy [31] provided a sound theoretical demonstration from the perspective of the optimal adaptive voting strategy. Zhou et al. [22] also acknowledged the significance of the weighted method and advocated that different weighted schemes led to different decoding results and were more important than decoding strategies themselves under some circumstances. e proposed WCR decoding can estimate the competence reliability by learning the weight coefficient matrix based on the column outputs. e ECOC framework based on WCR weighted decoding is described in Figure 1.
X is the training set with m dimensions and n samples. f denotes the base classifier. h(·) is the base classifier output. (w ij ) k×L is the weight coefficient matrix, in which k and L are the numbers of the classes and columns, respectively. e traditional ECOC classification is shown in the dotted box. h(·) can be trained with the binary splits based on the ternary matrix where the white rectangles represent 1, the black ones represent −1, and the gray ones are 0, which means the corresponding classes are not used to train the base classifier. e weight coefficient matrix can be obtained according to the decoding rule D and weight method. e final results can be achieved by the following formula: e key of WCR ECOC is how to calculate the weight coefficient matrix which reflects the competence reliability of the classifier for an instance. As we know, the closer an instance to a class, the more likely it belongs to, the more competent the base classifier trained with the corresponding class. So the question is how to measure the distance between an instance and each class.
First proposed by Tax and Robert [32], SVDD is an effective and commonly used single-class classification mechanism. SVDD obtains a spherically shaped boundary around a dataset X � x 1 , x 2 , . . . , x N ⊂ R m and can be made flexible by using kernel functions to project the original nonlinear data to high-dimensional feature space. Featured by the center o and radiusr > 0, the hypersphere covers all of the target data and superfluous space as little as possible, which can be used to detect nontarget data or outliers to realize two-class classification.
e hypersphere can be obtained by minimizing r 2 with the constraint ‖x i − o‖ 2 ≤ r 2 . In order to make the optimal zone more compact, according to the kernel projection notation, the low-dimensional input space F can be reflected into high-dimensional feature space H by nonlinear function Φ. e inner product operation in H can be replaced by the kernel function satisfied with Mercer constraints, which means finding the kernel function K(x, y) subjected to K(x, y) � 〈Φ(x), Φ(y)〉; after that, fix the hypersphere in H [33]. Allowing for the possibility of outliers, the distance from x i to o should not be strictly smaller than r 2 , but a larger distance should be penalized. erefore, the slack variable ξ i ≥ 0 is introduced and the minimization problem can be rewritten as follows: Parameter C controls the trade-off between r and the number of outliers. e optimal problem can be solved by changing into a dual problem with Lagrange multipliers: For a testing sample x, If ‖x − o‖ 2 ≤ r 2 , object x is recognized as the target class, and vice versa.

Construction of Weight Coefficient Matrix.
Consider a kclass classification case and the training dataset . . , k, and N i is the number of each class. N � k i�1 N i . e first step of the WCR weighted decoding is to use SVDD to compute the distance from an instance to each class.
for each class are achieved with X i , respectively, by using the kernel function. e Euclidean distance between an instance and the i th class is We may make the following assumption: an instance belongs to a class if the distance is equal to or smaller than the radius, and vice versa. D � (d 1 /r 1 , d 2 /r 2 , . . . , d k /r k ) for belonging preference relation is identified by normalizing the distance vector. From the analysis above, we can see that if d i /r i is not more than 1, the instance is likely to belong to class i . e smaller d i /r i , the larger the probability, while the larger the ratio, the smaller the probability.
Given a ternary matrix M � (m ij ) c×l , m ij ∈ 1, 0, −1 { }, if M ij � 0, class i is omitted in the j th base classifier training. If an instance belongs to class i , then the base classifier is noncompetent for the instance, whose outputs have little Computational Intelligence and Neuroscience guidance for classification prediction. erefore, the influence of the j th classifier should be weakened when the instance is classified. e WCR decoding relearns the weight of the base classifier based on competence reliability by using the following formula: It is worth noting that more than one class will be ignored during the training, so the final weight of base classifier is obtained: From equation (7), we may draw the following conclusions.
Given an instance, whose real class label is class i , the distance from the instance to class i is the shortest. If class i is ignored during the base classifier training, the weight of the corresponding base classifier in the decoding phase becomes smaller because d i /r i ≤ 1 and it is noncompetent. e more the classes ignored, the smaller the weight and the less competent the base classifier. According to the aforementioned considerations, the WCR weighted strategy is listed in Algorithm 1.
In order to elaborate the proposed method, the Balance benchmark dataset, 625 by 4 dataset with 3 classes from the University of California at Irvine (UCI) repository, is taken as an example. e numbers of the training and testing sets in one of the cross-validation folds are 499 [39 230 230] and 126 [58 10 58], respectively. e hypersphere configurations obtained by the proposed method during the training phase are indicated in Table 1.
In Table 1, σ is the value of radial basis function used in SVDD, α i is the solution to equation (4), and threshold is the radius range from the support vectors to the hypersphere center.
Consider a testing instance [1 5 2 4] belonging to class 1 . According to Algorithm 1, the distance vector D is (0.5417, 0.6978, 0.6519). If the ternary matrix is one-versus-one matrix, the weights of three base classifiers are 0.6519/0.4652 � 1.4013, 0.6978/0.4482 � 1.5569, and 0.5417/0.6003 � 0.9024, respectively. e weight of the third base classifier is the smallest because class 1 does not take part in the base classifier training, so the third base classifier is noncompetent for the testing instance and the weight is smaller than the other two competent classifiers.

Computational Intelligence and Neuroscience
After constructing the weight coefficient matrix, we can get the final results based on decoding strategy.

Computational Complexity of the Proposed Weighted
Decoding Strategy. Compared to the traditional ECOC classification, the time of our proposal is mainly consumed in constructing SVDD hyperspheres, which needs to solve the quadratic programming problem. e testing set has a complexity of Ο(m * N) to compute the distance from an instance to the center, where N is the number of samples and m is the dimension of attributes. erefore, the complexity has not obviously increased. e computation time and storage are acceptable. e approach of the weighted decoding for competence reliability comes into being. Next, we will validate the performance of its application in the classification through experiments.

Experiment Data.
In this section, we validate the proposed algorithm by using benchmark datasets. e characteristics of the UCI datasets are listed in Table 2. Certain UCI datasets are normalized by deleting the small samples. In the meanwhile, the principal component analysis is applied to reduce the dimensionality to promote classification speed.

Experimental Design.
ree kinds of experiments are executed for validating the proposed method in this paper.
Firstly, the benchmark datasets are used to evaluate the classification performance of the WCR decoding by comparing the results with four classic decoding strategies: Hamming decoding (HD), inverse Hamming decoding (IHD), linear loss function-based decoding (LLB), and exponential loss function-based decoding (ELB). ese strategies all belong to the first type of the decoding category.
Secondly, we compare the results of the weighted Hamming and LLB decoding based on error rate, class separability, and genetic algorithm [26], respectively. Five different ternary matrices, one-versus-one, spares random, subclass ECOC [12], HECOC based on SVDD [30], and CMECOC [22], are adopted in the experiments. When selecting the random code, we pick up the wanted matrix at random in the matrix set of 2000 sparse random matrices (whose probability of each code word is p(−1) � 1/3, p(0) � 1/3, and p(+1) � 1/3).
Finally, the WCR weighted method is compared to the distance-based relative competence weighting (DRCW) algorithm based on k-nearest neighbors [29] to evaluate the efficiency of estimating the competence reliability and solving the noncompetence problem. e one-versus-one coding matrix is adopted for consistency.
Decision tree and support vector machine (SVM) base classifiers are adopted in comparison, whose parameter configuration is shown in Table 3.
To evaluate the performance of different experiment results, we apply stratified tenfold cross-validation and test for the confidence interval at 95 percent with a two-tailed test if the number of the pieces of sample data is larger than 500. Otherwise, we apply stratified fivefold cross-validation [34] and the calculation formula is given by where μ and σ indicate the mean and variance, respectively, n is the number of fold validations, and t 0.025 (4) � 2.7764 and t 0.025 (9) � 2.2622.

Comparison of Classification Performance.
We compare the classification results of the WCR decoding with those of the original decoding strategies based on oneversus-one matrix and CMECOC. e former belongs to the predefined-based matrix and the latter belongs to the datadriven type. Figure 2 shows the classification accuracy of Input: any data-driven ternary matrix M ∈ −1, 0, +1 { } k×L , training set I, testing sample x Training phase Step 1: use coding matrix M and train l base classifiers f j (j � 1, 2, . . . , L); Step 2: apply SVDD to get the hypersphere (r i , o i ) i � 1, . . . , k, for each class; Testing phase Step 3: for a testing instance x, compute the distance to each class to get the D distance matrix; Step 4: use equations (6) and (7) to calculate the weights of base classifiers if M (i, j) � 0; Step 5: normalize w; Output: base classifier weight vector w ALGORITHM 1: e weight coefficient matrix construction based on SVDD. Computational Intelligence and Neuroscience these decoding strategies mentioned before based on decision tree classifier. From the results, we can see that the WCR decoding mechanism outperforms the original decoding ones on most of the datasets, which proves that the weighted decoding for competence reliability problem indeed has a positive influence on the classification performance regardless of the coding matrix. From the other way around, the idea of considering the noncompetence contradiction for the base classifier has practical significance. Effective and feasible, the WCR decoding possesses a remarkable promotion by evaluating the distance from an instance to each class based on SVDD and shifting the distance information into weight to merge in decoding phase. Figure 3 shows the classification accuracy of four different kinds of the weighted decoding strategies based on class separability (SW), error rate (EW), genetic algorithm (GW), and competence reliability (WCR), respectively.
In order to evaluate the performance comprehensively, Figure 4 shows the classification error for three data-driven matrices based on linear loss-based decoding and the corresponding weighted mechanism. From Figures 3 and 4, we can see that the green line representing the WCR strategy is on the top for accuracy and on the bottom for error on most of the datasets, illustrating that the decoding strategy based on the competence reliability performs better than the rest. Different from the weighted decoding strategies based on error rate, class separability, and genetic algorithm, in which the weights for base classifier are static, the weights in WCR mechanism dynamically change along with the distance from the testing instances to each class. In the meanwhile, the weights in classic methods are obtained in the testing step. e weights in WCR are got throughout the testing phase, which is much close to the real classification practice.
It is worth noting that some other weighted decoding algorithms perform the best sometimes. is can be explained by the fact that the corresponding dataset has a balanced distribution. Compared to the WCR decoding algorithm, there is only a little difference in the weight matrix got by error rate and class separability, which can still achieve the sound classification performance under certain situation suitable for their starting points. Figure 5 shows the runtime of different weighted decoding strategies. e computation complexity of WCR focuses on the hypersphere construction. Once the hyperspheres are built, short testing time is required, so the computation complexity is acceptable. In the meanwhile, the weighted decoding based on GW possesses the longest runtime because the weights are obtained by solving an optimal problem. e runtime of weighted decoding based on error rate is the shortest, which can be explained by the fact that the weights based on error rate can be got directly from the training accuracy.

Comparison of Competence Reliability Evaluation.
In order to validate the efficiency of our proposal further, the WCR mechanism based on SVDD is compared with the DRCW-OVO based on k-nearest neighbors. e coding is one-versus-one matrix, and SVM and decision tree are chosen as base classifiers, where k � 3 * number of class. Table 4 lists the accuracy of two methods of evaluating competence reliability, where the bold face denotes the best classification accuracy.
From Table 4, we can see that the accuracy of the weighted decoding based on WCR wins over DRCW twelve times in total 32-time experiment.
In order to get the statistical comparison, we calculate the average ranks as 1.75 and 1.25 for DRCW and WCR, respectively. e Nemenyi test [35] is used further to test the significant difference between the two methods. e critical difference value CD � q a ��������� � k(k + 1)/6J � q 0.05 ���������� 2 × 3/6 × 16 √   Computational Intelligence and Neuroscience � 0.49, where q a is the Studentized Range Statistic and k and J are the number of methods to be compared and experiment time for each method, respectively. We can see that the average rank of WCR and the difference are both larger than CD, which means the difference between the two methods is significant and the performance of WCR with lower rank is better.
is corroborates the efficiency of WCR in dealing with the noncompetence problem. In the contrast with k-nearest neighbors method, the distance obtained by SVDD hyperspheres is more comprehensive for the reason that it takes all the class samples into consideration. In summary, the idea of the weighted decoding based on WCR provides new sparks and thought for solving noncompetence problem.

Conclusion
How to estimate the competence reliability of base classifiers is a new issue for ternary ECOC classification. In order to reduce the error caused by the noncompetent classifiers, a new WCR weighted decoding is proposed in this paper. To achieve the goal, the SVDD hyperspheres are used to measure the distance from an instance to be classified to each class. en the distance is applied as the weight to fuse in the decoding step to make the final decision. In so doing, the    classifier competence reliability is evaluated quantitatively by the weights. In the meanwhile, the influence of the competent classifiers on classification is reinforced and that of the noncompetent ones is weakened. e experimental results prove the promotion of the WCR weighted decoding for classification. Roughly speaking, the proposed algorithm provides new possibilities for the noncompetence contradiction solutions. How to deal with the outlets for subclass separation and data uncertainty [36] is the next research direction.

Data Availability
All data used in this paper can be obtained by contacting the authors of this study.

Ethical Approval
is article does not contain any studies with human or animal subjects performed by any of the authors.

Consent
Informed consent was obtained from all individual participants included in the study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.