An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

High dimensionality and noise have made it difficult to detect related biomarkers in omics data. Through previous study, penalized maximum trimmed likelihood estimation is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, the algorithm commonly used in these studies is the concentration step (C-step), and the C-step algorithm that is applied to robust penalized regression does not ensure that the criterion function is gradually optimized iteratively, because the regularized parameters change during the iteration. This makes the C-step algorithm runs very slowly, especially when dealing with high-dimensional omics data. The AR-Cstep (C-step combined with an acceptance-rejection scheme) algorithm is proposed. In simulation experiments, the AR-Cstep algorithm converged faster (the average computation time was only 2% of that of the C-step algorithm) and was more accurate in terms of variable selection and outlier identification than the C-step algorithm. The two algorithms were further compared on triple negative breast cancer (TNBC) RNA-seq data. AR-Cstep can solve the problem of the C-step not converging and ensures that the iterative process is in the direction that improves criterion function. As an improvement of the C-step algorithm, the AR-Cstep algorithm can be extended to other robust models with regularized parameters.


Introduction
The first challenge presented by omics data is the high dimension, which far exceeds the sample size. The second challenge is the presence of noise in the omics data. This noise may be caused by misdiagnosis, mislabelling, recording errors, technical problems in the laboratory, or sample heterogeneity [1,2]. Penalized regression is a common method to solve the problem of variable selection and prediction for a high-dimensional dataset. It has been applied to omics data such as gene expression (4), GWAS [3], and DNA methylation [4]. However, the outliers in the data make the estimation of penalized regression inaccurate, so biomarkers cannot be properly screened. Additionally, the identification and further investigation of these outliers can correct the errors during the experiment or investigation. Therefore, it is very important to develop robust statistical methods for penalized regression.
A robust estimation method, least trimmed square (LTS), was proposed by Rousseeuw [5]. LTS is highly robust to outliers in both the response and predictors. It is effective for identifying outliers and can solve the problem of the masking phenomenon caused by the coexistence of multiple outliers [5,6]. Alfons et al. [6] applied LTS to LASSO-type penalized linear regression to solve the problem of robust high-dimensional variable selection when the dependent variable is quantitative data. Kurnaz et al. [7] applied LTS to elastic net-(EN-) type penalized linear and logistic regression to solve the problem of robust high-dimensional variable selection when the dependent variable is quantitative and binary data (enetLTS).
Both studies adopted the concentration step (C-step) in the FAST-LTS algorithm proposed by Rousseeuw and Van Driessen [8]. The basic ideas were an inequality involving order statistics and sums of squared residuals. This inequality guarantees that the criterion function declines monotonically as the iteration progresses. However, when it is applied to penalized regression based on trimming, the inequality does not necessarily hold due to the change of the regularized parameters. Thus, the criterion function cannot be guaranteed to decrease. Through our previous simulation study [9], we have found that enetLTS is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, it is also found that for a dataset with n = 500, p = 1,000, and an outlier ratio of 10%, it takes nearly 2 hours (Intel Core i7-6500U @2.50GHz) to run enetLTS once. For the omics data in real data analysis with n = 924 and p = 19690, enetLTS running time is about 77.8 hours (Intel Xeon Silver 4112 @2.60GHZ), which obviously does not meet the requirements for efficient data processing.
Therefore, the C-step algorithm needs to be improved to adapt to high-dimensional data. In this study, the AR-Cstep algorithm is proposed to solve the estimation of robust penalized regression based on trimming, which combines the C-step algorithm with the acceptance-rejection algorithm proposed by Chakraborty and Chaudhuri [10] . Two algorithms are compared in terms of variable selection and outlier identification accuracy and computation speed in simulation study. An RNA-seq dataset for triple negative breast cancer (TNBC) [1] that contains 28 samples with discordant labels obtained from different tests (immunohistochemical (IHC) method or fluorescence in situ hybridization (FISH)) is used to illustrate the application of the two algorithms.
The structure of this paper is as follows: In results section, simulation experiments are described that compare the MTL-EN (elastic net-type maximum trimmed likelihood) estimation using the AR-Cstep algorithm with enetLTS. The results of enetLTS and MTL-EN applied to a triple negative breast cancer (TNBC) RNA-seq dataset are compared. Then, the results are discussed and concluded.
In this article, a robust penalized logistic regression model based on trimming is introduced in Section 2. And the AR-Cstep algorithm is proposed and described in Section 3. In Section 4, simulation experiments are described that compare the MTL-EN (elastic net-type maximum trimmed likelihood) estimation using the AR-Cstep algorithm with enetLTS. The results of enetLTS and MTL-EN applied to a triple negative breast cancer (TNBC) RNA-seq dataset are compared in Section 5. We conclude with a discussion in Section 6 and a conclusion in Section 7.

Robust Penalized Logistic Regression Model Based on Trimming
Kurnaz et al. [7] proposed an EN-type penalized logistic regression based on trimming.
where dðy i 1 , Robust penalized logistic regression model based on trimming was denoted as enetLTS (robust EN based on the LTS), and C-step algorithm was adopted. We denote it as β∧ enetLTS in this paper. The estimate of the same model obtained by the AR-Cstep algorithm is recorded as the ENtype maximum trimmed likelihood estimate β∧ MTL−EN .

Algorithm
3.1. C-Step Algorithm. Kurnaz et al. [7] adopted the C-step algorithm in enetLTS. This algorithm was described below.
When the regularized parameters λ = λ 1 and α = α 1 are fixed, at the kth step of the iteration, H k is the current subset with h observations, and b β H k is the solution of the penalized logistic regression based on H k . The negative log-likelihood functions corresponding to n observations can be derived from b β H k . The subsample H k+1 consists of the h smallest negative log-likelihood observations, that is, where dðy i 1 , Thus, QðH k+1 ; b β H k Þ ≤ QðH k ; b β H k Þ can be obtained. H k+1 is the subset that minimizes the criterion function under the solution b β H k . Then, penalized logistic regression is applied to subset H k+1 . If λ = λ 1 and α = α 1 are unchanged, we get the solution b β H k+1 which minimize the solution of criterion function under the regularization parameters λ 1 and α 1 . Thus QðH k+1 ; b β H k+1 Þ ≤ QðH k+1 ; b β H k Þ holds. Therefore, when λ = λ 1 is fixed, The definition of H k+1 makes the first equation hold. The definition of b β H k+1 makes the second inequality hold.
For the C-step algorithm, the candidate subset H k+2 is constructed by sorting out h samples with the smallest negative log-likelihood contribution to QðH k+1 ; b β H k+1 Þ. Then, the C-step algorithm continues until Q m = Q m−1 .

2
Computational and Mathematical Methods in Medicine Therefore, when λ = λ 1 and α = α 1 remain unchanged, as the number of iterations k increases, the criterion function decreases. Because the criterion function is nonnegative and the number of subsets with sample size h is limited, the C-step algorithm must converge to the subset with the smallest criterion function after a limited number of steps.
The C-step algorithm is described in Algorithm 1, where "continueCstep" is set so that the absolute value of the difference between the likelihood functions of two iterations is less than some small value.
However, when penalized regression is performed on the subset H k+1 , the regularized parameters λ and α are not fixed. The regularized parameters are usually determined by data, such as by cross-validation. The regularized parameters determined for penalized regression performed on two different subsets are often different, which leads to the second inequality of [11] not necessarily being true.
A way to solve the problem is to set all λ and α values firstly. For a certain combination of λ and α, perform the C-step algorithm until convergence. Then, compare the convergent subsets under different regularized parameters, and select the subset that minimizes the criterion function. If the number of λ values is 40 and that of α values is 20, there are 800 parameter combinations. This means running the Cstep algorithm 800 times, which will undoubtedly make the algorithm very slow.

AR-Cstep Algorithm.
In this study, the AR-Cstep algorithm is proposed to solve the estimation of the robust penalized regression based on trimming, which combines the C-step algorithm with the acceptancerejection algorithm, which was proposed by Chakraborty and Chaudhuri [10].
3.2.1. Acceptance-Rejection Algorithm. The acceptancerejection algorithm is similar to that of Metropolise-Hastings in MCMC. Let H k represent the subset at the kth step of the iteration. Then, a randomly selected sample outside of H k replaces one of the samples in H k to form H cand . The corresponding likelihood function is obtained after penalized regression is performed on H cand . If the criterion function corresponding to H cand is better than that corresponding to the current subset H k , then H cand is accepted as H k+1 with probability one, and H k+1 = H cand . Otherwise, H cand is accepted as H k+1 with a probability of p < 1, so that the algorithm can escape the local optimal value.
In the acceptance-rejection algorithm, the candidate sample at each step is randomly selected from the remaining samples other than the current subset H k . Thus, whether the candidate subset can improve the criterion function better is completely random, which leads to the slower convergence of the iteration. The advantage of this algorithm is that, whether the criterion function corresponding to the candidate subset is better than that of the current subset is examined at each step. Moreover, the subset with the optimal criterion function up to the current step is recorded at each step.

AR-Cstep Algorithm.
The changes of the regularized parameters λ and α make the C-step algorithm hardly grad-ually converge to the subset with the smallest criterion function. Suppose the current subset is H k , and we obtain b β H k , and corresponding criterion function QðH k ; b β H k ; λ 1 , α 1 Þ after the penalized regression is performed on H k . The h smallest negative log-likelihood observations constitute the subset H cand , so that QðH cand ; b β H k ; λ 1 , α 1 Þ ≤ QðH k ; b β H k ; λ 1 , α 1 Þ holds. Then, penalized regression is performed on H cand , b β H cand is obtained, and the corresponding regularized parameters changed to λ 2 and α 2 . The corresponding criterion function QðH cand ; b β H cand ; λ 2 , α 2 Þ of H cand is not necessarily less than QðH cand ; b β H k ; λ 1 , α 1 Þ. The AR-Cstep algorithm adds the step of comparing the criterion function of the candidate subset H cand with that of the current subset The criterion function corresponding to the initial subset is recorded as the optimal subset, that is, To make the proportion of samples with y = 1 in the candidate subset H cand consistent with that in the full set, the samples constituting the candidate subset H cand are selected in the following manner. H cand consists of h 1 observations with the smallest dðy i , x i ′ b β H k Þ among observations with y = 1 (set a total of n 1 observations), and h 0 observations with the smallest dðy i , x i ′ b β H k Þ among observations with y = 0 (set a total of n 0 observations), where h 1 = bðn 1 + 1Þηc, and b:c means round down. 1 − η is the trimming ratio and h 0 = h − h 1 . In comparison with the acceptance-rejection algorithm, for which H cand consists of samples selected randomly from the complementary set, H cand of AR-Cstep is composed of observations with the smallest deviance; that is, each sample of H cand contains information that improves the criterion function; hence, the algorithm converges to the subset with the optimal criterion function faster. The AR-Cstep algorithm is described in Algorithm 2.
The acceptance probability p = e τ k ðlog ℓðβ∧ H cand , H cand Þ−log ℓðβ∧ H k , H k ÞÞ . It is inversely proportional to the absolute value of the difference between the two likelihood functions log ℓð b β H k , H k Þ and log ℓð b β H cand , H cand Þ. The acceptance probability p is also related to τ k . According to τ k ≔ log ðk + 1Þ/D, the acceptance probability p is inversely proportional to k, which is the kth step of the iteration. Similar to the study of Farcomeni and Viviani [12], D ≔ 0:1nð1 − ηÞ, and the acceptance probability p is inversely proportional to the sample size h of the subset. When other features remain unchanged, the larger the sample size h of the subset, 3 Computational and Mathematical Methods in Medicine the smaller the probability of being accepted. Additionally, if the current subset is not replaced after r iterations, the iteration process is stopped.
To ensure that the initial subset does not contain outliers, the sample size should be smaller. The initial subset consisted of six observations, three of which were randomly selected from groups y = 1 and y = 0, respectively. In order to make the algorithm reach the global optimal value, multiple initial subsets were selected.
First, the two-step iteration of AR-Cstep was performed on 500 initial subsets, and 500 updated subsets were obtained. Then, the 10 subsets with the smallest criterion function were retained. Then, AR-Cstep was performed on these 10 subsets until convergence. Among the 10 convergent subsets, the subset with the smallest criterion function was selected, denoted by H opt . The penalized regression was performed on H opt , and b β opt was obtained.

Reweighted
Step. In this article, we choose the subset of size h = bnηc where η = 0:75. So 1 − η is the initial guess that less than 25% of outliers contained in the data. This is a rather conservative estimation of proportion of outliers. There may not be so many outliers in the data. Therefore, reweighted step is considered to detect outliers via b β opt . Then, these outliers are excluded, and a new subset H rwt is While (continueCstep) do Penalized logisitic regression is applied on the current subset H k , and get b i βÞ of every observation is got and observations are sorted according to their deviances.
The h observations with smallest negative log-likelihood function are retained to form a subset H k+1 . end Algorithm 1: Description of C-step algorithm.
k represents the kth iteration, and r represents that the current subset has not been replaced after r iterations.
The corresponding criterion function is where y is the dependent variable and X j is the jth independent variable. In iteration step of AR-Cstep, we take a grid with steps of size 0.05 b λ max and α = 0:5 to reduce the computational burden. In the reweighted step, we take a grid with steps of size 0.01 b λ max of λ to derive the solution b β opt and b β rwt . The choice of α is selected by cross-validation in the interval [0.1,1] with a step size of 0.1. It would be better to standardize predictors before applying the penalized regression. Standardization mainly is aimed at eliminating the influence of dimension and quantity of a predictor. However, the mean and standard deviation computed from all sample are not robust with outliers. In the algorithm described above, penalized regression is applied to the subset in every iteration step of AR-Cstep. So we firstly, respectively, compute mean and standard deviation from subsamples. Then, we standardize all samples with this mean and standard deviation before applying penalized regression.

Comparison of MTL-EN and enetLTS on Outlier
Detection and Variable Selection. Simulation settings were the same as Sun et al. [9]. The parameter h of both enetLTS and MTL-EN was both set to b0:75nc, which meant the trimmed rate is 25%. The parameters in Ensemble followed Lopes et al. [1].
In the simulation experiment, we compared the two methods enetLTS and MTL-EN using C-step and AR-Cstep algorithms, respectively. Through our previous research [9] and subsequent simulation experiments, we can see that enetLTS is good at identifying outliers. However, the FDR of its variable selection is high, and many unrelated variables are identified. When encountering mislabeled omics data, we can combine enetLTS with Ensemble. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy. Then, we added the third method Ensemble to the simulation experiment. A detailed description of Ensemble is provided in our previous study [9].
The performances of the three methods are summarized in Figure 1.
The outlier detection accuracy of the three methods is shown in Figure 1. Here, we used two indicators Sn (sensitivity) and FPR (False Positive Rate) [14]. Sn represents the proportion of true misclassified individuals identified as misclassified ones among all true misclassified observations. FPR represents the proportion of individuals with correct labels that are wrongly categorized as misclassified ones.
The outliers identified by MTL-EN had the higher Sn than enetLTS. When the proportion of outliers were 10% and 15%, the gap between them further widened. MTL-EN FPRs were close to enetLTS. Ensemble has the lowest Sn and FPRs among the three methods. Therefore, MTL-EN had the best accuracy in identifying outliers.
The variable selection accuracy of the three methods is shown in Figure 1. PSR (Positive Selection Rate) indicates the proportion of true disease-related biomarkers identified in all true disease-related biomarkers. FDR (False Discovery Rate) represents the proportion of biomarkers that are not related to disease among all the screened biomarkers. A comprehensive indicator GM [15,16] for the accuracy of variable selection was used, which is the geometric mean of PSR and (1 − FDR). High accuracy of variable selection is indicated by a high GM.
MTL-EN variable selection accuracy was very similar to enetLTS with high PSR and FDR. As also shown in our previous study [9], Ensemble had the highest variable selection accuracy with much low FDR; however, Ensemble missed some associated variables when the proportion of outliers was 10% or 15%.
In terms of variable selection, when there were a small proportion of outliers, Ensemble performed best. However, its accuracy was greatly decreased when the proportion of outliers was large. In terms of outlier detection, regardless of the portion of outliers, MTL-EN had the highest outlier detection accuracy among the three methods.

4.2.
Combining with Ensemble to Improve the Accuracy of Variable Selection. In our previous study [9], we considered a two-step procedure when the proportion of outliers was relatively large. We found that it improved the variable selection accuracy by applying Ensemble on a subset with outliers identified by enetLTS removed. In this study, we also used MTL-EN to detect outliers and then applied Ensemble on the subset with outliers removed. The results of MTL-EN and enetLTS were compared by simulation, which is shown in Table 1.
From Table 1, compared with the results in the original data, the PSR of Ensemble raised from 0.533 to 0.644, and the GM was improved from 0.714 to 0.786 for subset after removing outliers identified by enetLTS. For subset with outliers identified by MTL-EN removed, the results of Ensemble were also improved with PSR increased from 0.533 to 0.708 and GM increased from 0.714 to 0.828. It can be seen that after removing the outliers identified by MTL-EN, the accuracy of Ensemble variable selection is the highest. Table 2, the computation time of enetLTS is 39 times that of 5 Computational and Mathematical Methods in Medicine MTL-EN (Intel Core i7-6500U @2.50GHz); that is, the computation time of MTL-EN was 2% of that of enetLTS. This is because the C-step algorithm used by enetLTS does not take into account the regularized parameters that need to be determined at each step of the iteration. The criterion function cannot be guaranteed to gradually decrease, which makes the algorithm converge slowly. The AR-Cstep algorithm adopted by MTL-EN solves this problem well, which greatly improves the convergence speed.

Case Study
In the previous study [9], we compared the application of enetLTS, Ensemble, and Rlogreg on a TNBC dataset from the TCGA-BRCA data collection. The results showed that enetLTS identified 68 outliers, seven of which were individuals with inconsistent labels. After removing the outliers identified by enetLTS, the prediction accuracy of the three Ensemble models was improved, and the number of associated genes identified increased from 5 to 9. In this study, we applied MTL-EN to this TNBC dataset. The outliers identified by MTL-EN were compared with those by enetLTS, and we also compared the performances of Ensemble after removing the outliers identified by MTL-EN and enetLTS, respectively. From Tables 3 and 4, among the 68 outliers identified by enetLTS, 3 of them were labeled as TNBC, which were also identified by MTL-EN; among them, 65 individuals with non-TNBC labels included 35 non-TNBC patients identified by MTL-EN. In other words, 38 of the 47 outliers with non-TNBC labels identified by MTL-EN were also identified by enetLTS. However, nine patients with TNBC labels were not identified by enetLTS. These 9 TNBC patients were highly expressed in one or more of the three genes, suggesting that they were likely to be non-TNBC patients or misclassified individuals.  HER2 60.12), and TCGA-LL-A740 (HER2 68.56), with high expression in one or more of three receptors, were more likely not to be a TNBC patients; that is, his/her labels were probably wrong. Seven of the 47 outliers identified by MTL-EN were suspect individuals with inconsistent HER2 labels. Six of them were labeled as non-TNBC, which were also detected by enetLTS. The remaining one "TCGA-A2-A0EQ" was labeled as TNBC, which was not detected by enetLTS.
In our previous study [9], we combined the advantages of enetLTS and Ensemble and removed 68 outliers identified by enetLTS, then ran Ensemble on a subset (856 samples), to improve the accuracy of gene selection. In this study, we removed 47 misclassification samples detected by MTL-EN and then ran Ensemble in the remaining 877 samples. The results are shown in Tables 6 and 7.
From Table 6, for the subset with outliers detected by enetLTS removed, the prediction index MR of the three models in Ensemble was much lower than that on the original TNBC dataset; the MR of EN decreased from 0.012 to 0, the SPLS-DA MR reduced from 0.064 to 0.008, and the SGPLS MR reduced from 0.059 to 0.015. When Ensemble was run on a subset of 47 outliers identi-fied by MTL-EN, the prediction accuracy MR of the three models in Ensemble also decreased greatly, to 0.001, 0.014, and 0.013, respectively.

Discussion
Through our previous research [9], we have found that in high-dimensional data with mislabeled error, robust trimmed penalized regression is a recommended method in identifying mislabeled samples. However, the C-step algorithm to implement this method (enetLTS) is too slow to meet the requirement of data analysis for high-dimensional omics data. The reason is that for LTS without regularized parameters, the inequality that guarantees the convergence of the C-step algorithm is established. However, for the robust trimmed penalized regression with regularized parameters, the inequality does not necessarily hold due to the change of the regularized parameters.
In the AR-Cstep algorithm, penalized regression is repeatedly performed on the subset at each step to concentrate on the individuals who fit the model best gradually; that is, the idea of the C-step algorithm is still adopted. However, AR-Cstep can solve the problem of the C-step algorithm not converging because the regularized parameters change during the iteration. A comparison of the likelihood function of the current subset and that of the candidate subset is used to determine whether to replace the current subset with the candidate subset in AR-Cstep, thereby ensuring that the iterative process is in the direction that improves the criterion function. To avoid falling into a local optimum, the Metropolis-type probabilistic acceptance-rejection algorithm is combined.  Through simulation experiments, it is found that MTL-EN using AR-Cstep algorithm was more accurate than enetLTS using C-step algorithm in outlier identification. In particular, the accuracy of Ensemble variable selection on the subset after removing outliers identified by MTL-EN was higher than the result of Ensemble running on the subset after removing outliers identified by enetLTS. The AR-Cstep algorithm adopted by MTL-EN greatly improved the convergence speed; that is, the computation time of MTL-EN was 2% of that of enetLTS.
If a misclassified sample identified by a certain method is labeled as non-TNBC, it means that the expression of the key genes ER, PR, or HER2 is false positive in this patient. Similarly, if a misclassified sample identified is labeled as TNBC, it implies that the expression of ER, PR, or HER2 is a false negative in the patient. In the analysis of the TNBC dataset, there are 153 individuals labeled as TNBC in this TNBC dataset. There are 3 samples identified by enetLTS that were labeled as TNBC patients with false negative rate 2% (3/153). Twelve individuals labeled as TNBC patients were identified as mislabeled samples by MTL-EN with false negative rate 7.8% (12/153). In the TNBC dataset, IHC test of ER and PR was adopted for all patients. For HER2 detection, the results of IHC were for 507 patients. According to previous studies, the false negative rates of IHC test for ER, PR, and HER2 were not low, 15.1%~21.8% for ER [39], 6.8% (4/58) for PR [40], and 6.2% (4/65) for HER2 [41], respectively. Therefore, the false negative misclassified samples identified by MTL-EN were more likely to be close to the reality than enetLTS.
A large class of computational problems in robust statistics can be formulated as the selection of the optimal subset of data based on some criterion function [10]. AR-Cstep algorithm, as the improvement of C-step algorithm, can be extended to other robust models with regularized parameters. It is an effective algorithm for finding the most suitable subset of regularized models, such as robust Adaptive LASSO, Group LASSO, SCAD, and MCP. The AR-Cstep algorithm can be extended to other generalized linear models, such as penalized multiclass logistic regression and penalized Poisson regression.

Conclusion
AR-Cstep can solve the problem of the C-step algorithm not converging because the regularized parameters change during the iteration. It provides an idea for developing the efficient algorithm of robust penalized regression based on trimming. The AR-Cstep algorithm can be extended to other robust models with regularized parameters. In practice, MTL-EN using AR-Cstep algorithm is the recommended method for mislabeled sample identification in omics data because of its high accuracy and high operation speed. When the proportion of mislabeled samples is relatively low and ≤5%, Ensemble can be used for variable selection. When the proportion of mislabeled samples is >5%, Ensemble can be used for variable selection on a subset of data after removing mislabeled samples identified by MTL-EN.