Perceptron Ranking Using Interval Labels with Ramp Loss for Online Ordinal Regression

Due to its wide applications and learning efficiency, online ordinal regression using perceptron algorithms with interval labels (PRIL) has been increasingly applied to solve ordinal ranking problems. However, it is still a challenge for the PRIL method to handle noise labels, in which case the ranking results may change dramatically. To tackle this problem, in this paper, we propose noise-resilient online learning algorithms using ramp loss function, called PRIL-RAMP, and its nonlinear variant K-PRIL-RAMP, to improve the performance of PRIL method for noisy data streams. 1e proposed algorithms iteratively optimize the decision function under the framework of online gradient descent (OGD), and we justify the algorithms by showing the order preservation of thresholds. It is validated in the experiments that both approaches are more robust and efficient to noise labels than state-of-theart online ordinal regression algorithms on real-world datasets.


Introduction
Ordinal regression, also called ranking learning, plays a central role in the learning task where the labels of data samples need to be ordered. It has been routinely used in social ranking tasks, e.g., collaborative filtering [1], ecology [2], and detecting the severity of Alzheimer disease [3], to name a few. In contrast to other types of regression analysis, ordinal regression describes the relationship between variables where the order matters. For example, the label of age category can be sorted as (0-9, 10-19, · · ·, 90-99) and the levels of bond credits can be sorted as "B" < "A" < "AA" < "AAA" [4]. In general, ordinal regression requires a linear (nonlinear) function and a set of K − 1 thresholds, where each threshold corresponds to a class. However, ordinal regression also distinguishes multiclassification in the sense that the output labels have a natural order.
A seminal work of ordinal regression can be traced back to the proportional odds model (POM) proposed by McCullagh [5], in which a general class of regression models for ordinal data was studied. Later, large margin formulations for ordinal regression were proposed in [6,7]. Both approaches outperformed existing benchmarks when they were applied to ranking and multiclass classification. Furthermore, to accelerate the training, Gu et al. [8] proposed asynchronous parallel coordinate descent algorithms. Besides, Berg et al. [9] showed that training a classifier can improve neural network accuracy and proposed that using several discrete data representations simultaneously can improve neural network learning compared to a single representation. In many situations, interval labels have to be used instead of the exact label. As an example of predicting product ratings, we get an entire range of scores (e.g., 1-3 and 4-7) from different customers. In [10], a large margin batch algorithm was proposed by using interval labels. e approaches discussed so far are batch algorithms. Recently, in order to reduce computational time, Crammer et al. [11] proposed an online learning algorithm, passive aggressive (PA) algorithm, for ordinal regression, which chooses an appropriate step size to find the new parameters in every trial. Later, Manwani and Chandra [12] employed the PA algorithm to solve problems with interval labels. Perceptron is another principled method of learning classifiers in online fashion. e authors in [13,14] proposed an online algorithm (with similar principles to the classic perceptron used for 2-class separation) for finding the set of parallel hyperplanes which would comply with the separation rule. In [15], perceptron-based approach was proposed for ordinal regression using interval labels.
Although existing works have made huge progress in reducing memory requirement and computational time, training models on the dataset with poor quality, e.g., having a lot of noise, remains challenging. Existing scene classification algorithms predominantly focus on static data and are designed to learn discriminant information from clean data. ey, however, suffer from two major shortcomings, i.e., the noisy label may negatively affect the learning procedure, and learning from scratch may lead to a huge computational burden. e authors in [16] propose a noise-resilient online classification algorithm, which is scalable and robust to noisy labels and applied it to peptide identification [17]. To reduce the negative influence of outliers, the authors in [18] propose a more robust algorithm termed as ramp loss for twin K-class support vector classification (Ramp-TKSVC) where the ramp loss function was used to substitute the Hinge loss function.
In this work, we aim to study a noise-resilient online learning algorithm to promote the online ordinary regression problem for noisy data streams. Particularly, we propose perceptron ranking algorithms using interval labels with ramp loss (PRIL-RAMP) and its kernel variant (K-PRIL-RAMP) for ordinal regression in an online learning manner. Our key contributions are as follows: (1) Propose linear and nonlinear noise-resilient online algorithms, i.e., PRIL-RAMP and K-PRIL-RAMP, and design a procedure to update the model parameters. e optimal parameters are obtained by online gradient descent (OGD) procedure. (2) It is theoretically proved that the proposed the PRIL-RAMP algorithm keeps the thresholds sequentially. (3) Practical experimental studies on various datasets demonstrate the effectiveness of the proposed algorithms by comparing them with state-of-the-art algorithms.
e paper is organized as follows. In Section 2, we discuss a generic framework of ordinal regression using interval labels. Furthermore, we introduce the proposed PRIL-RAMP and K-PRIL-RAMP algorithms and discuss the order preservation of thresholds of the proposed algorithms. In Section 3, we present experiments and illustrate the comparison results between PRIL-RAMP and the state-of-the-art algorithms. We conclude the whole paper in Section 4.

Learning to Rank in Online Ordinary Regression.
Ordinary regression has been successfully applied to a great number of real-world problems. Typical ordinary regression trains models in a batch manner, i.e., feeding all data into training model at once [6][7][8]. e batch training manner demands huge computation and memory for large-scale problems, and it is also not adaptable to streaming data. Comparing with the traditional batch learning framework, online learning (shown in Figure 1) learns samples in a streaming fashion so that it can be scaled and respond in real-time. We first introduce the online ordinary regression of the perceptron algorithm and its variant, and then present noise-resilient online learning algorithms PRIL-RAMP and K-PRIL-RMAP in the next section.
Let X ⊂ R d be the instance space, Y � 1, . . . , K { } the label space, and (x 1 , y 1 ), . . . , (x T , y T ) a sequence of instance-rank pairs, where x t ∈ R d and y t ∈ Y are its corresponding rank, t � 1, . . . , T. We define Y � 1, 2, . . . , K { } with " < " as the order relation. Every instance x ∈ X is assigned an interval label [y l , y r ] ∈ Y × Y. Particularly, it becomes the exact label scenario when y l � y r . Let S � (x 1 , y 1 l , y 1 r ), . . . , (x T , y T l , y T r ) be the training set. In ordinary regression, our objective is to learn model b: X ⟶ Y, which also defined as the upper interval endpoint: }. e lower interval endpoint can be defined similarly. Let L IMC I (f(x), θ, y l , y r ) denote the implicit constraints for ordering of thresholds in θ i , and I stands for the interval [8].
en, we have the loss function defined as follows [13]: As the loss function . . , K − 1 is also convex. Note that the loss tends to ∞ if there is a large amount of noise. Loss function L IMC I is primarily designed to learn information from clean data as the overall performance of the learning model degrades with interference of label noise.

Perceptron Ranking Using Interval Label Algorithms with
Ramp Loss. We start the presentation of noise-resilient online learning algorithm with the PRIL algorithm and its variants.

PRIL Algorithm.
Perceptron ranking using interval labels (PRILs) was first introduced in [13] for ordinal regression in online fashion. Let x t be the observed example, w t ∈ R d , and θ t ∈ R K− 1 , which are the parameters of the online ordinary at time t. Let I � 1, . . . , y l − 1 ∪ y r , . . . , K − 1 . en, we define where I A is an indicator function which takes value 1 if A is true. us z i � 1, i ∈ 1, . . . , y l and z i � − 1, i ∈ y r , . . . , K − 1.
Let f(x) � w · x. For given sample-interval pair x, (y l , y r ) , the loss L IMC I (w · x, θ, y l , y r ) � 0 only when e perceptron ranking loss function in linear situation can be rewritten as

PRIL-RAMP and K-PRIL-RAMP
Algorithms. Now, we present the PRIL-RAMP procedure to minimize the estimated risk using the OGD structure [14] of ramp loss in online ordinary regression.
(1) Ramp Loss. To offset the influence of noise data, we apply ramp loss as a surrogate of the perceptron loss function. e surrogate of perceptron loss function is ramp loss is defined as Notice that the ramp loss is a noise-resilient function, which makes the contribution of noise loss zero when z i (f(x) − θ i ) tends to ±∞. In classification methodologies, robustness to noise is always an important issue. e effect of noise samples can be significantly large since the penalty given to the outliers by the perceptron loss is quite huge, as any convex loss is unbounded. On the other hand, the ramp loss has an upper bound, and hence, it can control the effect of noisy samples and remove the effect of noise. Plots of perceptron loss and ramp loss in Figure 2 show the robustness (noise-resilient) of the ramp loss.
In addition, the ramp loss is a difference of convex (DC) function and can be represented as the difference of two convex functions: where l 0 is perceptron loss function, as illustrated in Figure 3.
(2) PRIL-RAMP Algorithm. In the setting of ordinal regression, instead of considering all the categories to contribute errors to threshold, we allow the ramp loss function to update the thresholds. Ordinal inequalities on thresholds are satisfied automatically at the optimal solution. Figure 4 uses an example to illustrate the update rule of PRIL-RAMP algorithm. In this example, we set θ � 1, . . . , 10 { }. Note that θ 10 � ∞ is omitted from all the plots in Figure 4. e correct rank of the instance is [y l , y r ] � [9,10], and thus, the value of w · x t should fall in the last interval, between θ 9 and θ 10 . However, as you can see in Figure 4, the value of w · x t falls below interval [6,7], and the predicted rank is y t y � 6. reshold values of θ 7 , θ 8 , and θ 9 are errors since the values of θ 7 , θ 8 , and θ 9 are higher than w · x t . To mend the mistake, the algorithm decreases the values of θ 7 , θ 8 by a unit and replace them with θ 7− 1 and θ 8− 1 . It also modifies w to be w + 2x t since i: yi(w·x t − θ i ) ≤ 0 yi � 2.
us, the inner-product w · x t increases by 2‖x t ‖ 2 . is update is illustrated at the middle plot of Figure 4. e updated prediction rule is sketched on the right-hand side. Note that after the update, predicted rank of x t is y � 8 which is closer to the true rank y � 9. We allow a small error of one unit. Note that the data in the interval label [1,2] (in red) are noise. By replacing perceptron loss with ramp loss,

Predictor Learner
Update W Figure 1: e framework of online ordinary regression. At a round t, the learner is given a question, x t ∈ X, and is required to provide an answer to this question, f(x t ).
After predicting an answer, it receives the correct rank y t and updates its ranking rule by modifying w, so that it enjoys good properties of scalability and real-time.
PRIL-RAMP guarantees that the parameters of this data with scoring z i (f(x) − θ i ) < − 8 are not updated, which significantly controls the effect of noisy data.
Next, we describe the procedure in PRIL-RAMP to update the parameters for predicting the label:   where e concave-convex procedure (CCCP) [19] can be used to obtain the optimal solution to the DC function. However, as is mentioned above, this batch learning manner cannot meet the real-time requirement when dealing with streaming data. In this work, we use the OGD method to find a near-optimal solution. It is a tradeoff between the accuracy and scalability. Particularly, we have To estimate w and θ, we initialize w 0 � 0 and θ 0 � 0. Let w t and θ t be the estimates of the parameters in the beginning of trial t, x t be the example observed, and [y t l , y t r ] be its label interval. w t+1 and θ t+1 are estimated as follows: Let s � − 1, then we have Note that sets I are known at every trial t. e complete description of the PRIL-RAMP algorithm is as given in Algorithm 1.
(3) K-PRIL-RAMP Algorithm. For nonlinear ordinary regression, we choose the suitable kernel function κ: X × X ⟶ R as proposed by Manwani [15]. at is, where s ∈ [t]: e Kernel PRIL algorithm with ramp loss (K-PRIL-RAMP) for PRIL-RAMP is discussed in Algorithm 2.

eoretical Analysis of PRIL-Ramp Algorithm.
Considering the property of PRIL algorithms, we now show that the PRIL-RAMP algorithm inherently maintains the ordering of thresholds at each iteration.

Theorem 1.
Order preservation of thresholds in the PRIL-RAMP algorithm: let w t and θ t 1 ≤ · · · ≤ θ t K− 1 be the thresholds at trial t. Let x t be the instance at trial t and [y t l , y t r ] be its corresponding interval label. Let θ t+1 1 , . . . , θ t+1 K− 1 be the updated thresholds using PRIL-RAMP. en, θ t+1 Proof. We need to analyse four different cases as follows: Mathematical Problems in Engineering 5 us, there can be two cases only.
ALGORITHM 2: K-PRIL-RAMP algorithm (kernel perceptron ranking using interval label with ramp loss). 6 Mathematical Problems in Engineering (4) k ∈ y r , . . . , K − 1 : we see that is completes proof. Similar demonstrations can be given for the K-PRIL-RAMP algorithm, and hence, we skip its proof. eorem 1 shows that the K-PRIL-RAMP algorithm can keep the thresholds in sequential.

Experiments
We evaluate performance of the PRIL-RAMP algorithm by comparing it with other benchmark methods on datasets with various ratios of noise. All experimental studies are performed in MATLAB R2016a environment on a PC with 2.5 GHz Intel Core i5 processors and 8 GB RODRAM running under the Windows 10 operating system.

Dataset.
All datasets are obtained from UCI (http:// archive.ics.uci.edu/ml/) machine learning repository [20] and LIBSVM website, and data features are normalized to zero mean and unit variance coordinatewise. More details about the datasets are provided as follows: Abalone: the dataset contains 4177 instances with 8 attributes, related to the physical measurement of Abalone found in Australia. A typical task based on the dataset is to predict the age of the Abalone using the "Rings" attribute which varies from 1-29. We divide the target variable 1-29 into 4 intervals as 1-7, 8-9, 10-12, and 13-29 [12].
Parkinsons-updrs: the dataset consists of a range of biomedical measurements by voice with early-stage Parkinson's disease. ere are 5847 instances with 21 features.
is dataset has 414 instances with 7 attributes. We focus on predicting the house price of unit area which ranges from 0 to 200. e created 4 intervals are 0-20, 21-40, 41-60, and 61-200. Table 1 shows the characteristics of the 5 datasets including the number of patterns, attributes, and classes, and also the number of patterns per class. To show the statistical properties of different datasets, we drew a box plot as shown in Figure 5. e x-axis denotes the dimensions of examples, and the y-axis corresponds to the outliers. It indicates that there exists noise in the datasets itself, especially in Parkinsons-updrs dataset.

Generating Noise.
Noises are generated by choosing m% instances from the dataset randomly. Firstly, for covariance variables x, the error term follows a mixture normal distribution 0.8N(0, 1) + 0.2N(0, 10 2 ) [21]. e visualization of the noise datasets is proved in Figure 6. e red points denote the original examples, and the blue circles are corresponding to the outliers. en, the label is generated in the target variables. For each of these examples, we randomly assign one of the following intervals, i.e., [y − 1, y], [y, y + 1], [y − 2, y − 1], and [y + 1, y + 2], where y is the actual label. Finally, we consider m � 25 and 50.

Kernel Functions.
We use the following kernel functions for different datasets [15] for Kernel PRIL.

Result Comparison.
We compare noise tolerance performance of PRIL-RAMP and K-PRIL-RAMP algorithms with basic PRIL and Kernel PRIL proposed in [15], respectively, where PRIL is perceptron-based approach for ordinal regression using interval labels in online. e difference between PRIL and Kernel PRIL is that the PRIL is linear while Kernel PRIL uses nonlinear kernel function in the experiments.

Accuracy.
We use average accuracy as the metrics to evaluate the performance of all algorithms in the experimental studies. Accuracy, the short form of the average accuracy, is defined as 1 − MZE (mean zero-one error) , where y t is the true label and y t is the predicted label. We run all algorithms 10 times and average the instantaneous accuracy across the 10 runs. e values of accuracy output by four algorithms running on different noisy data are reported in Figures 7-9. In the figures, the x-axis represents the number of examples, and the y-axis represents the accuracy on test examples. We select appropriate ramp loss parameter s for PRIL-RAMP and K-PRIL-RAMP and set η � 1.
e results of experimental studies are summarized as follows: (1) We see that for all the datasets, accuracy will gradually increase and stabilize as T increases. e accuracy in all algorithms decreases as the level of noise increases. e interference of label noise in the datasets does degrade the accuracy of prediction. (2) For different datasets, linear function algorithms (PRIL and PRIL-RAMP) and nonlinear function kernel algorithms (Kernel PRIL and K-PRIL-RAMP) perform not the same. at is determined by each dataset. Nevertheless, kernel algorithms need to solve a more complex model for ranking problems. (3) In general, PRIL-RAMP outperforms PRIL. Particularly, K-PRIL-RAMP shows consistently higher test accuracy than Kernel PRIL over all datasets with different noise levels. Also, noise-resilient algorithms (PRIL-RAMP and K-PRIL-RAMP) and noise-sensitive algorithms (PRIL and Kernel PRIL) work equally well in terms of the accuracy in the environment without noise added. Moreover, the gap of accuracy between noise-tolerances (PRIL-RAMP and K-PRIL-RAMP) and noise-sensitive (PRIL and Kernel PRIL) algorithms increases along with the level of noise. (4) On Abalone and Parkinson datasets, linear algorithms perform better than kernel methods when noise is added. e prediction accuracies of the PRIL and Kernel PRIL algorithms are relatively low and unstable in datasets with 25% and 50% noise, and this result shows that the two algorithms are very sensitive to noise. Interestingly, the noise-resilient PRIL-RAMP and K-PRIL-RAMP algorithms show better noise-tolerant performance. On the real estate valuation dataset, the accuracies of four algorithms are close when the dataset has 0% noise. However,     Overall, the noise-insensitive PRIL-RAMP and K-PRIL-RAMP algorithms significantly outperform basic PRIL and Kernel PRIL, which indicates that ramp loss-based algorithms can effectively handle noisy data. e major difference between linear algorithms (PRIL and PRIL-RAMP) and nonlinear (Kernel PRIL and K-PRIL-RAMP) is that PRIL and PRIL-RAMP use a constant step size whereas nonlinear algorithms choose a more complex step size to find the new w and θ in every trial. In PRIL-RAMP and K-PRIL-RAMP, the step size is determined by solving an optimization problem of OGD whose ramp loss is a noise-resilient function, which makes the contributes of noise loss become 0 when there is a lot of noise, whereas PRIL and Kernel PRIL take all samples to ensure that the loss on the current example becomes 0. us, we see that the proposed PRIL-RAMP and K-PRIL-RAMP perform better or comparable to other methodologies and are noise resistance.

Noise Statistical Study.
We investigate the proposed noise-resilient online ordinal regression algorithm in the case of covariance variable noisy data by considering the order. Specifically, we attempt to apply the statistical measures to answer the question of how effective the proposed PRIL-RAMP method is in handling data with noise input. In this subsection, we have reported a number of simulation studies on finite-sample performance evaluation (t � (1/3)T, t � (2/3)T, t � T). Considering order and statistical measure, we estimate the influence relation of prediction MAE (mean square error) [22], RMSE (root mean square error), Spearman's correlation coefficient, discarded sample, discarded rate, and run time between PRIL, PRIL-RAMP, and PA-I [10] for different noisy data. MAE � (1/T) T t�1 |O(y t ) − O(y t )| and RMSE � �����������������������     rank and O(y t ) is true one. e value ranges from 0 to K − 1 (maximum deviation in the number of categories). It suggests that the larger the values are, the worse the prediction accuracy is. In ordinal matching, a useful and common measure is Spearman's rank correlation which measures the correspondence in rank terms of the two distributions. It measures the structural similarity and values range from 0 to 1. e higher the value, the closer it is. e discard sample denotes the samples in the model that has no updating effect, namely, samples that make z T i (w t · x t − θ t i ) < s in each algorithm iteration. We calculate the number of discard samples by tracking the value of s. e discard rate is the percentage of the discard sample and the total. For each dataset, we have four noise scenarios to discuss: 0%, 25%, 50%, and 75%. e results are summarized in Tables 2-5, and as can be seen from the columns of MAE and RMSE, the proposed PRIL-RAMP method outperforms the competing PRIL and PA-I in four noisy data cases. Especially, in the case of a high level (50% and 75%), PRIL-RAMP significantly outperforms based on the algorithm PRIL. Spearman's correlation coefficient shows that the correspondence of PRIL-RAMP in ranking terms of the two distributions is greater in three cases of noise. Empirically, as we increase the ratio of noise, the difference between the two methods in each evaluation index becomes apparent. Due to the intrinsic flaw of hinge loss, the PA-I method is highly sensitive  Mathematical Problems in Engineering 13 to noise. Ramp loss can effectively reduce the impact of noise data in some sense, though except for individual data. Besides, as we increase our sample, the number of discarded samples increases with the increase of noise, compared with PRIL and PA-I. Most of the samples are discarded when the noise level is 75%. Not surprisingly, the percentage of the noisy data significantly affects the discarded rate and the results of the algorithms. Empirically, the discarded rate is approximately equal to the noise ratio. Moreover, the estimation of small samples can lead to the instability of the model because the prediction accuracy depends on how much proportion of noise points is used to train the model. It indicates that PRIL-RAMP is a better noise-resilient method than PA-I to deal with noisy data, though the latter is more faster.

Noise Sensitivity Comparison.
In order to compare the trend charts of MAE and discard rate, we have plotted the 2D performance variation under different sets of noise in Figure 10. As can be observed, the proposed PRIL-RAMP outperforms the two competing methods (PA-I and PRIL) significantly in the case of high-level noisy data. Because of the perceptron loss, original PRIL is very sensitive to noise data. However, MAE of PA-I is higher than PRIL, because the Hinge loss function is more sensitive. Moreover, MAE will gradually increase and stabilize with increasing noise ratio. Specifically, when the noise ratio is greater than 50%, most of the noisy samples are discarded, so the MAE will gradually stabilize though it increases with the increase of noise data. Furthermore, due to the discard samples in the case of high-level noisy data, PRIL-RAMP can reduce the impact of noise data obviously.

Conclusion and Future Works
In this paper, we studied online learning from noisy data streams. First, we have proposed ramp loss-based ordinal regression models PRIL-RAMP and its nonlinear variant K-PRIL-RAMP to incrementally predict the interval labels in noisy environments in an online manner. Moreover, relevant theoretical analysis of the thresholds order preservation has been carried out, which shows that the proposed ramp loss is noiseresilient in the scenarios of interval label(s). At last, experimental studies have shown that PRIL-RAMP and K-PRIL-RAMP are robust and can be used to deal with noisy data streams. Future work will involve extending the ordinal regression model to the PA model by introducing of convex loss function [12] and based on convex optimization theory [23].

Data Availability
Datasets used to support the findings of this study can be available form (http://archive.ics.uci.edu/ml/).

Conflicts of Interest
e authors declare that they have no conflicts of interest.