^{1}

^{1}

^{2}

^{1}

^{2}

Partial least squares regression (PLS regression) is used as an alternative for ordinary least squares regression in the presence of multicollinearity. This occurrence is common in chemical engineering problems. In addition to the linear form of PLS, there are other versions that are based on a nonlinear approach, such as the quadratic PLS (QPLS2). The difference between QPLS2 and the regular PLS algorithm is the use of quadratic regression instead of OLS regression in the calculations of latent variables. In this paper we propose a robust version of QPLS2 to overcome sensitivity to outliers using the Blocked Adaptive Computationally Efficient Outlier Nominators (BACON) algorithm. Our hybrid method is tested on both real and simulated data.

After it was developed by Wold [

PLS regression is sensitive to outliers and leverages. Thus several robust versions have been proposed in the literature, but only for linear PLS. Hubert [

In this work we attempt to obtain a robust version of the quadratic PLS algorithm QPLS2, by using the BACON algorithm. An application on real and simulated data is used to validate the method.

Every linear regression method is based on the following optimization problem:

Instead of regular predictors, PLS regression uses a set of latent variables called scores:

The quadratic nonlinear PLS is a PLS algorithm that supposes the existence of nonlinear relations between the two blocks of variables. Instead of the OLS regression presented in the linear PLS algorithm

Robust regression is a way of dealing with outliers, which are observations that come from a different distribution. They can also be the result of error measurements, and can harm the quality of the estimation. Just like OLS regression, PLS regression is also sensitive to outliers [

Many researchers proposed methods of dealing with the outlier problem in PLS regression. Hubert [

The BACON algorithm [

The first set is chosen. Then the distance is defined and used as a criterion for including the observation in the initial subset. Here are two distances used in the literature

Select an initial set

Compute the distances (

Set the new subset with all the points that have

where

Repeat (2) and (3) until the subset does not change.

We merge the BACON algorithm with the quadratic PLS, with the goal of obtaining a robust version of the algorithm:

Run the BACON algorithm on the dataset using distance (

For every PLS dimension, repeat until convergence of

Calculate the weights:

Calculate the scores:

Fit

Calculate

Update

Update

Calculate the new value of t:

Calculate the loadings using the final value of t:

Deflate

If an additional dimension is required, replace

The goal of this application is to compare the performance of the robust quadratic PLS with the original quadratic PLS. The comparison is conducted on both simulated and real data.

We use the dataset presented in [

Since we cannot calculate the mean squared error, we will compare the percentage of explained variance in both the robust and original quadratic PLS:

In Table

Comparison between explained variance of proposed robust quadratic PLS and original quadratic PLS in cosmetic dataset.

| | | | | | | | Cumulated variance | |
---|---|---|---|---|---|---|---|---|---|

| 0.286 | 0.196 | 0.129 | 0.139 | 0.05 | 0.11 | 0.08 | 0.003 | 0.99 |

| 0.277 | 0.239 | 0.155 | 0.177 | 0.051 | 0.093 | 0.004 | 0 | 0.99 |

| 0.180 | 0.077 | 0.137 | 0.134 | 0.042 | 0.04 | 0.065 | 0.03 | 0.68 |

| 0.33 | 0.181 | 0.103 | 0.117 | 0.06 | 0.037 | 0.0.32 | 0.05 | 0.91 |

In this section, a contamination study is used to assess the quality of the proposed robust method, by following these steps:

The nonlinear function presented in [

The dataset is randomly contaminated by adding a small percentage of data (5%, 10%, and 15%) from a multivariate normal distribution.

We first apply the quadratic PLS to the generated data, and then we apply the robust quadratic PLS described previously.

We compare the original quadratic PLS with the proposed robust PLS using the explained variance, as well as the predictive mean squared error and the predicted residual error sum of squares (PRESS).

The dataset is simulated 1000 times. The explained variance, predictive mean squared error, and PRESS are the mean of all values calculated for each dataset.

In case of a 5% contamination rate (Table

Comparison between explained variance of proposed quadratic algorithm and original one in simulated dataset for the three contamination rates (5%, 10%, and 15%).

Contamination rate | Explained variance by original quadratic PLS | Explained variance by robust quadratic PLS | ||
---|---|---|---|---|

X | Y | X | Y | |

5% | 0.99 | 0.73 | 1 | 0.99 |

10% | 1 | 0.68 | 1 | 0.99 |

15% | 1 | 0.67 | 1 | 0.99 |

The dataset of 500 observations was then split in two parts. The first contained 400 observations used in the estimation of two models: one with the original quadratic PLS and one with the robust quadratic PLS. Then we calculate the predictive residual mean squared error (RMSEP) of the dependent variable on the 100 left out observations.

The results of a comparison (Table

Comparison between optimal mean squared prediction error and predictive error sum of squares of proposed quadratic algorithm and original one for simulated dataset with three contamination rates (5%, 10%, and 15%).

Contamination rate | Mean squared prediction error (MSPE) | Predictive error sum of squares (PRESS) | ||
---|---|---|---|---|

Quadratic PLS | Robust quadratic PLS | Quadratic PLS | Robust quadratic PLS | |

5% | 103.07 | 12.83 | 100.9 | 89 |

10% | 110.42 | 60.9 | 108.75 | 17.15 |

15% | 119.08 | 4.32 | 117.37 | 17.15 |

Figures

Comparison of predicted and actual values of test dataset in case of quadratic and robust quadratic PLS regression on 5% contaminated data.

Comparison of predicted and actual values of test dataset in case of quadratic and robust quadratic PLS regression on 10% contaminated data.

Comparison of predicted and actual values on test dataset in case of quadratic and robust quadratic PLS regression on 15% contaminated data.

PLS regression has developed considerably since it was first introduced. The nonlinear nature of data encountered in the field of chemical engineering was the motivation behind developing nonlinear PLS methods. In this paper we proposed a robust version of the quadratic nonlinear PLS, in a hybrid form between the quadratic PLS algorithm and the BACON algorithm in order to overcome problems caused by outliers. Our method outperformed the quadratic PLS for both real and simulated data.

The data used to support the findings of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.