Accurate and fast determination of blood component concentration is very essential for the efficient diagnosis of patients. This paper proposes a nonlinear regression method with high-dimensional space mapping for blood component spectral quantitative analysis. Kernels are introduced to map the input data into high-dimensional space for nonlinear regression. As the most famous kernel, Gaussian kernel is usually adopted by researchers. More kernels need to be studied for each kernel describes its own high-dimensional feature space mapping which affects regression performance. In this paper, eight kernels are used to discuss the influence of different space mapping to the blood component spectral quantitative analysis. Each kernel and corresponding parameters are assessed to build the optimal regression model. The proposed method is conducted on a real blood spectral data obtained from the uric acid determination. Results verify that the prediction errors of proposed models are more precise than the ones obtained by linear models. Support vector regression (SVR) provides better performance than partial least square (PLS) when combined with kernels. The local kernels are recommended according to the blood spectral data features. SVR with inverse multiquadric kernel has the best predictive performance that can be used for blood component spectral quantitative analysis.
The component concentration in human blood may be an indicator of some diseases. Fast and accurate determination is very essential to the early diagnosis of the diseases. For instance, the serum uric acid (UA) level can be used as an indicator for the detection of diseases related to purine metabolism [
In a spectroscopic quantitative analysis, when radiation hits a sample, the incident radiation may be absorbed, and the relative contribution of absorption spectrum depends on the chemical composition and physical parameters of the sample. A spectrometer is used to collect a continuous absorption spectrum. The concentration of the component could be predicted by a regression algorithm [
Nonlinear regression with high-dimensional space mapping for blood component spectral quantitative analysis is discussed in this paper. Kernels are incorporated with PLS and SVR to realize nonlinear regression in the original input space. The kernel extension of PLS and SVR is completed by replacing the dot product calculation of elements with the kernel. Eight kernels are used in this paper to discuss the influence of different space mapping to the blood component spectral quantitative analysis. Each kernel and corresponding parameters are assessed to build the optimal nonlinear regression model. The dataset obtained from spectral measurement of uric acid concentration is used to evaluate the effectiveness of the proposed method. The experiment results are analyzed, and the mean squared error of prediction (MSEP) is used to compare the predictive capability of the various models.
This article is organized as follows. The methods are introduced in Section
PLS is advantageous to ordinary multiple linear regression for it examines for collinearities in the predictor variables. It assumes uncorrelated latent variables which are linear combinations of the original input data. PLS relies on a decomposition of the input variable matrix based on covariance criteria. It finds factors (latent variables) that are descriptive of input variables and are correlated with the output variables. For PLS, the concentration of the blood component (
For SVR, a linear regression can be performed between the matrix of wavelength signals
The regression ability of linear model could be enhanced by mapping the input data into high-dimensional space. By using the kernel method, the algorithm realizes a prediction in high-dimensional feature space without an explicit mapping of original space. A kernel describes the function of two elements in the original space which is concerned to be the dot product of them in feature space. A kernel extension of a linear algorithm can be completed by replacing the dot product calculation of elements. The combination kernel extension of PLS and SVR will be introduced.
Kernel PLS is a nonlinear extension of PLS. A nonlinear mapping
Kernel extended SVR, the concentration of the component, is calculated by the regression function:
Kernel determines the feature of high-dimensional space mapping and affects the regression performance. To build the optimal nonlinear regression model, different kernels should be evaluated combined with PLS and SVR. The kernels [ Linear kernel:
Linear kernel has no parameter. Actually, KPLS turns into PLS, and SVR turns into LinearSVR when linear kernel is adopted. Gaussian kernel:
The kernel parameter is the width, Polynomial kernel:
The kernel parameter Inverse multiquadric kernel:
The kernel parameter is Semi-local kernel:
The kernel parameter is the width, Exponential kernel:
The kernel parameter is the width, Rational kernel:
The kernel parameter is Kmod kernel:
The kernel parameter is
The prediction performance of high-dimensional mapping by the kernels introduced and the related parameter optimization will be discussed in the next section.
To evaluate the effectiveness of nonlinear regression with high-dimensional space for blood component spectral quantitative, the UA dataset is used in the experiment.
200 samples are obtained by uric acid concentration spectral determination experiment. Each spectrum has 601 signals from 400 nm to 700 nm with a 0.5 nm interval. The UA concentrations from 105 to 1100
Spectra of the UA dataset.
In order to assess the prediction effect of high-dimensional space mapping nonlinear regression for blood component spectral quantitative analysis, the linear, Gaussian, polynomial, inverse multiquadric, semi-local, exponential, rational, and Kmod kernels are combined with PLS (abbreviated as PLS, GKPLS, PKPLS, IMKPLS, SLKPLS, EKPLS, RKPLS, and KKPLS) and SVR (abbreviated as LinearSVR, GSVR, PSVR, IMSVR, SLSVR, ESVR, RSVR, and KSVR) to build the prediction models for the uric acid dataset and the effectiveness of these models are evaluated.
For the experiments, the dataset should be split into the calibration set and the validation set. The dataset is divided based on the shutter grouping strategy. One sample is selected into the validation set every five samples, and the rest samples are into the calibrating set. Out of the total 200 samples, 40 samples are used as the validation set while the left 160 samples the calibrating set. The calibrating set is used for building the prediction model, and the validation set is adapted for evaluating the effectiveness of the model. Both the spectral signals and the reference UA concentrations for the two sets are normalized according to the values of the calibration set.
To compare the prediction effect with different kernels, kernel parameter and related parameters will be optimized. The kernel parameter
Grid search based on cross-validation is used for parameter optimization. Different combinations of the parameters will be tested for each kernel on the calibration set using the 10-fold cross-validation method. In the 10-fold cross-validation, data are divided into 10 groups, 9 groups are used as the training data, and the left group is used as the test data. Change the test group next time until all the groups are tested. The cross-validation is then repeated 10 times, and the 10 results are averaged as the final prediction. The combination of parameters for each kernel with minimum MSECV is adopted to build the regression model.
The MSEP for the validation set, the squared correlation coefficient for the validation set (
In the next section, the parameter influence on the MSECV of cross-validation for each kernel introduced above will be discussed. The experiment results of kernel prediction capability will be evaluated on the validation data.
For each kernel, the curves of parameter optimization for KPLS and SVR are shown in Figures
The influence of
The influence of penalty constant and kernel parameter on the MSECV of different kernels for SVR on UA dataset. (a) The influence of penalty constant and nonsensitive loss on the MSECV of LinearSVR. (b–f) The penalty constant and kernel parameter curves of GSVR, PSVR, IMSVR, SLSVR, ESVR, RSVR, and KSVR.
Analytical results for UA dataset.
Model | MSEP |
|
MSECV |
|
Kernel parameter | Parameter 1a | Parameter 2b |
---|---|---|---|---|---|---|---|
IMSVR | 1523.42 | 0.9831 | 0.0535 | 0.8433 | 64 | 256 | 0.003906 |
RSVR | 1528.66 | 0.9830 | 0.0526 | 0.8456 | 64 | 16 | 0.003906 |
KSVR | 1530.09 | 0.9829 | 0.0526 | 0.8455 | 64 | 1024 | 0.003906 |
SLKPLS | 1880.18 | 0.9786 | 0.0410 | 0.8804 | 0.0055243 | 28 | / |
GSVR | 2021.93 | 0.9765 | 0.0448 | 0.8691 | 0.005524 | 512 | 0.003906 |
GKPLS | 2347.11 | 0.9740 | 0.0430 | 0.8740 | 0.0039063 | 25 | / |
SLSVR | 2359.86 | 0.9731 | 0.0427 | 0.8749 | 0.003906 | 11.3137 | 0.003906 |
IMKPLS | 2365.22 | 0.9721 | 0.0495 | 0.8560 | 32 | 25 | / |
RKPLS | 2519.68 | 0.9692 | 0.0481 | 0.8691 | 32 | 23 | / |
KKPLS | 2613.13 | 0.9693 | 0.0481 | 0.8589 | 32 | 23 | / |
EKPLS | 2860.96 | 0.9692 | 0.0672 | 0.8023 | 0.0625 | 2 | / |
ESVR | 2971.75 | 0.9660 | 0.0597 | 0.8254 | 0.003906 | 90.5097 | 0.003906 |
PSVR | 5518.49 | 0.9410 | 0.0365 | 0.8935 | 1 | 32 | 0.031250 |
LinearSVR | 5519.22 | 0.9410 | 0.0365 | 0.8935 | / | 32 | 0.031250 |
PKPLS | 8554.57 | 0.9062 | 0.0393 | 0.8852 | 1 | 10 | / |
PLS | 8554.57 | 0.9062 | 0.0393 | 0.8852 | / | 10 | / |
MSEP: mean squared error of prediction; MSECV: mean squared error of cross-validation.
The influence of
The influence of the penalty parameter
For KPLS, SLKPLS achieves the most accurate prediction with the lowest MSEP and the highest
Linear kernel (PLS) produces the worst prediction performance with MSEP of 8554.57. For SVR, the IMSVR has the best predictive capability with MSEP of 1523.42 followed by RSVR (MSEP is 1528.66), KSVR (MSEP is 1530.01), GSVR (MSEP is 2021.93), SSVR (MSEP is 2359.86), ESVR (MSEP is 2971.75), PSVR (MSEP is 5518.49), and LinearSVR (MSEP is 5519.22). It is obvious that the traditional linear regression algorithm cannot perform well on blood component spectral quantitative analysis. PLS has the highest MSEP and then LinearSVR. IMSVR exhibits the best performance on the validation set. The MSEP values of the IMSVR are 0.34%, 0.44%, 18.97%, 24.65%, 35.09%, 35.44%, 35.59%, 39.54%, 41.70%, 46.75%, 48.74%, 72.39%, 72.40%, 82.19%, 82.19% lower than the values obtained by the RSVR, KSVR, SLKPLS, GSVR, GKPLS, SLSVR, IMKPLS, RKPLS, KKPLS, EKPLS, ESVR, PSVR, LinearSVR, PKPLS, and PLS. Taking advantage of the SRM principle, SVR has a better prediction performance in general.
For both PLS and SVR, the optimized kernel parameter
The other kernels used in the paper are local kernels for only data points that are close to the test point have an influence on the kernel values. The good extrapolation abilities presented by local kernels show that only some specific spectral data are essential to the blood component concentration prediction. The performance of critical data is enhanced during high-dimensional mapping by local kernels. Based on the above studies, IMSVR is recommended for the nonlinear regression for blood component spectral quantitative analysis. The optimal kernel parameter
In the paper, high-dimensional space mapping methods which combined kernels with PLS and SVR are proposed for blood component spectral quantitative. For each model, the general trend of MSECV on model parameters is discussed. Some conclusions could be drawn as follows. Initially, the blood component spectral quantitative results show that for nonlinear regression models, prediction errors are more precise than the ones obtained by linear models. Furthermore, SVR provides better performance than PLS when combined with kernels. Additionally, local kernels are recommended for high-dimensional mapping according to the blood spectral data features. Finally, the experiment results verify that the IMSVR (a local kernel combined with SVR) has the higher predicative ability and could be used for blood component spectral quantitative effectively.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China (61375055), Program for New Century Excellent Talents in University (NCET-12-0447), Provincial Natural Science Foundation of Shaanxi (2014JQ8365), State Key Laboratory of Electrical Insulation and Power Equipment (EIPE16313), and Fundamental Research Funds for the Central University.