Quantitative Determination of Sucrose Adulterated in Red Ginseng by Terahertz Time-Domain Spectroscopy (THz-TDS) with Monte Carlo Uninformative Variable Elimination (MCUVE) and Support Vector Regression (SVR)

This paper introduces a method to detect the content of sucrose, an adulterant of red ginseng, based on terahertz spectroscopy. Experiments were carried out on red ginseng with 6 levels of adulterated concentrations using terahertz time-domain spectroscopy (THz-TDS). We separately extracted the information of the terahertz spectral curve by principal component analysis (PCA) and Monte Carlo uninformative variable elimination (MCUVE) and then separately performed quantitative analysis by partial least squares regression (PLSR) and support vector regression (SVR). Because the nonlinear line factor in the terahertz spectral curve of red ginseng samples is considered, the MCUVE-SVR has high correlation coeﬃcient ( > 0.99) and ratio prediction to deviation ( > 7.4), low root means square error of deviation ( < 1.2%), and Bias ( < 0.05%). The results prove that MCUVE-SVR can be regarded as an ideal quantitative analysis method in the detection of sucrose incorporation in red ginseng by terahertz spectroscopy.


Introduction
In Asia, red ginseng is a kind of famous herb that is popular among people because of its positive effects on cognitive ability [1], anti-ageing [2], anti-oxidation [3], anti-inflammatory [4], anti-obesity [5], and improving immunity [6]. However, high-quality red ginseng is not easy to process. First of all, qualified fresh ginseng needs to be picked out, washed, and then steamed at a specified time and dried into red ginseng finally [7]. In order to obtain more benefits, some unlawful businessmen add cheap sucrose during the processing of red ginseng, which can increase the weight and improve the color [8].
erefore, an effective method to detect whether red ginseng is adulterated is necessary to ensure its quality.
So far, the most commonly used methods for the detection of sucrose are chemical analysis [9] and high-performance liquid chromatography (HPLC) [10][11][12]. At the same time, there are some other methods suitable for quantitative analysis, such as liquid chromatography-mass spectrometry (LC-MS) [13], nuclear magnetic resonance (NMR) [14], and capillary electrophoresis (CE) [15]. All these methods can obtain accurate and objective results but often require additional chemical consumption and complex operations. e result is not only the rising cost and time consumption but also the detected red ginseng can no longer be used. In East Asian countries, precious red ginseng is in great demand, so a rapid and nondestructive detection technology is valuable.
From many research reports, terahertz time-domain spectral systems (THz-TDS) have been applied to research on drugs [16], food safety [17], materials [18], and biomedicine [19,20], and the results show that terahertz timedomain spectroscopy technology is a fast, nondestructive, and nonpolluting detection technology.
For the analysis of component content of terahertz spectroscopy, support vector regression (SVR) and partial least-squares regression (PLSR) are the two most commonly used algorithms, and they are used in applications such as evaluation of peroxide value in peanut oils [21], detection of maltose in wheat grains [22], and detection of octogen content [23].
e Monte Carlo uninformative variable elimination (MCUVE) was originally an effective information extraction algorithm for the application of PLSR [24,25]. To obtain a potential ideal way for the detection of adulteration of red ginseng, we further consider the nonlinearity and the removal of invalid data on the basis of studying SVR and combined SVR with MCUVE.

Sample Preparation.
Red ginseng in dried form was purchased from Tongrentang Co., Ltd. (Beijing, China), and the sucrose was purchased from Aladdin Biochemical Technology Co., Ltd. (Shanghai, China). e purity of sucrose is above 99.9%.
Since herbs are not a single component and the spatial distribution of components is uneven, the sample needs to be ground into powder to mix evenly. Because there is no chemical reaction in the testing process, the tested samples will not lose their medicinal value. erefore, the detection method based on terahertz spectroscopy is considered nondestructive. e red ginseng was crushed into powder with a grinder (DFX-X200, Wenzhou Dingli Medical Instrument Co., China), while the sucrose was ground into powder with a pestle and mortar. All powders were sieved with 200 meshes and then placed in a drying box, where they were dried at a temperature of 50°C for two hours to remove water. Each sample is made independently. e powders of red ginseng were added with sucrose powders and then mixed evenly at a variety of concentrations of 5%, 10%, 15%, 20%, 25%, and 30%. e appearance change of red ginseng is obvious when the sucrose concentration is more than 30% and can be easily identified by observation. After that, the powders were made into circular tablets with a diameter of 13 mm and a thickness of 1.2 mm under a pressure of 12 MPa by using a hydraulic press (PC-15, Tianjin Jingtuo Instrument Technology Corp., China). e weight of each sample is about 220 mg. ere are 36 samples made for each concentration. For each sucrose content, samples were randomly selected in the form of sampling without replacement, until 24 samples were selected. ese 24 samples were put into the training set, and the remaining 12 samples were used as the testing set. Repeat this operation until each sucrose content has been sampled. en, the training set has 144 samples and the testing set has 72 samples.

Instrumentation.
e terahertz time-domain spectral system (THz-TDS) used in this study is composed of a terahertz time-domain spectrometer (Z3, Zomega Terahertz Corp., USA) and a femtosecond laser (FemtoFiber pro NIR, TOPTICA Photonics Inc., Germany). e structure is shown in Figure 1. When the laser is working, it generates laser pulses with a repetition frequency of 82 Hz, a pulse width of about 100 fs, a wavelength center of 780 nm, and an average power of nearly 100 mW. e laser beam is divided into a pump beam and a probe beam through a cube beam splitter (CBS). e terahertz pulse generated by the pump beam irradiating the photoconductive antenna penetrates the sample under test. After passing through the delayed optical path, the probe light and the terahertz pulse pass through the ZnTe crystal, a quarter-wave plate (QWP), and a polarization-beam-splitter (PBS) and then irradiate on the detector. e beam path of the terahertz time-domain spectrometer is placed in a closed box and injected with dry air. To ensure the accuracy of the experiment, the terahertz spectrum of the samples was collected only when the humidity in the box was less than 1% and the indoor temperature was 25°C.

Data Acquisition.
e dry air is tested to obtain the reference signal and the samples are tested to obtain the sample signals. e signals are calculated by fast Fourier transform (FFT), and the reference spectrum E ref (ω) and the sample signal spectrum E s (ω) are obtained, respectively. ω is the sample frequency. Finally, the terahertz absorption spectrum can be calculated, which can reflect the absorption level of the sample. e calculation formula is shown in the following formula: (1)

Principal Component Analysis (PCA).
e terahertz absorption spectrum of the samples contains a wealth of information, which also contains redundant useless information.
erefore, an algorithm that retains effective information as much as possible and can simplify the data is needed. PCA is the most widely used dimensionality reduction algorithm. By transforming the coordinate system, the algorithm projects the high-dimensional original data into the low-dimensional data space and ensures the maximum variance of the reserved data. Generally speaking, the first few important principal components already contain most of the information. PCA simplifies the original data, which helps reduce the computational complexity of the model, and the accuracy of the model is almost unaffected [26].
Ni et al. studied the selection basis of principal components (PCs) [27]. Firstly, the standard deviation (SD) is calculated as follows: where ε ij is the squared residual of an object i on principal component j, n is the number of objects, p is the number of data in the detection band, and m is the number of principal components. en, the noise level (NL) of the instrument is calculated. It can be calculated using the following formula by measuring the sample q times: where x kl is the absorption spectrum, k is the number of scans, l is the wavelength, and x l is the average absorption spectrum at the wavelength l, which is calculated as follows: In order to determine the upper limit on the number of principal components, the cut off to determine maximum number of PCs is defined as When the r value is 1.5, the corresponding number of PCs is considered ideal.

PLSR.
Owing to its good analytical performance and robustness, partial least squares regression PLSR has become very popular in multiple regression analysis applications [28,29].
is multiple regression analysis methods constructs the relationship between the predicted concentration and the spectral matrix, as shown below where y is the concentration matrix, X is the spectrum matrix, b is the regression coefficient vector, and e is the error vector [30]. PLSR can perform regression analysis on multiple variables, with low computational complexity and fast data processing speed. e algorithm has more advantages when the data features are relatively linear. For the nonlinear situation, the partition and variable selection should be optimized; otherwise, the prediction accuracy will be affected.

Monte Carlo Uninformative Variable Elimination (MCUVE).
e UVE (uninformative variable elimination) algorithm is a typical variable screening method, which can eliminate uninformative variables and reduce errors caused by these variables. Based on the PLS model, the UVE algorithm adds a certain number of random variable matrices and obtains a regression coefficient matrix B by using leaveone-out cross-validation. e stability C j of the coefficient B is defined as where mean(B j ) is the average value of the regression coefficient B j and std(B j ) is the variance of the regression coefficient B j . Let C artif be the maximum value obtained by the artificial variable. When ABS(C j ) is less than ABS(C artif ), the variable is eliminated [31]. Monte Carlo cross-validation can be used in the original UVE algorithm to replace leave-one-out cross-validation.
is requires multiple training and verification of the model, and each time the data is randomly divided into a training set and a verification set. Finally, all the results are averaged as

SVR.
SVR is also a very common regression analysis model. Different from the SVM classification method, SVR constructs the optimal hyperplane and then adds the variable ε to form the upper and lower planes parallel to the hyperplane. e two planes sandwich the sample points in the middle while minimizing the distance between the two planes. With the help of kernel functions, the algorithm maps low-dimensional spatial data to high-dimensional spatial data, which can solve the nonlinear problem between independent variables and dependent variables. Due to its powerful linear and nonlinear regression analysis capabilities, SVR regression methods have been successfully used in some applications [33,34]. In this paper, radial basis function (RBF) was used as the kernel function of the SVR model and is expressed as follows: where x i is the point in space and y i is the center of the kernel function. e genetic algorithm, an excellent optimization algorithm commonly used in SVR, was used to search the optimal penalty factor C of the SVR model and the optimal parameter c of RBF.

Model
Validation. e performance of model prediction is generally determined by the correlation coefficient (R), the root mean square deviation (RMSD), the Bias, and the ratio of prediction to deviation (RPD). eir definitions are as follows: Here, N is the number of samples, y i is the reference concentration of the i-th sample, y is the average of the reference concentration, y i is the predicted concentration of the i-th sample, and y is the average of the predicted concentration.
We used 10-fold cross-validation to obtain the validation set from the training set, whose correlation coefficient and the root mean square error indicate the performance of the model and are expressed as Rv and RMSDV, respectively. Correspondingly, the correlation coefficient and root mean square error of the testing set indicate the generalization ability of the model and are expressed as Rp and RMSDP, respectively. e closer the R value less than 1 and the smaller the root mean square error, the better the model.
In the final evaluation of the model, in addition to R and RMSE, Bias and RPD will also be combined. Bias indicates the overall deviation between the predicted value and the actual value, so the smaller the number, the better. RPD indicates the predictive ability of the model. When the RPD is greater than 3.5, the model is considered ideal. If the value of RPD is greater, the model is considered better.

Spectral Analysis.
e time-domain spectra of red ginseng and sucrose samples are obtained by THz-TDS and are shown in Figure 2.
e corresponding absorbance spectra in the band of 0.3-1.6 THz were calculated from the time-domain spectra and are presented in Figure 3. It could be seen that four chief absorption peaks of red ginseng could be found at 1.03, 1.17, 1.13 and 1.45 THz, while the chief absorption peak of sucrose was at 1.47 THz. According to Lambert-Beer's law, absorbance is proportional to the concentration and thickness of the sample. In the case of our quantitative analysis, the thickness is fixed, so what we are interested in are the changes in the absorbance curves. As shown in Figure 4, when red ginseng and sucrose are mixed in different ratios, the main absorption peaks can also be observed and change according to the content. However, a large number of curves and the mutual interference between curves make it impossible to find the change law through observation. In order to obtain the content ratios, pattern recognition methods are usually used to establish an analysis model.

Feature Selection.
In order to deal with redundant information and computation, it is necessary to reduce the dimension of spectral data. Before dimensionality reduction, in order to eliminate the influence of dimension and value range differences between indexes, the Z-score was used for the standardization of spectral data. en, the spectral curves were transformed into sets of linearly uncorrelated variables by PCA, and the results are shown in Figure 5. According to Ni et al. [27], in order to achieve a balance between reducing interference and information loss, the value of r in formula (5) is set to 1.5, and the calculated number of PCs is 30.
Another dimension reduction method used in this paper is MCUVE.
is method tries to remove the invalid information carried in the spectral data, reduces the data scale within a reasonable limit, and does not affect the amount of effective information carried by the data. Figure 6 shows the results of the stability of subsets of spectral bands selected by MCUVE. In order to compare with PCA, we selected the first 30-dimensional data with the highest stability to establish the model. e quantitative analysis results of different models are shown in Table 1. e relationship between the observed sucrose content and the results predicted by each regression model is shown in Figure 7. e accuracy of the models is judged by comparing RMSDV, Rv, RMSDP, and Rp.
According to Beer-Lambert's Law, it can be considered that the spectrum of the mixture is the linear superposition of the spectra of each component. However, Beer--Lambert's Law is a limited law, and there are some factors, such as interaction between components of the mixture and the disturbance of stray light, that invalidate it in the actual spectrum acquisition process. As a result, the nonlinear factors of the herbs' terahertz spectra are often not negligible.
eoretically, SVR, a method with strong nonlinear analysis ability, can predict the sucrose content in red ginseng more accurately than PLSR, a linear regression method. e results show that SVR obtains better prediction results than PLSR, which is manifested in larger Rp and smaller RMSDP. Meanwhile, PLSR has a larger Rv and a smaller RMSDV than SVR. Better training set analysis results and worse testing set analysis results mean that PLSR has less fit.
In the case of combination with PCA, Rv and Rp of PLSR remain at 0.988 and 0.985, respectively. RMSDV increased by 0.015%, and RMSDP decreased slightly. is shows that compared with a single PLSR, the overfitting of PCA-PLSR is not improved and the prediction ability is also not enhanced. Combined with the same feature Absorbance/a.u. extraction method, Rv of SVR increased from 0.987 to 0.989 and Rp remains at 0.987. RMSDV and RMSDP decreased in varying degrees. It can be seen that the noise reduction effect on the terahertz spectrum of red ginseng samples is not obvious, and the improvement brought by PCA-SVR is very limited.

Red Ginseng Sucrose
While PCA reduces noise, effective information will also be reduced. When the advantages brought by noise reduction cannot fill the loss brought by effective information reduction, it will have an adverse impact on the prediction ability of the model. After replacing PCA with MCUVE, both PLSR and SVR achieved better performance. Especially, with the combination of SVR and MCUVE, RMSDV and RMSDP are also reduced to the lowest error values of these models Rv increased from 0.989 to 0.993, and Rp increased from 0.987 to 0.990; RMSDV decreased from 1.272% to 0.999%, and RMSDP decreased from 1.412% to 1.172%. is shows that the effective information extraction of red ginseng terahertz spectrum by MCUVE is better than PCA, which is due to the fact that MCUVE can remove the noise signal without affecting the amount of effective information carried by the data.
In order to visualize the evaluation of the model, observed (in the y-axis) vs. predicted (in the x-axis) (OP) regressions were used, and slope and intercept parameters were compared against the 1:1 line [35]. e evaluation graphs are shown in Figure 7. From the prediction of each sucrose content, the prediction result of MCUVE-SVR is closer to the observed value, and the distribution is more concentrated, especially at the content of 15%, 20%, and 30%.
From the comparison of Bias, smaller absolute Bias values mean higher overall prediction accuracy. Arranged in descending order of absolute values of Bias, the models of the validation set are in the order PCA-PLSR, SVR, PLSR, MCUVE-PLSR, PCA-SVR, and MCUVE-SVR, while the models of the testing set are in the order PLSR, PCA-PLSR, MCUVE-PLSR, SVR, PCA-SVR, and MCUVE-SVR. It basically meets the expectation that SVR is more suitable than PLSR and MCUVE is more suitable than PCA in the detection application of this paper, and MCUVE-SVR obtained the best result. It is consistent with the evaluation results of R and RMSE.
From the comparison of RPD, the values of all models are greater than 5, which indicates that these models have reached a reliable level. e RPD values of the validation set and the testing set of MCUVE-SVR are 8.5 and 7.5, which are significantly larger than the corresponding values of other models, indicating that MCUVE-SVR has the highest reliability.
By comparing the values of R, RMSE, Bias, and RPD of each model, MCUVE-SVR is more suitable than the other five common models for the detection of adulterated sucrose in red ginseng.

Conclusions
In order to improve the measurement accuracy of the sucrose content incorporated into red ginseng, we studied the quantitative analysis model of terahertz spectral characteristics. e experimental results show that the quantitative analysis model of sucrose based on terahertz spectroscopy can be more accurate by combining it with an effective information extraction algorithm. MCUVE is an effective information extraction algorithm commonly used in PLSR, which can improve the prediction accuracy of PLSR. In this paper, we try to use MCUVE in an SVR model and get more accurate prediction results. e RPD values of the validation set and the testing set of MCUVE-SVR reached 8.5 and 7.5, indicating that the model is ideal and better than other compared models. erefore, MCUVE-SVR is a suitable quantitative regression model for the detection of adulterated sucrose in red ginseng. is paper has reference value and great meaning for the content detection of components in food and drug safety.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.