Protected Geographical Indication Identi�cation of a Chinese Green Tea ��n�i-�hite� �y Near-Infrared Spectroscopy and Chemometric Class Modeling Techniques

is paper reports a rapid identi�cation method for a Chinese green tea with PGI, Anji-white tea, by class modeling techniques and NIR spectroscopy. 167 real and representative Anji-white tea samples were collected from 8 tea plantations in their original producing areas for model training. Another 81 non-Anji-white tea samples of similar appearance were collected from 7 important tea producing areas and used for validation ofmodel speci�city. DiffuseNIR spectra weremeasuredwith �nely ground tea powders. OCPLS and SIMCA were used to describe the distribution of representative Anji-white tea objects and predict the authenticity of new objects. �or data preprocessing, smoothing, derivatives, and SNV were applied to improve the raw spectra and classi�cation performance. It is demonstrated that taking derivatives and SNV can improve classi�cation accuracy and reduce the complexity of classmodels by removing spectral background and baseline. �or the bestmodels, the sensitivity and speci�city were 0.886 and 0.951 for OCPLS, 0.886 and 0.938 for SIMCA with SNV spectra, respectively. Although it is difficult to perform an exhaustive analysis of all types of potential false objects, the proposed method can detect most of the important non-Anji-white teas in the Chinese market.


Introduction
As one of the most popular beverages in the world, tea is favored because of its pleasurable aroma, taste, and putative healthy effects [1][2][3][4][5][6][7].Teas can be generally grouped into three principal types: unfermented, partially fermented, and fully fermented according to the degree of fermentation [8].
China has a long history of tea cultivation, processing, and consumption.Among various species, green tea accounts for the bulk of the total production and is favored by most Chinese consumers.In China, tea producing areas cover most of the central and southern provinces with vast differences in geographical and natural conditions, where the tea species, cultivation techniques, and processing procedures are different.erefore, almost all of the famous teas in China are named aer their origins.Among the most famous teas, Anjiwhite tea with protected geographical indication (PGI) has a somewhat confusing name because it is a typical green tea.It is called a "white" tea because its leaves are very light in color due to its low chlorophyll and polyphenol contents [9].Its processing procedure makes it a green tea by withering, pan �ring, and shaping, followed by �ring gently over charcoal.is speci�c variety of tea bush reported in ancient literatures was rediscovered growing wild in the 1980s at an altitude over 800 m.Now it is cultivated in the mountains of Anji County, along the Huangpu River in a spectacular area, where there are heavy mist and vast forests of bright green bamboo.e �at and straight leaves produce a lasting fragrance unique in taste.It is recognized by traditional tea-tasting specialists that the high quality of Anji-white tea should be attributed to its species, growing environments, and processing procedure.erefore, the PGI authentication of Anji-white tea is demanded to identify false products and ensure consumer interests.Numerous researches have contributed to the chemical compositions of teas in�uenced by various factors [10][11][12][13][14][15], including species, season, age of the leaves (plucking position), climate, and horticultural conditions (soil, water, minerals, fertilizers, etc.).Such investigations are important to understand the biological and healthy effects of teas but usually lack a comprehensive view of chemical compositions.Actually, because the chemical compositions of teas are very complex, it is very difficult to perform a thorough component analysis of teas and represent the quality/class of teas by the contents of a few chemical components.In traditional sensory analysis, the quality of teas is evaluated by professional tea tasters.
Because training a quali�ed tea taster may take years and is very expensive [16], it is suitable to evaluate tea quality by some instrumental techniques.
Recently, spectroscopy coupled with chemometric methods has been widely applied in food analysis [17][18][19][20][21][22]. e principle of such techniques is that chemical compositions of complicated samples are represented by multivariate spectral signals; then relevant and useful information concerning food quality/parameters can be extracted by multivariate analysis methods.Near infrared (NIR) spectroscopy has been one of the most commonly used spectroscopic techniques in food quality evaluation and has some advantages over traditional chemical analysis methods, including lower sample preparation requirements, reduced analysis time and cost, the ability to simultaneous multicomponent analysis, and the potential use for online analysis [23].
e performance of NIR spectroscopy analysis depends heavily on the proper use of chemometric methods.It has been pointed out that PGI authentication is a typical oneclass problem [24], where a decision needs to be made on whether a new object should be accepted or rejected by a target class.In such cases, the commonly used classi�cation methods discriminating two or more prede�ned classes are unsuitable for several reasons.Firstly, PGI identi�cation requires identifying various unknown false objects, which is difficult to be exhaustively collected and analyzed.Moreover, a discrimination/classi�cation model would be highly complex and have poor generalization performance if it includes many different classes of training objects.erefore, a class model is required to describe the representative samples belonging to the target class and predict the identities of the unknown objects.A class model aims at describing the distribution of a target class and has reduced model complexity.However, the sensitivity and speci�city of a class model should be sufficiently validated to ensure its usefulness.e sampling procedure should be representative and comprehensive to include most if not all of the signi�cant variations likely to be encountered in future test materials [18].T 1: Analyzed tea samples.
With the above considerations, the objective of this paper is to develop a rapid and well-validated PGI authentication method for Anji-white tea by using class modeling techniques and NIR spectroscopy, with emphasis on representative sample collection and validation of class models.

Tea Samples
Analyzed.167 authentic Anji-white and 81 other main non-Anji-white tea samples were collected directly from the market branches of tea plantations in original producing areas with official certi�cations.All of the samples were made of green tea leaves picked before Qingming Festival 2011 (April 5, 2011).e detailed information concerning samples is shown in Table 1.All of the samples were stored in a cool, dark, and dry place with integral packaging before spectroscopic analysis.

FTIR Spectroscopy.
Diffuse NIR spectra were collected using a Bruker-TENSOR37 FTIR spectrometer (Bruker Optics, Ettlingen, Germany) in the wavelength range from 4000 −1 to 12000 cm −1 .Tea samples were �nely grounded into particles using an agate pestle and mortar and �ltered through a 40-mesh sieve.e powders were then packed into a NIR sample cup.e sample cup was �lled fully and compacted naturally without further pressing.For each sample, 128 scans were performed with a resolution of 8 cm −1 at room temperature using OPUS soware.An increase in scanning time did not signi�cantly improve the signal.e average of the 128 scans was used as a raw spectrum for further data analysis.e scanning interval was 3.857 cm −1 ; therefore, each raw spectrum had 2074 individual data points.Sherborn, MA, USA).In practical data analysis, outliers in the data would cause model bias or even breakdown of the models.erefore, robust principal component analysis (rPCA) [25] was used to detect the outliers.rPCA can overcome the masking effects caused by the presence of multiple outliers.Considering the high-dimensional nature of the NIR spectral data (for the raw spectra,   ), an improved rPCA [26] was used, which was shown to be more numerically stable for high-dimensional data and have a moderate computational cost.According to the computed score distance (SD) and orthogonal distance (OD), an rPCA diagnosis plot classi�es the samples into four groups: regular data (with small SD and small OD), good PCA-leverage points (with large SD and small OD), orthogonal outliers (with small SD and large OD), and bad PCA-leverage points (with large SD and large OD).

Outlier Diagnosis and Data
e data with outliers removed were then split into a representative training set and a test set by the Kennard and Stone (K-S) algorithm [27].K-S algorithm selects a representative test set in such a way that the objects are scattered uniformly in the range of training objects.Because the distributions of tea samples from each producing area were not the same, the K-S method was performed separately for teas from different producing areas.For class models analysis, the training and test samples from each producing area selected by K-S algorithm were then put together to form a training and test set.
Smoothing, taking derivatives, and standard normal variate (SNV) [28] were used to improve the training and predicting performance of class models.Smoothing can suppress random noise in spectra and improve the signal-tonoise ratio (SNR).e S-G polynomial �tting algorithm [29] was used considering its popularity and simplicity.Taking derivatives can enhance spectral resolution and remove baseline and background, so �rst-order and second-order derivatives were used.To prevent the degradation of SNR by differencing, derivatives were also computed by S-G algorithms.SNV was proved to be effective in reducing scattering effects and correcting the interference caused by variations of optical path.In this paper, SNV was used to reduce the spectral variations caused by the possible differences of powder bulk density.

Class Modeling Techniques.
Recently, a new class modeling technique was proposed by us using one-class partial least squares (OCPLS) [30] regression.It was used for authentication of pure sesame oils by mid-infrared spectra and was demonstrated to have a comparable performance to so independent modeling of class analogy (SIMCA) [24].OCPLS develops a partial least squares regression model relating the features to a class response vector 1 with all the elements being ones.e use of 1 as a response vector means all the objects in the same class should be distributed as compact as possible.Unlike SIMCA which projects the raw variables onto a few principal components (PCs) explaining most of the data variances, OCPLS considers both the explained variances and compactness of a class by projecting the raw features onto the class average.e modeling error of the response variable is assumed to have a normal distribution and used as the distance measurement from an object to the class center.e class center is estimated as the mean of modeling error.Since OCPLS can be performed in the framework of multivariate calibration, estimation of its model complexity is more straightforward than for SIMCA; for example, a well-established F-test combined with Monte Carlo cross validation (MCCV) [31] was demonstrated to be effective in reducing the risk of over�tting [32].SIMCA describes the class structure of the training objects by the PCs space spanned by a few signi�cant PCs.e magnitude of residual error can be used as a distance measurement from an object to the class center.To reject or accept a new object, its residuals can be tested with an Ftest procedure.It was realized that the residual error could be underestimated when it is computed directly from PCA of the training samples.is would lead to a large number of objects that are wrongly rejected (a large -error); therefore, residuals predicted by leave-one-out cross-validation (LOOCV) rather than the training residuals were used [33].is procedure was shown to be effective in reducing the number of false outliers.

Results and Discussion
Some of the spectra of the authentic Anji-white tea and Non-Anji-white tea samples are demonstrated in Figure 1.e spectral range of 9000-12000 cm −1 carries poor chemical information and has a very low SNR, so this wavelength range was excluded from further data analysis.Seen from Figure 1, all the teas have very similar absorbance bands in the range of 4000-9000 cm −1 .e wide bands in 8000-9000 cm −1 can be attributed to the second overtone of -C-H stretching.Peaks in 6000-7000 cm −1 involve the contributions of O-H stretching vibrations and stretching vibration of N-H (∼6800 cm −1 ) in amino acids.Other obvious bands include 5600 cm −1 (fundamental stretching of -C-H), 5200 cm −1 (combination of O-H and C-O stretching), 4700 cm −1 (combination of O-H bending and C-O stretching), and 4300 cm −1 (combination of C-H stretching and -CH 2 deformation).e raw spectra are highly overlapped and characterized by a poor peak resolution, so accurate assignments of speci�c peaks are very difficult.e low level of details in the raw spectra can be attributed to the contributions of multicomponents and the shis and distortions resulted from their interactions.ough different tea varieties had very similar absorbance patterns, the relative intensities of different bands were different.erefore, class modeling techniques are useful to extract the subtle information from spectral data for characterizing real Anji-white teas.
To sharpen the classi�cation performance of class models, smoothing, �rst-and second-order S-� derivatives, and SNV were used to preprocess the raw spectra.Some of the preprocessed spectra of Anji-white and non-Anji-white teas are plotted in Figure 2. Seen from Figure 2, although smoothed spectra can slightly improve the SNR of the raw spectra, they have the risk of losing some useful high-frequency information in the raw data.e second-order derivative spectra can remove most of the baselines and enhance some detailed information.SNV spectra can reduce some spectral variations while enhancing others.e actual effects of data preprocessing should be evaluated by classi�cation performance.
Outlier detection was performed by rPCA of the raw spectra.e number of PCs was determined by robust pooled predicted residual sum of squares (PRESS) values.Following the rule of thumb, the �rst seven PCs were selected as to account for 95.32% (>95%) of the total data variance.e rPCA diagnosis plot of the 167 authentic Anji-white tea samples is shown in Figure 3. OD is a measure of the distance from the sample to the model space spanned by selected PCs, and SD describes the sample dispersion in the class projected onto the model space.erefore, both orthogonal outliers and bad PCA-leverage points should be excluded from the training set.Because the real Anji-white tea samples came from different producing areas, there might be considerable difference in the contents of different chemical components.erefore, good PCA-leverage points (objects 49, 51 and 135) were retained to represent the spectral variations among the real Anji-white tea samples from different producing areas.In Figure 3, three orthogonal outliers (objects 26, 115, and 156) were removed.e K-S algorithm was then used to split the remaining 164 Anji-white samples into a training set with 120 samples and a test set of 44 samples.erefore, the test set had 44 positive (Anji) objects and 81 negative (non-Anji) objects.
OCPLS and SIMCA models were developed to describe the distribution of real Anji-white tea samples.For SIMCA, the improved decision region [33] was adopted to reduce the risk of having a large number of objects wrongly rejected.Cross-validation was performed to evaluate the number of signi�cant PCs� the criterion of 95% total explained variances was also considered.For OCPLS, Monte Carlo crossvalidation (MCCV) with 10% objects le out was used to determine the number of PLS components and the sampling time was 100.e PRESS values by MCCV were subject to an -test [32].As suggested, a signi�cant level of 0.25 was adopted to select the least number of latent variables with a PRESS value not signi�cantly larger than the minimum value according to the F-test.Sensitivity and speci�city were used to evaluate the performance of different models and preprocessing options.e training and prediction results of test samples by SIMCA and OCPLS were shown in Table 2. Seen from Table 2, preprocessing generally improved the classi�cation performance in terms of sensitivity and speci�city.However, the models based on spectra smoothing seem to have inferior performance, which might be attributed to the possible loss of detailed frequency information.Second derivative and SNV signi�cantly improved the class models by reducing the baseline and backgrounds.e model complexity of SIMCA and OCPLS based on such preprocessing was reduced.For the best models, the sensitivity and speci-�city were 0.886 and 0.951 for OCPLS and 0.886 and 0.938 for SIMCA with SNV spectra, respectively.For both OCPLS and SIMCA, the best class models were obtained by SNV preprocessing and the prediction results were demonstrated in Figures 4 and 5. e comparison of different preprocessing methods demonstrated that the spectral variations caused by scattering effects and baseline shis played a more important role than an inferior SNR.

Conclusions
Rapid identi�cation methods of a P�I green tea, Anji-white tea, were developed by using NIR spectroscopy and chemometric class modeling techniques.With SNV preprocessing, OCPLS (sensitivity 0.886 and speci�city 0.951) and SIMCA (sensitivity 0.886 and speci�city 0.938) achieved best classi-�cation performance in terms of prediction sensitivity and speci�city.e analysis results indicate removal of spectral    background, and baseline plays a more important role than a higher SNR.Taking derivatives and SNV transformation can not only improve classi�cation accuracy but can also reduce the complexity of OCPLS and SIMCA models.Although it is hard to perform an exhaustive collection and analysis of all types of white teas, this study provides a reliable and effective tool to identify Anji-white tea against most of the important non-Anji-white teas in the Chinese market.

F 1 :
Preprocessing.All of the data analysis was performed on MATLAB 7.0.1 (Mathworks, Some of the NIR spectra of real Anji-white (a) and non-Anji-white (b) tea samples.

F 2 :
Some of the preprocessed spectra of Anji-white (le) and non-Anji white (right) tea samples by smoothing, taking derivatives, and SNV.

F 3 :
Outlier diagnosis plot obtained by robust PCA with 7 principal components of 167 Anji-white tea samples.

F 4 :
Results obtained by SNV-OCPLS with 8 latent variables for (a) 44 positive test objects, and (b) 81 negative test objects.

F 5 :
Results obtained by SNV-SIMCA with 7 principal components for (a) 44 positive test objects, and (b) 81 negative test objects.

T
2: Predicting results obtained by OCPLS and SIMCA.TP: number of true positives; FN: number of false negatives; TN: number of true negatives; FP: number of false positives."Positive" and "negative" represent Anji-and non-Anji-white tea, respectively.
a Numbers of OCPLS or SIMCA latent variables.b