Rapid Discrimination of the Geographical Origins of an Oolong Tea (Anxi-Tieguanyin) by Near-Infrared Spectroscopy and Partial Least Squares Discriminant Analysis

This paper focuses on a rapid and nondestructive way to discriminate the geographical origin of Anxi-Tieguanyin tea by near-infrared (NIR) spectroscopy and chemometrics. 450 representative samples were collected from Anxi County, the original producing area of Tieguanyin tea, and another 120 Tieguanyin samples with similar appearance were collected from unprotected producing areas in China. All these samples were measured by NIR. The Stahel-Donoho estimates (SDE) outlyingness diagnosis was used to remove the outliers. Partial least squares discriminant analysis (PLSDA) was performed to develop a classification model and predict the authenticity of unknown objects. To improve the sensitivity and specificity of classification, the raw data was preprocessed to reduce unwanted spectral variations by standard normal variate (SNV) transformation, taking second-order derivatives (D2) spectra, and smoothing. As the best model, the sensitivity and specificity reached 0.931 and 1.000 with SNV spectra. Combination of NIR spectrometry and statistical model selection can provide an effective and rapid method to discriminate the geographical producing area of Anxi-Tieguanyin.


Introduction
Oolong tea has been one of the most popular traditional beverages in the world. As a semifermented tea, Oolong tea is somewhere between the green and black tea with pleasurable aroma and taste for its particular processing. Moreover, a lot of people regard oolong tea as a functional drink for its high content of antioxidative substracts like epigallocatechin gallate (EGCG) and catechins [1]. And oolong tea is reported to have effect on antiobesity, preventing decayed tooth, disinfecting and hypolipidemic actions [2,3].
The special aroma and taste of a tea depend largely on geographical and natural conditions of tea tree growing as well as on tea cultivar, cultivation traditions, and processing procedures. Therefore, most of the famous teas in China are named after their origins. Anxi-Tieguanyin tea (ATT) is one of the most famous Oolong teas with protected geographical indication (PGI), produced from Anxi County, a small town in Fujian province. Due to individual climate and edatope in Anxi County, ATT has a long-lasting fragrance and a strong aftertaste. It has been exported to many places all over the world. As the Tieguanyin tea with different provenances has similar appearance, some merchants fraudulently label ATT indication to non-Anxi-Tieguanyin teas (NATT) for illegal profits [4]. These actions do damage the reputation of ATT [5]. Therefore, it is important and urgent to employ quality control of ATT against various counterfeits. Until now, sensory analysis is the usually used method in distinguishing specific ATT geographical origin, which depends basically on the experience and personal emotion of tea tasters. For this application, a more stable and effective tool is worthwhile and necessary to be developed.

2
Journal of Analytical Methods in Chemistry Different cultivating places had varied growing conditions including altitude, climate, soil, microelement, fertilizer, and processing [6,7]. All these factors contribute to the different chemical components in teas. Although varied pattern of several chemical components could partially indicate the characters of specific teas, making errorless discriminations just by several chemical components proved difficult because the components in teas are really complicated [8,9]. In recent years, instrumental methods coupled with chemometrics have provided promising alternative approaches in food components analysis [10][11][12]. As one of the rapid and effective measuring instruments, the near-infrared (NIR) spectroscopy has been widely performed in food multivariate quality control [13,14]. Depending on individual vibrational frequency of molecular structure, NIR could characterize multiple chemical components of samples, which can help researchers discriminate the provenances of tea products. NIR has the following advantages over chemical analysis: (1) less money and time cost in analysis; (2) being nondestructive for samples; and (3) the ability in online analysis [15].
For automatic identification, some researchers have successfully used NIR spectroscopy and class models to discriminate the provenances of green teas [16,17]. However, as a semifermented tea, the chemical components in Tieguanyin tea are far more complicated; it requires higher sensitivity in measuring process and classification. This paper aims to provide an effective way to discriminate the geographical origin of ATT by NIR spectroscopy and PLADA.

Tea
Samples. 450 authentic ATT samples were collected from 30 main Tieguanyin-producing areas of Anxi County with official certifications. 120 NATT samples were collected from Yongchun, Huaan, Xiandu, and so forth. All these samples were spring teas of 2013 (bought in the local tea markets of Anxi County before May 23, 2013) and were preserved in cold storage (4 ∘ C) before measuring analysis. The detailed information concerning samples was presented in Table 1.

NIR Spectrometric Analysis.
All of the samples were scanned by TENSOR37 Fourier transform NIR spectrometer (Bruker, Ettlingen, Germany) and OPUS 7.2 software. Each sample was packed in a quartz cuvette and detected with a PbS detector. Each reported spectrum is the average of 64 scanning spectra in the spectral range from 4000 cm −1 to 12000 cm −1 . Here the resolution was 8 cm −1 and the scanning interval was 1.928 cm −1 , so 4148 individual data points were acquired from each spectrum for multivariate analysis.

Data Preprocessing and Splitting.
Data analysis is performed on MATLAB 7.14.0.739 (Mathworks, Sherborn, MA). Aberrant spectra (outliers) are usually caused by abnormal samples or measuring faults. For class models, outliers make negative influence, sometimes even leading to model bias. Therefore, the Stahel-Donoho estimates (SDE) were used to detect abnormal spectra. SDE can detect the multiple outliers by calculating the values of outlyingness [18]. In addition, to ensure the sensitivity and the specificity, spectra need to be preprocessed. Three preprocessing methods were investigated in this study, including standard normal variate (SNV) transformation [19], taking second-order derivatives (D2), and smoothing [20].
With the spectra preprocessed, the data were split into a training set and a prediction set by the Kennard and Stone (K-S) algorithm [21]. K-S algorithm can ensure that the prediction objects are uniformly distributed in the range of training objects. In this paper, K-S algorithm was performed separately for ATT and NATT samples. Then two training sets were put together as a total training set and two prediction sets formed a total prediction set.

PLSDA.
As a key method in chemometrics, partial least square has various applications [22][23][24]. The spectra of training set can be represented as an × matrix (with training objects and wavelength points). is the value of sorted number, in this paper, = 2 (the ATT class and the NATT class), and an × matrix is then designed. The value of each element in is the corresponding category of the object in . If an object ( = 1 : ) is from class ( = 1 : ), then element at th row and th column in is given a value 1; all other elements in are set −1. For prediction, a new sample is classified into class ( = 1 : ) when the th element of its predicted response is above zero. Because = 2, we just need to consider the first column of predicted responses. If the value of the first column is above zero, it will be classified into ATT class; otherwise, it is NATT class. For PLSDA, the number of latent variables is a key parameter. Too many latent variables will cause the risk of overfitting. So in this model, Monte Carlo cross-validation (MCCV) [25] was performed to estimate the parameter. The number of latent variables was selected to calculate the minimal misclassification rate of MCCV (MRMCCV): where is the number of misclassified objects, is the total number of prediction objects, and is the times of data splitting; then sensitivity and specificity were calculated to evaluate the performance of classification models: where TP represents the number of true positives, FN is the number of false negatives, TN represents the number of true negatives, and FP is the number of false positives.

Results and Discussions
Some of the raw NIR spectra of ATT and NATT were demonstrated in Figure 1. The wavenumbers from 9000 to 12000 cm −1 were discarded in further analysis because these wavenumbers had low response signals and carried little information. Seen from Figure 1, although some of the NATT samples have lower absorbance in the range of 4000∼ 5000 cm −1 , a part of spectra is highly similar to the spectra of ATT and they can hardly be distinguished just by naked eyes. Therefore, to extract some useful chemical information from these spectra, chemometrics is used to develop classification models. Outlier detection was performed by the SDE outlyingness diagnosis. The results were shown in Figure 2. In this paper, a spectrum is recognized as an outlier if its outlyingness is above 3. Seen from Figure 2, 19 ATT objects and 4 NATT objects were removed.
Then the raw spectral data was preprocessed by SNV, D2, and smoothing. Preprocessed spectra were shown in Figure 3. Compared with the raw spectra, SNV can reduce some spectral variations, D2 can enhance the resolution of some bands and remove most of the baselines, and smoothing can reduce the strength of noise signals.
For 300 ATT samples and 80 NATT samples) and a prediction set (with 131 ATT samples and 35 NATT samples), and then PLSDA models were developed using the raw, D2, and SNV spectra. MCCV was used to estimate the number of PLSDA latent variables; the training set was randomly divided into a secondary training set (50%) and a secondary predicting set (50%) for 20 times. The number of latent variables was selected to calculate the minimal MRMCCV. Therefore, the objects in prediction class were applied to calculate the sensitivity and specificity of PLSDA model. The prediction results and optimized parameters with different preprocessing methods were listed in Table 2. Seen from Table 2, the best model is the SNV-PLSDA with the sensitivity/specificity of 0.931/1.000. The PLSDA model based on smoothing spectra and raw NIR spectra has exactly the same results. It means smoothing had little effect on NIR spectra. However, compared with the raw data, the model based on D2 spectra even gets a lower sensitivity/specificity. It might be caused by the loss of frequency information while taking second derivatives. The training and prediction results of SNV-PLSDA were demonstrated in Figure 4.

Conclusion
The results in this paper demonstrate the feasibility of combining NIR spectroscopy and PLSDA for discriminating the geographical origin of Tieguanyin tea. The sensitivity and specificity of PLSDA model based on SNV preprocessed spectra reached 0.931 and 1.000. Compared with the traditional methods [26,27], a NIR spectrum of sample can be acquired within a minute and PLSDA model just needs several seconds to make a prediction. Moreover, compared with other NIR-chemometrics methods [16,17], the samples in this study were scanned by NIR without any pretreatment, like grinding or smashing, so this method is nondestructive for samples as well. For geographical identification of teas, our future work will be trying some other sensitive measuring instruments, for example, inductively coupled plasma mass spectrometry and atomic absorption spectroscopy, and then make a comprehensive comparison with NIR.