Mid-Infrared Spectroscopy for Coffee Variety Identification : Comparison of Pattern Recognition Methods

The potential of using mid-infrared transmittance spectroscopy combined with pattern recognition algorithm to identify coffee variety was investigated. Four coffee varieties in China were studied, including Typica Arabica coffee from Yunnan Province, Catimor Arabica coffee from Yunnan Province, Fushan Robusta coffee from Hainan Province, and Xinglong Robusta coffee from Hainan Province. Ten different pattern recognition methods were applied on the optimal wavenumbers selected by principal component analysis loadings. These methods were classified as highly effective methods (soft independent modelling of class analogy, support vector machine, back propagation neural network, radial basis function neural network, extreme learning machine, and relevance vectormachine), methods of medium effectiveness (partial least squares-discrimination analysis,K nearest neighbors, and random forest), and methods of low effectiveness (Naive Bayes classifier) according to the classification accuracy for coffee variety identification.


Introduction
Coffee is one of the most important and popular beverages all over the world.Coffee plants are cultivated in over 70 countries.Coffee trade and consumption are the important income source for many people and many countries [1].Because of the vast plant territory and varieties of coffee plants, the quality of coffee beans varies and is significantly related to the growth conditions and the varieties.Identification of coffee bean varieties is crucial for coffee trade and consumption.
Since identification of coffee variety by mid-infrared spectroscopy has been proved to be feasible, the methods to build classification models for better and robust classification results should be further studied.Many pattern recognition methods have been applied for spectral data analysis of classification issues, especially the supervised methods for constructing the classification models.Different pattern recognition methods showed different results due to different algorithm principles.In many studies, at least 3 methods were used and compared for analysis.In this study, we applied 10 pattern recognition methods for coffee variety identification to select optimal recognition methods for practical application, including partial least squares-discrimination analysis (PLS-DA) [29],  nearest neighbors (KNN) [30], SIMCA [31], support vector machine (SVM) [32], back propagation neural network (BPNN) [33], radial basis function neural network (RBFNN) [34], extreme learning machine (ELM) [35], random forest (RF) [36], Naive Bayes classifier [37], and relevance vector machine (RVM) [38].Among these methods, PLS-DA, KNN, SIMCA, SVM, and BPNN are the most used methods in spectral data analysis.
The main objective of this study was to use mid-infrared spectroscopy for coffee bean variety identification with different pattern recognition methods.The specific objectives were to (1) select the optimal wavenumbers which contributed most to the identification to coffee bean varieties; (2) build classification models by using 10 different pattern recognition methods; (3) compare and select the most effective pattern recognition methods for coffee bean variety identification.

Sample Preparation. Four varieties of coffee beans in
China (Typica Arabica coffee from Yunnan Province, Catimor Arabica coffee from Yunnan Province, Fushan Robusta coffee from Hainan Province, and Xinglong Robusta coffee from Hainan Province) were collected.All coffee beans were collected at the same year and medium toasted.Six hundred coffee beans of each variety were collected and stored in a vacuum glass box.Twenty coffee beans of each variety were grounded, screened by 80 mesh sieve, and dried as one sample, and 30 samples of each variety were prepared.The samples of each variety were randomly divided into the training set and the test set with the ratio of 2 : 1 (20 samples of each variety for training and 10 samples of each variety for test).

Mid-Infrared Spectra Collection.
The mid-infrared spectra of sample were acquired by a Jasco FT/IR-4100 spectrometer (Japan) in the spectral range of 400 cm −1 -4000 cm −1 .20 mg of each sample was mixed with 980 mg KBr powders, the mixture was grounded and mixed thoroughly.The mixture was put into the tablet machine for tableting, and the sample tablet was used for transmittance MIR spectral data collection.For each sample, 32 times of scans were applied with the resolution of 4 cm −1 and the average spectrum of the 32 spectra was used as the transmittance spectrum of the sample.

Multivariate Analysis Methods.
Principal component analysis (PCA) is the most widely used method for qualitative analysis of spectral data.PCA linearly transforms the original data into new variables (principal components (PCs)) which are the linear combination of the original data.The first PC (PC1) has the direction of maximum variance, and the second PC (PC2) has the direction of second largest variance, and so do the rest PCs.Generally, the first few PCs could explain the most variance of the original data.In many cases, the 2D scores scatter plot by PC1 and PC2 was used to present the sample distributions, especially for the classification issues [39,40].
PLS-DA is a supervised pattern recognition method based on PLSR.PLS-DA conducts PLSR with integers representing the categories as .The outputs of PLS-DA are real numbers with decimals.To determine which category the sample belongs to, a threshold value should be set.In this study, the threshold value was set as 0.5, indicating that the sample belongs to the category which was the nearest integer of the test value [29].There are two approaches to be used in PLSR, PLS1 and PLS2.When the  response consists only of 1 variable, PLS1 is applied.When there are more than one variable in  response, PLS2 is used.In this study, only one variable was in  response and PLS1-DA was used [41].
KNN is a supervised pattern recognition method.KNN calculates the distance between samples, and the category of the sample is determined by its  nearest neighbors ( samples with the smallest distance from the sample).The sample is classified by a majority vote of its  nearest neighbors [30].
SIMCA is a supervised pattern recognition method based on PCA scores.SIMCA firstly conducts PCA of each category and determines the optimal number of PCs for classification.The classification procedure is then implemented based on the PCA residuals [31].
SVM is a widely used machine learning method for regression and classification.SVM maps the original data into a high dimensional space and finds a hyperplane which has the largest distance to the nearest data point of any category.Then, the samples were classified.To conduct SVM, the kernel functions to be used are essential and significantly important [32].
BPNN is a widely used artificial neural network (ANN).BPNN uses error back propagation to modify the internal network weights after each training epoch until the goal of the training error or the training epochs of the network is achieved [33].
RBFNN is another widely used artificial neural network.RBFNN uses RBF as the activation function.RBFNN typically has three layers: an input layer, a hidden layer with a nonlinear RBF activation function, and a linear output layer.The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters [34].
ELM is a single-hidden layer feedforward network (SLFN).ELM is a fast, simple method for regression and classification.In ELM, only the number of neurons needs to be set to obtain unique optimal solution [35].
RF is an ensemble method that uses a multitude of decision trees.RF constructs different decision trees, and the decision trees are independent of each other.To construct a random forest, the samples to train each decision are randomly selected from the training sample set by recovery sampling.The features to be used for the node of the decision tree are also randomly selected from the features from the training sample set.The output classification results are based on the output of each decision tree [36].
Naive Bayes classifier is a supervised pattern recognition method based on Bayes' theorem.It assumes that the features are with strong independence.Naive Bayes classifier is a probabilistic classifier, and the features contribute independently to the probability of the sample category [37].
RVM is a machine learning method based on Bayesian inference.RVM is a special case of sparse Bayesian modelling.RVM has the same function form as SVM, and it has many common features as SVM.Unlike SVM, RVM provides probabilistic classification [38].

Optimal Wavenumber Selection.
The number of the spectral data points of the acquired transmittance spectra of 400-4000 cm −1 was 3734 with the spectral resolution of 4 cm −1 .The large amount of data increase the computation time and the hardware level.Besides, the redundant information would result in complex, unstable, and inaccurate models.Optimal wavenumber (wavelength) selection is generally used to solve these problems.Optimal wavenumber (wavelength) selection methods work efficiently by selecting several wavenumbers (wavelengths) carrying most information from the original spectral data.In this study, PCA loadings were used to select the optimal wavenumbers.The peaks and valleys of loading plots of the first few PCs were selected as the optimal wavenumbers, and the selected wavenumbers contribute most to the loadings of each PC [42].

Model Evaluation and Software.
The classification accuracy of the classification models was evaluated by the ratio of the number of correctly classified samples and the number of total samples (corrected classified rate) of the training set and the test set.The higher the classification accuracy was, the better performance the model obtained.KNN, SIMCA, SVM, BPNN, RBFNN, ELM, RF, Naive Bayes classifier, and RVM models were built on Matlab R 2010b (The Math Works, Natick, USA), and PLS-DA and PCA were conducted on Unscrambler Ⓡ 10.1 (CAMO AS, Oslo, Norway).

Results and Discussion
3.1.Spectral Profiles.Considering the noises of the head and the end of the collected transmittance spectra in the range of 400-4000 cm −1 , only the spectra of 700-3600 cm −1 were used for analysis.The raw transmittance spectra are shown in Figure 1(a).It could be noticed that there were random noises of the spectral data.The preprocessing of the spectral data is necessary to reduce the noises.Wavelet transform (WT) is an efficient tool to remove the noises from the signals by a wavelet series with different spatial and frequency properties, and it has been used to remove the noises from the spectral data [43].In this study, the wavelet function Daubechies 4 (db4) with the decomposition level 4 was applied after the trials of different wavelet function with different decomposition level.
The raw spectrum and the spectrum preprocessed by WT of a randomly selected sample are shown in Figure 1(a).It could be observed that the preprocessed spectra were much smoother than the unpreprocessed spectra without eliminating the critical transmittance peaks and valleys.It could be observed from the average spectra (Figure 1(b)) that the trend of the transmittance spectra of the 4 varieties was the same, while the transmittance values were quite different, showing obvious differences.

Principal Component Analysis.
PCA was conducted on the preprocessed spectral data.The scores scatter plot of PC1 and PC2 is shown in Figure 2. PC1 and PC2 explained 91.275% and 4.159% of the total variance, respectively.It could be observed in Figure 2 that most of the samples could be distinguished from the samples of the other varieties, indicating the feasibility of coffee variety identification.Some observed overlaps indicated that further analysis for coffee varieties identification is needed.

Optimal Wavenumbers Selection.
The first 4 PCs explained over 99.310% of the total variance.The loadings of the first 4 PCs were used to select the optimal wavenumbers.The peaks and valleys of the loading plot were selected (shown in Figure 3).In all, 29 optimal wavenumbers were selected, and the selected optimal wavenumbers are shown in Table 1.

Classification Models on Optimal Wavenumbers.
Compared with the original data, the selection of optimal wavenumbers significantly reduced the number of input variables by 99.04%.The classification models were built on the optimal wavenumbers, and the results are shown in Table 2. To build classification models, the four varieties of coffee were assigned the category values of 1, 2, 3, and 4.
PLS-DA models were built with the spectral data as  and the category values as  with leave-one-out cross validation.The threshold value of the PLS-DA model was set as 0.5.The optimal number of latent variables (LVs) was determined by the minimum  residual variance.The classification accuracies of the training set and the test set were 86.25% and 80.00% with 7 LVs, indicating good classification results.The number of nearest neighbors was important for KNN model.To obtain optimal results, the number of nearest neighbors was set from 3 to 10. Euclidean distance was calculated as sample distances.The highest classification accuracy was achieved with 3 nearest neighbors.The classification accuracies of the training set and the test set were 95.00% and 90.00%, respectively.
For SIMCA model, different numbers of PCs were used and compared for each variety, and the best results were achieved on 10 PCs for each variety.The classification accuracies of the training set and the test set were 96.25% and 95.00%, respectively.For BPNN model, the learning rate was 0.1 and the iteration epochs were 1000.The number of neurons of the hidden layer is determined by the following: where  is the number of neurons in the input layer,  is the number of neurons in the output layer, and  is a constant between 1 and 10 [43].After comparing the performances of BPNN models with different number of neurons in the hidden layer, the optimal number was determined as 7 with the classification accuracies of the training set and the test set of 100%.The number of nodes in the hidden layer was important in ELM models.For ELM model, the optimal number of nodes in the hidden layer was based on a stepwise search.The number of nodes was selected from 1 to 80 with step of 1.The optimal classification accuracy was obtained by 52 nodes in the hidden layer.The classification accuracies of the training set and the test set were 100.00% and 97.50%, respectively.
For Naive Bayes classifier, the empirical prior probabilities for the classes were used.The classification accuracies of the training set and the test set were 72.50% and 72.50%, respectively.
For RBFNN model, the determination of spread value was important.In this study, the spread value was explored from 1 to 20, and the corresponding RBFNN model was built.The classification accuracies of the training set and the test set were both 100% with the spread value of 5.
For SVM, RBF was used as the kernel function.A grid search procedure was applied to search for the optimal penalty coefficient () and the kernel parameter () of RBF.The classification accuracies of the training set and the test set were 100.00% and 95.00% with the optimal (, ) of (27.8576, 0.0068).
For RF model, the number of trees in the forest was set from 50 to 500, and the number of features to be used for each node was 5.The optimal number of trees was determined by the performances of RF models.The optimal classification accuracies of the training set and the test set were 100.00% and 92.50% with 50 trees.
For RVM model, the kernel function was selected as RBF, and the optimal kernel parameter was searched from 0.1 to 1.The RVM model obtained the classification accuracies of the training set and the test set of 100.00% and 95.00% with the optimal kernel parameter of 0.6.
It could be noted that the classification results of different models were different.The results of BPNN and RBFNN were excellent with classification accuracies of 100% in both the training set and the test set, while the results of Naive Bayes classifier were poor with classification accuracies lower than 80%.In general, the nonlinear classification models (SVM, BPNN, RBFNN, ELM, RF, and RVM) showed better results than the linear classification models (SIMCA, PLS-DA, KNN, and Naive Bayes classifier) in this study.The reason might be that the selected optimal wavenumbers contained more nonlinear features.According to the study of Balabin et al. [47], the classification models were divided into three categories by their classification accuracy: highly effective methods, methods of medium effectiveness, and methods of low effectiveness.In this study, the ten methods for coffee variety classification were divided into the above three categories by the classification accuracy.The methods with the classification accuracy over 95% in the training set and the test set were classified as highly effective methods, including SIMCA, SVM, ELM, BPNN, RBFNN, and RVM.The methods with the classification accuracy over 80% in the training set and the test set were classified as methods of medium effectiveness, including PLS-DA, KNN, and RF.The methods with the classification accuracy under 80% in the training set and the test set were classified as methods of low effectiveness, including Naive Bayes classifier.Moreover, all models were built on a computer with Intel Core i7 Processor and 16 GB memory, the computation time was less than 5 seconds, and the differences of computation time of all models were quite small.
As for the 4 coffee varieties, variety 3 (Fushan Robusta coffee from Hainan Province) and variety 4 (Xinglong Robusta coffee from Hainan Province) were more likely to be misclassified in all classification models, indicating the smaller differences between these two varieties.
The overall results indicated that the MIR spectroscopy with pattern recognition methods could efficiently identify the coffee varieties.The inputs of all models were significantly reduced from the original data, and the computation time of all the models showed no significant difference.The results showed that although ELM, RBFNN, and RVM were not frequently used in spectral data analysis, these methods could also be quite effective and promising for spectral data analysis and online application.

Conclusion
Mid-infrared spectroscopy combined with 9 different pattern recognition methods was successfully used to identify coffee varieties.The collected transmittance spectra were preprocessed by wavelet transform with db4 wavelet function and decomposition level of 4. The scores scatter plot of PCA showed the feasibility of identifying coffee varieties, and 29 optimal wavenumbers were selected by the loadings of the first 4 PCs.Ten classification models were built on the optimal wavenumbers.SIMCA, SVM, ELM, BPNN, RBFNN, and RVM models were classified as highly effective methods with classification accuracies over 95% in the training set and the test set; PLS-DA, KNN, and RF were classified as methods of medium effectiveness with the classification accuracy over 80% in the training set and the test set; Naive Bayes classifier was classified as methods of low effectiveness with classification accuracy lower than 80%.There was no significant difference of the computation time of different methods due to the optimal wavenumber selection.The highly effective methods were recommended for practical application.SVM, ELM, BPNN, RBFNN, and RVM models showed advantages in this study and provided more alternatives for other studies.

Figure 1 :Figure 2 :
Figure 1: The raw mid-infrared spectrum and the mid-infrared spectrum preprocessed by wavelet transform of a randomly selected sample (a) and the average spectra of 4 varieties after preprocessing (b).

Table 1 :
The selected wavenumbers by principal component analysis loadings.

Table 2 :
Classification results of the models on optimal wavenumbers.Nr a /Nt b Nr/Nt Nr/Nt Nr/Nt Nr/Nt Accuracy (%) Nr/Nt Nr/Nt Nr/Nt Nr/Nt Nr/Nt Accuracy (%) Nr was the number of correctly classified samples; b: Nt was the total number of the samples. a: