Quality prediction models are constructed based on multivariate statistical methods, including ordinary least squares regression (OLSR), principal component regression (PCR), partial least squares regression (PLSR), and modified partial least squares regression (MPLSR). The prediction model constructed by MPLSR achieves superior results, compared with the other three methods from both aspects of fitting efficiency and prediction ability. Based on it, further research is dedicated to selecting key variables to directly predict the product quality with satisfactory performance. The prediction models presented are more efficient than tradition ones and can be useful to support human experts in the evaluation and classification of the product quality. The effectiveness of the quality prediction models is finally illustrated and verified based on the practical data set of the red wine.
An accurate evaluation of the wine quality is of significance for vintners to perform wine classification and target marketing. However, since influenced by numerous factors such as grape varieties, yeast strains, wine making technologies and human experiences [
Amongst the various influence factors of the wine quality, grape is the most basic and important factor for making high quality wine and some grape physicochemical indexes have a strong relation to the wine quality. Since the sugar in grapes is the raw material for yeast to produce alcohol, its content plays a crucial role in the fermentation process and almost determines the alcoholic level of the wine [
In this paper, based on a data set of wine qualities and grape physicochemical indexes, wine quality prediction models are constructed with multivariate statistical methods. In order to obtain an efficient wine quality prediction model, a comparison amongst the models established by ordinary least squares regression, principal component regression, partial least squares regression, and a modified partial least square regression is made and the best model is selected out. The calibration set includes 50 grape physicochemical indexes. Dealing with so much data is a time consuming and complex task. With the correlation analysis, it has been found that there exists multicollinearity problems in some grape physicochemical indexes. Under the framework of wine quality prediction model, a suggestion regarding the use of fewer grape physicochemical indexes to predict the wine quality under the promise of prediction accuracy is proposed. Compared with the majority of the methods reported for wine quality analysis, the method discussed in this paper provides a simpler and more convenient way to predict the wine quality. Besides, what is most remarkable of the proposed method is the increasing possibility for winemakers to predict the wine quality before the complete wine making process and make appropriate decisions in advance, such as the grape selection, wine classification, and target marketing.
Although investigations on evaluating wine quality only based on grape physicochemical indexes are rare in the wine science community, the relationship between grapes and metabolites in wine is available in many literature resources. Since the multiple biochemical process occurs with the grape ripening, the grape harvest time is influential in the wine composition. For example, the yeast-derived metabolites, including volatile esters, dimethyl sulfide, glycerol, and mannoproteins, will increase with harvest date [
The rest of this paper is structured as follows. Section
Multivariate statistical analysis is a powerful tool for solving the wine analysis problems which often involve large amounts of data and has been wildly applied in many relevant studies, such as analyzing the elements in wines by PLS regression [
Ordinary least squares regression (OLSR) is a standard approach to provide the approximate solution for the overdetermined systems through minimizing the sum of the errors created in the results of every single equation. Generally, OLSR often appears in the situation where only one dimension response variable is involved. However, OLSR is also capable of solving the problems of more than one dimension response variables and the algorithm can be briefly introduced as follows.
Collect
Minimize the sum of errors
According to the invertibility of
Use the
In practice, the independent variables may be highly collinear. This phenomenon is the so-called multicollinearity and it is known that such collinearity problems can sometimes lead to serious stability problems when the OLSR is applied [
Perform normalization on the gathered measurable variables and the responses variables, presented as
Implement singular value decomposition (SVD) on the covariance matrix of independent variable set:
Use appropriate criteria [
Preform OLS regression between the score matrix
Different from the PCR, which only considers the outer relation of
Normalize the collected data sets of
Iteratively calculate the following equations
Store
Both PCR and PLSR project the original data set into principal components space or latent variables space and residue space, and the lower dimension of principal components space or latent variables space overcomes the multicollinearity problem which is a hamper for the OLSR. However, these dimension reduction methods also make PCR and PLSR have the risk that useful information will lose in selected PCs or LVs. Although OLSR builds the model maintaining all of information from the origin data set, it always comes to a halt when the data set exists the problem of multicollinearity or the number of samples smaller than the number of variables, which are two common phenomenons in practice. In order to solve these problems, Yin et al. proposed the MPLSR, which has been validated on the industrial benchmark of Tennessee Eastman process for fault detection and a good result has been obtained [
Gather all the
Calculate the regression coefficient matrix
Obtain the final model of MPLSR:
All wines are practically made in a common process from grapes harvesting to bottling. For red wines, the winemaking process can be roughly divided into six steps.
Firstly, grape harvesting is performed to supply the raw material for winemaking. For the grape compositions, which may positively or negatively influence wine chemistry and sensory properties or vary with the maturity of grapes, the harvest dates should be carefully considered. Traditionally, the levels of grape sugar, acids, and PH are used to determine the harvest dates. However, Jackson and Lombard have proposed that such measures alone are not sufficient to accurately predict wine composition because many key grape-derived compounds in wine do not track with sugar accumulation [
The natural yeast which is already presented on grapes may give unpredictable results depending on the exact types of yeast; thus the cultured yeast is always added to the must in order to ensure a successful fermentation. The amount of time consumed by a wine ferment process varies depending on the type of grape and the methods adopted by winemakers. Generally, 10 to 30 days will be consumed during the fermentation which takes place in large vats [
Flow sheet of winemaking.
In this section, multivariate statistical methods are utilized to establish the models of the relationship between the grape physicochemical indexes and the wine quality. The fitting efficiency and predicting ability of these methods are compared by corresponding figures or indexes. After analyzing the influence of grape physicochemical indexes on the wine quality, a suggestion is proposed to reduce the usage of grape physicochemical indexes in efficient wine quality prediction.
The data set (mathematical modeling official site:
Corresponding to grape samples, wine samples have been vinified and collected simultaneously to avoid the differences in wine quality caused by different vintages, which influence the compositions in wine significantly [
The wine quality of one sample given by human experts.
Human expert | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Mean points |
---|---|---|---|---|---|---|---|---|---|---|---|
Presentation (15 points) | 10 | 12 | 10 | 7 | 10 | 12 | 12 | 10 | 11 | 10 | 10.4 |
Fragrance (30 points) | 18 | 24 | 24 | 18 | 24 | 22 | 22 | 18 | 27 | 20 | 21.7 |
Mouth feel (44 points) | 24 | 33 | 37 | 29 | 28 | 26 | 22 | 26 | 34 | 29 | 28.8 |
Overall feeling (11 points) | 8 | 9 | 10 | 8 | 8 | 7 | 8 | 8 | 9 | 8 | 8.3 |
| |||||||||||
Sum points (100 points) | 60 | 78 | 81 | 62 | 70 | 67 | 64 | 62 | 81 | 67 | 69.2 |
It is known that the data set used to construct the models cannot be applied to verify the performance of the models. Based on such guideline, the samples are divided into two parts, which are named calibration set and verification set, respectively. The calibration set formed by 27 samples is used to construct the wine quality prediction models and the verification set formed by 10 samples is used to verify the predicting ability of the models.
In order to simplify the problem, some assumptions are made as follows. For all of wines involved in this paper, the vinification processes are identical, including the materials added during the wine making process. All of wines have been vinified by the winemakers who possessed the same vinification experience. The wine qualities have been evaluated by human experts as soon as the wines have been vinified.
All of the above assumptions are made to ensure that the differences of wine qualities are only caused by differences of the grapes. This paper merely aims to study the relationship between the grape physicochemical indexes and the wine quality.
As aforementioned, OLSR is not satisfactory to build the regression equation when the calibration set suffers the problem of multicollinearity or the number of samples is smaller than the number of variables. By applying correlativity analysis on the calibration set, strong correlations have been found in some grape physicochemical indexes, especially the reducing sugar and glucose whose correlation coefficient reaches
Apart from OLSR, all the discussed multivariate statistical methods like PCR, PLSR, and MPLSR have been utilized to construct the relevant models. Two commonly used indexes, that is, the root mean squared error of calibration (RMSEC) and root mean error of prediction (RMSEP), are mainly considered here for evaluating the fitting efficiency and prediction ability of different methods. RMSEC and RMSEP can be described as follows:
The
Figures
RMSEC and RMSEP values.
PCR | PLSR | MPLSR | |
---|---|---|---|
RMSEC | 4.53 | 2.82 | 0 |
RMSEP | 4.26 | 2.32 | 1.31 |
Correlation coefficients of some grape physicochemical indexes.
Total content of amino acids | Resveratrol | Flavonol | Reducing sugar | Reducing sugar | |
---|---|---|---|---|---|
Proline | CIS resveratrol | Quercetin | Fructose | Glucose | |
Correlation coefficient | 0.978 | 0.978 | 0.968 | 0.967 | 0.982 |
(a)–(f) denote fitted quality versus actual quality. (g)–(i) denote predicted quality versus actual quality. (a), (b), (g) are plotted by PCR. (c), (d), (h) are plotted by PLSR. (e), (f), (i) are plotted by MPLSR.
The goal of regression models is to predict the wine quality from new measured grape physicochemical indexes. In order to compare the prediction ability of these models, the RMSEP values of each model are calculated. Corresponding to the maximum RMSEP value of the PCR model and the minimum RMSEP value of the MPLSR model, the PCR model has the worst prediction ability while the best predicting ability belongs to MPLSR model, which is similar to the results of fitting efficiency of these three models. Figures
According to Figure
It is known that 50 grape physicochemical indexes are included in the calibration set and sufficient information has been provided by these indexes. However, it is a time consuming and complex work to measure so many grape physicochemical indexes. From the regression equations, it has been found that some grape physicochemical indexes contribute little to the wine quality. Besides, Table
In order to analyze the contribution of the every grape physicochemical index to the wine quality, a contribution ratio (CR) is introduced in and it can be formulated as follows:
As the best prediction model, the MPLSR model is applied to analyze the CR of every grape physicochemical index and select the optimal indexes. Utilizing one of the samples, the CR of every grape physicochemical index can be obtained as shown in Table
The CR of 50 grape physicochemical indexes.
Index | CR |
---|---|
Total amino acids | 2.49 |
Aspartic acid |
|
Threonine | 19.18 |
Serine |
|
Glutamic acid | 10.25 |
Proline* | 0.89 |
Glycine | 14.00 |
Alanine |
|
Cystine |
|
Valine |
|
Methionine |
|
Isoleucine | 17.28 |
Leucine | 20.47 |
Tyrosine |
|
Phenylalanine | 1.56 |
Lysine | −5.16 |
Histidine | 8.77 |
Arginine |
|
Protein | 45.29 |
VC* | −0.78 |
Anthocyanin* | 0.23 |
Tartaric acid | −6.97 |
Malic acid | 7.41 |
Citric acid | −1.90 |
POA† | 6.34 |
Browning degree |
|
DPPH free radical | 6.54 |
Total phenolic |
|
Tannin | 4.71 |
Grape flavonoid* |
|
Resveratrol* |
|
Trans-RG†* | 0 |
CIS-RG†* | 0.12 |
Transresveratrol | 2.21 |
CIS resveratrol* | 0 |
Flavonol* | 0.38 |
MC†* | 0.74 |
Quercetin* | 0.29 |
Kaempferol* |
|
Isorhamnetin* | 0 |
Total sugar |
|
Reducing sugar | 10.67 |
Fructose | 9.08 |
Glucose | 10.72 |
Soluble solids |
|
PH |
|
Titratable acid | 11.32 |
Solid acid ratio |
|
Dry matter content | 39.92 |
Ear weight | 1.43 |
The CRs of 50 grape physicochemical indexes.
To get rid of these useless indexes which contribute little to the wine quality or have a strong correlation with other grape physicochemical indexes, a CR threshold is applied and the grape physicochemical index will be ignored if its CR is lower than the CR threshold. Comparing the regression results with different CR thresholds, 1 is selected as the threshold and 13 indexes are ignored which are marked in Table
The values of RMSEC and RMSEP.
Old model | New model | |
---|---|---|
RMSEC | 0 | 0 |
RMSEP | 1.31 | 1.68 |
The CRs of 37 grape physicochemical indexes in the new model.
A comparison between the new model and the old model is made in Figure
A comparison between the model with 50 grape physicochemical indexes and the model with 37 grape physicochemical indexes. (a), (b), (c) are plotted by old model constructed with 50 grape physicochemical indexes. And (d), (e), (f) are plotted by the new model constructed with 37 grape physicochemical indexes.
Whether from the results shown in Figure
In this paper, four multivariate statistical methods, that is, OLSR, PCR, PLSR, and MPLSR, are firstly reviewed. Based on these methods and the real data obtained in practice, wine quality prediction models have been constructed. With the superior fitting efficiency and better predicting ability represented by RMSEC and RMSEP, respectively, the model built by MPLSR outperforms the other three models. Several grape physicochemical indexes, such as protein, soluble solids, and total sugar, are found to have significant contributions to the final wine quality while others are insignificant. Through ignoring the insignificant grape physicochemical indexes, the model constructed by key indexes could present a satisfactory wine quality predicting ability. The efficiency of the MPLSR model is essential to be validated on a larger data set. Moreover, robust wine quality prediction models are meaningful to be proposed in the future work.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China (no. 61304102) and the Natural Science Foundation of Liaoning Province, China (no. 2013020002).