Identification of Geographical Origin of Milk by Amino Acid Profile Coupled with Chemometric Analysis

This study aimed to establish a method to identify the geographical origin of milk based on its amino acid proﬁle. High-performance liquid chromatography (HPLC) was carried out to measure amino acid contents. The signiﬁcant diﬀerences of amino acid proﬁles of milk samples from four regions in China (Hebei, Ningxia, Heilongjiang, and Inner Mongolia) were analyzed by ANOVA. Furthermore, the principal component analysis (PCA) demonstrated the feasibility of geographical origin identiﬁcation using an amino acid proﬁle, which the ﬁrst 2 principal components account for 65.62% of total variance. The predictive model for the geographical origin of milk samples was established by orthogonal partial least squares-discriminant analysis (OPLS-DA) with a classiﬁcation accuracy of 100% and the performance parameters of R 2 X 0.98, R 2 Y 0.82, and Q 2 0.75. The excellent predictive ability of the model was validated using the validation data set. The analysis of variable importance in projection (VIP) showed that seven amino acids played a key role in the geographical origin identiﬁcation. This method is a reliable strategy to identify the geographical origin of milk for protecting consumers against mislabeling fraud.


Introduction
Milk and its products have become an indispensable part of people's life which provides about 20% of the total protein consumed across the world. Almost all the land on earth has pasture which keeps cows and sheep. e best pastures in the world are concentrated on the temperate grassland at about 40-50 degrees north and south latitudes, internationally recognized as the "golden belt of milk source," where the climate and environment are in favor of the growth of forages and cows.
e output, quality, and nutritional composition of milk are closely concerned with the forage quality and cows' body health. Cows with a comfortable and healthy lifestyle can provide high-quality milk. e golden belt of milk sources in China is mainly located in the grasslands of the Northeast, Northwest, and North China, such as Hebei, Ningxia, Inner Mongolia, Heilongjiang, and Xinjiang. ese milk sources provide more than 60% of the raw milk in China. e high quality of milk is generally processed into highvalue products, such as infant milk powder produced from the milk sources of New Zealand. e upscale market share of dairy products in the world is almost completely occupied by the milk sourced from these belts. e nutritional and economic value of milk and dairy products is often associated with their geographical origin. Just as the price of wine, coffee, and tea varies enormously depending on where they come from, customers are willing to pay more for the products from some specific geographical regions with favorable acceptance as reliable quality criteria. An example of a preeminent dairy product is the "Emmentaler Switzerland" cheese from the alpine regions, which has the status of a "Protected Designation of Origin (PDO)" and a considerable premium price over Emmental cheese produced elsewhere [1].
ere is an increasing demand of robust analytical techniques for the geographical origin traceability of dairy products that can be utilized by regulatory authorities to ensure fair competition and protect consumers against fraud due to mislabeling. So far, many methods have been developed for identifying the geographical origin of foodstuffs [2,3]. Isotope ratio mass spectrometry (IRMS) coupled with chemometric methods is the most promising techniques, which has been widely used to determine the authenticity and geographical origin of dairy products [4][5][6][7]. e values of the stable isotope ratio of milk vary as the function of environmental factors and animal feeding regimes, which provide a proof correlated with the geographical origin of milk. A pilot study was conducted to evaluate the suitability of multielement isotope ratio analysis for determining the origin of cows' milk from seven dairying regions in Australia and New Zealand. Each milk sample displayed a distinct fingerprint of isotopic ratios of δ 13 C, δ 15 N, δ 18 O, δ 34 S, and δ 87 Sr. e potential of IRMS has been verified for determining the geographical origin of dairy products produced within Australasia [8]. e stable isotope ratios of δ 13 C and δ 15 N for the milk samples were from different Italian origins, and their fractions (fat, casein, and whey) were used to develop a new analytical approach that can simultaneously discriminate milk samples according to their geographical origin and type of processing [9]. By using δ 13 C and δ 15 N values of extracted proteins and δ 2 H and δ 18 O values of milk water, IRMS was applied to identify the geographical origin of pure milk from Australia and New Zealand, Germany and France, the USA, and China [10]. Using δ 13 C, δ 15 N, δ 2 H, and δ 18 O values to specifically assign geographical origin, Zhao et al. [11] studied the traceability accuracy of cow milk samples from various provinces in China. It was found that different geographical locations with a separation distance greater than 0.7 km can be distinguished using multi-element (C, N, H, and O) stable isotope ratio analysis. In addition, stable isotopic ratios analysis in combination with other techniques, such as inductively coupled plasma atomic emission spectroscopy (ICP-AES), inductively coupled plasma mass spectrometry (ICP-MS), and nuclear magnetic resonance (NMR), was a hopeful way for the geographical origin determination of milk [12,13]. Regarding dairy products, the determination of the geographical origin of cheeses [14][15][16], butter [17], and milk powder [18] was also successfully carried out by IRMS coupled with chemometric analysis.
It has been demonstrated that isotope ratio analysis is a powerful tool for the traceability and identification of milk and dairy products. However, this technique has some limitations, such as the high cost of sample analysis and the high price of the instrument [5,19]. e objective of this work was to provide a new low-cost method for identifying the geographical origin of milk and dairy products. is method was based on the characteristics of the amino acid profile of raw milk, as same as the isotope ratio value which was correlated with the living environment and feeding regimes of cows. is procedure was carried out by amino acid analysis using high-performance liquid chromatography (HPLC) coupled with chemometric analysis to develop a classification model for the geographical origin of milk. e method could be used to prevent from the mislabeling fraud of the geographical origin of milk.

Sample Collection.
Cow's milk was sampled from dairy farms located at four Chinese provinces (Hebei, Ningxia, Heilongjiang, and Inner Mongolia) in August 2018. A total of 178 fresh milk samples were collected to ensure a representative data set (Appendix 1). e collected samples were transported to the laboratory by cold chain logistics. Milk samples were kept frozen at −20°C until preparation. In addition, three samples were purchased from markets and had production origin labels.
at is, A, B, and C were labeled as Inner Mongolia, Heilongjiang, and Ningxia, respectively.

Sample
Preparation. An approximate 50-100 g (fresh weight) homogeneous milk sample was placed on a glass plate and lyophilized for 24 h to dry powder. After being ground, 0.2 g of freeze-dried milk powder was weighed into a 12 mL glass vial, added 10 mL of 6 mol/L HCL solution (containing 1 g/L of phenol) and screwed the cap tightly. Subsequently, the hydrolysis was carried out in an airblowing thermostatic oven at 110°C for 24 h.
1 mL of hydrolysate was transferred into a 50 mL eggplant-shaped flask and evaporated in vacuo to dryness at 70°C. e residue obtained was redissolved with 2 mL of 0.1 mol/L HCL solution by vortex mixing and passed through a 0.45 μm inorganic filter membrane. 100 μL of hydrolysate filtrate, 200 μL of buffer solution (pH 9.0), and 100 μL of derivative agent (300 mg/mL of 2,4-dinitrochlorobenzene) were successively transferred into a 1.5 mL glass vial and screwed the cap tightly. After vortex mixing, the derivatization was carried out in a thermostatic oven at 90°C for 90 min.
After derivatization, the solution obtained was adjusted to pH 7 with 50 μL of 10% (V/V) acetic acid and diluted with 600 μL of ultrapure water. Finally, the derivative solution was filtered through a 0.45 μm organic membrane for subsequent analysis by HPLC.
Seventeen amino acids were selected for determination. e 17 analytes selected by short name were Asp, r, Ser, Asn, Glu, Gln, Pro, Gly, Ala, Cit, Abu, Val, Met, Ile, Leu, Tyr, Phe, Lys, His, and Arg (in elution order). e data acquired by HPLC were exported to Microsoft Excel (Microsoft Corp., USA), e amino acid ratio, i.e., proportion of each amino acid content in total 17 amino acid content, was used for multivariate data analysis. Statistical analysis of amino acid was performed using SPSS 22.0, and a post hoc Duncan's test of analysis of variance (ANOVA) was performed to determine significant differences between samples from different origins. e preprocessed data was subjected to principal component analysis (PCA) by the Origin software package (Northampton, MA, USA). For modeling, the samples of each class were divided into calibration sets (125 samples) and validation sets (53 samples) by applying the Kennard-Stone (KS) uniform sampling algorithm. A calibration set was used to develop a model by using the supervised technique OPLS-DA from SIMCA 14.2 software (Umetrics, Umeå, Sweden). e permutation test (n � 200) was used to evaluate whether the data was overfitted or not. Furthermore, 7-fold cross-validation was run, and its validation metrics were Q 2 and the lowest root mean square error of cross-validation (RMSECV). e external validation was performed according to a previously reported procedure [20,21]. e external validation data set was imported into the developed model under the "specify toolbar" of SIMCA. Its validation metrics were the correct discrimination rate and receiver operating curve (ROC). e area under the curve (AUC) of the ROC illustrates the method performance; the closer to 1 the value is, the better the performance is.

Differences in Amino Acid Profile of Milk from Different Geographical Origins.
e content of 17 amino acids in milk samples was determined by HPLC. e amino acid ratios, i.e., proportion of each amino acid content to the total 17 amino acid content, were calculated and shown in Table 1. Unlike the amino acid content, the value of amino acid ratio is only related to the protein composition, which is not affected by other components, such as fat, in milk samples. So the amino acid ratio, rather than the amino acid content, was selected to show the amino acid profile of milk. e differences in amino acid profiles of milk samples from four regions were analyzed by post hoc Duncan's test of ANOVA. e ratios of Asp, Cys, and Ala of samples were significantly different between four regions (P < 0.05). e ratios of Glu and His in the samples were significantly different between three of the four regions (P < 0.05). e amino acid profiles of samples from Hebei and Ningxia were relatively similar. e results of ANOVA indicate the potential feasibility of using amino acid ratios as an indicator of geographical origin traceability. It was reported that the amino acid profile of milk is linked to feed [22]. e differences in pasture conditions, such as forage quality, feeding strategy, and climate, lead to the differences in the amino acid profiles of milk from different geographical origins.

Potential of Geographical Origin Classification Based on
Amino Acid Profiles. PCA was first applied for data visualization, which demonstrated the general potential to differentiate between the geographical origin of milk samples using amino acid profiles. PCA is the most commonly used variable-reduction method, which decomposes the data matrix with n rows (samples) and P columns (variables) into the product of a score matrix [22]. All principal components (PCs) are mutually orthogonal. Each successive PC contains less of the total variability of the initial data set, and the scores show the position of samples in the space of the PCs.
PCA was carried out on the 178-sample data set of 4 geographical origins. e scores are plotted as a multiclass model; i.e., each geographical origin of the milk sample is separately presented as a class (Figure 1). e first two PCs accounted for 65.62% of the total cumulative variance, which interpreted a majority of the variables from the raw data. Samples from Inner Mongolia are well separated from the other three groups of samples. e three groups of samples from Hebei, Ningxia, and Heilongjiang overlapped to a certain extent, which was consistent with the results of the previous ANOVA. e results suggested that the amino acid profiles had the potential for the identification of geographical origin.

Establishment of the OPLS-DA Model.
As an unsupervised chemometric method, PCA just shows the data as they are, which is frequently seen as a practical indicator for the potential of OPLS-DA model [23]. Conversely, OPLS-DA is a supervised chemometric method that can determine features within data and is explicitly oriented to address particular issues, such as food authentication and geographical origin traceability. By constructing the predictive models, OPLS-DA can separate and classify new data points, which allows it to be used as a nontargeted method to analyze whether an unknown sample is accepted or rejected by a predefined class. rough orthogonal signal correction (OSC), by filtering out the useless orthogonal information in the independent variable X matrix which is not related to the dependent variable Y, the correlation between X and Y is strengthened, and the explanatory ability and accuracy of predicting model are improved. [24,25]. e predicting model of OPLS-DA was established using the calibration data set with unit variance scaling and principal components of "3 + 8 + 0." It was shown that there was a clear clustering of milk samples from different regions with obvious separation boundaries (Figure 2(a)). Note that the new variables t1 and t2 summarize the X-variables. Score t1, which is the first component, corresponds to the largest variation of the X space, followed by t2, and so on. Inner Mongolia samples were negatively affected by t1, while samples from other provinces both had positive score values for t1. Ningxia, Hebei, and Heilongjiang samples were separated according to t2, where Heilongjiang samples were negatively affected. e prediction performance of the model was frequently assessed according to the cumulative coefficient of determination (R 2 (cum)) and cross-validated coefficient of determination (Q 2 (cum)). R 2 evaluates the fitting degree and Q 2 indicates the predictability. And R 2 values were evaluated based on their components attributed to the input variables (R 2 X (cum)) and Journal of Food Quality class response (R 2 Y (cum)). e model was considered stable and robust when the values of R 2 and Q 2 were greater than 0.5, and the closer to 1 these values were, the better the model was [26]. e model fitting parameters of R 2 X (cum) and R 2 Y (cum) were 0.98 and 0.82, respectively, and the prediction parameter of Q 2 (cum) was 0.75. In addition, the lowest root mean square error of cross-validation (RMSECV) for the proposed model was 0.18. Note that the closer to 0 the value is, the better the model is. e above results indicated that the model had a good fitness and a strong ability of prediction. In the score plot (Figure 2(a)), it seems that the data point distribution is relatively close between samples from Hebei and Ningxia. However, these two groups are clearly separated from each other in the corresponding 3D model plot (Appendix 2). e data points of the Inner Mongolia samples were separated very successfully from those of the other three regions, probably because their large differences in latitude and longitude led to differences in amino acid profiles. e permutation test (n � 200) was performed to assess whether the OPLS-DA model overfitted the data or not (Figure 2(b)). e results showed that the intercept value of Q 2 Y was below 0, and the values of R 2 and Q 2 on the left were all lower than the original points on the right, which indicated that the model was valid and did not exhibit overfitting [27]. Moreover, the statistical significance of the OPLS-DA model was also validated by P CV−ANOVA value, which was 0. e result of the ROC curve can also represent the ability of the model to classify samples correctly. Of those, an ACU value of 1 in Figure 2(c) revealed an excellent performance of this model. e analysis of variable importance in projection (VIP) was performed to evaluate the contribution of independent variables to the model classification.
e potential key markers for differentiation between classes were determined according to the criteria of both the variable importance in projection (VIP) value ≥1 and P < 0.05 [28]. As can be seen from Figure 2(d) and Appendix 3, Asp, Glu, Leu, Cys, Ala, Pro, and Val gave a major contribution to the model classification and were proposed as potential markers between four different geographical origins of milk samples. e amino acids required by dairy cows include essential and nonessential amino acids. Essential amino acids are those that cannot be synthesized by the cow herself and need to be absorbed directly from feed or metabolites by the microbiome of the rumen, including Arg, His, Ile, Leu, Lys, Met, Phe, r, and Val [29,30]. Nonessential amino acids are synthesized through the cow's metabolism using feed and amino acid profiles are controlled by genes. e amino acids in milk are fractionated from the amino acids in the blood through mammary metabolism. erefore, the reasons for amino acid differences between cow milk origins may include: metabolism differences of essential amino acids between cow milk origins are mainly influenced by feeding  differences, as different topography and environment result in different feed population. Metabolism differences of nonessential amino acid between cow milk origins are mainly influenced by genetic differences because different regions may raise different breeds of cows and different environments can also cause genetic mutations for cows on adapting to the environment. In addition, the microbiome of the rumen may also vary depending on the environment, and it has been reported that some amino acids in cows are derived from these microbial metabolites [29]. In conclusion, amino acid differences between cow milk origins are influenced by a combination of many factors that are representative of the origin, including the regional climate conditions (rainfall, temperature, possibility to graze) and the breeding of different breeds.

Validation of the OPLS-DA
Model. e predicting model of OPLS-DA established was validated using the validation data set.
e OPLS-DA model is a binary classification method that can only assign imaginary value of 1 and 0. For example, if the imaginary values of 1 (deviation <0.5) is assigned to the class predefined as Hebei, the value of 0 (deviation <0.5) will be assigned to the other three classes (Inner Mongolia, Ningxia, and Heilongjiang). e predicted values of classification were separately calculated by the OPLS-DA model for each of four classes, and the results of four binary classification predictions are shown in Figure 3. e classification accuracy of the predicting model was 100% for the four classes ( Table 2). All the samples were correctly classified, which suggested the powerful predicting ability of the model.

Practical Application of the OPLS-DA Model.
Chemometric analysis, particularly using the partial least squares-related methods, can intuitively visualize the classification results, and is widely used in the field of geographical origin traceability. Xie et al. [13] established an OPLS-DA model for milk traceability in a small-scale region. Chen et al. [31] used PCA and LDA to trace the geographical origin of elephora ganbajun. However, these studies did not perform the practical application to the real samples. It is crucial to provide a practical process for identifying the geographical origin of commercial milk using the proposed model. e OPLS-DA model established has been applied to the geographical origin identification of commercial milk in this study. Firstly, amino acid ratios in the unknown milk   samples were measured. Subsequently, these data were imported into the developed model under the "specify toolbar" of SIMCA. e geographical origin class of the sample was judged according to the predicted value. If the predicted value of an unknown sample locates at the range from 0.5 to 1.5 in one of the four binary classifications, it will be identified as the corresponding class of geographical origin with an assigned value of 1. According to the above procedure, the predicted values of three samples were obtained, which are detailed in Table 3. e predicted results show that the samples of A, B, and C are from Inner Mongolia, Heilongjiang, and Ningxia, respectively, which is consistent with the labeled information of geographical origin.

Conclusions
is study provided a method to identify the geographical origin of milk based on its amino acid profile. e significant differences of amino acid profiles of milk samples from four regions in China (Hebei, Ningxia, Heilongjiang, and Inner Mongolia) were analyzed by ANOVA and PCA. e results suggested that the amino acid profiles had the potential for the identification of geographical origin. e predictive model for the geographical origin of milk samples was established by OPLS-DA with the correct classification accuracy of 100%. e excellent predictive ability of the model was validated using the validation data set. e VIP analysis showed that the amino acids of Asp, Glu, Leu, Cys, Ala, Pro, and Val gave a major contribution to the model classification.
is method is a reliable strategy to identify the geographical origin of milk for protecting consumers against mislabeling fraud.

Data Availability
e data used to support the study are included within the article. Raw data can be acquired from the corresponding author upon reasonable request.

Conflicts of Interest
All authors declare that there are no conflicts of interest.