Use of pattern-recognition display techniques to visualize the data contained in complex data-bases. A case study

Automation and computer acquisition of data in analytical chemistry has meant that enormous amounts of data can be obtained. The analytical chemist is faced with the problem of bringing some order into these data, so that he can understand the relationships between the variables being measured and between those variables and the object of his research. This is particularly difficult when more than two variables are being measured. Indeed, when only one or two variables are measured the analyst usually makes graphs to explain what happens. When more tha two variables are measured, it is impossible to represent the data in two dimensions. The purpose ofthis article is to show that, in such a case, pattern-recognition methods can be used to visualize the data and therefore to gain insight into them. These methods can be called display methods. Instead of giving a lot of theory, this is demonstrated with a practical example: the authentification of the geographical origin ofwine. The classical methods for the chemical analysis of wines enable the major constituents, such as ethanol content, dry matter, acidity and mineral content, to be controlled. These parameters can be useful for the detection of adulterations, but do not help with the problem of the identification of wines according to their origin. Until now, such an identification has only been possible with sensorial evaluation by experienced tasters. As the characterization of wines according to origin is important, several investigators have attempted to solve the problem by multivariate characterization of wines. Wines are analysed for a number ofconstituents and the set ofobservations made on each wine sample constitutes a pattern. If the constituents that were analysed were appropriately chosen, wines from different origins would have different patterns--so these patterns may be used for classification according to origin. Motet et al. [1] and Scarponi et al. [2 and 3-] analysed samples of three groups of Venetian DOC wines for several inorganic parameters. On the basis of the patterns obtained, and with the use of multivariate statistical techniques, the three groups of wines could be reasonably well identified. Kwan et al. [-4] also proved that the characterization of wines on the basis of inorganic contents can be useful for assignment according to origin. Schreier et al. [5] made use of the concentrations of aroma constituents derived from the grapes: the six groups of German white wines that were involved in their study could be discriminated using multivariate techniques. When the same samples were characterized by their content ofyeast metabolites, the discrimination decreased considerably. It has been suggested that amino-acid patterns may be useful for the detection of adulterations of food products (orange-juice


Introduction
Automation and computer acquisition of data in analytical chemistry has meant that enormous amounts of data can be obtained. The analytical chemist is faced with the problem of bringing some order into these data, so that he can understand the relationships between the variables being measured and between those variables and the object of his research. This is particularly difficult when more than two variables are being measured. Indeed, when only one or two variables are measured the analyst usually makes graphs to explain what happens. When more tha two variables are measured, it is impossible to represent the data in two dimensions. The purpose of this article is to show that, in such a case, pattern-recognition methods can be used to visualize the data and therefore to gain insight into them. These methods can be called display methods. Instead of giving a lot of theory, this is demonstrated with a practical example: the authentification of the geographical origin of wine.
The classical methods for the chemical analysis of wines enable the major constituents, such as ethanol content, dry matter, acidity and mineral content, to be controlled. These parameters can be useful for the detection of adulterations, but do not help with the problem of the identification of wines according to their origin. Until now, such an identification has only been possible with sensorial evaluation by experienced tasters.
As the characterization of wines according to origin is important, several investigators have attempted to solve the problem by multivariate characterization of wines. Wines are analysed for a number ofconstituents and the set ofobservations made on each wine sample constitutes a pattern. If the constituents that were analysed were appropriately chosen, wines from different origins would have different patterns--so these patterns may be used for classification according to origin.
Motet et al. [-1] and Scarponi et al. [2 and 3-] analysed samples of three groups of Venetian DOC wines for several inorganic parameters. On the basis of the patterns obtained, and with the use of multivariate statistical techniques, the three groups of wines could be reasonably well identified. Kwan et al. [-4] also proved that the characterization of wines on the basis of inorganic contents can be useful for assignment according to origin. Schreier et al. [5] made use of the concentrations of aroma constituents derived from the grapes: the six groups of German white wines that were involved in their study could be discriminated using multivariate techniques. When the same samples were characterized by their content ofyeast metabolites, the discrimination decreased considerably.
It has been suggested that amino-acid patterns may be useful for the detection of adulterations of food products (orange-juice Professor Dr Massart is corresponding author. for example) [6, 7 and 8], and in order to determine the geographical origin of foodstuffs. Gilbert et al.
[_9] made use of amino-acid patterns to prove that groups ofhoney samples from different countries could be distinguished from each other.
As amino-acids are important factors in the vinification process, and as most of the amino-acids in wines are related to the grape varieties, Ooghe and De Waele [ 10] suggested that the differentiation of wines according to origin is possible on the basis of such patterns. Therefore, French wines were analysed for 20 amino-acids. The interpretation ofthe resulting data-set is not possible with the use of univariate statistics or by visual interpretation. The evaluation of the results by comparing the values for each of the parameters separately takes no account of the relationship between these different parameters. This can cause an important loss of information, as becomes obvious from the following example. In figure the concentration of the proline content is plotted against the concentration of the glycine content in, respectively, 13 C6tes du Rhone wines and 14 Beaujolais wines. From the range of these two parameters in both groups it is clear that univariate consideration of the two parameters separately does not permit differentiation of both groups of wines. However, when the two parameters are plotted against each other, a good discrimination can be made between the two groups. Obviously, more information is obtained from bivariate interpretation of the results. If the concentration of a third amino-acid is taken into consideration, the data-set can be visualized by representing each individual wine sample in a three-dimensional space, each dimension corresponding to the concentration of one acid. This three-dimensional representation might then improve the differentiation between the groups. However, only a minor part of the information included in the data-set is used. As 20 parameters are available for each wine, each of the samples can be thought to be situated in a 20dimensional space. Visual representation ofthe data is no longer possiblemmore sophisticated techniques that utilize the information in an optimal way and that can give a visual representation ofthe results are necessary. Multivariate statistical techniques or pattern-recognition techniques can be used for this purpose.
Among the different pattern-recognition techniques, a distinction can be made between three groups--pure display methods, supervised methods and clustering techniques: (1) The aim of the pure display methods is to represent multivariate data collected for a group of samples in a two-dimensional space without significant loss of information.
(2) Supervised methods are used when the data-set consists of samples that can be divided a priori into several groups, which is the case in this example. They aim to develop mathematical decision rules that can be used for the classification of new samples.
(3) Clustering techniques are employed when the samples of the data-set cannot be a priori divided into groups, or when one ignores the a priori existence of groups. The fundamental purpose of these techniques is to find or to define groups in the data-set.
The first step in the analysis of a multivariate data-set should always be the display, as in this way an indication can be obtained of what can be expected from further multivariate investigations. In this article it is the intention to introduce and to illustrate these methods with results obtained from a specific data-set. Besides the typical display methods, some supervised and most clustering techniques also permit visualization of the data. So the application of such techniques on the same data-set is also explained. Beaujolais wines (1-14) and 13 Cbte du Rhone wines (15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27). The full lines are the axes of the plot of the unscaled data; the dotted lines are the plot of the scaled data. Where [--] [-]=range of the parameters in the Beaujolais wines; and Q) (C) range of the parameters in the Cbte du Rhone wines.  Computer programs The computer package 'ARTHUR' [11] was used to obtain principal component plots and non-linear maps of the data-set, the 'BMDP' package [12] was used for the application of LDA. From the 'CLUSTAN' package [13] the routines 'Hierarchy', 'Plink' and 'Tree' were used for the application of the clustering technique and for the representation of the corresponding dendrogram.
Pre-treatment of the data Before the application of the multivariate techniques, the data were transformed in order to make the range between the different variables comparable. A z transformation, or autoscaling, was applied: each variable is given variance one and mean zero. This means that from each variable the average of that variable over the entire data-set is subtracted and the result is divided by the standard deviation of that variable over the dataset. It will be demonstrated that scaling can have a great influence on the display obtained.

Data-set
195 French wines were analysed for their amino-acid pattern. An appropriate sample of wine was boiled for 20h in a nitrogen atmosphere with an excess of 6N HC1. After filtration and concentration in vacuum an adequate amount of the concentrate was injected into an amino-acid analyser (a Technicon NC-2P). The concentration ofeach of the 20 identifiable aminoacids was determined by the use of an external standard that contained each of the amino-acids in an amount comparable to that of the samples. Uj--alj X qt-azj X + q-anj X (1) The coefficients ai. (the loadings) give an indication of the importance of the original variable in the direction of PCj.
Instead of thinking of the objects as situated in the hyperspace with the position of each object given by the values of the N original parameters, the objects can be thought now to be situated in a hyperspace the dimensions of which are given by the latent variables. This coincides with the rotation of the original space. The position of the samples in this rotated sphere are then given by the scores of each object on each PC. These scores are calculated by filling in the values of the parameters Xi in equation (1).
As the first few PCs explain the greatest part of the variance within the data, the samples can be represented in a reduced space defined by these first few PCs only. The percentage of the variance explained by each PC can be calculated and gives an idea of how much of the information in the data-set is represented in the display. A PC analysis was performed on the scaled data-set plotted in figure 1. As in this data-set each sample is characterized by two parameters, two PCs can be computed. The vector V1 in figure gives the direction ofthe first PC. The second PC will be orthogonal to the first. The PCs are given by the following equations: Instead of representing the samples in the original pattern space, as was done in figure 1, they can also be represented in a rotated space--the co-ordinates of which coincide with the direction of the PCs (see figure 2). The position of, for instance, sample 1, which has in the plot ofthe scaled data the co-ordinates(-0" 158, 1-997) is computed in the following way: =0"7071 (-0"158)-0"7071 1"997 1"524 U2,1 =09"7071 (-0"158)+0"7071 1"997= 1"300.
In order to represent the data in a reduced space, i.e. in a onedimensional space in this example (on one axis), the data points can be projected on the first PC: figure 3 shows such a display.
As the fraction of the total variance explained by the first PC amounts to 57.8%, the display obtained in the reduced space gives an idea ofthe situation ofthe samples in the original space. The loadings ofboth parameters on the first PC are equal, which means that the variance represented in the direction ofthis PC is defined by both variables to the same extent. In fact, when PCA is performed on a two-dimensional data-set with autoscaled variables, the absolute value of the loadings on both PCs will always be equal to each other. In the case that more than two variables are included in the data-set, the loadings cannot be predicted a priori; the loadings of the variables on the important PC give an indication of the relations between the parameters themselves. Indeed, when two variables are strongly correlated the variance of these two variables over all the objects will vary in a similar way. In the case that a great part of the variance of one of these variables is explained, by for instance, PC 1, resulting in a high loading on this PC, a significant part of the variance within the other variable will also be explained by the same PC, so that the loading of this second variable will also be high on this same PC. Variables that have high loadings on the first few PCs will be strongly related to each other.
If PCA is performed on the unscaled data, the first PC explains 97.5% ofthe variance and the second 2.5%. This first PC is given by the equation" =0"9997 X-0"0251 X 2.
Clearly the direction of this PC is given almost solely by proline--the loading ofthis variable is much greater than that of glycine. Consequently, the information given by this PC (see figure 4) is almost the same as that given by proline itself. This is to be expected as the variance ofproline in the samples (0.8906) is much greater than that of glycine (0.1442).
If the only purpose of displaying the data is to obtain a visualization of the data-set without interest in the relations between the parameters used to characterize the samples, nonlinear display methods can be used as well. These methods produce displays in which the reduced dimensions are nonlinear combinations of the original variables. The displays one obtains are called maps. Non-linear mapping (NLM) is one of the most frequently used methods: it represents the data in a reduced dimension according to the criterion that the distances between the samples in the reduced space approach the distances between the samples in the original space. Kowalski and Bender [14] give the following comparison in order to explain the method. Suppose that each of the samples in the original hyperspace is connected with every other sample by tensionless springs. The total energy of the springs is zero when considered in the original space. Pressing the samples together in order to obtain a mapping into, for instance, two dimesions gives rise to a tension on each of the springs. NLM aims to give a representation of the data so that the sum of the tension on all the springs is minimal. Bordeaux wines, particularly, form a compact cluster in the plot. The differences between the two subgroups in the Bourgogne wines is less explicit, but a certain degree ofdissimilarity can still be seen. Table 2 gives the loadings of the amino-acids on the two first PCs. The loadings on PC have the same magnitude for all the parameters (except for proline but this has practically no importance in the first direction). This means that this PC represents mainly the difference in the concentration of the amino-acids in the samples. As the Bordeaux and the C6tes du Rhone wines all have a low score on this component, it can be concluded that these groups of wines are characterized by a lower total amino-acid content as compared to the Bourgogne wines. The direction of the second PC is largely due to the proline content. The discrimination that occurs in this direction between Beaujolais wines and the other groups is mainly due to the greater concentration of proline in Beaujolais. As mentioned above, the loadings on the most important PCs may give an indication of the relationship between the parameters themselves. In the example, only the first PC is really important. The amino-acids Asp-ac, Thr, Set, Glu-ac, Gly, Ala, Val, Met, Ile, Leu, Phe and Lys have almost the same weight on this PC; consequently, these parameters must be correlated to a high degree. This conclusion was confirmed by the application of the routine Pearson available in the SPSS package [ 15]. The correlation between each of the amino-acids mentioned ranges from 0"68 to 0"95. There is no significant correlation between proline and one of the other amino-acids.
The second display method applied was NLM. As the method is time-consuming, the NLM routine of the ARTHUR package was used to obtain maps for two groups of wine at a time. Figure 6 is the map of the Bordeaux and the non-Beaujolais Bourgogne wines; figure 7 is the map of the Bourgogne wines. Again, it can be concluded that a differentiation of wines must be possible on the basis of the parameters used in the study. In order to compare the representation obtained with NLM and PCA, PCA was applied to the same combinations of two groups of wines. The pictures obtained with NLM are similar to the PC plots. As the optimization procedure to obtain an NLM map is rather tedious, and no supplementary information is obtained about the parameters themselves (i.e. it cannot be easily concluded which parameters are responsible for the differentiation between the groups), PCA is the preferred display method. 140 Table 2. Loadings of the parameters on the two first principal components. Supervised methods that can be used for display As mentioned above, supervised pattern-recognition techniques are methods that deal with the problem of the classification of 11 Fi.qure 7. Non-linear map of 110 Bourgogne wines. Where 1 --Beaujolais wines; and 2 non-Beaujolais Bourgogne wines.
samples or objects into a group. The general procedure carried out in a supervised situation is as follows. First, with a data-set consisting of objects with known classification, a classification or decision rule is developed that separates the learning classes in an optimal way--these decision rules can be used for the classification of new samples. The computation of the decision rules corresponds with the division of the original hyperspace in as many regions as there are learning classes in the training set, each region corresponding to one class. New objects are classified according to their position with respect to the boundaries between the classes. The data-set collected on the wines can obviously be investigated with supervised techniques as the samples can a priori be divided into several groups, i.e. into groups of wines from the same region. Some supervised techniques, such as linear discriminant analysis (LDA), also give a display. In LDA, the classification functions are developed in a reduced pattern space; as in PCA, each of the directions of this space is obtained by a linear combination of the original variables. These new variables are called discriminant functions or canonical variates. The criterion used to define the discriminant function is the maximization of the ratio of the betweengroup variance to the within-group variance. The resulting discriminant functions are oriented in the direction which provides a maximum differentation between the classes. The dimension of the reduced pattern space is one less than the number of training classes. In the case of two groups, one discriminant function is obtained. Each of the discriminant functions is given by the equation: DFi=alj X -{" +a, v X,,. ( The position of an object, k, in the reduced space is given by its score on each discriminant function: dsik=a Xlk + +a, v X,,. ( The display obtained with this technique visualizes the optimal discrimination between the classes. The vector V' in figure gives the direction of the discriminant function obtained when LDA was applied to the data-set represented in the figure. Figure 8 gives the display of the data in the reduced onedimensional space.
Application to the wine data With LDA, a display of the data-set is obtained when the samples are plotted in the space defined by the discriminant functions. When a discrimination is attempted between three groups, two discriminant functions are calculated and the optimum discrimination can be represented in a twodimensional plot. If the differentiation between, for instance, four groups is investigated, the reduced space is threedimensional and so visualization becomes more difficult.
Therefore LDA was applied to only three groups at a time. The variables were introduced in the discriminant function one at a time. The selection criterion for the introduction of a new parameter is based on the discriminating power of that variable, i.e. the degree to which a newly introduced variable increases the discrimination between the groups. Variables are included in the discriminant function until the further inclusion of one of the remaining parameters does not significantly improve the discrimination. The coefficients of the canonical variates obtained in this application are given in table 3. Only the amino-acids Asp-ac, Thr, Pro, Cys, Ileu, Gaba, His, Orn, Eth, and Arg were included. It is this combination of aminoacids that gives the best discrimination between the three groups of wines under investigation.

Clustering techniques
The aim of multivariate techniques belonging to this group is mainly to find groups of similar samples in the data-set and to display these in a so-called dendrogram. The first step in clustering is the determination of the similarity between the different objects to be clustered. One ofthe possible criteria is the distance between the samples in the original hyperspace.
Consider, for instance, four objects (A, B, C, D) each measured for three different variables (X, Y, Z): The second step in the clustering procedure is the search for similar groups of objects. Many clustering methods exist, differing in the criteria used to decide which belong to the same group. One of the possible procedures is the following. In the first instance, all the objects are considered separately. Then the two most similar objects (A and B) are to form a cluster A*. The similarity between this cluster and the remaining objects is now represented by one value, for instance the mean similarity between the objects of the cluster and the other objects:  Figure 10. Representation of the situation offour objects (each measured for three parameters) in the pattern space.
Again, the two most similar objects or clusters, A* and C, are joined and represented as one object: A dendrogram visualizes the relations between the different objects ofthe data-set: the total height ofthe links are a measure of the distances between the objects. The algorithm used to join the clusters is called the average linkage method, and the method used for clustering the data of figure and the wine data is known as Ward's method, or the error sums of squares method. The clustering obtained with this method is generally comparable to that obtained with the average linkage algorithm. More information on the different clustering algorithms can be found in a recent book by Massart and Kaufman [16]. Figure 11 is the dendrogram of the scaled data offigure 1. As the objects can be divided into two classes, the highest link in this dendrogram can be cut to obtain two clusters and to see whether each of these contain mainly objects of a same class. Applying this criterion, a cluster is obtained which consists only ofCgte du Rhone wines, while the second contains all the Beaujolais wines and three C6te Rhone wines. Obviously, the dendrogram visualizes the degree of discrimination between the groups of wines.
The application ofWard's method to the wine data results in the separation of the objects into two very distinct groups (see figure 12): group A is compact and consists mainly of Bordeaux wines; group B, which is more heterogeneous, can be divided into three subgroups: B1 and B3 consist of Bourgogne wines and B2 joins wines from Bordeaux and from Bourgogne. The Cgte du Rhone wines appear in all of the clusters.

Conclusion
The object of this paper is to show that display and related pattern-recognition techniques permit the investigation of multivariate data-sets. Wine analysis is considered only as an example and certain aspects are treated more fully by Ooghe and De Waele [10]. Clearly, the applications are not confined to wine or food analysis. Applications in, for example, archeometry [17], meteoritics [18], microbiology [19], medical diagnosis [20], investigations of structure activity relations [21], and environmental problems [22] have also been reported. Pattern recognition is a natural addition to automated analysis: eventually there will be instruments which can analyse samples, determine the pattern to which a sample corresponds, and suggest its geographical origin. Figure 11. Dendrogram of the scaled objects in .figure 1 (obtained by using the Ward's method).