Global Clustering Quality Coefficient Assessing the Efficiency of PCA Class Assignment

An essential factor influencing the efficiency of the predictive models built with principal component analysis (PCA) is the quality of the data clustering revealed by the score plots. The sensitivity and selectivity of the class assignment are strongly influenced by the relative position of the clusters and by their dispersion. We are proposing a set of indicators inspired from analytical geometry that may be used for an objective quantitative assessment of the data clustering quality as well as a global clustering quality coefficient (GCQC) that is a measure of the overall predictive power of the PCA models. The use of these indicators for evaluating the efficiency of the PCA class assignment is illustrated by a comparative study performed for the identification of the preprocessing function that is generating the most efficient PCA system screening for amphetamines based on their GC-FTIR spectra. The GCQC ranking of the tested feature weights is explained based on estimated density distributions and validated by using quadratic discriminant analysis (QDA).


Introduction
Invented by Pearson in 1901 [1,2], principal component analysis (PCA) is still one of the most popular statistical tools, because it helps us inspect the structure of large amounts of data characterizing a given phenomenon. PCA is mostly used in exploratory data analysis in order to identify the clusters formed by samples with similar properties or behavior as well as the most important variables (principal components (PCs)) that explain most of the data variance. Once the natural clustering of the data is known, PCA may be applied to build predictive models that are then used for classification purposes, that is, for assigning the class of unknown samples.
The spreading of computers and user-friendly software applications has allowed in the last decades the use of PCA in a large variety of domains [2]. Useful PCA applications have been recently reported in very different scientific fields, from food science [3] to educational management [4], from economy [5] to environmental research [6], from energy [7] to biometrics [8], and from toxicology [9] to archaeometry [10]. Most of the PCA studies have been performed in analytical chemistry (chemometrical applications), as most of the instruments are currently linked to computers that allow the storage of huge amounts of data. For example, PCA has been used to explore, sort, and group spectral data obtained with techniques such as UV-visible (UV-VIS), near-infrared (NIR) and mid-infrared (MIR) spectroscopy [11], Fourier transform near-infrared spectroscopy (FT-NIR) [12], diffuse reflectance infrared Fourier transform spectroscopy [13], attenuated total reflection Fourier transform infrared spectroscopy (ATR-FTIR) [10], gas chromatography-Fourier transform infrared spectroscopy (GC-FTIR) [9], gas chromatographymass spectroscopy [14], or inductively coupled plasma optical emission spectrometry (ICP-OES) [15].
In order to operate data and make predictions, PCA is also often used in combination with other methods, such as linear discriminant analysis (LDA) [8,13], soft independent modeling of class analogy (SIMCA) [9], cluster analysis (CA) [15], or artificial neural networks (ANN) [14,16]. More specifically, PCA is used to reduce the dimensionality of complex data sets and to reveal sometimes hidden, simplified structures that underlie it. These structures form, in the PCA score plots, clusters that are expressing the data in such a way as to highlight their similarities and differences.
In other words, PCA is an unsupervised pattern cognition technique that may be used to identify patterns in data and build associated models. Then these models may be used for developing more complex classification systems with supervised pattern recognition techniques such as LDA, SIMCA, or ANN.
An essential factor influencing the efficiency of the predictive models is the quality of the data clustering in the score plots. The sensitivity and selectivity of the class assignment are strongly influenced by the relative position of the clusters and by their dispersion. The larger the distances between the clusters and the lower the cluster dispersion, the better the modeling and discrimination power of the PCA models. The quality of the data clustering may often be improved by signal preprocessing techniques [9,17,18]. However, the selection of the best preprocessing function is not always obvious, as often an improved sensitivity is obtained at the expense of the selectivity or vice versa [19]. For example, increased distances between the clusters are often generated by signal preprocessing functions that are also increasing the dispersion of the clusters. In addition, the positive effects of different preprocessing functions are often compared in a qualitative and thus relatively subjective manner, that is, by the visual inspection of the associated PCA score plots.
In this paper, we are proposing a set of indicators inspired from analytical geometry that may be used for an objective quantitative assessment of the data clustering quality in the PCA score plots, as well as a global clustering quality coefficient (GCQC) that is a measure of the overall efficiency of the PCA class assignment. The use of these indicators is illustrated by a comparative study performed for the identification of the preprocessing function that is generating the most efficient PCA system screening for amphetamines based on their GC-FTIR spectra. The effect of three feature weights on the clustering quality is discussed in comparison with the cluster relative positions and dispersions obtained in the case of unprocessed spectra. The ranking of the feature weights indicated by the GCQC is explained based on the dynamics of the cumulated explained variance and on the estimated density distributions calculated with the PC1 and PC2 scores associated with stimulant amphetamines (class M), hallucinogenic amphetamines (class T), and negatives (class N). The hierarchy obtained with the GCQC was validated by using quadratic discriminant analysis (QDA) [20].
Amphetamines are the main recreational drugs of abuse and thus important efforts are made to develop chemometrical tools allowing the automated detection of these drugs [21]. In order to improve the celerity of forensic procedures, these systems must be frequently updated, as new analogues are constantly emerging on the black market. On the other hand, taking into account the legal implications of the detection results, the expectations regarding the sensitivity and selectivity of these systems are very demanding.
Internationally, amphetamine is a Schedule II drug under the Convention on Psychotropic Substances [21,22] in accordance with the Convention on Psychotropic Substances of 1971. Amphetamine analogues, often referred to as "amphetamines" or "substituted amphetamines, " contain amphetamine as a molecular skeleton. Substituted amphetamines, such as methamphetamine, whose basic molecular structure contains an aromatic ring linked by an aliphatic side chain to an amino group (see Figure 1(a)), are also stimulants [9,14]. Other substituted amphetamines, such as 3,4-methylenedioxyamphetamine (MDA) and its analogues (see Figure 1(b)), are Schedule I drugs, as they have a pronounced hallucinogenic effect.
This study was performed in order to determine the GC-FTIR spectra preprocessing function yielding the best PCA clustering quality and associated models for performing an automated detection of amphetamines. The effect of three different preprocessing functions was assessed by an objective clusters analysis, based on analytical geometry indicators and GCQC. The results indicate that all three preprocessing functions increase the efficiency of the detection system. Although no information about the biological activity or toxicity of amphetamines was introduced in the input database, the illicit amphetamines may be detected according to their primary (stimulant or hallucinogenic) biological activity. This result is explained by the correlation between the biological effect of amphetamines and the substitution pattern of the aromatic ring present in their molecular skeleton, the information about the latter structural feature being maintained by the GC-FTIR spectra. GCQC has indicated which spectra preprocessing function is the most fit for purpose, that generates the PCA models with the best predictive (modeling and discriminating) power.

Experimental Part
The PCA training set consists of 30 GC-FTIR spectra, seven of which are the main illicit stimulant amphetamines. These amphetamine analogues have been assigned the class code M, as they all have in their molecular structure a monosubstituted aromatic ring (see Figure 1). The training set also contains the main hallucinogenic amphetamines. This group is formed by six 3,4-methylenedioxyamphetamine (MDA) analogues and has been assigned the code class T, as the molecular skeleton of all these compounds contains a trisubstituted aromatic ring (see Figure 1). The rest are spectra of nonamphetamines, representing compounds of toxicological interest (class code N). The experimental conditions in which the spectra have been recorded are presented in detail in previous studies [9,14]. The analyzed spectra were recorded between 4000 and 600 cm −1 , and the absorbance was measured every 5 cm −1 . Thus, each spectrum is a vector with 681 variables.
The spectra in the database have been divided into two distinct classes, class I including the positives (M and T) and class II the negatives (N). The effect of three spectra preprocessing functions , 2 , and ( − 1) 2 on the efficiency of the PCA based amphetamines screening system was evaluated in comparison with the results obtained with unprocessed spectra. The form of the feature weight is given by [9] = ∑ ( 2 where I and II are the absorptions corresponding to the samples of classes I and II, and I and II are the numbers of samples in classes I and II. The functions and 2 act as selective amplifiers; that is, they enhance the intensity of the absorptions according to their modeling and/or discrimination power ( 2 > > 1) and leave unchanged the absorptions corresponding to irrelevant wave numbers ( = 1). The function ( − 1) 2 acts as an amplifying selector, amplifying the absorptions with an important modeling and/or discrimination power and cancelling the absorptions recorded at irrelevant wave numbers, for which ( − 1) 2 = 0 [9].
PCA was performed by using the MATLAB 2012a software. The optimum number of PCs was established by taking into account the dynamics of the cumulated explained variance. The score plots revealed the existence of three clusters, formed by the class of stimulant amphetamines (class M), hallucinogenic amphetamines (class T), and negatives (class N). The best clustering results were obtained in all cases with the PC1 versus PC2 score plot.
A MATLAB 2012a application was built in order to assess quantitatively the relative position and the density of the clusters. As the results have indicated that each spectra preprocessing function has other advantages, a GCQC characterizing the overall predictive power of the PCA based detection system was defined. This global coefficient was also included in the MATLAB 2012a application and used to obtain an objective ranking of the spectra preprocessing functions from the point of view of their positive effect on the quality of the clusters in comparison with the results obtained with unprocessed spectra.
The results have been explained based on density distributions determined with a normal kernel estimator. The feature weight hierarchy obtained with the GCQC was validated by using a supervised pattern recognition technique, that is, QDA.

Detecting Amphetamines with Principal Component
Analysis. The system built for the detection of stimulant M and hallucinogenic T amphetamines is based on PCA. The assignment of the class of an unknown occurs depending on the cluster where the point associated with this compound falls in the score plots.
The predictive power of the detection system may be improved by preprocessing the GC-FTIR spectra with various functions that may yield better defined clusters (see Figure  2). The more distant the clusters (M, T, N) are from each other in the -dimensional space defined by the principal components (PC), the more selective the detection method is, that is, the smaller the probability of obtaining the false positives (M or T) or false negatives (N) is. At the same time, the more dense each cluster is, the better the sensitivity of the detection method is, that is, the larger the probability of correct classifications (true positive and true negative) is.

Assessing the Clustering Quality with Quantitative Indicators Derived from Analytical
Geometry. The quality of the clusters found in the PCA score plots is usually assessed by visual inspection. This is a relatively subjective method, which can be easily misleading in assessing the overall improvement of the predictive power of the system. In order to obtain a more objective evaluation and overall comparison of the detection performances yielded by each feature weight, we have used a set of quantitative indicators derived from analytical geometry.
First, the center ( , ) of each cluster [19,23] was determined: where is the number of compounds forming the cluster, is the score associated with compound on the PC represented on the abscissa, and is the score of the same compound on the PC represented on the ordinate. Then the relative position of the clusters was characterized by using the following indicators: (a) distance , between two neighboring clusters, determined as the Euclidean distance [24] between the centers of clusters and : and, in our case, = 2, as the distance was determined for bidimensional representations (score plots); (b) perimeter of the polygon (triangle in our case) defined by the centers of the clusters; (c) distance , between the closest two points on the periphery of two different clusters and . The larger , , , and , are, the better the cluster discrimination is. In our case, these indicators confirm that all the tested spectra preprocessing functions are improving the dissociation of the clusters. Perimeter is considerably higher in these cases than in the case of unprocessed spectra (see Table 1). The largest , , , and , are generated by the 2 selective amplifier. However, the efficiency of the detection system is influenced not only by the distance between the centers of clusters but also by the dispersion of the clusters. The effect of cluster Journal of Analytical Methods in Chemistry 5 condensation generated by each preprocessing function was assessed by using the following indicators: (a) radius of a cluster, calculated as the distance between the center and most peripheral point of the cluster; (b) distance between the two most distant points on the periphery of the same cluster; (c) dispersion of the points forming a cluster, calculated as the mean Euclidean distance between the center and the component points of a cluster.
These indicators characterize the density of a cluster. In other words, the lower , , and are, the higher its density is. Tables 3 and 4 show that the 2 selective amplifier, which generates the best results from the point of view of the relative position of the clusters, yields in the same time the most important cluster dispersion, that is, the largest cluster radii , distances between the most distant peripheral points, and dispersion of all three clusters (M, T, and N). The best results from the point of view of the cluster density are obtained with the amplifying selector ( − 1) 2 (see Table 4). On the other hand, this feature weight has only a moderate positive effect from the point of view of the relative position of the clusters (see Table 1).

Assessing the Global Efficiency of the PCA Class Assignment.
In conclusion, each preprocessing function has other advantages. Therefore, we have defined a global clustering quality coefficient (GCQC) that is assessing the overall predictive power of the PCA models. In our particular case, it is used for evaluating the overall effect of each preprocessing function on the PCA detection efficiency. The GCQC is defined based on the indicators previously used to characterize the relative position and the dispersion of the clusters; that is, where is the radius of cluster , is the dispersion of cluster , is the distance between the two most distant points of cluster , is the distance between the two closest peripheral points belonging to the neighboring clusters and , and is the perimeter. Table 5 confirms that all the feature weights that were used for spectra preprocessing are improving the clustering quality. The largest GCQC is obtained for the ( − 1) 2 amplifying selector, which is ensuring the best global performances, that is, the best selectivity and sensitivity in detecting amphetamines. The and 2 selective amplifiers lead to smaller, almost equal GCQCs (see Table 5), indicating that their advantage of generating more distant clusters is offset by the disadvantage of yielding more dispersed clusters.
A first explanation for these results is given by the number of PCs found adequate to perform PCA in each case. Table  6 shows the dynamics of the cumulated explained variance obtained for each feature weight. A number of 10 PCs were necessary to compress the useful information in the case of unprocessed spectra and of the preprocessed spectra. However, the cumulated explained variance obtained for preprocessed spectra is larger (95.9%) than for unprocessed spectra (93.8%). Only 7 PCs were necessary in the case of 2 preprocessed spectra which are associated with nearly the same cumulated explained variance than in the case of preprocessed spectra, although slightly smaller (95.7%). On the other hand, the first 3 PCs were enough to explain a cumulated explained variance of 95.8% in the case of ( − 1) 2 preprocessed spectra. The results are consistent with the feature weight hierarchy indicated by the GCQC (see Table 5).
We have mentioned before that the PC2 versus PC1 score plot yielded the best clustering in all cases. Table 6 also shows that the cumulated explained variance of the first two PCs is much larger in the case of ( − 1) 2 preprocessed spectra (92.8%) than in the rest of the cases.

Assessing the Predictive Power of the PCA Detection
System. In order to explain the results obtained with the GCQC and the associated analytical geometry indicators, each cluster has been analyzed by using the estimated density distribution of the PCA scores associated with GC-FTIR spectra. The cluster density was determined with a normal kernel estimator evaluated at 100 equidistant points that cover the whole range of PCA scores determined for each class of modeled compounds [25][26][27]: where ℎ = 0.94 is the smoothing leveling parameter, = 100, and is the kernel function. The estimated density distributions determined for unprocessed spectra indicate that an efficient detection can be obtained only for hallucinogens, especially due to the large positive PC2 scores (see Figure 3). Stimulant amphetamines and negatives can be distinguished better based on their PC1 scores. However, the negatives characterized by relatively large positive PC1 scores will most probably be classified as (false) M positives and stimulant amphetamines characterized by relatively small positive PC1 scores will be classified as (false) negatives N.
The GCQC increases with 10% in the case of the spectra preprocessed with the selective amplifier in comparison with the case of unprocessed spectra. The estimated density distributions determined for this feature weight (see Figure 4) show that its main effect on the relative position of the clusters is an improved detection of the hallucinogens (T). The larger distances , between the centers of the M, T, and N clusters and perimeter (see Table 1) and the increased distances , between the two closest points on the periphery of two of the M, T, and N clusters (see Table 2) are contributing especially to the improvement of the sensitivity and selectivity of the T cluster (see Figure 4(b)). The overlap of the M and N estimated density distributions determined with the associated PC1 scores is less important than in the case of the unprocessed spectra. However, false M and N classifications may still be expected due to significant cluster dispersion.
The estimated density distributions determined for 2 show that the hallucinogens form the best defined cluster (see Figure 5). Similarly to the case of the selective amplifier, these compounds can be distinguished from the other substances due to their high positive PC2 scores. The 2 selective amplifier generates the largest distances characterizing the relative positions of the clusters. As a result, the M and N estimated density distributions determined for PC2 scores are overlapping less than in the case of the selective amplifier (see Figure 4(b)) or of the unprocessed spectra (see Figure 3(b)). However, 2 also generates the largest cluster dispersion. In this case, the dispersion is so important that it counterbalances the positive effect related to cluster relative positions stronger than in the case of the selective amplifier. The GCQC is sensitive to these variations. As Table 5 shows, the GCQC obtained for 2 represents only 95.43% of the value determined for the selective amplifier.
The estimated density distributions (see Figure 6) indicate that the ( − 1) 2 amplifying selector leads to the best discrimination of amphetamines according to their biological activity and associated toxicity. The most important positive effect of this feature weight in comparison with the and 2 selective amplifiers is the improved discrimination of the clusters of stimulant amphetamines (class M) and of negative compounds (class N). Indeed, the cluster of stimulant amphetamines is found in quadrant I, the cluster of negatives in quadrants II and III, and the cluster of hallucinogenic amphetamines in quadrant IV of the PC2 versus PC1 score plot (see Figure 2(d)).
QDA is a supervised pattern recognition technique closely related to LDA, where it is assumed that the measurements from each class are normally distributed. The quadratic version of discriminant analysis was preferred as, unlike LDA, in QDA there is no assumption that the covariance of each of the classes is identical. The classification results are displayed in Figures 7, 8, 9, and 10.
The classification results obtained with QDA are synthesized in Table 7. The classification rate indicates the percentage of compounds that could be classified (correctly or not) by the system. Among these, the correct classification rate indicates the percentage of compounds that have been classified as true positives (amphetamines classified as such) and true negatives (nonamphetamines classified as such). Unlike the case of unprocessed spectra, all the class assignments performed with preprocessed spectra yield classification rates of 100%. The correct classification rates confirm the ranking provided by GCQC for the spectra preprocessing functions and their positive effect on the quality of the clusters in comparison with the case of unprocessed spectra (see Tables  5 and 7 The increased predictive power indicated by the GCQC in the case of the selective amplifier is validated by an improved correct classification rate. As the estimated density distributions have indicated, this positive effect is due to the improvement of the sensitivity and selectivity of the clusters formed by hallucinogenic and stimulant amphetamines, the T and M correct classification rates reaching 100% and 95.23%, respectively. However, these results are obtained at the expense of the sensitivity of the N model.
The GCQC has indicated a slightly lower predictive power for the models generated by spectra preprocessing with the 2 selective amplifier. Indeed, the correct classification rate becomes 84.90% in this case. As the estimated density distributions have indicated, this is due to the large cluster dispersion that affects the sensitivity of the N model even more than in the case of the selective amplifier.
Finally, the GCQC has forecasted that the best results may be obtained with the spectra preprocessed with the ( − 1) 2 amplifying selector. This function generates the best balance between the relative position and the dispersion of each cluster. As a result, the correct classification rate reaches 89.93%. Consistent with the information provided by the estimated density distributions, the most important positive effect of the ( − 1) 2 amplifying selector in comparison with the second best preprocessing function, that is, the selective amplifier, is related to an improved sensitivity of the M and N models. With this feature weight, no false negatives have been recorded. Taking into account the purpose of the screening system, its most important requirement is related to the sensitivity of the M and T models. The only acceptable compromise is related to the selectivity of these models: false positives are not desirable, but acceptable.

Conclusions
The literature presents many cases in which the natural clustering of the data has been improved by preprocessing the analytical signal used for characterizing the target compounds. However, an improved selectivity is often obtained at the expense of sensitivity of the PCA class assignment, and often the feature weight yielding the most efficient screening system is selected subjectively.
We have proposed a set of quantitative indicators inspired from analytical geometry that may be used for characterizing objectively the sensitivity, selectivity, and overall efficiency of unsupervised pattern recognition applications. Their use has been illustrated by a comparative study on the effect of three spectra preprocessing functions on the efficiency of a PCA based screening system designed to assign the class   of amphetamines according to the type of their biological activity. Analytical geometry indicators have been used for an objective assessment of the cluster relative position and of the cluster density, which are directly influencing the predictive power of the PCA models, and thus the sensitivity and selectivity of the detection system. The results show that the preprocessing function that is yielding the best cluster relative position is not necessarily ensuring the best efficiency in class Table 5: Global clustering quality coefficient (GCQC) assessing the overall efficiency of the PCA class assignment.
Unprocessed spectra 0,8889 Spectra processed with the selective amplifier 0,9774 Spectra processed with the 2 selective amplifier 0,9327 Spectra processed with the ( − 1) 2 amplifying selector 1,2124 assignment. For example, the 2 selective amplifier generates an important cluster dispersion, which shadows its positive effect of yielding the largest distances between the clusters. In order to obtain an objective ranking, we have defined a GCQC that allows the quantitative overall assessment of the efficiency of the PCA based screening systems. The results indicate that the best detection of amphetamines according to their biological activity is obtained when the GC-FTIR spectra are preprocessed with the ( − 1) 2 amplifying selector. Although this spectra preprocessing function yields only a moderate improvement of the cluster relative position, it yields the class assignment because it ensures the best      balance between the distances between the clusters and their dispersion.
The hierarchy indicated by the GCQC has been explained by analyzing the dynamics of the cumulated explained variance, the associated number of PCs, and the estimated density distributions determined for the PC1 and PC2 scores associated with stimulant amphetamines, hallucinogenic amphetamines, and negative compounds. The results indicate that the ( − 1) 2 amplifying selector maximizes the sensitivity of the detection system.
In conclusion, GCQC may be used to compare score plots provided by this nonhierarchical clustering technique in the same way as the cophenetic correlation coefficient may be used for ranking different algorithms generating hierarchical classification trees (dendrograms). The same indicators may be used for deciding the size and nature of the training set. In other words, they can help us make objective evaluations for deciding matters such as what is the best number and nature of samples to include in the training set? Should we keep in the training set samples that are not outliers but are found on the border of a given cluster? How many and which variables may be eliminated without affecting the efficiency of the class assignment? How many models should be built? Another important advantage is that this approach saves the time spent for evaluating the predictive power of the models by more complex classification techniques.
Last but not least, we would like to emphasize that the use of quantitative (analytical geometry and GCQC) indicators presents the major advantage of allowing the automatic evaluation of the cluster relative position and dispersion, as well as of the overall predictive power of the models. The described procedure has been developed as a MATLAB algorithm for the overall assessment of PCA based screening systems.