Probabilistic Adaptive Crossover Applied to Chilean Wine Classification

Recently, a new crossover technique for genetic algorithms has been proposed.The technique, called probabilistic adaptive crossover (PAX), includes the estimation of the probability distribution of the population, storing the information regarding the best and the worst solutions of the problem being solved in a probability vector. The use of the proposed technique to face Chilean wine classification based on chromatograms obtained from anHPLC is reported in this paper. PAX is used in the first stage as the feature selection method and then support vector machines (SVM) and linear discriminant analysis (LDA) are used as classifiers. The results are compared with those obtained using the uniform (discrete) crossover standard technique and a variant of PAX called mixed crossover.

GAs are one of the techniques known under the name of evolutionary algorithms (EAs) and are inspired by the concept of biological evolution [26].The main idea is that each individual of the population is represented by a binary strip called chromosome, where each element in the strip is called gene or locus and the value that can be acquired is named allele.Each individual in the population corresponds to a possible solution of the problem being solved.An objective function is defined so that each individual has its own fitness.The probability that each individual has offspring is defined accordingly to its fitness.The evolution of the population will take place by generating randomly new possible solutions corresponding to the result of a crossover, the way in which the selected parents are mixed to generate the new solutions of the next generation.Then a new fitness scoring is made and the possible solutions evolve along generations [26].
One of the most interesting applications of the GAs is certainly feature selection in the pattern recognition area [18][19][20][21][22][23][24][27][28][29].The methodology introduced by Holland in his 1975 book [30] has been referred to as GAs and was generalized later by De Jong [31] for the N crossover case.In 1987, Ackley [32] proposed the uniform crossover technique; however, this method is usually attributed to Syswerda who presented in 1989 a theoretical study of the uniform crossover [33,34].
According to Yang [35] the way in which the crossover is adapted in a GA can be classified in three categories: adaptation of the crossover type [36][37][38], adaptation of the crossover rate [39,40], and adaptation of the crossover position or the probability of change of each bit [41][42][43][44].The new technique called probabilistic adaptive crossover (PAX) proposed in Salah et al. [26] belongs to the third category.
The purpose of this paper is to show an application of this new adaptive crossover technique PAX, which was recently introduced in [26], to an important problem of the Chilean economy related to the production and quality of its wine.In this technique the estimation of the population distribution is done in a new way rewarding the best individuals and Mathematical Problems in Engineering penalizing the worst individuals of the population.Once the estimation of the distribution is obtained, two parents are selected to generate new individuals, transferring the allele of each locus to the offspring in those places where the parents are equal and applying the estimated distribution to determine the alleles in those places where the parents have different values.Other similar techniques have been proposed by Varnamkhasti et al. [45] and the one reported by Talaslioglu [46].In the former, a new crossover operator and probability selection technique are proposed based on the population diversity using a fuzzy logic controller, whereas in the latter a bipopulation-based genetic algorithm with enhanced interval search is introduced.In both cases the proposed methodologies are effective in finding better performance and quality solutions.
This paper presents the application of this new crossover technique used as the feature selection methodology in the problem of classifying the grape variety of Chilean red wines.To this extent a wine database was first built using different commercial Chilean wines of three type: cabernet sauvignon, carménère, and merlot.The classification is done based on the information contained in the liquid chromatograms of phenolic compounds provided by an HPLC.A brief description of the new PAX methodology is presented in Section 2 and its application to a wine database of 172 wine samples is developed in Section 3, where it is compared with uniform (discrete) crossover and a technique which is a variant of PAX called mixed crossover.Finally, some conclusions are drawn in Section 4.

Probabilistic Adaptive Crossover (PAX)
As mentioned above, in [26], a new crossover methodology called PAX was proposed for genetic algorithms, which uses a probability vector (probability that the bit in the th position in one individual would be equal to 1) which is updated at each generation.The updating procedure is such that the evolution is done moving away from the worst genotypes and getting closer to the best genotypes.When the crossover is performed, this new method retains those bits that are equals in both parents in their offspring and randomly generates those bits that are different in both parents, using the previously updated probability vector.This probability vector presents characteristics of boundedness, forgetting factor, and learning rate.Thus, a methodology that feeds back the contribution of the different alleles to the fitness function is generated.This provides crossover with the kind of memory needed to use the information from previous evaluations.A new way of capturing population statistics is formulated favoring the best individuals and punishing the worst.The above allows stating a new feature selection methodology mixing two currently existing groups of methodologies: filtering and wrapped.In the following a brief description of PAX is presented although the reader is referred to [26] for a more detailed explanation of the method.
In the proposed method, those genes of the parents having the same values are directly transferred to the children.Since the fact that both parents have the same genetic content in certain positions is expected by natural selection, these values are inherited to the children in a direct way, since they have produced good adaptation in their parents.For those genes where the parents have different values, what is analyzed is which of these values produces a greater adaptation to the environment, using the information from the experience of the entire population.
The method uses an adaptive vector of probabilities, , which is constantly updated, computed for the entire population.This vector  represents the probability that a bit (loci) takes the value 1 in the children provided that both parents have different values.This vector also considers the information of the previous generations and is updated using the fitness of the individuals of the population.The updating procedure is done considering those individuals that present fitness over the average of the population as positive examples and those individuals presenting fitness below the average as negative examples.
In the updating process the information of previous generations is introduced, weighted by a forgetting factor .This forgetting factor is introduced because the average of the fitness of the individuals theoretically increases, which is the reason why a solution that was over the average in generation  can be below the average in generation +1.Thus, the probability vector is updated considering the population history as well as the current population.If the evolution has led to values near to extreme, almost all the individuals in the next generation will have the same value.In order to prevent this overfitting, this forgetting factor brings the probabilities values near to 0.5.
This updating methodology has another characteristic since the contribution to the adaptive vector of probabilities of those individuals which are over the average is weighted differently.Thus, each individual, that is, over the average does not contribute equally to updating the probability vector but rather the contribution, is made proportional to the fitness of each individual with respect to the others.The same concept is used for those individuals that are below the average, where those individuals of worst performance will be severely punished.In case that no allele stands out over the rest in some loci, the probability vector should converge to 0.5, which would transform the method into a discrete crossover [47].
The forgetting factor  is considered as follows: where ⃗ 1 denotes the unity vector of length  and P is an intermediate probability vector.After the application of the forgetting factor, the individuals   of the population Pobl  are separated into the individuals whose fitness (  ) is over the mean (), and those who are under the mean.These two groups are denoted as two subsets pobl + and pobl − , respectively, and they are defined as Mathematical Problems in Engineering 3 Then, the probability vector is updated using the following relationship: where  is the learning rate that determinates the maximum value that can be used to update one position of the vector .This maximum value is reached in the case where all the individuals in the set pobl + have the same value in one position, and the individuals in the set pobl − have the opposite value in the same position (in the case of a binary alphabet).
Once the probability vector  is updated, the individuals that will be the parents of the next generation are selected according to the desired selection scheme (selection by tournament [48], ranking selection [49], stochastic universal sampling (SUS) [50], deterministic crowding [51], etc.).Next, the chosen parents are grouped in pairs, giving rise to two new children, which will have the same value of the parents in those positions where the values of the parents coincide.The rest of the positions will be filled with 1, with the probability indicated by vector .After this it is possible to apply a tournament between children and parents, elitism operators or mutation, according to the desired scheme.
Tournament selection is a method for selecting the parents for the next generation making several tournaments between random selected individuals.The winner of each tournament is selected as a parent [48].In the ranking selection method the probability of being selected as a parent for the next generation is assigned proportionally to the fitness of the individual [49].SUS is an improvement of ranking selection using a single random value for selecting the whole population, reducing the bias of the ranking selection [50].In this study the deterministic crowding technique [51] was chosen as selection method.In this method all the individuals are used as parents for the next generation.
In order to analyze empirically the performance of the proposed method PAX, in [26], the method is compared with six different crossover methodologies: one and two-point crossover, uniform crossover, discrete crossover, statisticsbased adaptive nonuniform crossover (SANUX) [35], and selective crossover.From the studies performed in [26] over several databases it is concluded that PAX provides better solutions using up to 15% fewer computations compared with the other crossover methodologies, for the case of problems with one global optimum and when each bit contributes individually to the fitness function.On the other hand, in case of multiple optimum problems or when each bit does not make an explicit contribution to the fitness function, PAX performs equally well compared with the other methodologies studied.

Application of PAX as Feature Selection for Chilean Wine Classification
Recently, there has been a strong interest in applying techniques for wine classification.Numerous techniques and algorithms have been used in order to classify not only variety but also the geographical origin of wines.This classification has been done by using physical characteristics (color, density, conductivity, etc.), chemical features (anthocians, phenols, amino acids, etc.), and organoleptic characteristics (flavor, aromas, etc.) [23,24,26,[52][53][54][55].
In what follows, a methodology based on PAX [26] is presented for selecting the main variables from the viewpoint of wine variety classification, using information coming from liquid chromatograms supplied by an HPLC-DAD [53].

Problem Description.
The information used for wine classification in this study corresponds to polyphenolic compounds of low molecular weights contained in liquid chromatograms obtained from an HPLC-DAD [53].HPLC is an analytical chemistry technique used to separate the chemical compounds present in a liquid sample, in order to identify each component and to determine its concentration.Generally speaking, the method involves passing the liquid sample over a solid adsorbent material located inside a column, using a flow of liquid solvent.Each compound in the sample interacts differently with the adsorbent material so that light compounds flow off the column first, whereas heavier compounds flow off much later.
The equipment use in the study corresponds to a Merck-Hitachi model L-4200 UV-Vis Detector with a pump and column holder thermostat.The column is a Novapack C18, with a length of 300 mm and an inner diameter of 3.9 mm.For separation of the different phenolic compounds the following solvents were used: After a series of trial experiments, it was determined that the best gradient to be used in the HPLC for the separation process was that shown in Table 1.Each chromatogram provided by the HPLC is a 90 minutes signal sampled every 800 ms and contains 6751 points.Each of these peaks has been identified by researchers in the area of chemistry and agronomy and associated to a particular chemical compound [53,56,57].
In general, the approach followed to identify the wine sample has been to identify each peak of the chromatogram (which is associated with a particular chemical compound) and then to use the concentration of these compounds as features for classification.The hypothesis is that each wine variety has a different combination of these compounds present in the chromatogram with a specific concentration.The approach we are proposing in this paper is to use all of the information contained in the chromatogram (electric signal), without identifying the chemical compounds associated with each peak, but using the complete chromatogram as a distinctive mark (a kind of "fingerprint") associated to the wine sample.Here the hypothesis is that wines belonging to the same class have similar distinctive marks and wines belonging to different classes have distinctive marks that differ in some sense.
As an example, Figure 1 shows a typical normalized profile corresponding to a carménère wine sample supplied by the HPLC-DAD.The chromatograms last 90 minutes and contain 6751 points since the HPLC uses a sampling period of 800 milliseconds.
A spectral analysis of the chromatograms (DFT) revealed that, on average, 99.95% of the power spectrum is located at frequencies less than 0.125 Hz.Then using as Nyquist frequency for the data   = 0.125 Hz and applying the Shannon sampling theorem [58], it is possible to resample the chromatograms with a sampling period of   = 1/(2  ) = 1/(2 ⋅ 0.125) = 4 [s], losing about 0.05% of its information.Thus, after resampling the chromatograms at this new sampling period, the amount of points to be considered for each chromatogram are reduced from the original 6751 points to only 1350.
On the other hand, the first 5 minutes of the chromatograms were discarded since this part of the chromatogram contains basically information related to the effluents used to perform the liquid chromatogram but do not contain any useful information associated with the wine sample itself.Thus, we finally worked with chromatograms of 1276 points.
Since the size of the maxima depends upon the wine volume injected into the HPLC to obtain the chromatograph, a normalization procedure for the amplitudes was used, in order to avoid distortions.In some cases 20 mL of a prepared wine sample was injected in the HPLC but in others up to  100 mL was used, and, as a consequence, peaks with different magnitudes were obtained in the chromatograms.The normalization procedure was done using the following formula: Thus the values of all chromatograms will lie on the intervals 0 and 1, which allow comparing the chromatograms using a common base.The 172 wine samples available for this study are distributed as shown in Table 2.These are commercial wines from different valleys of the central part of Chile (Maipo, Rapel, Curicó, Maule and Itata) and are from vintages from 2000 and 2001.
A standard practice in data mining is to separate data in two sets: one for training-validation purposes and the other for testing.With the first set, all the parameters of the classifier are suitably tuned and then the test set is presented to the classifier to evaluate its performance.The choice of the training-validation and the test sets was done at random, choosing the number of samples proportional to the total number of samples available for each class.Based on previous works [23,24,54], 2/3 of the database was used for trainingvalidation and the rest for testing.

Feature number 1
Feature number 1276

Methodology Description
3.2.1.Individual Coding.For individual coding purposes binary strips of length  were used. is the number of features available for the problem, which in this case is  = 1276.(Each feature is associated with a point of the chromatogram).The existence of a 1 at the th position indicates that the GA has chosen feature  of the sample to be considered in the classification stage and the existence of a zero means that this feature should not be considered in the classification process, as is shown in Figure 2.
The choice of the number of individuals for each new generation will be denoted as   .This number has a direct relationship to the amount of computation necessary to find the optimal solution.As this number increases the exploration is wider, but the number of computations is also greater.
In this study the value   = 150 was chosen, which is equal to that used in the study reported in [23] on a similar problem.A simple mutation probability of   = 0.001 was considered together with 150 generations per each population.
The individuals of the initial population were chosen in the following manner: a random number of characteristics  between 1 and 1276 was first chosen.Then these  features (numbers one) were randomly located in the binary strip of length 1271, obtaining an individual of the initial population.

Fitness Function.
In order to define the fitness of each individual, classification using SVM was used considering as kernel the radial basis function where   ,   are the two patterns being compared and  is the spread of the RBF.The fitness considers the number of correct classifications normalized by the total number of individuals employed in the training.This number is penalized subtracting a term proportional to the number of characteristics, normalized by the total number of characteristics available (), weighted by the factor .The fitness is then determined by For this study a value of  = 0.1 was considered, since with this value an individual (solution) classifying correctly one sample less than another individual but uses 11 fewer features The percentage of correct classification was obtained by using the K-fold-cross-validation [36] methodology over the training-validation set with  = 6.K-fold-cross-validation consists of separating the data set into  groups and training the classifier with  − 1 of them and then validating with the remaining group.This is repeated for each of the  sets.
Since SVMs are defined for two classes, a scheme was developed in order to classify one class first with respect to the other two ("other classes") and then, using another SVM, samples classified as "other classes" are separated.Since there are three possible schemes (depending on what class is left out) all of the three schemes were implemented.The results obtained from these three schemes were voted on, assigning the sample  to class (  ), when this is the solution in two out of the three possible classification schemes.
To determine the SVM parameters, the methodology K-fold-cross-validation with  = 6 was used.Different combinations of  values (spread of the radial basis function) between 0.001 and 1 and  values (degree of penalization for wrongly classified samples in SVM) between 1 and 100 were chosen and evaluated.The best parameters in each case are shown in Table 3.

Crossover and Selection Methodology.
In this study the niching method called deterministic crowding was used [51] for selecting the individuals to produce the next generation.Three different crossover techniques were used for feature selection with GA.These techniques were discrete crossover, PAX (with learning rate of 0.1, forgetting factor 0.1 and bounds between 0.3 and 0.7), and a modification of PAX called mixed crossover (with learning rate of 0.1, forgetting factor 0.1, and bounds between 0.3 and 0.7).In the mixed crossover methodology the probability vector is not directly updated using each individual fitness but rather using other measures applied to the data (Fisher criteria in this study).This combination would correspond to an approximation of the hybrid methodology between the filter feature selection method and the filter-wrapped selection method.This corresponds to an open loop-close loop approach that can be applied since in PAX individuals are compared according to the results of the SVM classifier, but they are led towards the feature sets presenting the best characteristics according to the separation measure being used (Fisher criteria in our case).Figure 3 presents the block diagram of the classification method used here.4 for the three crossover techniques employed using SVM as the classifier.Also, the percentage of correct classification obtained in testing is presented using the features proposed by the best individual of the population in the generation 150. Figure 4 shows the average fitness of the best individuals at each generation when using the GA with mixed, PAX, and discrete crossover methodologies, using SVM as the classifier.Figure 5 shows the 213 characteristics selected by the mixed crossover technique plotted over the chromatogram of a cabernet sauvignon sample (Sample 1).These characteristics correspond to those obtained for the best individual in run 14.
The maximum percentage of correct classification obtained in training-validation and the number of features (N ∘ Char.) selected by the three crossover techniques studied using LDA as the classifier are shown in Table 5.The classification rate obtained in testing is also presented, using the characteristics obtained for the best individual of the population after 150 generations.
The average fitness of the best individuals of each generation for the three crossover techniques (mixed, PAX, and discrete) when LDA is used as the classifier are presented in Figure 6.In Figure 7 the 22 features selected by the mixed crossover methodology, plotted over the chromatogram of a cabernet sauvignon wine sample (Sample 1) are indicated.These features correspond to those obtained for the best individual in run 14.

Analysis and Discussion of the Results
. Simulations were performed on 12 Pentium IV computers of 1.5 GHz and 256 MB RAM running independently.The software used was MATLAB 6.5 and the GENETIC ALGORITHM TOOLBOX [59].LDA classification was done using the toolbox DIS-CRIM of MATLAB [60] and the SVM classification was done using LIBSVM [61], an open source machine learning library.From Table 4 and Figure 4 it can be seen that the three crossover methodologies achieve similar results, as far as fitness is concerned, when using SVM.The number of selected features differs from one method to another, with the mixed crossover being the one with the smaller number of selected features and with a lower value of the objective function (6).
As far as the convergence speed is concerned it can be appreciated from Figure 4 that PAX increases during the first generations with a larger rate than the other techniques.It is important to note that the velocity at which PAX reaches values near 0.85 occurs around generation 70, whereas the same level is reached by the discrete crossover at around generation 120.This shows a significant computational saving when using PAX as compared with use of the discrete crossover.
Regarding the speed of convergence, similar results to those obtained with SVM are obtained when LDA is used as the classifier (see Table 5).However, the number of features selected with LDA is 75% or 80% less than the number obtained with SVM.
For both classifiers values around 70% are obtained in testing, attributable to the small number of wine samples in the database.
Other results on wine classification, using the same wine database, were obtained by the authors and reported in [62].There, a feature extraction stage was used instead of feature selection, as is reported here.Several combinations of feature extraction methods followed by the quadratic discriminant analysis (QDA) classification technique were studied.Average correct classification rates as high as 99% in the set were reached when using the quadratic fisher transformation (QFT) for feature extraction [63].These results are superior to those reported here and a future investigation will be to analyze why this large difference is produced in both methodologies and how the results reported here could be improved.

Conclusions
The application of a genetic algorithm with a new crossover technique has been applied to the problem of classifying Chilean red wines.The GA was used as the feature selection method, and the classifying technique is based on the information contained in the liquid chromatograms of phenolic compounds present in the wine samples obtained form an HPLC.The results obtained using SVM and LDA with a database of 172 wine samples are quite acceptable and the PAX methodology (standard and modified) presents some advantages over the uniform (discrete) crossover technique.
Much better results are obtained when LDA is used as the classifier compared with those obtained when SVM is used as the classifier.In the first case, classification rates of 71% are achieved on average against only 68% in the SVM case.More important is the fact that the number of selected features decreases on average from 200 (in the case of SVM) to only 40 for LDA.Nevertheless, this last aspect has to be carefully examined.In fact, since the objective function (6) considers the number of features, suitably weighted through parameter , as well as the percentage of correct classification, lower number of features does not necessarily mean better classification rates.
In general, the application of PAX to this wine database reveals that PAX maintains the good characteristics exhibited when applied to standard databases [26], regarding the speed of convergence and the quality of the obtained solutions.Unfortunately, the good results obtained in trainingvalidation are not reflected in testing.This can be attributable to the fact that the low number of elements (samples) in the database is not representative enough for the problem under analysis, which could be causing an overtraining.
Finally, it is important to point out that PAX allows developing open loop-close loop optimization methodologies, as the one presented in the mixed crossover approach, leading to better solutions, as long as a suitable fitting function is defined.In this sense, other combinations of open loop-close loop measures applied to the case of SVM, as it was in the case of LDA where the Fisher criteria was used, could improve the results reported here using SVM.This statement can be viewed as a future work in this area.

Figure 3 :
Figure 3: Block diagram of the classification system.

3. 3 .
Results. 15 runs using different randomly chosen initial populations of 150 individuals were analyzed.At each run the sets used to perform the training and the set for crossvalidation were changed each time, choosing it randomly.The maximum percentage of correct classification obtained in training-validation and the number of features selected by the GA (N ∘ Char.) are shown in Table

Figure 4 :
Figure 4: Average evolution of the best individual using SVM.

Figure 5 :
Figure 5: Selected features by the mixed crossover technique using SVM, plotted over the chromatogram of the wine Sample l (cabernet sauvignon).

Figure 6 :
Figure 6: Average evolution of the best individual using LDA.

Figure 7 :
Figure 7: Selected features by the mixed crossover technique using LDA, plotted over the chromatogram of the wine Sample l (cabernet sauvignon).

Table 1 :
Gradient used by the HPLC.

Table 2 :
Wine sample distribution according to the three grape varieties.

Table 3 :
Parameters used for the SVM.

Table 4 :
Results obtained for feature selection of Chilean wines using SVM.