Classification of the Entities Represented by Samples from Gaussian Distribution

This paper aims to cluster entities which are described by a data matrix. Under the assumption of normality of observations contained in each table, each entity is represented by samples fromGaussian distribution, that is, a number of measurements in the data matrix, the sample mean vector, and the sample covariance. We propose a new distance based on Mahalanobis’s discriminant score to measure the similarity between objects. The present study is thought to be an important and interesting topic of research not only in the quest for an adequate model of the data representation but also in the choice of the distance index between entities that would allow justifying the homogeneity of any observed classes.


Introduction
One of the fundamental problems in automatic classification is the development and validation of similarity indices between the objects to be classified.These indices must adapt to classify objects and allow measuring the adequacy between an object and a class of objects.If the objects to be classified are described by matrices comprising such repeated observations of individuals for the variables that describe them over a finite period of time, we present a new distance based on Mahalanobis's discriminant score to measure the similarity between objects.
Usually, for this type of data, before the classification stage, we proceed to a reduction step.We can summarize each table by a vector, or a hyperrectangle, and we can use factorial techniques to reduce each table.
Therefore, these reduction techniques require assumptions that are difficult to achieve in practice.Indeed, the first type of reduction makes sense only if the mean or another central value summarizes perfectly the observations of each individual , and this reduction does not take into account the variability of the observations.The hyperrectangles are Cartesian products of intervals.The interval estimated depends on the variability of the observations but does not consider the possible relationship between the variables.This type of reduction requires that the variables must be uncorrelated.Several distances between interval objects have been extended to distances between hyperrectangles and remain a subject of research in automatic classification.These include the distance based on city block distance [1], Hausdorff distance between hyperrectangles, Wasserstein based distance [2], and single adaptive distance [3].Finally, the third type of reduction leads to new uncorrelated variables but poses significant mathematical problems such as the search for compromise space and the number of observations to be used for the reduction of each entry table (see [4,5]).If the number of observations of each variable is the same for each object, the input data can be considered as a structure of data matrices (see [6]).
This paper aims to cluster entities which are described by a data matrix.Under the assumption of normality of observations contained in each table, each entity is represented by samples from Gaussian distribution, that is, a number of measurements in the data matrix, the sample mean vector, and the sample covariance.We define a new distance based on Mahalanobis's discriminant score to measure the similarity between objects.We propose an extension of the -means algorithm to this case.The approach can be extended to cluster objects described by variable subjects with errors of measurements.

Advances in Decision Sciences
In analogy to the classical squared-error criterion and the -means algorithm, clustering is here proceeding by defining and minimizing a joint clustering (heterogeneity) criterion for a partition (with a given number  of classes) and a set of  class prototypes, that is, the sum of the class-specific sums of dissimilarities between class elements and the corresponding class prototype.
The paper is organized as follows.In Section 2, we present the data structure and some references.In Section 3, we introduce the index of distance between objects and the steps of the algorithm.In Section 4, we provide a numerical example and do a comparative study with the classical approach.In Section 5, we explain how the algorithm is applied to cluster the workdays according to the degree of the traffic pollution at the most important roundabout.In combination with six weather conditions parameters measured on the same days, the resulting classes are analyzed and described in terms of six meteorological characteristics.In Section 6, we draw the corresponding conclusions.

The Data Structure
Let Ω be a set of  objects described by a set of  quantitative variables {  ;  = 1, }.
is a map defined by where V   is the value taken by the individual  for the variable   .
(i)   , for example, represents the medical record of the patient  for  variables made in the daytime; represents, in this case, the value taken by the patient  for the variable   .
(ii)   contains in our study the value of the seven pollution parameters for the day  for the 24 hours of the day.
The input data are

Classical Approach
In order to compare the objects, we must be in the same reference frame.Thus the basic problem of the search of a compromise axis system is posed.This problem also concerns other disciplines of mathematics, especially in differential geometry [5].The proposed criteria in literature, for the search of compromise space on which we project the objects to compare them in terms of proximity, are not really justified.The proposed technics are purely heuristics [7], available online for free.Relations between tables are also analyzed with Procrustes analysis and compromise factorial axes in the context of multiple factorial analysis.One important reference can be Gardner et al. [8].
Finally, the conclusion regarding Bouroche's [4] proposal is too reductive of the large domain of research so that this reference could be removed.
(ii) If the matrix   has the same dimension (, ), in [6], an algorithm of -means type is proposed based on the Hilbert-Schmidt inner product to classify these matrix objects.If   does not have the same dimension, we can envisage a step of completion in order to obtain a structure of juxtaposition of data tables of the same dimension.
∃ ̸ =  such that   ̸ =   ; we can use the following procedure to complete the tables.We assume that   > 1; ∀ = 1, .Let  be the least common multiple of   :  = LCM (  ,  = 1, ) . ( There exists   so that Now, by duplicating   times each table   , we obtain a new table T of dimension  × .So, if   is a large number, the least common multiple becomes necessarily large and the procedure leads to a structure of large tables.Moreover, this completion removes any chronological order of the data.It seems more reasonable to carry out the classification without processing with this completion step.It seems necessary to study the case where the tables   do not have the same dimension and without a reduction stage.If the hypothesis of normality of the observations in each column of table   is verified, this matrix   can then be considered as regrouping a realization of the normal random vector  whose distribution parameters (  , Σ  ) should be estimated.These parameters will be estimated in an empirical way from the observations in the entry tables.The aim of the present paper is therefore to present a new approach of classification based on the -means algorithm.This approach uses a new distance index based on the Mahalanobis discrimination scores.The proposed algorithm expands to the tables of different dimensions and is validated on real data of the traffic pollution.

Proposed Approach
These estimators are unbiased, convergent, and consistent and do not depend on the number of observations or trials.

Classification Algorithm.
We wish to gather the  individuals in  homogeneous classes.The heterogeneity of the classes is measured by a criterion of the inertia sum of the classes.This criterion is expressed by where   is the prototype or the kernel of the class   ;   is the observation of the individual ; and  is an index of distance between the objects and the prototype or representative elements of the classes.This criterion expresses the adequacy between the individuals with regard to the classes where they are affected.

Description of Individuals.
We suppose that every table   groups a sample of size   of the Gaussian random vector of parameters (  , Σ  ).For example, in the case of data with errors of measure, the tables data groups the repeated observations about the description of the variables.These observations are the realizations of the Gaussian random vector.It is clear that these observations are not correlated and the estimated variance covariance matrix is complete and thus not singular.Each  is described by   = (  ,   , Σ  ), where (i)   ∈ N, where   is the number of observations of the individual ; (ii)   ∈ R  , where   is the vector containing the estimated means for each variable; (iii) Σ  ∈   (R) is the set of real symmetric positive definite matrices of order .

Distance between Individuals.
Let  and  be 2 individuals described, respectively, by   and   .We wish to build an index of distance which takes into account the distribution parameters.To do this, we use the notion of discriminant score.For a realization  ()  of the individual , the discriminant score of Mahalanobis of this observation with regard to the realizations of the individual  is given by sc where   and   are the average vector of the individuals  and , respectively.It supposes that the th observation of the individual  is assimilate to the average vector (empirical value of the distribution of its observations), for all the realizations of the individual : Similar arguments lead to These scores are positive quantities and perfectly express the similarity between two individuals.
The map  is defined by where  is an index of weighted distance.Without loss of generality, we assume that all objects are observed the same number of times;   =  for all  = 1, . . ., .We assume that (1) ( 1 ,  2 ) = 0 ⇔  1 =  2 ; (2) 2 > .This hypothesis implies that the matrices Σ 1 and Σ 2 are nonsingular.
Proof.We focus on the case where, for all  = 1, ,   = .We research  * and Σ * which minimize   .We put  = ,   =   , and Σ = .We have   (, ) =  (, ) The necessary condition is written as We note that the first expression of  does not depend on .We put Then and also The expansion of ( +  −1 ) −1 gives Remark 2. In the case of measuring errors on the obtained classes, characterization is not flawed and is given exactly.This seems quite natural.

The Distance between
Individual and a Class.The individual  is described by (  , Σ  ) and the class   which contains   individuals is characterized by   ∈ R  given by where The distance between the individual  and the class   is given by ( The minimum is obtained by 4.6.2.Classification Algorithm.We choose  (0) and in all cases we alternatively use the functions  and .The algorithm runs as follows: The algorithm stops as soon as the partition does not change.We build two sequences   and   .Proposition 3. The sequence   = Cr( () ,  () ) is decreasing and converges.

Numerical Illustrative Example
We wish to cluster the six objects {1, . . ., 6} into 2 clusters.Each object is described by three variables  1 ,  2 , and  3 .The three variables are unspecified and we assume that the condition of normality of these observations is verified.The artificial input data is as follows: In (32)  1 , . . .,  6 are not in the same dimension.

Classical Approach.
Usually, we summarize observations of each individual by a central value that can be the mean.This method of data reduction can lead to erroneous results.This is shown in this numerical example.
(i) The mean value of each variable and the coordinate of final centers of clusters is obtained by using the means algorithm: Final partition is as follows: Distances between objects and kernel of the class in the proposed approach are as follows: Final partition is as follows: Taking into account the variations that have resulted in errors of measurement and drop of the -means algorithm with the weighted Mahalanobis distance it appeared that individuals 2 and 4 must be in the same class which has not been reported with the precedent procedure.In the case where the variability of observations plays an important part in the description of the individuals, the classification, made without taking into account these variabilities, leads to incorrect results compared with the reality of the data.

Application
Two files were used.The first file contains the observations of the seven parameters measuring the pollution caused by gases emitted by cars at a major intersection center of a city.The seven measured pollutants are carbon monoxide, nitrogen monoxide, nitrogen dioxide, PM10 dust, sulfur dioxide, volatile organic compounds, and ozone.These pollutants were measured each hour for each day.These observations concerned 420 days without gaps over the past three years.This file contains 420 tables of dimension 24 × 7 each.For these 420 days, we build up another file by measuring the daily average of 6 meteorological parameters: temperature, rainfall, atmospheric pressure, humidity, wind speed, and hours of sunshine.This table is of dimension 420 × 6.The interest is on the possible relationships between the variables measuring pollution and meteorological variables.We classify the days in three classes according to the degree of pollution and explained them using meteorological variables.Each day  is described by 7 curves corresponding to the 7 pollutants; the proposed algorithm, written in Matlab, brought together the 420 days in 3 classes without reduction step: class 1 "low-pollution days," class 2 "days of average pollution," and class 3 "days of high pollution."The results are convincing; the profile of each class was explained by meteorological variables.
As a result of this, many questions arise, and we want to study the relationship between the pollution variable and the weather conditions variables.We also need to explain the classes in connection with the weather conditions variables and determine the profile of each class in connection with these weather conditions variables.
The first approach to this study has consisted in summarizing the pollution file (420 tables of dimension 24 × 7) in a table of dimension 420 × 7 by measuring the daily average for each pollutant.The variability effect of the measures is removed.We have studied the relationship between the 2 groups of variables "pollution and weather conditions."We are not interested in this approach.
The results are conclusive; the profile of each class has been explained by the weather conditions variables.
Table 1 shows the discriminating variables of each table.It describes the classes of pollution obtained according to the weather conditions parameters.
Characterization of the pollution classes by the weather conditions parameters at the station is as follows.
Class 1 is characterized by low temperatures, an important amount of rain, and strong winds with a minimum of sunshine.This corresponds to the class of weather disturbances.
Class 2 is characterized by the category of days with intermediary weather conditions between the stable situation and the weather disturbances situation.
Class 3 is characterized by high temperatures, light rainfalls, and a lot of sunshine.This represents the anticyclonic situation.
According to Table 1, we notice that the pressure is not a discriminating variable; that is, it does not help us differentiate between classes.Conversely temperature, rainfalls, humidity, wind speed, and sunshine do show the difference  [9], the pollution episodes in large cities are often related to high atmospheric pressure situations.The latter represent the ideal conditions for the gathering of pollutants in the air.
Since nowadays we can predict the weather conditions 2 days ahead, we can therefore take action and make such recommendations as to regulate the road traffic at the roundabout cited above and prevent it from falling in the high-pollution class.

Conclusion
In several scientific disciplines and particularity in medicine, the variability of the observations plays an important part.The proposed approach permits classifying objects by taking into account variability of the observations.The approach can be extended to the classification of matrix objects even of different dimensions and to functional data and this can integrate variability in the distribution which describes the object for each variable.Thus, each object will be described by a multidimensional distribution.This extension will be developed in future work.
(i) A standardized principal component analysis on each table   leads to the construction of   orthogonal factor axes on which we project the   observations of the individual , and we obtain new uncorrelated variables which give  systems of axes 1 , ...,   ) ∈ P  and  = ( 1 , ...,   ) ∈ L  .We search ( Advances in Decision Sciences function of representation  and the function of affectation  which will be used alternatively to decrease the criterion.The representation function satisfies the following procedure: : P  → L  ,  = { 1 , . ..,   } →  () = { 1 , . ..,   } , Class of Individuals.We seek for the kernel from each of the classes generated by the algorithm.Let {1, . ..,   } be the   individuals of the class   and let  be the map defined by, without loss of generality, : N × R  ×   (R) → R + , * ,  * ) which realizes min∈P  ∈L  Cr (, ) .(14)The algorithms used to solve such problems are of -means type.These algorithms are based on the definition of the * and Σ * which minimize  are given by The means   are given as follows: 1  2  3  4  5  6