Depth-Based Classification for Distributions with Nonconvex Support

Halfspace depth became a popular nonparametric tool for statistical analysis of multivariate data during the last two decades. One of applications of data depth considered recently in literature is the classification problem.The data depth approach is used instead of the linear discriminant analysismostly to avoid the parametric assumptions and to get better classifier for data whose distribution is not elliptically symmetric, for example, skewed data. In our paper, we suggest to use weighted version of halfspace depth rather than the halfspace depth itself in order to obtain lower misclassification rate in the case of “nonconvex” distributions. Simulations show that the results of depth-based classifiers are comparable with linear discriminant analysis for two normal populations, while for nonelliptic distributions the classifier based on weighted halfspace depth outperforms both linear discriminant analysis and classifier based on the usual (nonweighted) halfspace depth.


Introduction
Data classification or discrimination is an old statistical problem which has been discussed, studied, and applied since more than century.Therefore we will recall the problem only very briefly.The goal of the classification is to allocate a new observation into one of two (or more) groups.The rule for assessing a new observation to one of the possible groups is created by analysis of available observations with known group assignment (so-called training set).
The most popular classification method is based on normality assumption, and it is known as the linear discriminant analysis (LDA).LDA is easy to use and quite successful in many cases, in particular for elliptically symmetric distributions.
Naturally there are also nonparametric methods of classification.One of the motivations for replacing the LDA method is to avoid the normality assumption and therefore, hopefully, get better results for nonelliptic distributions.Recently it was proposed to use data depth as the basis for new nonparametric classifiers, see, for example, [1] or [2].
Let us recall the essential facts about the data depth.A depth of a point x with respect to a probability measure  is a nonnegative number (x, ) which measures the "centrality" of x with respect to .In other words, depth should reflect position of the point with respect to the probability distribution, or in the sample version, position of the point with respect to the observed data cloud.Recall also that for a multivariate data there is no natural linear ordering available.The depth, as a nonnegative (and bounded) number allows to define a depth-based ordering of multivariate data.Points of a high depth are "central" while points of a low depth are "outliers"; hence we speak about the "centre-outward" ordering.
There are many possibilities how to define data depth.The most popular and known are the halfspace (Tukey) depth [3] and the simplicial depth [4].Other depths include the zonoid depth [5] or  1 depth [6] among many others.Complete overview of the data depth, its properties, computational aspects, and applications may be found in Liu et al. [7].
The main advantage of LDA is its computational simplicity.On the other hand the assumption of normality is in some situations inadequate, and depth-based methods are more appropriate, for example, for nonconvex distributions (in the sense of nonconvex level sets of density or even nonconvex support of distribution) or for functional data (observations in function space) classification [8].Unfortunately the computation of most data depth functions (even of the most popular halfspace and simplicial) is very slow, in particular in higher dimensions.The  1 depth is computationally much easier but it has several disadvantages for classification use.For example, points far from convex hull of all training sets still have positive  1 depth.Therefore we restrict ourselves to the halfspace depth and its generalization; the simplicial depth and its generalization would give very similar conclusions.
The paper is divided into two main parts.In the first part we introduce the max-depth classification rule.Then the halfspace depth function is defined, and the maximal depth classifier for the halfspace depth is discussed.Some disadvantages of the classifier suggest changing some of the depth properties in order to improve the depth-based classifier.We propose to use the generalized halfspace depth which is defined, and it is shown that thanks to its "local" property it may be considered as a better choice of depth for depth-based classification than the original halfspace depth.
Section 3 is devoted to simulations.The simulation study shows that considering the generalized halfspace depth for the maximal depth classifier is usually better than the halfspace depth itself for any distribution.In particular, it is shown that the classification based on the generalized halfspace depth [9] gives comparable results as the halfspace depth classification and LDA for normal populations while it is much better for "nonconvex" distributions.

Classification Based on a Depth Function
In what follows we consider  ≥ 2 populations P 1 , . . ., P  with unknown absolutely continuous -variate probability distributions   ,  = 1, . . ., , respectively.The corresponding empirical probability measures P ,  = 1, . . .,  may be easily calculated from the training sets Y , , where  = 1, . . ., , and  ,1 , . . . ,  are iid observations from the distribution   .Having a new observation X = x (independent of the training sets and following unknown one of the distributions   ) the problem is to find a rule (measurable map)  : x ∈ R   → {1, . . ., } allocating the observation to a population P (x) .In other words classification rule  means that the unknown distribution of X is estimated by  (X) .As a usual measure of quality of classification the misclassification rates are considered in what follows, that is, the probability   [(X) ̸ = ] if X follows the distribution   .( The classifier is known as the maximal depth classifier.This idea was used by Jörnsten [1], Hartikainen and Oja [10], or Mosler and Hoberg [11].

Halfspace Depth.
The halfspace depth is the most popular and the most frequently used depth function.Briefly speaking the halfspace depth of a point x is the minimal probability of any closed halfspace which contains this point.More formal definition follows.
Definition 1.Let  be a probability measure on R  .The halfspace depth of the point x ∈ R  with respect to  is defined as The sample version of the halfspace depth is defined as HD(x, P), where P is the empirical probability measure.The maximal possible depth is 1/2 for the symmetric distributions (1/2 is attained at the centre of symmetry in such case); for general distributions the maximal depth may be much lower.Points that are not in a convex hull of the support of distribution (convex hull of observations) have zero depth.
Halfspace depth has many desirable properties; see discussion of Zuo and Serfling [12].The halfspace depth is in particular strongly consistent, affine invariant, vanishing at infinity, and maximal at centre of symmetry (if it exists); there exists unique deepest point for any absolutely continuous distributions, the depth is decreasing along rays from the deepest point, depth level sets are always convex, and so forth.
Ghosh and Chaudhuri [2] have proved that the maximal depth classifier based on the halfspace depth is asymptotically optimal (it has the lowest possible average misclassification rate) if the considered distributions are elliptically symmetric with the density function strictly decreasing in every direction from its centre of symmetry, differ only in location and rotation and have equal prior probabilities.We will show that for nonelliptic distributions the halfspace depth classifier is far from being optimal.

Problem of Distributions with Nonconvex Density Level sets.
One particularly important property of the halfspace depth is its quasiconcavity.Quasiconcavity means that level sets (and equivalently central regions) of halfspace depth are always convex; that is, for any  ≥ 0 and for any probability measure  the set {x ∈ R  : HD(x, ) ≥ } is convex (or empty).However, this property becomes a disadvantage for classification if the level sets of density function are not convex.
Let us show an example of two bivariate distributions with disjoint supports such that LDA has misclassification rate equal almost to 1/2 (the estimated normal distributions are almost the same) and the central regions of halfspace depth have large intersections; hence the misclassification rate is relatively high.
Example 2. Consider two bivariate uniform distributions  1 and  2 on fixed bounded sets (supports).The (nonconvex) supports of the distributions are disjoint, and both are symmetric about the origin: The supports are plotted as grey areas in Figure 1.
Although the supports are disjoint and the classification may seem to be an easy task it is not true at all.The halfspace depth central regions are quite similar and with large intersections rather than disjoint.In Figure 1 the contours of the 0.25, 0.50 and 0.75 central regions (multivariate analogy of univariate quantiles) are shown.Clearly the discrimination cannot be good in this situation.In Section 3 it will be shown that the LDA is even worse in this case (almost as worse as random classifier) since the estimated means and variance matrices are almost equal in most simulations.The problem lies obviously in the construction of supports supp( 1 ) and supp( 2 ).The convexity of both halfspace depth level sets and normal distribution density level sets is in big contrast with the nonconvexity of the underlying distributions  1 and  2 .Therefore there is a natural question to find a depth function which may reflect this nonconvexity.

Generalised Halfspace Depth. Motivated by Example 2
we may try to use a generalised version of halfspace depth proposed by Hlubinka et al. [9].The main idea of the generalisation is to use weighted probability in the halfspace rather than the probability of halfspace itself.The generalized halfspace depth may be also considered as localized version of the halfspace depth.Recall that the depth reflects global position of a point with respect to the distribution.The localized depth should also include more local properties of the distribution beside the global position.We will see that adding the locality may improve performance of the max-depth classifier substantially.
Let us first clarify the connection between the weighted halfspace depth and the halfspace depth itself.If we consider weight function equal to 1 on the whole halfspace, that is, (x) = 1 for all x ∈ R  :   ≥ 0, we get a depth function, which can be easily "converted" to the halfspace depth, since HD (x, ) = GD (x, ) 1 + GD (x, ) (5) in this case.As this transformation is monotone, the ordering based on the weighted halfspace depth is the same as the ordering based on the halfspace depth in this case.While the halfspace depth of a point x can be interpreted as the minimal probability of any closed halfspace containing this point, the weighted halfspace depth is the minimal ratio of probabilities of complementary halfspaces.The difference between the weighted halfspace depth and the halfspace depth might be observed only when other weights are considered.Weights are used to "localize" the depth-greater weight might be assigned to points which are "near" the considered point x.
One particularly useful choice of the weights will be presented in the following paragraphs.
The weighted halfspace depth loses some properties of the halfspace depth.The depth is still strongly consistent for "regular" weight functions [9].It is affine invariant for translation and rotation only, not for scaling; hence, it is generally recommended to use some of the transformation techniques [13] to obtain affine invariant version.There is unique deepest point for symmetric absolutely continuous distributions but for general distributions more deepest points may exist (but once more, it can be even advantage for classification).The central regions of generalised halfspace depth need not to be convex; they even need not to be connected sets.This is mainly due to the fact that a weight function may emphasise the "local" properties of the underlying distribution.
One of the weight functions which is simple to use and has the "localisation" property is the band weight function defined in -variate case as where  > 0 is half of the bandwidth.We shall call the generalised halfspace depth with the band weight function the band depth.From the geometrical point of view, we restrict our attention from the whole halfspace only to a band (where the weights is nonzero).The band contains points with the first  − 1 coordinates close to the first  − 1 coordinates of the considered point x.
The choice of the bandwidth  may be done based on the training set.Clearly, the smaller is the value  the more "localised" the band depth is while for higher  the band depth is approximating the halfspace depth.Namely, in Example 2 the band depth is equivalent to the halfspace depth if  ≥ 2.
The main problem of choosing  is, however, in the computation of the depth.Depth computation is relatively slow (compared to LDA) as it is necessary to calculate the depth for each observation.Obviously, for optimal choice of  the calculation of band depth must be repeated for different values ; hence the time needed for classification grows rapidly.In Figure 2 we can see that the shapes of the band depth central regions "mimic" the shape of the supports.In particular the value  = 0.15 is used here.We have tried several choices, and then  = 0.15 was chosen to have sufficiently "smooth" borders of central regions, and simultaneously the central regions do not cover much areas without observations of the training sets.Comparing contours on Figures 1 and 2 it is not surprising that the results of band depth-based classification are much better than the corresponding results for halfspace depth-based classifier.

Observations of Zero Depth.
Most of the depth functions assign zero depth to points outside the convex hull of support of distribution.In empirical version it means that points outside the convex hull of a training set have zero depth.In the case of weighted halfspace depth there may exist points in the convex hull of support with zero depth; see Figure 2.
Let us consider a situation that for a new observation x it holds (x, P ) = 0 for all  = 1, . . ., ; the observation has zero depth with respect to all training sets.What should be the classification (x) in this case?There are several possibilities how to solve this situation.
(1) Other depth function (e.g., Mahalanobis depth or  1 depth) which is positive for any point may be used.This method was used, for example, by Mosler and Hoberg [11] who combined zonoid and Mahalanobis depth.( 2) (x) may be chosen to minimise the Euclidean distance from the (nearest) deepest point.(3) (x) may be chosen to minimise the Euclidean distance from a specific central region (regions).(4) Central regions may be inflated, and (x) is chosen such that the inflation needed to cover x is the smallest one.

Simulation Study
Although the computation of depth-based classification is slower comparing to the LDA in some situations it gives much better results.In what follows three different situations are considered to show strong and weak sides of the discussed classification procedures.The maximal depth classifier which uses the band depth is compared to the classifier which uses the halfspace depth and to the classical LDA.
In all examples two subpopulations P 1 and P 2 of a total population P are considered.It is further considered that the sizes of the respective training sets are equal.First the training sets consisting of 100 observations are considered.Later the size of each training set is considered to be 1000.
Observations of zero depth are allocated randomly to one of the populations.This rule is very naïve, and the misclassification rates of depth-based classifications are therefore overestimated in what follows: in particular, the overestimation is high for the smaller training sets.
Let us now introduce three situations which are considered in the simulation study.
(i) In the first simulation a mixture of two bivariate normal distributions of the same variance-covariance matrix and different means is considered.Therefore the LDA should be the best in this case.
(ii) The second situation is taken from Example 2. It can be suggested from Figures 1 and 2 that the classifier based on the band depth is more promising than the other two classifiers.
The formal description of the three different pairs of populations is given in Table 1.
In the next three paragraphs we discuss the results for the three considered classification problems.The summary of the simulation studies is presented in Table 2 and in Figure 3. Situation 1 (two normal distributions).The bandwidth  of weight function ( 6) is set to  = 6.This value was chosen ad hoc to show the difference between the band depth and the halfspace depth, since the optimal value is  = ∞ (the halfspace depth is proven to be optimal in this situation).
For small training sets the misclassification is in general higher for depth-based classification.This is partially given by relatively higher number of observations with zero depth than for large training sets.For larger training sets the misclassification rates are almost equal for the three methods.
The simulation has shown that the depth-based classifiers are comparable to optimal classification procedures (LDA) in standard situations.
Situation 2 (uniform distribution on circle).We have already stated that the band depth is expected to give the best results in this situation.Not surprisingly the misclassification rate of the LDA is close to 1/2 with extremely high variability, and the variability is not decreasing with the size of training sets.This is mainly due to the fact that estimates of mean values and variance matrices of  1 and  2 , respectively, are very similar in most simulations.
The classification based on the band depth is superior to both LDA and the halfspace depth classification.The superiority is in particular clear for larger training sets (1000 points each), and the situation for smaller training sets is biased mainly due to chosen naive allocation of points of zero depth, and the misclassification rate is highly overestimated in particular for the band depth.
Situation 3 (uniform distribution on disjoint rectangles).This is a very different situation since it is not symmetric from the subpopulations point of view.The convex hull of support of distribution  1 contains one-third of the support of distribution  2 of population P 2 .On the other hand the (convex) support of distribution  2 and support of distribution  1 are disjoint.Therefore it may be expected that many points of population P 2 will be allocated to population P 1 while the opposite case will be quite rare.It is indeed true for the LDA and the halfspace depth classification.
The classification based on the band depth has misclassification rate similarly small for both populations.The main reason is that the band depth of points inside the convex hull of support of  1 but outside the support itself is zero or very close to zero.For the halfspace depth almost 1/3 of points of population P 2 has higher depth in population P 1 , and this corresponds to the misclassification rate of the halfspace depth.

Conclusion.
The misclassification rate of the LDA which is low for normal distributions may be very high for distributions with nonconvex supports (or for distributions with nonconvex levelsets of density).Nonparametric approach using the halfspace depth may be in such situation better (Situation 2) but may be worse (Situation 3).
In general it may be recommended to use the generalised halfspace depth instead of the halfspace depth if possible.The advantage of the generalised halfspace depth (simple band depth was used here) is shown on extreme Situations 2 and 3 above to emphasise its properties.Hence it may be concluded that the generalised halfspace depth is much better for disributions with nonconvex level sets.Situation 3 was chosen to show that for different types of distributions of populations P  the misclassification rate is similar for each population when the generalised halfspace depth is used while for the LDA and classification based on the halfspace depth the misclassification rates differ a lot between the considered populations.
The main disadvantage of the generalised halfspace depth is the need to choose the weight function and its respective parameter.It seems that band weight function (6) is good enough but it need not to be optimal, and even if we choose the band weight function the problem of bandwidth  remains.

2. 1 .
Maximal Depth Classifier.The basic depth-based rule for classification is simple and natural.Consider a depth function .Compute the depth of a new observation x with respect to all empirical distribution functions P1 , . . ., P and allocate x to the population to which it has maximal depth:  (x) = arg max =1...,  (x, P ) .

Figure 1 :Figure 2 :
Figure 1: Supports of two bivariate uniform distributions (represented by grey areas).Estimated (solid curves) are halfspace depth central regions of 25, 50, and 75 percent of deepest points.

Rules 1 and 2
are almost useless in Example 2 since origin is the theoretical deepest point of both populations, and the Mahalanobis depth gives almost the same value for both training sets to any point.Method 4 works well for starshaped central regions (as in Example 2).