Characterizing the Relative Spatial Structure of Point Patterns

We generalize Ripley’s K function to get a new function, M, to characterize the spatial structure of a point pattern relatively to another one. We show that this new approach is pertinent in ecology when space is not homogenous and the size of objects matters. We present how to use the function and test the data against the null hypothesis of independence between points. In a tropical tree data set we detect intraspecific aggregation and interspecific competition.


Introduction
Investigating the spatial structure of point patterns has been a long-time challenge for ecologists.Pielou [1] claimed that the information an ecologist wants to have immediately when observing a point set representing a vegetal community is the density of each species and the existence of interactions between plants.Density can be estimated by various methods [2] but interactions have motivated a living literature for more than half a century.In ecology, Ripley's K function [3], or its square-root transformation L [4], is the most used tool to characterize them [5], assuming the pattern is a realization of a homogenous point process; that is to say the probability to find a point is the same everywhere.
However, identifying interactions under the assumption of nonhomogeneity of space is still an open question.Twenty years ago, Cuzick and Edwards [6] followed by Diggle and Chetwynd [7] paved the way by introducing some specific tests.The D function proposed by Diggle and Chetwynd is defined as the difference between the K function for the studied points (called cases) and the K function for others (called controls).It is not completely satisfactory yet because both K functions are computed separately so all the data contained in the relative position of cases and controls is lost.A more recent important advance was proposed by Baddeley et al. [8] who generalized K to inhomogeneous point processes.They developed a complete theoretical framework but practical applications are still difficult because assumptions are necessary about the relative scales of heterogeneity and interactions, leading to possibly opposite analyses of the results [9].A recent review of these methods can be found in [5].
Other tools were also developed by economists [10] with a different approach, comparing the distribution of points of interest relatively to that of other points.Brülhart and Traeger [11] call them relative measures, as opposed to topographic measures such as K which take space (measured by areas) as their benchmark.
In this study, we introduce a relative measure of spatial structure, namely, the M function [12] to extend the ecologists' toolbox and allow a more pertinent approach when the null hypothesis is that points of interest are distributed like others.It bypasses the issue of heterogeneity and allows weighting points.Moreover, its computation is easy.
The paper is organized as follows: first, we derive the M function as a generalization of Ripley's K function; then, we apply it to theoretical examples and real data sets in a tropical forest and in epidemiology; we finally discuss the way it can be usefully applied after clarifying the assumptions it relies on.

Ripley's K Function
2.1.1.Definition.The theoretical framework is a point process whose realization is observed in a window of area A. Nonparametric methods such as Ripley's K are used to reject the null hypothesis of independence of points.To use Ripley's K, we will assume that the point process is stationary (i.e., its intensity does not vary by translation).All point processes used in this paper are second-order stationary (i.e., interactions between points do not vary under translation) and isotropic (i.e., they do not depend on direction).The null hypothesis to reject is therefore complete spatial randomness (CSR); that is, the point pattern is a realization of a homogenous Poisson process.More details on K can be found in [5] or [13].We focus on its estimator here.
Points are denoted x.We call a point x's neighbors all the points less than r apart from it (all points in a disk of radius r centered on the point x).
Ripley's K function estimator was built by counting neighbors (indexed by n) around reference points (indexed by f ), which can belong to the same type (univariate K function) or not (intertype or bivariate K function).N points are found in the window, and we denote 1( x f −x n ≤ r) the indicator function equal to 1 if the distance between x f and x n is less than or equal to r, 0 else.
An unbiased estimator of univariate K [14] with no edgeeffect correction is The bivariate version of K (denoted K f ,n ) is very similar.We denote N f the number of reference points and N n the number of neighbors.We have (2)

Edge-Effect Correction.
Points located close to the window borders are problematic because a part of the circle inside which points are supposed to be counted is outside the window.Various answers have been proposed to correct for this [15,16].We prefer Besag's [4] correction.Let us denote A f r the part of the area of the circle of radius r centered on the point x f located inside the window.We count the number of neighbors inside the circle, and we correct it by the ratio between the circle's area and its inside part.We suppose that the outside part of the circle would have contained the same neighbor density than the inside part.Finally, an unbiased estimator of K with edge-effect correction is 2.1.3.Normalization.Besag [4] proposed to normalize K to obtain a benchmark of r rather than πr 2 .The well-known L function is defined as L(r) = K(r)/π.It can be interpreted as a distance [17]: L(r) = r + l means that as many neighbors are found around reference points up to distance r as would be expected at distance r + l under CSR.We believe that K(r)/πr 2 is a better normalization.Its reference value is 1, and it can be interpreted as the density of neighbors around reference points divided by the density of neighbors anywhere.

The
r is the number of neighbors divided by the area where it is counted.Its average value is compared to what it is expected to be all over the window, (N − 1)/A.
Topographic measures like K use space as their benchmark; that is, the number of points is divided by an area.The benchmark may also be another point pattern; for example, the number of trees of a species under study may be divided by the total number of neighbor trees, defining relative measures.
We transpose K into a relative framework.The ratio is now built comparing a number of neighbors of interest to the total number of neighbors.Weights can be associated to points without changing the construction of the measure.Reference points are indexed by f (x f is a reference point), neighbor points by n; all points whatever their type (i.e., the benchmark) by a; their numbers are N f , N n , and N a .w i is the weight of point x i , W i = Ni i=1 w i is the total weight of this type of points.
The average weighted ratio of neighbor points around reference points is (1/N f ) In the whole window, the same ratio is points and reference points belong to the same type, (1/N f ) We define the univariate M function as International Journal of Ecology 3 The bivariate M f ,n function is Equations ( 5) and ( 6) are simplified when points are not weighted:

Case-Control Design.
A particular attention must be paid to case-control designs.In practical terms, all points of interest (called cases) are carefully referenced, and the benchmark point set (called controls) is just sampled.This approach has been widely used for spatial clustering of diseases [7, 9, 18-20, among others]: sick people are the cases and the rest of the population the controls.Casecontrol design is of course not limitative to geographical epidemiology and can easily be applied to ecology questions.
The M function defined previously can be slightly modified to take into account this feature.Since the controls are chosen to be a representative sample of the population at every scale, neighbors of any kind are replaced by controls, indexed by a. Reference and neighbor points are N c cases; their total weight is W c .After simplifications, M cases can be written as follows:

Significance.
The first-order property (intensity) of the process must be controlled to allow the detection of the second-order property (nonindependence of points, that is to say interactions between the objects they represent).Thus, a point distribution generated according to the null hypothesis must respect, on the one hand, the local values of the density of the process the point distribution is a realization of and, on the other hand, its points must be distributed independently from each other.The practical difficulty comes from the lack of knowledge of the point process that gave the point distribution, which is its unique available realization.Its first-order property is consequently widely unknown.We can only assume that the actual set of point locations is a good approximation of it, following Duranton and Overman [10].Consequently, we generate random data sets for the univariate and case-control M functions by redistributing the actual point set (type and weight couples) on the actual location set (coordinates).The confidence interval of the null hypothesis is then computed by the Monte Carlo technique [21].
The intertype function must support two null hypotheses [22].The random labeling hypothesis is simulated by permuting the point types, keeping point locations and weights unchanged.The population independence hypothesis is more complex to test.The reference points are kept unchanged, so that the spatial structure of the reference point type is maintained, and all other points are redistributed across the available locations.This allows testing the independence of populations considering the structure of the reference point type.Then, the reference and neighbor types are interchanged and the test is repeated.If both M f ,n and M n, f functions leave their null hypothesis confidence envelope in a range of distances, then population independence is rejected.This test requires that some points do not belong to either the reference or the neighbor point type or there will be nothing to redistribute.More generally, testing the relative spatial structure only makes sense if the tested point types are a small part of the point pattern; see discussion Section 4.1.
The tests based on Monte Carlo simulations are actually not correct because they are repeated at each step of the function (see [23] for an extensive discussion).A global test, without the need of simulations, is available only for K against CSR [14].We can follow Loosmore and Ford's goodness-of-fit (GoF) test to obtain a correct P-value to reject the null hypothesis.We first need to compute the average value of M(r) on all simulations, more exactly [24]: where s is the number of simulations and M i (r) the value of M(r) in the ith simulation.Then, the statistic u i is calculated for the ith simulation by summing on all values of r, where δr is the difference between the next value of r and the present one: The same statistic for the actual data, denoted u, is compared to the simulated values to get a P-value: If u is greater than all simulated values, the P-value to reject the null hypothesis erroneously is around 0. To avoid 0 or 1 P-values, we can assume that another simulation would have given a value of u i higher or lower than u and write P u < 1/s or P u > 1 − 1/s.

3).
No theoretical example is given with weighted points because they are not so easy to understand visually.Three real point patterns are considered then.They do not allow a classical analysis by the K function because of heterogeneity.Cuzick and Edwards [6] introduced the first formal way to deal with nonhomogeneous point processes: they used a dataset (published with the paper) concerning the location of 62 cases of childhood leukemia between 1974 and 1986 in the North Humberside area, England.A control set of 141 children representing the whole concerned population was chosen from the birth register (all weights are 1).They could conclude that the cases were significantly clumped.We use this data set to go further: we are now able to corroborate their conclusion and also to precise the size of aggregates.The M function is computed according to the case-control design, (8).
We overall want to provide evidence of the interest of relative spatial structure in ecology.Trees are considered in a 25 ha plot of tropical rainforest in Paracou field station in French Guiana [25].We investigate the spatial structure of two species, Vouacapoua americana Aublet (Caesalpiniaceae) and Qualea rosea Aublet (Vochysiaceae) in a point set of 11,276 trees above 10 cm diameter at breast height (DBH), excluding flooded zones.All trees above 1 cm diameter have been measured and plotted for a few species, allowing us to study the spatial relations between saplings (up to 10 cm DBH) and possibly reproductive trees (30 cm or more) of V. americana.Points are weighted by the basal area of the tree they represent, the reference and neighbor points are mentioned in the results, and all trees of the maps, including references and neighbors, are used as the benchmark.

Theoretical Examples.
In what follows, we generate a point pattern ("black points," represented by closed circles in the figures) to investigate its spatial structure with the L and M univariate functions.Other points ("grey points" represented by grey open circles) are used by M only: grey and black points together constitute the benchmark.Black points may be considered as trees of a species of interest in a forest plot, while grey points are all other trees.
All confidence intervals are computed at 1% risk level generated from 10,000 simulations.

3.1.1.
Aggregates. 200 grey points are completely randomly distributed.Black points are generated by a Matérn process [26]: 5 aggregates (radius 0.5) of 5 points.All point weights equal 1.The map is in Figure 1; the curves are in Figure 2.
The M curve shape is similar to L's: significant, positive peaks denote concentration.The benchmark points of the M function are distributed almost homogeneously so the number of neighbors around each point is proportional to the area: the relative and the topographic measures are nearly equivalent.Nevertheless, while L peaks approximately correspond to the diameter of aggregates [27], M peaks occur exactly at distances at which the local density is the greatest, that is, approximately the distance between points in the aggregates.The differences are due to the way L is normalized: M(r) peaks occur at the same distance as those of K(r)/πr 2 (not shown on the figure).
3.1.2.Regularity.200 grey points are drawn from a homogenous Poisson process again.100 black points have a regular distribution around a square, 1-by-1 grid, with a perturbation: each point is randomly moved horizontally and vertically within a 0.4 interval around the grid nodes (Figure 3).All point weights equal 1.
The first part of the univariate M curve (Figure 4) is made of 0 values, showing the absence of neighbors at any distance up to 0.6.Note that the univariate L curve shape is different since its original value is 0 and its minimum slope is −1 by construction.Negative peaks of both the univariate L and M functions allow detecting the grid size (before r = 1, 2, and 3).Maximum values correspond to the diagonal of the grid ( √ 2 ≈ 1, 44 is the diagonal length, then √ 5 ≈ 2, 24 and so on).

Inhomogeneous Point Set.
We generated two completely random point sets in a 10-by-10 window.Then, we transformed the point coordinates: after having calculated the polar coordinates (r, θ) of each point from the center of the point set, we squared the distance to get (r 2 , θ).The result is a nonhomogeneous Poisson pattern, shown in Figure 5.Both point types have the same random distribution, but the center of the map shows a greater density.
It can be seen (Figure 6) that the L function is not applicable: assuming homogeneity, it interprets the black point distribution as a single big aggregate.This issue is known as "virtual aggregation" [28,29].The M function is able to control for density variations: since the pattern of the black points does not differ from that of all points, M values are around 1.

Cuzick and Edwards Data
Set.The case-control M function is used (Figure 7).0.7 km apart from cases, the average case density is about 70% higher than it would be if the cases followed the control pattern (at this distance, the peak of the M function reaches 1.7).
In the discussion of [6], Diggle (page 101) suggested that the D function, equal to K cases -K controls , would be appropriate Both methods suffer here a severe lack of power due to the very little number of controls.The confidence envelopes are computed at 10% levels (from 1000 simulations).The GoF test applied to M returns P u = 23%.Diggle and Chetwynd obtained a P-value equal to 14% for D. Increasing the number of controls would not have been a real problem if the experimental design had included a distance-based point pattern analysis.8) contains 156 V. americana, 388 Q. rosea, and 10,732 other trees in a 25 ha plot where flooded zones have been excluded, leaving a 20.06 ha, polygonal shape for the study.

V. americana and Q. rosea. The dataset (map in Figure
Aggregation of both species is detected up from 4-6 meters (Figures 9(a These results suggest competition if our null hypothesis is correct: we suppose that both species could locate anywhere if the other did not impede it.Of course, it might be wrong so further work is necessary to test alternate hypotheses: the environment may be different and niche preferences may be the reason for segregation, or else the spatial distribution of populations may not be in equilibrium, and we may be observing the contact of two colonization fronts.Figure 6: Inhomogeneous point set, univariate L and M functions for black points.L is not pertinent contrarily to M that controls for first-order heterogeneity.

V. americana Regeneration.
The density of V. americana is very variable (Figure 10(a)).The univariate M function is applied to saplings (reference and neighbor points are saplings, the benchmark is all trees; weights are basal areas).Saplings are aggregated (Figure 10(b)) at all distances, P u < 0.1%.The intertype M function shows that potentially reproducing trees repulse saplings (Figure 10(c)), with significant results up to 15 meters.Actually, P u = 6.7% (reference points are potentially reproducing trees, neighbors are saplings, and the benchmark is all trees; weights are basal areas).Jansen et al. [31] already mentioned the absence of seedlings around adult V. americana at short distances (6 meters).Our results show that no sapling can be found less than 3 meters apart from adult trees, but also that the relative abundance of saplings among neighbors is low (these results are significant according to Monte-Carlo simulations), suggesting that Vouacapoua regeneration follows the Janzen-Connell hypothesis [32,33].

Using M.
The theoretical examples illustrate that M(r) is equivalent to K(r)/πr 2 when applied to a homogenous, nonweighted point pattern.The univariate M function is not affected by virtual aggregation when the point pattern is not homogenous, using all points as its benchmark.This means that the points of interest should be a small enough fraction of the whole point set to allow considering the latter as a valid control set.The case-control approach is meaningful when the points of interest must not be included in the control set.Then, a sufficient number of controls is required, or the test against the null hypothesis of independence of points will not be powerful.
Although the M function requires several point types, it is completely different from a bivariate K or L function.This may be illustrated by the example of Section 3.1.1:the univariate M function characterizes the spatial structure of black points, exactly like K; grey points are added to black points to obtain a benchmark for M(r) (the number of points whatever their type less than r apart from each black point), while the disk area πr 2 is the benchmark for K(r).In Section 3.3, the bivariate K function could be computed instead of the bivariate M function: it would only consider the 156 V. americana and 388 Q. rosea trees (and suppose both are distributed homogenously), while M also includes the 10,732 other trees to constitute its benchmark.
Figure 9(a) shows a good example of the behavior of the M function.Confidence envelopes are around the expected value equal to 1.At very low distances, they are not defined if no pair of points less than r apart exists, and the confidence interval is very wide then because of stochasticity, amplified by the little number of point pairs.At long distances, all values tend to 1.When point weights are not homogenously distributed, the envelope is not around 1 (Figure 10).Heuristically, M measures the spatial structure of square centimeters of basal areas of trees: when points are redistributed independently from each other under the null hypothesis, square centimeters are still aggregated.More or less aggregation than under the null hypothesis is detected relatively to its envelope and by the GoF test.As shown by Loosmore and Ford [23], the classical, Monte-Carlo-generated confidence envelope may be too optimistic: while the M curve is clearly out of the 1% envelope, the P-value for V. americana regeneration (Figure 10) is only 6.7% (but no power study for the GoF has been conducted as far as we know).

Relative versus Topographic Measures.
Distance-based measures of spatial concentration can be classified into two main categories [34] following Brülhart and Traeger [11].Topographic ones compare a number of neighbors to a measure of space (a surface area) while relative ones compare it to another number of neighbors.All indices used in ecology are topographic, except for Diggle and Chetwynd M shows that in a 0.7 km radius circle around each case, the case density is 70% higher than expected without aggregation (M = 1.7).
Confidence intervals for the null hypothesis of independence (dotted lines) are computed by Monte-Carlo simulations at the 10% risk level.The poor significance levels are due to too few controls.
in the spatial distribution of firms on a territory, mostly use relative measures: using the distribution of the whole industry as a benchmark to study the spatial structure of an economic sector is even one of the good properties a measure must respect according to Duranton and Overman [10].We believe both frameworks can help addressing ecological questions, as they allow different null models to be tested.The topographic toolbox is already well furnished, with Baddeley et al.'s [8] K inhom and Wiegand and Moloney's [28,35] O-ring allowing to deal with inhomogenous point patterns.Diggle et al. [9] separated control points to evaluate intensity and cases to evaluate dependence in K inhom ; that is to say they used K inhom as a relative measure.The M function is designed for this purpose.It is similar to K inhom with a simple box kernel [13] with bandwidth parameter r used to estimate density around each reference point, but it also allows weighting points.The relative structure of basal areas of trees is more meaningful than that of individuals in many applications (biomass spatial structure, competition for light, etc.): roughly speaking, a big tree is often more similar to many small trees than to a single one.

Conclusion
The M function is defined as a generalization of Ripley's K function to allow its application to inhomogeneous point processes and to take into account point weights.From a more theoretical point of view, it is a weighted, relative measure of spatial structure.We believe that relative measures (comparing a point pattern to another) are powerful tools, even though the topographic approach is more used.To allow the effective use of the M function, we developed the necessary code for R [36], available as a supplementary material available online at doi:10.1155/2012/619281.
reference and neighbor points belong to the same type, N f = N n and W f = W n .

Figure 1 :
Figure 1: Aggregates, Point map.Grey circles are drawn from a homogenous Poisson process.Black disks are generated by a Matérn (radius = 0.5) process.

2. 3 .
Examples.Three theoretical examples are given.Two of them illustrate very simple point patterns on a homogeneous space for a comparison of L and M functions (Sections 3.1.1and 3.1.2).The third one computes an inhomogenous Poisson point process to show how the M function controls for the first-order property of point processes (Section 3.1.

Figure 2 :Figure 3 :
Figure 2: Aggregates, univariate L and M functions for the aggregated point set.Solid curves are the L and M function values; dotted lines are the envelope of the confidence interval of the null hypothesis.Both functions detect clumping.L(r) − r is plotted rather than L(r) for convenience.

6 InternationalFigure 4 :Figure 5 :
Figure 4: Regular point set, univariate L and M functions for the regular point set.Both functions detect dispersion.
) and 9(d)) by the univariate M function.The species repulse each other (Figures9(b) and 9(c)).P u < 0.1% in all cases, according to the bivariate M function.All trees are used as the benchmark in all analyses.

[ 7 ]Figure 7 :
Figure7: Childhood Leukemia epidemiology[6].Map (a): cases (62 circles) are ill children locations; controls (141 crosses) are a sample of the whole population; distances are in km.Cases are significantly clumped, as shown by both functions D (b) and M (c), drawn as solid lines.M shows that in a 0.7 km radius circle around each case, the case density is 70% higher than expected without aggregation (M = 1.7).Confidence intervals for the null hypothesis of independence (dotted lines) are computed by Monte-Carlo simulations at the 10% risk level.The poor significance levels are due to too few controls.

Figure 8 :Figure 9 :Figure 10 :
Figure 8: Map of trees.Vouacapoua americana trees are blue, Qualea rosea red and other trees grey.Circle sizes are proportional to those of the trees.Flooded zones are excluded.Distances are in meters.