This paper presents a spatiotemporal analysis of hotspot areas based on the Extended Fuzzy C-Means method implemented in a geographic information system. This method has been adapted for detecting spatial areas with high concentrations of events and tested to study their temporal evolution. The data consist of georeferenced patterns corresponding to the residence of patients in the district of Naples (Italy) to whom a surgical intervention to the oto-laryngopharyngeal apparatus was carried out between the years 2008 and 2012.
1. Introduction
In a GIS, the impact of phenomena in a specific area due to the proximity of the event (e.g., the study of the impact area of an earthquake, or the area constraint around a river basin) is performed using buffer area geoprocessing functions. Given a geospatial event topologically represented as a georeferenced punctual, linear, or areal element, an atomic buffer area is constituted by circular areas centered on the element. For example, if the event is the epicenter of an earthquake, georeferenced by a point, a set of buffer areas is formed by concentric circular areas around that point; the radius of each circular buffer area is defined a priori.
When it is not possible to define statically an area of impact and we need to determine what is the area affected by the presence of a consistent set of events, we are faced with the problem of detecting this area as a cluster on which the georeferenced events are thickened as well. These clusters are georeferenced, represented as polygons on the map, and called hotspot areas.
The study of hotspot areas is vital in many disciplines such as crime analysis [1–3], which studies the spread on the territory of criminal events, fire analysis [4], which analyzes the phenomenon of spread of fires on forested areas, and disease analysis [5–7], which studies the localization of focuses of diseases and their temporal evolution. The clustering methods mainly used for detecting hotspot areas are the algorithms based on density (see [8, 9]); they can detect the exact geometry of the hotspots, but are highly expensive in terms of computational complexity, and in the great majority of cases, it is not necessary to determine exactly the shape of the clusters. The clustering algorithm more used for its linear computational complexity is the Fuzzy C-Means algorithm (FCM) [10], a partitive fuzzy clustering method that uses the Euclidean distance to determine prototypes cluster as points.
Let X={x1,…,xN}⊂Rn be a dataset composed of N pattern xj=(x1j,x2j,…,xnj)T, where xkj is the kth component (feature) of the pattern xj. The FCM algorithm minimizes the following objective function:
(1)J(X,U,V)=∑i=1C∑j=1Nuijmdij2,
where C is the number of clusters, fixed a priori, uij is the membership degree of the pattern xj to the ith cluster (i=1,…,C), V={v1,…,vC}⊂Rn is the set of points given by the centers of the C clusters (prototypes), m is the fuzzifier parameter and, dij is the distance between the center vi=(v1i,v2i,…,vnj)T of the ith cluster and the jth vector xj, calculated as the Euclidean norm:
(2)dij=∥(xj-vi)∥=∑k=1n(xkj-vki)2.
Using the Lagrange multipliers method for minimizing the objective function (1), we obtain the following solution for the center of each cluster prototype:
(3)vi=∑j=1Nuijmxj∑j=1Nuijm,
where i=1,…,C and for membership degrees uij,
(4)uij=1(∑h=1c(dij2/dhj2))2/(m-1)
subjected to the constraints:
(5)∑i=1Cuij=1,∀j∈{1,…,N},0<∑j=1Nuij<N,∀i∈{1,…,C}.
Initially, the uij’s and the vi are assigned randomly and updated in each iteration. If U(l)=(uij(l)) is the matrix U calculated at the l-th step, the iterative process stops when
(6)∥U(l)-U(l-1)∥=maxi,j|uij(l)-uij(l-1)|<ε,
where ε>0 is a prefixed parameter.
This algorithm has a linear computational complexity; however, it is sensitive to the presence of noise and outliers; furthermore, the number of cluster C is fixed a priori and needs to use a validity index for determining an optimal value for the parameter C.
In order to overcome these shortcomings, in [11, 12], the EFCM algorithm is proposed, where the cluster prototypes are hyperspheres in the case of the Euclidean metric. Like FCM, the EFCM algorithm is characterized by a linear computational complexity; furthermore, it is robust with respect to the presence of noise and outliers, and the final number of clusters is determined during the iterative process.
In [13, 14], the authors propose the use of the EFCM algorithm for detecting hotspot areas. The final hotspots are identified as the detected cluster prototypes and shown on the map as circular areas. In [4], the authors analyze the spatio-temporal evolution of the hotspots in the fire analysis. The pattern event dataset is partitioned according to the time of the event’s detection; so each subset is corresponding to a specific time interval. The authors compare the hotspots obtained in two consecutive years by studying their intersections on the map. In this way, it is possible to follow the evolution of a particular phenomenon.
The cluster prototypes detected from EFCM method are circular areas on the map that can approximate a hotspot area. Figure 1 shows an example of two circular hotspots, obtained as clusters.
Intersections of two hotspots detected for events that happened in two consecutive periods.
Figure 1 shows three different regions.
An area in which the hotspot A is not intersected by the hotspot B (corresponding to A-(A∩B)=A-B): this region can be considered as a geographical area in which prematurely detected event disappears successively.
The region of intersection of the two hotspots A∩B: this region can be considered a geographical area in which the event continues to persist.
An area in which the hotspot B is not intersected by the hotspot A (corresponding to B-(A∩B)=B-A): this region can be considered as a geographical area in which the prematurely undetected event propagates successively.
We can study the spatio-temporal evolution of the hotspots by analyzing the interactions between the corresponding circular cluster prototypes obtained for consecutive periods, and detecting the presence of new hotspots in regions previously not covered by hotspots and the absebce of hotspots in regions previously spatially included in hotspot areas.
In this research, we present a method for studying the spatio-temporal evolution of hotspots areas in disease analysis; we apply the EFCM algorithm for comparing, in consecutive years, event datasets corresponding to oto-laryngopharyngeal diseases diagnosis detected in the district of Naples (I). Each event corresponds to the residence of the patient who contracted the disease.
We study the spatio-temporal evolution of the hotspots analyzing the intersections of hotspots corresponding to two consecutive years, the displacement of the centroids, the increase or reduction of the hotspots areas, and the emergence of new hotspots.
In Section 2, we give an overview of the EFCM algorithm. In Section 3, we present our method for studying the spatio-temporal evolution of hotspots in disease analysis. In Section 4, we present the results of the spatio-temporal evolution of hotspots for the otolaryngologist-laryngopharyngeal diseases diagnosis events detected in the district of Naples (I). Our conclusions are in Section 5.
2. The EFCM Algorithm
In the EFCM algorithm, we consider clustering prototypes given by hyperspheres in the n-dimensional feature’s space. The ith hypersphere is characterized by a centroid vi=(v1i,…,vni) and a radius ri.
Indeed, if ri is the radius of Vi, we say that xj belongs to Vi if dij≤ri.
The radius ri is obtained considering the covariance matrix Pi associated with the ith cluster, defined as
(7)Pi=∑j=1Nuijm(xj-vi)(xj-vi)T∑j=1Nuijm
whose determinant gives the volume of the ith cluster. Since Pi is symmetric and positive, it can be decomposed in the following form:
(8)Pi=QiΛiQiT,
where Qi is an orthonormal matrix and Δi=(λij) is a diagonal matrix. The radius ri is given by the following formula (see [12]):
(9)ri=1n∏k=1nλik1/n=det(Pi)1/n.
The objective function to be minimized is the following:
(10)J(X,U,V)=∑i=1C∑j=1Nuijm(dij2-ri2),
where the membership degrees uij are updated as
(11)uij=1×(∑h=1C((dij2max(0,1-ri2/dij2))1211111∑h=1C(dij2max(0,1-(ri2/dij2)))/(dhj2max(0,1-rh2/dhj2)))1/(m-1))-1=1∑h=1C(dij2wij/dhj2whj)1/(m-1).
We set dkj′2=dkj2wkj=max(0,dkj2-rk2) and define the number φj=card{k∈{1,…,C}:dkj′=0} for any j=1,…,N; thus, we obtain
(12)uij=1∑h=1C(dij′/dhj′)2/(m-1)ifφj=0,uij={0ifdij′>0,ifφj>0.1φjifdij′=0,
However, the usage of (12) produces the negative effect of diminishing the objective function (10) when a meaningful number of features are placed in a cluster and this fact can prevent the separation of the clusters. Then a solution to this problem consists in the assumption of a small starting value of ri and then it is increased gradually with the factor β(l)/C(l), where C(l) is the number of clusters at the lth iteration and β(l) is defined recursively as β(0)=1,β(l)=min(C(l-1),β(l-1)), by setting
(13)Iik=∑j=1Nmin(uij,uki)∑j=1Nuij
and the symmetric matrix S=(Sik), where Sik=max{Iik,Iki} is defined as well. If S(l) is the matrix S at the lth iteration (l>1) and the threshold α(l)=1/(C(l)-1) is introduced as limit, then two indexes i* and k* are determined such that Si*k*(l)≥α(l) and thus i* and k* are merged by setting
(14)ui*j(l)=ui*j(l)+uk*j(l),∀j∈{1,…,N},C(l)=C(l-1)+1.
The k*th row can be removed from the matrix U(l). In other words, the EFCM algorithm can be summarized in the following steps.
The user assigns the initial number of clusters C(0),m>1 (usually m=2), ε>0, the initial value Sik(0)=0, and β(0)=1.
The membership degrees uij(0)(j=1,…,Nandi=1,…,C(0)) are assigned randomly.
The centers of the clusters vi are calculated by using (3).
The radii of the clusters are calculated by using (9).
uij is calculated by using (12).
The indexes i* and k* are determined in such a way that Si*k*(l) assumes the possible greatest value at the lth iteration.
If |Si*k*(l)-Si*k*(l-1)|<εand Si*k*(l)>α(l)=1/(C(l-1)-1), then the i*th and k*th clusters are merged via (14) and the k*th row is deleted from U(l).
If (6) is satisfied, then the process stops; otherwise, go to the step (3) for the (l+1)th iteration.
3. Hotspots Detection and Evolution in Disease Analysis
Each pattern is given by the event corresponding to the residence of the patient to whom a specific disease has been detected. The two features of the pattern are the geographic coordinates of the residence.
The first step of our process is a geocoding activity necessary for obtaining the event dataset starting by the street address of the patients.
To ensure an accurate matching for the geopositioning of the event, we need the topologically correct road network and the corresponding complete toponymic data.
The starting data include the name of the street and the house number of the patient’s residence. After the matching process, each data is converted in an event point georeferenced on the map.
In Figure 2, the road network of the district of Naples is shown; the name of the street is labeled on the map; the events are georeferenced as points on the map.
Example of events georeferenced on a road network.
Figure 3 shows the data corresponding to an event selected on the map.
Data associated to an event on the map.
After geo-referencing each event, the event dataset can be split, partitioning them by time interval. For example, the event in Figure 3 can be split by the field “Year.”
For each subset of events, we apply the EFCM algorithm to detect the final cluster prototypes.
In this research, we point out the analysis of the temporal evolution and spread of oto-laryngo-pharyngeal diseases detected within the district of Naples. The datasets, divided by time sequences corresponding to periods of one year, are made up of patterns for different events georeferenced corresponding to ailments encountered in patients for which an intervention and the subsequent histological examination were pointed out as well. The event refers to the geopositioning of the location of the patient.
The data have been further divided by the type of the disease for analyzing the distribution and evolution of each specific disease on the area of the study.
The EFCM algorithm has been encapsulated in the GIS platform ESRI ArcGIS. Figure 4 shows the mask created for setting the parameters and running the EFCM algorithm.
A form created in the GIS Tool ESRI ArcGIS for managing the EFCM process.
We can set other numerical fields for adding other features to the geographical coordinates.
Initially, we set the initial number of clusters, the fuzzifier m, and the error threshold for stopping the iterations. After running EFCM, the number of iterations, the final number of clusters, and the error calculated at the last iteration are reported. The resultant clusters are shown as circular areas on the map and can be saved in a new geographic layer.
The final process concerns the comparative analysis of the hotspots obtained by the clusters corresponding to each subsets of events.
Figure 5 shows an example of display on the map of hotspots obtained as final clusters for two consecutive subset of events.
Analysis of the spatio-temporal evolution of hotspots detected in two consecutive periods.
In order to assess the expansion and the displacement of a hotspot, we measure the radius of the hotspot and the distance between the centroids of two intersecting hotspots.
In the next section, we present the results obtained by applying this method for the data corresponding to surgical interventions to the oto-laryngo-pharyngeal apparatus in patients residents in the district of Naples between the years 2008 and 2012.
We divide the dataset per year and analyze various types of diseases.
Among the types of the most frequent diseases, the following were analyzed:
carcinoma,
edema of bilateral Reinke,
hypertrophy of the inferior turbinate,
nasal polyposis,
bilateral vocal fold prolapse.
In the next section, we show the most significant results obtained by applying this method to the each partitioned dataset of events.
4. Test Results
We present the results obtained on the event dataset described above in the period between the years 2008 and 2012.
We consider first the subset of data corresponding to the edema of bilateral Reinke disease.
We fix the fuzzifier parameter to 0.1, the initial number of clusters to 15, and the final iteration error to 1 × 10^{−2}.
Table 1 shows the results obtained for each year.
Endema of bilateral Reinke disease—final number of clusters and final error for year.
Year
Initial number of clusters
Final number of clusters
|U(l)-U(l-1)|
ε
2008
15
4
0.48×10-2
1×10-2
2009
15
4
0.55×10-2
1×10-2
2010
15
4
0.71×10-2
1×10-2
2011
15
4
0.67×10-2
1×10-2
2012
15
3
0.53×10-2
1×10-2
We present the details relating to the comparison of the hotspots obtained by considering the event data for the years 2011 and 2012.
Figures 6 and 7 show, respectively, the hotspots obtained by using the pattern subset of events that occurred in the years 2011 and 2012.
Edema of bilateral Reinke disease—year 2011: display of the hotspots on the map.
Edema of bilateral Reinke disease—year 2012: display of the hotspots on the map.
Figure 8 shows the overlap of the hotspots obtained for the two years: in red, the hotspots corresponding to the year 2011; in blue, the ones corresponding to the year 2012.
Edema of bilateral Reinke disease—years 2011 and 2012: display of the two hotspots’ series.
Table 2 shows in the first two columns the labels of the hotspots in 2011 and 2012, in third (resp., fourth) column the radius obtained in 2011 (resp., 2012), and the distance between the centroids is given in the fifth column.
Endema of bilateral Reinke disease—comparison results of the hotspots obtained for the years 2011 and 2012.
2011 hotspot
Intersecting 2012 hotspot
Radius 2011 hotspot (km)
Radius 2012 hotspot (km)
Centroid’s distance (km)
1
1
1.724
3.848
1.759
2
1
1.943
3.848
1.507
3
2
3.434
3.453
0.074
4
3
3.591
6.519
1.115
The results show that only hotspot 3 obtained for the year 2011 remains almost unchanged in the year 2012. Instead, hotspots 1 and 2 seem to merge into a single larger hotspot (the hotspot 1 obtained for the year 2012), and hotspot 4, that shifts about 1 km, is expanded; the radius of this hotspot in 2012 is about 6.5 km (hotspot 3 obtained for the year 2012 in Figure 8).
Now we show the results obtained for the disease nasal polyposis.
Figure 9 shows the overlap of the hotspots obtained for the two years, 2011 and 2012.
Nasal polyposis disease—years 2011 and 2012: display of the two hotspots’ series.
In Table 3, the comparison’s results are reported.
Nasal polyposis—comparison results of the hotspots obtained for the years 2011 and 2012.
2011 hotspot
Intersecting 2012 hotspot
Radius 2011 hotspot (km)
Radius 2012 hotspot (km)
Centroid’s distance (km)
1
1
3.087
4.951
2.656
2
2
4.915
7.103
1.052
The results in Figure 9 show that in 2011 and 2012 there are two hotspots: the one covering an area of the city of Naples and the other covering many Vesuvian towns. The two hotspots, which in 2011 covered a circular area with a radii of about 3 and 5 km, respectively, in 2012 cover a circular area with radii of about 5 and 7 km, respectively.
The histogram in Figure 10 shows the trend of the radii of the two hotspots in the course of time.
Nasal polyposis disease—histogram showing the variation of the radius of the two hotspots over time.
It is relevant the spread in recent years of the hotspot that surrounds the Vesuvian towns (the radius of this hotspot, from about 2 km in the year 2008, is about 7 km in the year 2012).
Another significant trend concerns the hotspots obtained for the carcinoma disease.
Also, in this case, the two main hotsposts cover the city of Naples and many Vesuvian towns. In in this case, we have a very high spread of the hotspot covering the city of Naples (cfr. Figure 11); in recent years, the radius of this hotspot is increased up to 9.5 km.
Carcinoma disease—histogram showing the variation of the radius of the two hotspots over time.
5. Conclusions
The hyperspheres obtained as clusters (circles in case of two dimensions) by using EFCM can represent hotspots in hotspot analysis; this method has a linear computational complexity and is robust to noises and outliers. In hotspots analysis, the patterns are bidimensional and the features are formed by geographic coordinates; the cluster prototypes are circles that can represent a good approximation of hotspot areas and can be displayed as circular areas on the map.
In this paper, we present a new method that uses the EFCM algorithm for studying the spatio-temporal evolution of hotspots in disease analysis.
We consider the residence’s information of patients in the district of Naples (Italy) to whom a surgical intervention to the oto-laryngo-pharyngeal apparatus was carried out between the years 2008 and 2012. A geocoding process is used for geo-referencing the data; then, the georeferenced dataset is partitioned per year and type of disease; we compare the hotspots obtained for each pair of consecutive years and analyze the trend of each hotspot over time measuring the variation of the radius and the distance between intersecting cluster centroids concerning two consecutive years.
The results show a consistent spread in the last years of the nasal polyposis disease hotspot covering some Vesuvian towns and of the carcinoma disease hotspot covering the city of Naples.
ChaineyS. P.ReidS.StuartN.KidnerD.HiggsG.WhiteS.When is a hotspot a hotspot? A procedure for creating statistically robust hotspot geo-graphic maps of crimeHarriesK.MurrayA. T.McGuffogI.WesternJ. S.MullinsP.Exploratory spatial data analysis techniques for examining urban crimeDi MartinoF.SessaS.The extended fuzzy c-means algorithm for hotspots in spatio-temporal GISMullnerR. M.ChungK.CrokeK. G.MensahE. K.Introduction: geographic information systems in public health and medicinePolatK.Application of attribute weighting method based on clustering centers to discrimination of linearly non-separable medical datasetsWeiC. K.SuS.YangM. C.Application of data mining on the develoment of a disease distribution map of screened community residents of Taipei County in TaiwanGathI.GevaA. B.Unsupervised optimal fuzzy clusteringKrishnapuramR.KimJ.Clustering algorithms based on volume criteriaBezdekJ. C.KaymakU.BabuskaR.SetnesM.VerbruggenH. B.van Nauta LemkeH. M.RuanD.Methods for simplification of fuzzy modelsKaymakU.SetnesM.Fuzzy clustering with volume prototypes and adaptive cluster mergingDi MartinoF.LoiaV.SessaS.Extended fuzzy c-means clustering algorithm for hotspot events in spatial analysisDi MartinoF.SessaS.Implementation of the extended fuzzy c-means algorithm in geographic information systems