In recent years, the size and complexity of datasets have shown an exponential growth. In many application areas, huge amounts of data are generated, explicitly or implicitly containing spatial or spatiotemporal information. However, the ability to analyze these data remains inadequate, and the need for adapted data mining tools becomes a major challenge. In this paper, we propose a new unsupervised algorithm, suitable for the analysis of noisy spatiotemporal Radio Frequency IDentification (RFID) data. Two real applications show that this algorithm is an efficient data-mining tool for behavioral studies based on RFID technology. It allows discovering and comparing stable patterns in an RFID signal and is suitable for continuous learning.
In recent years, the size of datasets has shown an exponential growth. A study exhibits that the amount of data doubles every three years [
As the study of data streams and large databases is a difficult problem because of the computing costs and the big storage volumes involved, two issues appear to play a key role in such an analysis: (i) a good condensed description of the data properties [
In this work, we focus on the analysis of RFID data. RFID is an advanced tracking technology. The RFID tags, which consist of a microchip and an antenna, must be used with a reader that can detect simultaneously a lot of tags in a single scan. A computer is used to store the data about the position of each tag for each scan in a database. This allows different analyses. RFID, thanks to miniaturization, offers the advantage of automation and overcomes the constraints imposed by video analyzes. The evolution of these data over time and their spatial position require the exploration of multiple data sets described in high dimension spaces.
The proposed algorithm presents some interesting properties to deal with RFID data. It allows a compact representation of each trajectory in a linear computational cost. This is important regarding that the total amount of recorded data can be high. Indeed, each tag’s position is recorded very frequently (sometime less than each second) during several hours and many tags are followed simultaneously. It is able to deal with noisy data and to find automatically a suitable number of clusters, without constraints about the clusters’ shape. Moreover, it presents good clustering performances comparing to traditional algorithms. This is appreciable as RFID trajectories are very noisy in our application and the extraction of general patterns become a difficult task. The algorithm is thus used to perform a clustering of each trajectory from its representation. It is suitable for trajectories comparisons. As new trajectory record can be added to the database at any time, it is important to compare patterns of new trajectories with older ones. The algorithm can evaluate the similarity of two trajectories’ representations using information about underlying distribution of the tags’ movements. This approach is very resistant to noise and is more reliable than distance-based methods.
Thus, the algorithm combines all needed properties—good abstraction performances, low memory, and computational cost, suitable for noisy experimental data analysis—to be a good candidate for RFID data mining.
The remainder of this paper is organized as follows. Section
The basic assumption in this work is that it is possible to define prototypes in the data space and to calculate a distance measure between data points and prototypes. First, each dataset is modeled using an enriched SOM model, constructing an abstract representation which is supposed to capture the essential properties of the data. Then, a clustering of the data is computed, in order to catch the global structure of these data. Finally, the density function of each dataset is estimated from the abstract representation and different datasets can be compared using a dissimilarity measure based upon these density functions.
The idea is to combine the dimension reduction and the fast learning capabilities of SOM to construct a new vector space then apply other analysis in this space. These are called two-level methods. The two-level methods are known to reduce greatly the computational time, the effects of noise, and the “curse of dimensionality” [
The algorithm proceeds in three steps. The first step is the learning of the enriched SOM. During the learning, each SOM prototype is extended with novel information extracted from the data. These information will be used in the following step to find clusters in the data and to infer the density function. More specifically, the attributes added to each prototype are the following. The second step is the clustering of the data using density and connectivity information so as to detect low-density boundary between clusters. The third step is the construction, from each cluster (i.e., a set of enriched prototypes in a SOM), of a density function which will be used to estimate the density in the input space. This function is constructed by induction from the information associated to the prototypes of the SOM and is represented as a mixture model of spherical normal functions. The last step accomplishes the comparison of two different datasets (e.g., clusters from different databases) using a dissimilarity measure able to compare the two density functions constructed in the previous steps.
Kohonen SOM can be defined as a competitive unsupervised learning neural network [
In our algorithm, the SOM’s prototypes will be “enriched” by adding new numerical values extracted from the dataset.
The enrichment algorithm proceeds in three phases.
the data
the density the neighborhood values
initialize the SOM parameters, For all compute find the two closest prototypes (BMUs: Best Match Units) number of data: variability: density: For all neighborhood:
In this study, we used the default parameters of the SOM Toolbox [
At the end of this process, each prototype is associated with a density and a variability value, and each pair of prototypes is associated with a neighborhood value. The substantial information about the distribution of the data is captured by these values. Then, it is no longer necessary to keep data in memory.
Various prototypes-based approaches have been proposed to solve the clustering problem [
Here, DS2L-SOM uses information learned by the enriched SOM. Figure
Example of a sequence of the different stages of the clustering algorithm.
Database
Sets of connected prototypes
Density modes detection
Subgroups associated to each mode
Merging of irrelevant subgroups: final clusters
Data clustering from prototypes clustering
At the end of the enrichment process (Section
Each dataset is modeled using an enriched Self-Organizing Map (SOM) model, constructing an abstract representation which is supposed to capture the essential data structure. Each of the datasets is partitioned using the DS2L-SOM algorithm. In order to be able to compare different clusters from different databases, the algorithm first estimate the underlying density function of each clusters, then use a dissimilarity measure based upon the density functions for the comparison.
The first objective of this step is to estimate the density function which associates a density value to each point of the input space. An estimation of some values of this function have been calculated (i.e.,
The hypothesis here is that this function may be properly approximated in the form of a mixture of Gaussian kernels. Each kernel
The most popular method to fit mixture models (i.e., to find
Thus, we propose the heuristic to choose
Now, since the density
Thus, we have a density function that is a model of the dataset represented by the enriched SOM. Some examples of estimated density are shown on Figures
“Engytime” dataset and the related estimated density function.
“Rings” dataset and the related estimated density function.
A measure of dissimilarity between two clusters
The dissimilarity between
The idea is to compare the density functions
The effectiveness of the proposed two-level clustering method have been demonstrated in [
To summarize, DS2L-SOM presents some interesting qualities in comparison to other clustering algorithms the number of cluster is automatically detected by the algorithm, no linearly separable clusters and nonhyperspherical clusters can be detected, and the algorithm can deal with noise (i.e., touching clusters) by using density estimation.
In order to demonstrate the performance of the proposed dissimilarity measure, nine artificial datasets generators and two real datasets where used in [
The main idea to test the quality of the comparison measure is that a low dissimilarity value is only consistent with a similar distribution and does, of course, give an indication of the similarity between the two sample distributions. On the other hand, a very high dissimilarity does show, to the given level of significance, that the distributions are different. Then, if the measure of dissimilarity is efficient, it should be possible to compare different datasets (with the same attributes) to detect the presence of similar distributions, that is the dissimilarity of datasets generated from the same distribution law must be much smaller than the dissimilarity of datasets generated from very different distribution. This is measured using a generalized index of Dunn [
The results was compared with some distance-based measures usually used to compare two sets of data (here, we compare two sets of prototypes from the SOMs). These measures are the average distance (Ad: the average distance between all pair of prototypes in the two SOMs), the minimum distance (Md: the smallest Euclidean distance between prototypes in the two SOMs), and the Ward distance (Wd: The distance between the two centroids, with some weight depending on the number of prototypes in the two SOMs) [
As shown in Table
Value of the Dunn index obtained from various dissimilarity measures to compare various data distributions.
Distributions | Average | Minimum | Ward | Proposed |
---|---|---|---|---|
Ring 1–3 + Spiral 1-2 | 0.4 | 0.9 | 0.5 | |
Noise 1–4 | 1.1 | 1.4 | 22.0 | |
Shuffle 1-2 | 1.1 | 16.5 | 6.3 |
The complexity of the algorithm is scaled as
This is much faster than traditional density estimator algorithms as the Kernel estimator [
In this section, an adaptation of the algorithm is introduced to deal with RFID data, and two applications are described. The fist application aims at mining customers trajectories in a supermarket during their shopping. The second is an analyze of migration behavior in an ant’s colony.
The proposed method is to analyze RFID signal proceeds in three steps. Data postprocessing: The aim is to analyze the variation of the spatial behavior over time. To do that, a current spatiotemporal behavior must be defined for each instant during the following. This behavior must be inferred from the complex and noisy RFID signal. To do this, we compute for each Obviously, this definition implies some correlations between the description of two time windows if they are separated by less than Detection of individual homogeneous patterns: In order to regroup similar behaviors and to detect changes over time, an enriched SOM is applied on time windows from each individual sequences. DS2L-SOM is then applied on the enriched SOM as in Section Detection of similar patterns: The method used in step ( The idea is to define a similarity measure between two set of prototypes (from the enriched SOM) that represent two individual subsequences (i.e., clusters). For this job, the related density function is computed for each stable pattern as in Section
In this application, we aim at studying the individuals’ spatiotemporal activity during their shopping in a supermarket. Until now, little research has been undertaken in this way. Usual questions are: how do customers really travel through the store, do they go through every area or do they skip from one area to another in a more direct manner, do they follow a single, dominant pattern, or are they rather heterogeneous?
The purpose of this work is to explore data recorded via an RFID device to model and analyze the purchasing behavior of customers. In particular, we would know the time spend in each area of the store so as to detect hot spots and cold spots. We also aim at analyzing the customers trajectory patterns.
The movement of customers during their shopping was monitored using RFID device. To do this, some plastic basket are at the disposal of the customers. Each basket have an RFID tag glued on is back.
The supermarket used for this experiment is a store specializing in the sale of decorative objects (6000 m
The RFID experimental device. Yellow box represents readers positions.
The data files are in text format. They indicate, for each scan (about one scan per second), the ID number of the tag detected, the IP address of the reader that have detected the tag, and the date and time of the detection (Figure
Example of a recorded scan in the data file.
As a customer moves inside the store, they are detected successively by different readers. However, depending of the crossed area, one tag can be detected by more than one reader, approximatively at the same time. This adds much more information about the actual location of the customer, but this also make the moving sequence much more hard to understand. Furthermore, data recorded a very noisy, because of perturbations of the RFID signal by all the metallic structures and by human bodies (see Figure
Example of a moving sequence. On the abscissa is the time (minutes). On the ordinate are the readers detecting the tag.
Here, we want to analyze the variation of the customers spatial behavior over time. To do that, a current location must be defined. The current location represent the area where the customer is in a given time. For each 10 seconds of the customer trajectory, a vector is computed representing how many times and how long each RFID reader have detected the customer’s tag during a 3-minute time window centered on the current time. This will allow to detect when a customer moves from one area to another, by detecting sudden change in the description of current location.
In order to regroup similar current location and to detect changes over time, the algorithm DS2L-SOM is applied on time windows from each individual sequences.
By using the similarity measure to compare all the subsequences of all the recorded customers in the first day of recording, we found six clusters that represent six well-defined homogeneous locations. The similarity measure can now be used to label each new customer’s sequence recording after the first day. This is fast enough to be made in real time during the customer’s shopping.
The analysis method allowed to find six well-defined homogeneous locations (named sectors). This means that we were able to define more well-localized area than the number of reader (50% more), this is a good information extraction. The sectors can be described as follows (see also Figure detected by reader 9 only, it correspond to the entrance of the store. Baskets waiting for new customers are detected in this sector. detected by reader 1 only. In this sector, customers can find flowers and vases. mainly detected by reader 4 and 10. In this sector, customers can find wrought iron objects. mainly detected by reader 1, sometime by reader 9 (wood furnitures). mainly detected by reader 9 and 10, sometime by reader 4 or 1 (dishes and small objects). mainly detected by reader 1 and 4, sometime by reader 9 (Mirrors and linens).
Estimation of sectors location. The thickness of arrows is proportional to transitions frequency.
Figure
Finally, the mean time spent in each sector is computed so as to find hot and cold spot. This shows us that (S5) is a very hot spot (48% of the time) unlike (S2) (6%) and (S4) (2%) which are very cold spot. Note that we do not use (S1) for this analysis, as this sector include waiting baskets.
Animal societies are dynamic systems characterized by numerous interactions between individual members. Such dynamic structures stem from the synergy of these interactions, the individual capacities in information processing, and the diversity of individual responses.
Ants, often caricatured and little known, have nevertheless a huge ecological impact, and they are considered as major energy catalysts. Their complex underground nests contributes to soil ventilation, and because of their predatory and detritivore diets, they contribute to ecosystem equilibrium. Ant colonies face rapid changes of environmental conditions and constraints through an important individual flexibility. The following study aims at studying the mechanisms leading to a colony migrations (change of nest). Migration is a widespread phenomenon in many species, but it remains a risky event because during the movement the queen and brood will be particularly vulnerable. The strategies used in nest choice and movement organization are therefore crucial for group survival.
An RFID device has been developed for this study. Based on marketed products, it requires little development. It consists of a network of RFID readers in a constrained space with compulsory passageways in an artificial nest. These readers are connected to a detector which sends the information to a computer.
For this study, we chose a big-sized tropical ant
Ant with RFID tag.
The movement between nests of a colony of 55
The experimental device for this experiment consists of two artificial nests (N1 and N2) of three rooms each (Room 1, 2, and 3) and a foraging area, linearly connected by six tunnels (Figure
The RFID experimental device.
At
The data files are in text format. They indicate, for each antenna scan (about three scans per second), the scan number, the date, time, and, for each individual (i.e., for each tag), which antenna is activated (Figure
Example of a recorded scan in the data file.
We used this information to produce the individual moving sequence of each ant. This sequence is a function that gives the ant’s location at any time during the move.
However, what we would like to analyze is the variation of the ant’s current spatial behavior over time. To do that, a current spatial behavior must be defined. Here, the current location cannot just be chosen, because this way, we would lose all dynamic information such as “the ant is moving quickly” or “the ant makes round trips between two rooms”. Therefore, the current behavior is defined as the time spent in each location (static information) and the number of exits from each location (dynamic information) during a 10-minute time window centered on the current time. As there are 19 locations in the RFID device (7 Rooms and 12 readers), each temporal windows is coded in a vectorial form of 38 normalized features (one static and one dynamic feature for each location).
In order to regroup similar current behaviors and to detect changes in current behaviors over time, the DS2L-SOM algorithm is applied on time windows vectors modeled by an enriched SOM from each individual sequences. Then, the similarity measure is used to compute a similarity matrix with all the subsequences of all the ants. DS2L-SOM is used to find clusters of homogeneous subsequences. This allows to compare the behaviors of different ants. These clusters are then used to rename all the subsequences, so as to give the same name to subsequences that belong to the same cluster.
The RFID apparatus only provides a partial observation of the individuals. No information is provided concerning what happens inside a room, but only the duration of the permanence of an ant inside it can be known. Moreover, sensors are not reliable having a missing detection rate ranging from 5% to 15%. Thus, a HMM variant called S-HMM [
At the end of learning procedure, eight groups of behaviors A
In order to check the validity of the obtained results, we compared them with some visual observations from a video record. A movie camera was placed over the foraging area and every ant moving across this area was filmed. This allowed us to detect only one apparent behavior: the transportation of larva and cocoon. Each ant can be identified visually thanks to some color painted on their tag. So, we know which ant does transport and at what time this behavior occurred. We compared this with the results of the automatic analysis, and we found that all the transportations subsequences were grouped into only one activity (Activity A5, see Table
Corresponding between visual observation and automatic analysis (number of ants).
Transportation | Others | |
---|---|---|
Activity | 8 | 2 |
Others | 0 | 42 |
Examples of detected transportation patterns.
The analysis of the spatiotemporal structure of each activity have been used by ethologist for hypothesize a plausible explanation for each behavior.
Now, the abstracted individual activity during emigration can be used to compute the collective dynamic of emigration. Figure
Activity analysis: evolution of the number
From an ethological point of view, these results are of great help for understanding how tasks are distributed during a nest relocation. In fact, we obtained a very accurate description of the dynamic of the whole colony during all the emigration phase allowing us to emit some strong hypothesis about the function of the different behavior during the nest relocation phase. Some results are in accordance with previous works, especially the behaviors that can be observed in the foraging area. For example, the dynamic of the transportation behavior detected by the system match up the results presented in [
In this paper, a new algorithm is proposed for modeling data structure, based on the learning of an SOM, and a measure of dissimilarity between cluster structures. The advantages of this algorithm are not only the low computational cost and the low memory requirement, but also the high accuracy achieved in fitting the distribution of the modeled datasets. The results obtained on the basis of artificial and real datasets are very encouraging. The new unsupervised algorithm used in this paper is an efficient data mining tool for behavioral studies based on RFID technology. It allows discovering and comparing stable patterns in a RFID signal and is suitable for continuous learning.
Here, it was possible to highlight some characteristics of spatial organization of customers during their shopping in a big store from complex and noisy spatiotemporal database. Moreover, the characteristics of spatiotemporal organization in ant colonies during migration were described, and these results are perfectly compatible with the results of previous works using classic methods [
This work was partly supported by the ANR (Agence National de la Recherche) CADI 07 TLOG 003 and SILLAGE 05 BLAN 017701.