Using Omnidirectional Vision to Create a Model of the Environment : A Comparative Evaluation of Global-Appearance Descriptors

Nowadays, the design of fully autonomous mobile robots is a key discipline. Building a robust model of the unknown environment is an important ability the robot must develop. Using this model, this robot must be able to estimate its current position and to navigate to the target points. The use of omnidirectional vision sensors is usual to solve these tasks. When using this source of information, the robot must extract relevant information from the scenes both to build the model and to estimate its position. The possible frameworks include the classical approach of extracting and describing local features orworkingwith the global appearance of the scenes, which has emerged as a conceptually simple and robust solution. While feature-based techniques have been extensively studied in the literature, appearance-based ones require a full comparative evaluation to reveal the performance of the existing methods and to tune correctly their parameters. This work carries out a comparative evaluation of four global-appearance techniques in map building tasks, using omnidirectional visual information as the only source of data from the environment.


Introduction
During the last years, the presence of mobile robots in both industrial and household environments has increased substantially since they are able to solve many different tasks.The expansion of robots into such environments and applications has been eased thanks to the development of their abilities in perception, computation, autonomy, and adaptability to different circumstances.As far as perception is concerned, the robots must be equipped with sensors that allow them to extract the necessary information from the environment to be able to carry out autonomously their tasks.Vision sensors have gained popularity because they present some interesting advantages such as providing a big quantity of information with a relatively low cost and low power consumption (comparing to other sensors, such as laser rangefinders) and stable data both outdoors and indoors (unlike GPS, whose signal is prone to degradation indoors).They also permit carrying out additional high level tasks, such as people detection and recognition.Among vision sensors, catadioptric systems have extended in recent years as they are able to capture images with a field of view of 360 deg.around the robot [1].In our approach, the mobile robot is equipped with a catadioptric system on it, which captures images from the environment.Using this information, the objective is building a robust model of the environment.In general, these models can be represented as a metric, a topological, or a hybrid map [2].First, metric maps define the position of some relevant features of the environment with respect to a coordinate system and permit robot localization with geometric accuracy (except for a relative error) [3].Second, topological maps often represent the environment as a graph where nodes are distinctive localizations (e.g., rooms) of the environment and links are the connectivity relationships between localizations.Usually, such maps do not permit fine localization but they are enough to estimate the position of the robot and to navigate to the desired localizations [4].At last, hybrid maps are hierarchical models where information is arranged in multiple levels.Usually, there is a high level of topological information that allows an approximate localization (in an area of the environment) and several low levels of metric information that permit refining the localization (in the previously detected area) [5].
In all cases, to build a functional map, it is necessary to extract relevant information from the scenes.Traditionally, researchers have focused on methods that extract some outstanding landmarks or interest points from the scenes and describe them with any robust description method.These methods have become popular in map building and mobile robots localization.For example, Angeli et al. [6] make use of SIFT features [7] to solve the mapping and global localization problems simultaneously (SLAM) and Valgren and Lilienthal [8] and Murillo et al. [9] make use of SURF features [10] to solve the localization problem in a previously created model.Using feature-based approaches in combination with probabilistic techniques, it is possible to build metric maps [3].However, these methods present some drawbacks; for example, it is necessary that the environment be rich in prominent details (otherwise, artificial landmarks can be inserted in the environment, but this is not always possible); also, the detection of such points is sometimes not robust against changes in the environment and their description is not always invariant to changes in robot position and orientation.Besides, camera calibration is crucial in order to incorporate new measures in the model correctly.This way, small deviations in either the intrinsic or the extrinsic parameters add some error to the measures.At last, extracting, describing, and comparing landmarks are computationally complex processes that often make building the model in real time unfeasible, as the robot explores the environment.
In contrast, global-appearance techniques have gained relevance in more recent works [11][12][13].These techniques are useful when the robot moves within unstructured environments where extracting and describing robust points is difficult.These approaches lead to conceptually simpler algorithms since each scene is described by means of a unique descriptor.Map creation and localization can be achieved just storing and comparing pairwise these descriptors.As a drawback, extracting metric relationships from this information is difficult; thus, this family of techniques is usually employed to build topological maps (unless the visual information is combined with other sensory data, such as odometry).Despite their simplicity, several difficulties must be faced when using these techniques.Since no local information is extracted from the scenes, it is necessary to use any compression and description method that make the process computationally feasible.These descriptors do not present invariance to changes neither in the robot orientation nor in the lighting conditions or other changes in the environment (position of objects, doors, etc.).They will also suffer problems in environments where visual aliasing is present, which is a common phenomenon in indoor environments with repetitive visual structures.
Many algorithms can be found in the literature working both with local features and with global appearance of images.All these algorithms imply many parameters that have to be correctly tuned so that the mapping and localization processes are correct.Feature-based approaches have reached a relative maturity and some comparative evaluations have been carried out, such as [14].These evaluations are useful to choose the most suitable extractor and descriptor to a specific application.However, global-appearance-based approaches are still a field that is worth deeper exploration.We have not found any work that makes a comparative evaluation of the performance of such descriptors in mapping tasks.This is the main objective we propose in this paper.We have selected four accepted global-appearance description methods, adapted them to be used with omnidirectional visual information, and studied their properties.Then, we have developed the necessary algorithms to create a model of the environment, tested their performance, and studied the influence of the most relevant parameters.
The remainder of the paper is structured as follows.Section 2 presents briefly the description approaches that will be evaluated along the paper.After that, Section 3 describes the kind of models of the environment we will build to test the performance of the approaches.Then, Section 4 details the set of experiments designed and the results obtained.To finish, a final discussion is carried out in Section 5.

Global-Appearance Descriptors: State of the Art
This section outlines some methods to describe the global appearance of images.Four families of methods are proposed to be analysed: methods based on the discrete Fourier transform (Section 2.1), on principal components analysis (Section 2.2), on orientation gradients (Section 2.3), and on the essence of the scenes (Section 2.4).These are the description methods whose performance will be evaluated along the paper.

Methods Based on the Discrete Fourier Transform (DFT).
The Discrete Fourier Transform (DFT) is a classical method to describe scenes that presents some interesting features.When the two-dimensional DFT of a scene im(, ) is calculated, the result is a complex function IM(, V) in the frequency domain ( and V are the frequency variables) that can be decomposed into magnitude and argument matrices.The first matrix (also known as amplitude spectrum) represents the distribution of spatial frequencies within the image (i.e., it contains information on the overall structure of the image: edges orientation, smoothness, width, etc.).On the other hand, the argument matrix contains information about the local properties of the scene (shape and position of the objects).Taking these facts into account, the amplitude spectrum can be used as a global descriptor of the scene, since it contains information about the dominant structural patterns and it is invariant to the distribution of the objects.This information has proved to be relevant to solve simple classification tasks [15].However, this kind of descriptors has no information about the spatial relationships between the structures in the image.To have a complete description of the scene, such information must be included.
Considering it, we have opted for a formulation of the DFT which contains complete information.This formulation is the Fourier Signature (FS), described first in [12].It is defined as the matrix composed of the 1D DFT of each row in the original image.When applied to panoramic scenes, it offers rotational invariance.When we calculate the FS of a panoramic image im(, ) ∈ R   ×  , we arrive to a new matrix IM(, ) ∈ C   ×  , where the main information is concentrated in the low frequency components of each row (so we can retain only the  first columns, having a compression effect).This new matrix with   rows and  columns can be decomposed in a magnitude matrix   (, ) = |  (, )| with   rows and  1 columns and an argument matrix Φ  (, ), with   rows and  2 columns.
Based on the shift property of the DFT, when two panoramic images have been captured from the same position but have the robot different orientations, both images have the same magnitude matrix and the arguments matrices permit obtaining the relative robot orientation.This property allows us to use the magnitude matrix to estimate the position of the robot (as it presents rotational invariance) and, then, the arguments matrix to estimate the relative orientation of the robot.

Methods Based on Principal Components Analysis (PCA).
Panoramic images are data that fall in a space with a very high number of dimensions.However, the image pixels tend to be very correlated data, since they have been captured from a 3DOF process (robot pose on the ground plane).Taking this fact into account, a natural way to compress the information is principal components analysis (PCA), as shown in [16].This kind of descriptors has evolved from the original formulation to adapt them to be used in mapping and localization tasks.The works of Leonardis and Bischof [17] show some examples of how this analysis can be used to mobile robots localization in a robust way.
When we have a set of  images im  (, ) ∈ R   ×  ,  = 1, . . ., , each image can be considered a point in a space with   ⋅   dimensions, ⃗   () ∈ R   ⋅  ×1 ,  = 1, . . ., , ( ≪   ⋅   ).Using the classical formulation of PCA, it is possible to transform each point ⃗   (), in a new data point, namely, image projection ⃗   () ∈ R  3 ×1  = 1, . . ., , where  3 is the number of PCA features that contain the most relevant information,  3 ≤ .Turk and Pentland [18] show how the necessary transformation matrix V can be obtained in an efficient way.They make use of the SVD of the data matrix covariance, retaining only the eigenvectors with higher eigenvalues.If the number of eigenvectors is equal to , then there is no loss of information during the compression process [16].Thus, after applying PCA techniques, images can be handled efficiently, with a low computational cost.However, depending on the images' size, the process to obtain V may be substantially slow.
The use of PCA in mapping and localization tasks is limited since the image projections depend on the robot orientation.Independently, on using omnidirectional scenes, the images projections contain only information of the position and orientation the robot had when capturing the images.This is the reason why Jogan and Leonardis developed the concept of eigenspace of spinning images [19].This model uses specific properties of panoramic images to obtain, in an efficient way, an optimal subspace that takes into account the different orientations a robot may have when capturing each image.The method takes profit of the symmetry properties the data matrix presents when we add the rotations information.This method has the advantage of permitting the estimation of the robot orientation, but the computational cost to obtain the transformation matrix V is extremely high.By this reason, it has been only used with small environments, with a limited number of images.

Methods Based on the Histogram of Oriented Gradients (HOG)
. HOG is a description method used traditionally in object detection.This technique considers the gradient orientation in localized parts of a scene.The method outstands by its simplicity, good computational cost, and relatively good results in object recognition tasks.It was initially described by Dalal and Triggs [20], who used it in people detection tasks.Later on, some researchers developed an improved version of the algorithm both in detection accuracy and in computational cost [21].
However, the experience with HOG descriptors in the mobile robotics field is limited to simple and small environments.Few previous works have made use of HOG in robot mapping and localization.Hofmeister et al. [22] use this descriptor in small robots localization tasks, with low resolution images and small environments not prone to visual aliasing.Under these limited conditions, the algorithm works well.
HOG is not defined as a global-appearance descriptor because the basic implementation consists in dividing the scene in a set of cells and obtaining a histogram of gradient orientation using the pixels information in each cell.The combination of all these histograms is the image descriptor.We have redefined the algorithm to obtain a unique descriptor per image that contains information of the global appearance of this image.The version of HOG we consider is described in [23], where a global version of HOG is used to carry out mapping and Monte-Carlo localization in large environments.Anyway, it is necessary to make an evaluation of the performance of this algorithm and systematize it in map creation tasks.

Methods Based on Gist and Prominence.
The gist concept was first introduced by Oliva and Torralba [24], with the idea of creating a low-dimension scene descriptor, and avoiding segmentation and processing of points, objects, or individual regions.They inspired by some works that suggested that humans recognize scenes by codifying the global configuration and just ignoring most of the details and individual objects [25].
More recently, some works make use of the prominence concept together with gist.It refers to regions of pixels that stand out with respect to the neighbor regions, in contrast to gist, which implies the accumulation of statistical data from the whole image.Siagian and Itti [26] try to establish a synergy between the two concepts and they design a unique descriptor that takes both into account.This descriptor is built using the intensity, orientation and color information.
The experience with this kind of descriptors in mobile robots applications is limited.For example, Chang et al. [27] present a localization and navigation system based on gist and prominence and Murillo et al. [28] make use of gist descriptors in a localization problem.However, they obtain these descriptors using specific regions in a set of panoramic images.
Like HOG, gist is not primarily defined as a globalappearance descriptor and we have redefined the algorithm to obtain a unique descriptor per image.The version of gist we consider in this evaluation is described in [23] and is built from orientation information, analysed in some resolution levels.

Creating a Visual Topological Map of the Environment
In this section we focus on the map creation problem.The robot, which is equipped with a catadioptric vision system on its top, explores the environment to map to cover it completely.During this process, the robot captures a set of omnidirectional scenes from several positions.Only this visual information will be used to build the map (neither odometry nor laser or other sensory data will be used).This way, the final model will be a topological map since it contains some localizations (represented as panoramic scenes) and connectivity relations, but no metric data.In Section 3.1, we describe how the nodes of the map are represented with each description method and, in Section 3.2, the process to add connections between the nodes is outlined.

Using Global-Appearance Descriptors to Create a Model of the Environment.
Let us suppose that the mobile robot has gone across the environment to map (either in a teleoperated way or autonomously, following any exploration algorithm) and has captured a set of omnidirectional images  = {im 1 , im 2 , . . ., im  }, where im  ∈ R   ×  .From this set of images, a set of descriptors, one per original scene, is calculated.As a result, the nodes of the map will be a set of descriptors  = { 1 ,  2 , . . .,   } where, in general,   ∈ C   ×  .With the objective that these nodes are functional, it is necessary that   contains information that permits estimating the position of the robot when capturing im  (taking into account that the robot may have any orientation in this position).In the next subsections, we detail the kind of information each   should contain when using each description method.

DFT Descriptor.
Each node   contains two matrices: the magnitudes one   (, ) ∈ R   × 1 and the arguments matrix Φ  (, ) ∈ R   × 2 . 1 is the number of columns we retain in the localization descriptor and  2 is the number of columns retained in the orientation descriptor.The higher  1 and  2 , the more information the descriptor contains.However, we must take into account that the main information is concentrated in the low frequency columns, and if noise is present on the image, it will affect high frequency components mostly; thus, removing these components may imply an additional benefit.The effect of both parameters in a mapping process will be evaluated.

PCA Descriptor.
The PCA descriptor we use is proposed in the works of Jogan and Leonardis [19].This model uses the specific properties of panoramic images to create a set of   spinning images from each of the  original panoramic images, so we get   data vectors per original image.To obtain the transformation matrix V, the similarities among the rotated versions of each image are taken into account.This permits decomposing the original problem (which is computationally very heavy) in a set of lower order problems.
As a result of the process, the map will be composed of (a) a set of descriptors ⃗   () ∈ R  3 ×1  = 1, . . ., , which are the projections of the original panoramic images and contain information on the robot position, (b) a set of phase vectors, ⃗   ∈ R  3 ×1 , one per image, which contain information of the robot orientation, and (c) a unique transformation matrix V ∈ C  3 ×  ⋅  . 3 is the number of eigenvector chosen.The higher the  3 , the more the information that the map contains.If  3 = , there is no loss of information.

HOG Descriptor
. Each image will be described through two HOG descriptors.The first one, ⃗ ℎ 1 , is the position descriptor and is invariant against rotations of the robot.To obtain it, the panoramic image is divided into horizontal cells, whose width is equal to   (number of columns in the image) and whose height can be configured freely.The size of ⃗ ℎ 1 is 1 ×  4 ⋅ , where  4 is the number of horizontal cells and  is the number of bins in each orientation histogram.The second one, ⃗ ℎ 2 , is the orientation descriptor.To obtain it, the panoramic image is divided into vertical cells whose height is equal to   .Some overlap between these cells may exist.If the width of the cells is  1 and the distance between consecutive cells  1 , then the number of vertical cells is  5 =   / 1 .
The size of the orientation descriptor ⃗ ℎ 2 is then 1 ×  5 ⋅ .In the experiments, the influence of  4 ,  5 , and  will be evaluated.
Figure 1 shows, from a panoramic image whose gradient has been calculated, the process to obtain both descriptors: (a) ⃗ ℎ 1 and (b) ⃗ ℎ 2 .

Gist and Prominence
Descriptor.The information of the orientation of the edges in the image is used to build the descriptor.First, two versions of each image are considered: the original one and a new version after applying a Gaussian low-pass filter and subsampling to a new size 0.5  × 0.5  .Second, both images are filtered with a bank of  Gabor filters with orientations evenly distributed between 0 and 180 deg.Third, to reduce the amount of information, the pixels in each resulting image are grouped into blocks.The block division is carried out in a similar fashion as in HOG: a position descriptor ⃗  1 is obtained by defining  6 horizontal blocks and an orientation descriptor ⃗  2 is calculated with  7 vertical blocks (with overlapping).In the experiments, the influence of  6 ,  7 , and  will be evaluated.Figure 2 shows, from a panoramic image, the process to obtain ⃗  1 .To sum up, Table 1 shows the parameters to be tuned in each description method included in the evaluation.On the other hand, Table 2 gives details of the contents of the map when we consider each description method.

Adding Topological Relations.
Our starting point is a set of images captured from unknown positions.The objective of this section consists in designing an algorithm that allows us to establish adjacency relations among them, with the goal of creating a topological map.Apart from this, we expect the distribution of the nodes in this map to be similar to the distribution of the points where the images were captured.It goes beyond the classical concept of topological map since besides adjacency it also introduces the concepts of closeness and farness.Thanks to this kind of maps, the robot will be able to plan its trajectory more accurately.
To create such a map, a method based on a mechanical system of forces is used.This kind of methods has been used often to simulate the movement of flexible bodies, as in [29], where the body is discretized into a set of particles, and the interaction among them is modelled with a set of springs.Our framework also includes a set of dampers in parallel with the springs, since the dampers can help to achieve an overdamped behaviour that facilitates reaching the steady state.
The idea we develop consists in considering each image a particle which is linked to the rest of images (particles) through a pair spring-damper, where the natural length of each spring is equal to the distance between the descriptors of the two images linked by this spring.The particles start their evolution from random positions.If we let the forces produced by springs and dampers move freely in the system until it tends to a minimum energy position, we expect the distribution of particles to be similar to the distribution  of capture points.The algorithm we use is inspired by the algorithm presented by Menegatti et al. [12], who used it in small environments.

Mass-Spring-Damper Method.
Each image is considered a particle   ,  = 1, . . ., , with mass   , where  is the number of images to include in the map.No information about the coordinates of the capture points is available.Each pair of particles   and   is linked with a spring   with elastic constant   and a damper with damping constant   .The natural length of each spring  0  is equal to the distance between the descriptors of the images associated with the particles   and   .
The initial positions of the particles are randomly initialised.After that, the system is allowed to evolve freely until it reaches a steady state.At this state, the distribution of the particles is expected to be similar to the distribution of capture points (except for a scale factor and a rotation).This way, the result is a scaled model of the real distribution.We consider the value of the elastic constants to be proportional to the distance between images and, from a threshold distance, the images are not linked by any spring.Under these circumstances, the spring and damper linking each pair of particles   and   make on these particles the force: where ⃗   , ⃗ V  are the position and speed of the th particle, respectively.Then, the resulting force on each particle is obtained:

Descriptor Localization Orientation
FS From this resulting force, the acceleration of the particle is obtained from the 2nd Newton's law: where ⃗   () is the th particle acceleration at time instant , ⃗   is the resulting force on particle , and   is the th particle mass.From this acceleration, the speed and position of particle  once it passed a period of time Δ can be calculated: This method, known as Euler integration, may not be stable if the step time is not low enough, which would increase the computational cost of the process.This is the reason why the Verlet integration is sometimes suggested.In this integration method, the position and speed are updated at each iteration with the following expressions.
At the time instant  = Δ, From this time instant,

Experiments
In this section, we compare the performance of the four description methods.First, we describe the sets of images we have used to carry out the experiments.Then, the evaluation is carried out from several points of view to fully uncover the goodness of each method in mapping tasks.We analyse the computational cost of the mapping process, the relationship between the image distance and the geometric distance, and the performance in topological map building.

Sets of Images.
To carry out the experiments, we make use of two sets of images, captured with two different catadioptric systems.First, set 1 has been captured by us in a building of Miguel Hernández University (Spain).The images were captured along 6 different rooms in an office-like environment.Figure 3(a) shows a bird's eye view of this environment.The database is composed of 873 panoramic 64 × 256-color images which have been captured on a dense 40 × 40 cm grid of points (red points in Figure 3(a)).Set 1 [30] is a challenging database due to the tendency to visual aliasing that presents the environment.There are many zones which, despite being geometrically far, present a similar visual appearance.Also, the images were captured in different times of day (changing lighting conditions) and the positions of some objects in the scenes are modified (e.g., changes in the state of doors).All the images were captured with an Imaging Source DFK 21BF04 camera mounted on a Pioneer P3-AT robotic platform.The camera takes pictures of a hyperbolic mirror (Eizoh Wide 70) which is mounted on it with its axis aligned with the camera optic axis.The resulting omnidirectional images are transformed with a cylindrical projection to obtain their panoramic versions.The P3-AT robot has 4 drive wheels.Its maximum linear speed is equal to 0.7 m/s, its maximum turning speed is equal to 140 deg/s, and the minimum turning radius is null.The robot can move freely on the floor so the image capture process has 3 degrees of freedom: position on the ground floor with respect to a world coordinate system (, ) and orientation with respect the -axis ().Figure 3(b) shows the robot, the catadioptric system, and a sample image (omnidirectional and panoramic formats).
The second set of images has been captured by a third party [31].It is composed of a set of panoramic grayscale images, captured in several rooms of a university and a   3 shows the rooms we have used and the main features of the images.

Computational Cost.
Previously, Section 3.1 has outlined the contents of the map nodes.Now, the objective of this section consists in making a comparative evaluation of the computational cost of the four description methods during the creation of the map nodes.This study will be carried out depending on the value of the most relevant parameters of each description method.Data set 1 is used to carry out this comparative evaluation.This is an interesting study as it allows us to know which algorithms could work in real time.First, Figure 4 shows the computation time using (a) FS versus  1 and  2 , and (b) rotational PCA versus   .Second, Figure 5 shows the time when using HOG versus  4 ,  1 , and  1 .At last, Figure 6 shows gist with  = 8 versus  6 ,  2 , and  2 .In all cases, the time per image is depicted.The total time to build the map can be obtained by multiplying by 873 (number of images in set 1).
In the case of FS, as  1 and  2 increase, the time increases slightly.The cost to obtain the DFT of each row is the same, the difference is in the need of computing the magnitude and argument of a different number of components, which implies a low computational cost.In any case, the computational cost of FS is very low.
As far as rotational PCA is concerned, Figure 4 shows how the time increases exponentially as   does, arriving at up to 110 seconds per image (27 hours to build the whole    map), when   = 32 rotations.It has been impossible to consider a higher number of rotations due to the enormous requirements of memory during the process.
If we analyse now HOG, on the one hand, the influence of  4 is low and, on the other hand, time increases linearly when  1 does.At last, when  1 increases, the time decreases as fewer vertical cells are considered.In general, HOG presents a substantially higher computational cost compared to FS; despite it, the algorithm is quick enough to permit carrying out the mapping process in real time, as the robot explores the unknown environment.
At last, the computational cost of gist is, in general, approximately 10 times the cost of FS and similar to HOG.All of FS, HOG, and gist are computationally feasible algorithms.Nevertheless, rotational PCA could only be used if the mapping process is allowed to be done offline.Also, the maximum number of rotations included in the map has been   = 32.This means that the resolution in orientation estimation will be low.Anyway, even though the computational cost of rotational PCA had been low enough, this algorithm would not have permitted building maps online since all the training images must be available to start the process (unless any incremental PCA algorithm is used [32], which would add more computational cost to the process and make it unbearable in real time).The other three algorithms do not present this disadvantage since they are inherently incremental methods (each image is described independently on the rest of images so the robot can build the map as it is exploring the unknown environment).

Image Distance versus Geometric Distance.
Once we know the computational cost of the description methods, the objective of this section consists in carrying out several experiments to test the applicability of these methods to the creation of topological maps.
The first experiment consists in studying the relationship between the geometrical distance between the positions where two images have been captured and the distance between the descriptors of these two images.The behaviour of this distance should be monotonically increasing and linear, at least in a close interval around the point where the reference image was captured.
To carry out this study, several distance measures are taken into consideration.First, these distances are formalized.If we have two descriptors ⃗  ∈ R ×1 and ⃗  ∈ R ×1 , where   and   are the th components of ⃗  and ⃗ , with  = 1, . . ., .The distance between these descriptors can be defined as follows.
(b) Pearson Correlation Coefficient.It is a similitude coefficient that can be obtained as follows: (c) Inner Product.It is also a similitude coefficient that can be calculated as the scalar product between the two vectors to compare As shown in the equation, ⃗  and ⃗  are usually normalized.In this case, this measure is known as cosine similitude and takes values in the range (d) Other Distance Measures.Other distance measures have been considered in the study, as they have provided good results when applied to very-high dimensional data in clustering tasks [33].We name them log and root distances: where max(  ) and min(  ) are, respectively, the maximum and minimum value among the  components of the  vectors in R = { ⃗  1 , . . ., ⃗   }.This way, the distance does not only depend on ⃗  and ⃗ , but also on the set of vectors in R: To study the relation between the image distance (distance between the descriptors of two images) and the geometric distance (Euclidean distance between the points where these images were captured) the rooms kitchen and hall of data set 2 have been used, since these are the two rooms whose grid presents a higher resolution (10 × 10 cm).In both cases, from a reference point, some sets of scenes both horizontally and vertically have been taken and the distance between the reference image and all of them has been obtained.The next figures show the results obtained (average distance and variance) after this set of experiments.
First, Figure 7 shows the distance results when using FS.This figures show how, in this case, the different distance measures present quite similar results.In a close interval to the reference image, the image distance increases (quite linearly in the case of the correlation and cosine distances).However, they present a nondesirable behaviour since they reach a maximum and then they begin to decrease.The cosine distance is not shown as it provides a very similar result to the correlation.
Next, Figure 8 shows the results when using rotational PCA to describe scenes.In all cases,  3 = 200 components have been used.The result obtained with the distance cityblock is remarkable because, despite being the simplest measure, it behaves quite linearly when the number of rotations is high enough (but it presents a local minimum in the middle).Logarithm and root distances present also relatively good results.The data in Figure 9 allow us to analyse the influence of the number of PCA components.In all cases, including a low number of components (very compact descriptors), the behaviour is quite linear and monotonous with some distance measures.
Thirdly, Figure 10 shows the results obtained when the images are described through HOG.In all cases, the results are quite similar to the FS.However, the local maximum is reached in a closer point to the reference image.This fact limits the validity range of the computed distance.
To finish the distance results, we show the results obtained with gist in Figure 11.In this case, thanks to the linearity and monotony, the results obtained with correlation (and cosine) must be highlighted.
As a final conclusion, the FS and HOG descriptors present a limited utility to estimate the topological distance between images, provided that the behaviour of the distances is not monotonous (FS presents a larger useful interval).Rotational PCA presents a relatively good behaviour when using cityblock, Euclidean, and Minkowski distance.At last, the excellent performance of gist with the cityblock and correlation distances must be highlighted, due to their monotony and linearity.The goodness of this configuration suggests   that it could be the first option to implement a topological mapping algorithm.
4.4.Topological Model.This section reflects the last experiment carried out.The algorithm presented in Section 3.2 has been used to build several topological maps using the data set 2. This data set presents different grid steps, depending on the room considered.This way, it allows us to study the influence of this important parameter.
As far as the configuration of the mass-spring-damper algorithm is concerned, the most critical parameter is the spring constant.If we consider that all the springs have the same elastic constant, the results are not consistent, because the presence of visual aliasing in the environment introduces nondesired forces in the system.To avoid this effect, each elastic constant is calculated depending on the distances between the descriptors of the two particles  and  linked by this spring, according to the following expression: where  is the average slope measured on Figures 7-11, depending on the selected descriptor and parameters.The value of   has been limited to 100 to avoid the presence of very high efforts.At last, the natural length of each spring is equal to the distance between the descriptors of the particles linked by the spring:  To finish, all the particles are considered to have the same mass   = 1,  = 1, . . ., , since our experiments have shown that it is not a relevant parameter.The damping constant of all the dampers is set to   = 0.6.Thanks to this dynamic friction, the behaviour of the system tends to be overdamped and more stable, permitting a gradual evolution from the initial position to the steady state, without large oscillations.To finish, we have defined the time step Δ = 0.03 s.It is an important parameter that influences both the settling time and the stability of the resulting system.A low value supposes a high settling time and a high value makes this time lower but the movement between two consecutive iterations may be so high that the system could destabilize.
After a complete bank of experiments, the best results have been obtained with the gist descriptor with  6 = 16 blocks,  = 16 orientations, and correlation distance and with the FS descriptor with  1 = 32 blocks and correlation distance.These results are in line with Section 4.3.HOG has not provided good results, as Figure 10 suggested.
Figures 12, 13, and 14 show some of the topological maps created in three different rooms of data set 2 (hall, laboratory, and corridor, resp.).We show the results of these rooms because they have different grid size.In each room, several sets of images, with different size and distribution along all the space of the room, have been chosen.Then, the mapping algorithm has been applied.The final distribution of each map is shown.In these maps, the lines are drawn just with representative purposes (when the algorithm starts, it has no information about the initial positions nor about the vicinity relations).
The figures show that, despite the different grid size, relatively good results are achieved in all cases.This way, global-appearance descriptors prove to be a good choice for the creation of topological maps where the concepts of closeness and farness are included.
Comparing to feature-based techniques, in a previous work [34], a new global-appearance description method was proposed and a preliminary comparison with a classical global-appearance method (the Fourier Signature) and a feature-based method (SIFT features) was carried out.The results showed that global-appearance descriptors are robust to solve the localization process and their computational cost is relatively low, improving the performance of local feature descriptors.
To finish, Table 4 shows a final comparison of the performance of the four methods in mapping tasks.First, to compare the computational cost, the table shows the minimum and the maximum necessary time ( min and  max , resp.) to include each image in the model.Second, to study the relation between the image distance and the geometric distance, a least squares linear fit has been carried out with all the curves in Figures 7-11.In all cases, the origin has been weighted to ensure that the resulting line passes through it.The table shows the results of the best fit: the slope , the coefficient  2 , and the values of the parameters.

Conclusion and Future Works
This paper has focused on the study of the mapping problem.It has been addressed from a topological point of view, using the information provided by an omnidirectional vision sensor to build the model, and methods based on global appearance to extract relevant information from the scenes.The work has carried out a comparative evaluation between some renowned description methods in map building tasks.
The main contributions of the paper include an exhaustive study of visual appearance techniques (FS, PCA, HOG, and gist) and the adaptation of some of these algorithms to store position and orientation information from panoramic scenes.Also, the computational cost to build the nodes  of the map has been studied, including the influence of the most relevant parameters.This study has revealed that FS, HOG, and gist present a reasonable computational cost and, from this point of view, their use could be feasible in real time applications.Besides this, the performance of the descriptors has been tested in mapping tasks.First, we have focused on the relation between the image distance and the geometric distance, which allows us to know the descriptors that best reflect an idea of closeness and farness, since they are two important concepts to reflect in the map.All the description methods have been tested along with several distance measures, and the results have shown that gist and FS descriptors with certain distance measures present positive results.Second, a mass-spring-damper method has been implemented to build topological maps, their parameters have been tuned and several experiments have been carried out.To finish, several topological maps have been built, including not only connectivity but also closeness and farness concepts.The results have shown the goodness of the mapping approach and of the parameters tuning.
These results have demonstrated that global-appearance methods are a feasible approach to solve the mapping task.Thanks to them, the robot can build a model of the environment that goes beyond the classical topological maps since the model is a version of the original grid except for a scale factor.This suggests that the model could be used to estimate with accuracy the position and orientation of the robot in the environment, with computational efficiency.This fact may have interesting implications in future developments in the field of mobile robotics.As an example, this concept can be used to build hybrid maps that arrange the information in several layers, with different accuracy: a high level layer that permits carrying out a rough and quick localization and a lower layer that contains information with geometric accuracy and allows the robot to refine the estimation of its position.Global-appearance methods can be used on their own or in conjunction with feature-based techniques to develop algorithms that face these problems efficiently.
All these facts encourage us to go into this framework in depth.To build a fully autonomous mapping and localization system several future works should be considered.First, the image collection process could be automated to obtain an optimal representation of the environment.Second, this model must be used to estimate the current position and orientation of the robot taking into account typical situations such as changes in lighting conditions or visual occlusions.At last, both processes could be integrated in a topological SLAM system that carries out both the model creation and the localization from the scratch.To optimize these algorithms we also consider carrying out a complete comparison between global-appearance and feature-based techniques as a future work.

Figure 3 :
Figure 3: (a) Bird's eye view of the environment where set 1 was captured.(b) Catadioptric system mounted on the robot and sample scene captured in the corridor (omnidirectional and panoramic formats).

Figure 4 : 4 Figure 5 :
Figure 4: Computational cost to obtain the nodes' descriptors using (a) Fourier Signature and (b) rotational PCA.

Figure 6 :
Figure 6: Computational cost to obtain the nodes' descriptors using gist.

Figure 12 :
Figure 12: Topological maps created in the hall, data set 2. The grid size is 10 × 10 cm.

Figure 13 :
Figure 13: Topological maps created in the laboratory, data set 2. The grid size is 30 × 30 cm.

Figure 14 :
Figure 14: Topological maps created in the corridor, data set 2. The grid size is 50 × 50 cm.

Table 1 :
Parameters to be tuned in each description method. 7 =   / 2 ⇒ number of vertical blocks, orientation descriptor ⃗  2

Table 2 :
Contents of the map, relative to localization and orientation estimation, per image included in the model im  ,  = 1, . . ., .

Table 3 :
Images set 2: rooms considered in the experiments and main parameters.

Table 4 :
Performance of the description methods: computational cost per image to build the model and best linear fit of the image distance versus the geometric distance.