Intrinsic Dimension Estimation : Relevant Techniques and a Benchmark Framework

When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant stateof-the-art estimators.


Introduction
Since the 1950s, the rapid pace of technological advances allows us to measure and record increasing amounts of data, motivating the urgent need to develop dimensionality reduction systems to be applied on real datasets comprising high-dimensional points.
To this aim, a fundamental information is provided by the intrinsic dimension (id) defined by Bennett [1] as the minimum number of parameters needed to generate a data description by maintaining the "intrinsic" structure characterizing the dataset, so that the information loss is minimized.
More recently, a quite intuitive definition employed by several authors in the past has been reported by Bishop in [2], p. 314, where the author writes that "a set in  dimensions is said to have an id equal to  if the data lies entirely within a -dimensional subspace of R  ." Though more specific and different id definitions have been proposed in different research fields [3][4][5], throughout the pattern recognition literature the presently prevailing id definition views a point set as a sample set uniformly drawn from an unknown smooth (or locally smooth) manifold structure, eventually embedded in a higher dimensional space through a nonlinear smooth mapping; in this case, the id to be estimated is the manifold's topological dimension.
Due to the importance of id in several theoretical and practical application fields, in the last two decades a great deal of research effort has been devoted to the development of effective id estimators.Though several techniques have been proposed in the literature, the problem is still open for the following main reasons.
At first, it must be highlighted that though Lebesgue's definition of topological dimension [6] (see Section 3.2) is quite clear, in practice its estimation is difficult if only a finite set of points is available.Therefore, id estimation techniques proposed in the literature are either founded on different notions of dimension (e.g., fractal dimensions, Section 3.2.1)approximating the topological one or on various techniques aimed at preserving the characteristics of data-neighborhood distributions, which reflect the topology of the underlying manifold.Besides, the estimated id value markedly changes as the scale used to analyze the input dataset changes [7] (see Figure 1: At very small scales (a) the dataset seems zero-dimensional; in this example, when the resolution is increased until including all the dataset (b) the id looks larger and seems to equal the embedding space dimension; the same effect happens when it is estimated at noise level (d); the correct id estimate is obtained at an intermediate resolution.
an example in Figure 1), and with the number of available points being practically limited, several methods underestimate id when its value is sufficiently high (namely, id ⩾ 10).Other serious problems arise when the dataset is embedded in higher dimensional spaces through a nonlinear map.Finally, the too high computational complexity of most estimators makes them unpractical when the need is to process datasets comprising huge amounts of high-dimensional data.
In this work, after recalling the application domains of interest, we survey some of the most interesting, widespread used, and advanced id estimators.Unfortunately, since each method has been evaluated on different datasets, it is difficult to compare them by solely analyzing the results reported by the authors.This highlights the need of a benchmark framework, such as the one proposed in this work, to objectively assess and compare different techniques in terms of robustness with respect to parameter settings, highdimensional datasets, datasets characterized by a high id, and noisy datasets.
The paper is organized as follows: in Section 2 the usefulness of the id knowledge is motivated and interesting id application domains profitably exploiting it are recalled; in Section 3 we survey notable state-of-the-art id estimators, grouping them according to the employed methods; in Section 4 we summarize mostly used experimental settings, we propose a benchmark framework, and we employ it to objectively assess and compare relevant id estimators; in Section 5 conclusions and open research problems are reported.

Application Domains
In this section we motivate the increasing research interest aimed at the development of automatic id estimators, and we recall different application contexts where the knowledge of the id of the available input datasets is a profitable information.
In the field of pattern recognition, the id is one of the first and fundamental pieces of information required by several dimensionality reduction techniques [8][9][10][11][12], which try to represent the data in a more compact, but still informative, way to reduce the "curse of dimensionality" effects [13].
Furthermore, when using an autoassociative neural network to perform a nonlinear feature extraction, the id value  can suggest a reasonable value for the number of hidden neurons [14].Indeed, a network with a single hidden layer of neurons with linear activation functions has an error function with a unique global minimum and, at this minimum, the network performs a projection on the subspace spanned by the first  principal components [15] estimated on the dataset (see 8.6.2 of [2]), with  being the number of hidden neurons.Besides, according to statistical learning theory [16], the capacity and generalization capability of a given classifier may depend on the id.More specifically, in the particular case of linear classifiers where the data are drawn from a manifold embedded through an identical map, the Vapnik-Chervonenkis (VC) dimension of the separation hyperplane is  + 1 (see [16], pp.156-158).Since the generalization error depends on the VC dimension, it follows that the generalization capability may depend on the id value .Moreover, in [17] the authors mark that, in order to balance a classifier generalization ability and its empirical error, the complexity of the classification model should also be related to the id of the available dataset.Furthermore, since complex objects can be considered as structures composed by multiple manifolds that must be clustered to be processed separately, the knowledge of the local ids characterizing the considered object is fundamental to obtain a proper clustering [18].
These observations motivate applications employing global or local id estimates to discover some structure within the data.In the following we summarize or simply recall some interesting examples [19][20][21][22][23][24][25].
In [19] the authors introduce a fractal dimension estimator, called correlation dimension (CD) estimator (see Section 3.2.1),and show that the id estimate it computes is a reliable approximation of the strange attractor dimension in chaotic dynamical systems.
In the field of gene expression analysis, the work proposed in [20] shows that the id estimate computed by the nearest neighbor estimator (described in [26] and Section 3.2.2) is a lower bound for the number of genes to be used in supervised and unsupervised class separation of cancer and other diseases.This information is important since generally used datasets contain large number of genes and the classification results strongly depend on the number of genes employed to learn the separation criteria.
In [21], the authors show that id estimation methods being derived from the basis theory of fractal dimensions ( [7,19,27], see Section 3.2.1)can be successfully used to evaluate the model order in signals and time series, which is the number of past samples required to model the time series adequately and is crucial to make reliable predictions.This comparative work employs fractal dimension estimators, since the domain of attraction of nonlinear dynamic systems has a very complex geometric structure, which could be captured by closely related studies on fractal geometry and fractal dimensions.
A noteworthy research work in the field of crystallography [22] employs the fractal CD estimator [19] followed by a correction method [28] that, according to the authors, "is needed because the CD estimator, to give correct estimations of the id, requires an unrealistically large number of points."Anyway, the experimental results show that id is a useful information to be exploited when analyzing crystal structures.This study not only proves that id estimates are especially useful when dealing with practical tasks concerning real data, but also underlines the need to compute reliable estimates on datasets drawn from manifolds characterized by high id and embedded in spaces of much greater dimensionality.
The work of Carter et al. [23] is very interesting and notable because it is one of the first considering that the input data might be drawn from a multimanifold structure, where each submanifold has a (possibly) different id.To separate the manifolds, the authors compute local id estimates, by applying both a fractal dimension estimator (namely, MLE [27]; see Section 3.2.1)and a nearest neighbor-based estimator (described in [29,30]; see Section 3.2.2) on properly defined data neighborhoods.The authors then show that the computed local ids might be helpful for the following interesting applications: (1) "Debiasing global id estimates": the negative bias caused both by the limited number of available sample points and by the curse of dimensionality is reduced by computing global id estimates through a weighted average of the local ones, which assign greater importance to the points away from the boundaries.However the authors themselves note that this method is only applicable for data with a relatively low id, since in high dimensions the points lie nearby the boundaries [31].(2) "Statistical manifold learning": the local id estimates are used to reduce the dimension of statistical manifolds [32], that is, manifolds whose points represent a pdf.When this step is applied as the first step of document classification applications, and analysis of patients' samples acquired in the field of flow cytometry, it allows us to obtain lower dimensional points showing a good class separation.(3) "Network anomaly detection": considering that the overall complexity of a router network is decreased when few sources account for a disproportionate amount of traffic, a decrease in the id of the entire network is searched for.(4) "Clustering": problems of data clustering and image segmentation are dealt with by assuming that different clusters and image patches belong to manifold structures characterized by different complexity (and ids).
In [24], to the aim of analyzing gene expression time series, the authors compute id estimates by comparing the fractal CD estimator and the nearest neighbor (NN) estimator [26].The results on both simulated and real data show that NN seems to be more robust than CD with respect to nonlinear embedding and the underlying time-series model.
In the field of geophysical signal processing, hyperspectral images, whose pixels represent spectra generated by the combination of an unknown set of independent contributions, called endmembers, often require estimating the number of endmembers.To this aim, the proposal in [25] is to substitute state-of-the-art algorithms specifically designed to solve this task with id estimators.After motivating the idea by describing the relation between the id of a dataset and the number of endmembers, the authors choose to experiment two fractal id estimators [7,19] and a nearest neighborbased one [33].They obtain the most reliable results with the latest one after opportunely tuning the number of nearest neighbors to be considered.
Finally, other noteworthy examples of research works that profitably exploit id and estimate it by usually applying fractal dimension estimators concern financial time series prediction [34], biomedical signal analysis [35][36][37], analysis of ecological time series [38], radar clutter identification [39], speech analysis [40], data mining and low dimensional representation of (biomedical) time series [41], and plant traits representation [42].

Intrinsic Dimension Estimators
In this section we survey some of the most notable, recent, and effective state-of-the-art id estimators, grouping them according to the main ideas they are based on.
Specifically, in Section 3.1 we describe projective id estimators, which basically process a dataset P  ≡ {p  }  =1 ⊆ R  to identify a somehow appealing lower dimensional subspace where to project the data; the space dimension of the identified subspace is viewed as the id estimate.
More recent projective id estimators exploit the assumption of datasets P  ≡ {p  }  =1 ⊆ R  being uniformly drawn from a smooth (or locally smooth) manifold M ⊆ R  , embedded into a higher -dimensional space through a nonlinear map; this is also the basic assumption of all the other groups of methods that will be referred to as topological id estimators (see Section 3.2) and graph-based id estimators (see Section 3.3).
We note that the taxonomy we are using to group the reviewed methods is different from the one, commonly used by several authors in the past (as an example, see [43]), that viewed methods as global, when id estimation is performed by considering a dataset as a whole, or local, when all the data neighborhoods are analyzed separately and an estimate is computed by combining all the local results.All the recent methods have abandoned the global approach since it is now clear that analyzing a dataset at its biggest scale cannot produce reliable results.They thus estimate the global id by somehow combining local ids.This way of proceeding comes from the assumption that the underlying manifold is locally smooth.

Projective id Estimators.
The first projective id estimators introduced in the literature explicitly compute the mapping that projects input points P  ∈ R  to the subspace M ⊆ R  minimizing the information loss [27,43] and therefore view the id as the minimal number of vectors linearly spanning the subspace M. It must be noted that, since these methods were originally designed for exploratory data analysis and dimensionality reduction, they often require the dimensionality of M (the id to be estimated) to be provided as input parameter.However, their extensions and variants include methodologies to automatically estimate id.
Most of the projective id estimators can be grouped into two main categories: projection techniques based on Multidimensional Scaling (MDS) [44,45] or its variants, which tend to preserve as much as possible pairwise distances among the data, and projection techniques based on Principal Component Analysis (PCA) [15,46] and its variants that search for the best projection subspace M minimizing the projection error.Some of the best known examples of MDS algorithms are MDSCAL [47-52], Bennett's algorithm [1,53], Sammon's mapping [54], Curvilinear Component Analysis (CCA) [55], ISOMAP [56], and Local Linear Embedding (LLE) [57].As shown by experiments reported in [56,57] ISOMAP and variants of LLE compute the most reliable id estimates.We believe that their better performance is due to the fact that both ISOMAP and LLE have been the first projective methods based on the assumption that the input points are drawn from an underlying manifold, whose curvature might affect the precision of data neighborhoods computed by employing the Euclidean distance.However, as noted in [58,59], these algorithms have shown that they suffer of all the major drawbacks affecting MDS-based algorithms, which are too much tied by the preservation of the pairwise distance values.Besides, as highlighted in [30], ISOMAP as an id estimator, as well as other spectral based methods like PCA, relies on a specific estimated eigenstructure that may not exist in real data.Regarding LLE, it either requires the id value to be known in advance or may automatically estimate it by analyzing the eigenvalues of the data neighborhoods [60]; however, as outlined in [7,27], id estimates computed by means of eigenvalue analysis are as unreliable as those computed by most PCA-based approaches.Moreover, in [46] it is noted that methods such as LLE are based on the solution of a sparse eigenvalue problem under the unit covariance constraint; however, due to this imposed constraint, the global shape of the embedded data cannot reflect the underlying manifold.
PCA [15,46] is one of the most cited and well-known projective id estimators, often used as the first step of several pattern recognition problems, to compute low dimensional representations of the available datasets.When PCA is used for id estimation, the estimate is the number of "most relevant" eigenvectors of the sample covariance matrix, also called principal components (PCs).Due to the promising dimensionality reduction results, several PCA-based approaches, both deterministic and probabilistic, have been published.
Among deterministic approaches, we recall the Kernel PCA (KPCA) [61] and the local PCA (LPCA) [62] and its extensions to automatically select the number of PCs [63,64].We observe that the work presented in [64] is one of the first works that estimates id by considering an underlying topological structure and therefore applies LPCA on data neighborhoods represented by an Optimally Topology Preserving Map (OTPM) built on clustered data (given an input dataset P  , its OTPM is usually computed through Topology Representing Networks (TRNs); these are unsupervised neural networks [65] developed to map P  to a set of neurons whose learnt connections define proximities in P  .These proximities correspond to the optimal topology preserving Voronoi tessellation and the corresponding Delaunay triangulation.In other words, TRNs compute connectivity structures that define and perfectly preserve the topology originally present in the data, forming a discrete path-preserving representation of the inner (topological) structure of the topological manifold underlying the dataset P  ).The authors of this method state that their approach is more efficient and less sensitive to noise with respect to the PCA-based approaches.However, they do not show any experimental comparison and, besides, their algorithm employs critical thresholds and a data clustering technique whose result heavily influences the precision of the computed estimate [27].
The usage of a probabilistic approach has been firstly introduced by Tipping and Bishop in [66].Considering that deterministic methodologies lack an associated probabilistic model for the observed data, their Probabilistic PCA (PPCA) reformulates PCA as the maximum likelihood solution of a specific latent variable model.PPCA and its extensions to both mixture and hierarchical mixture models have been successfully applied to several real problems, but they still provide an id-estimation mechanism depending on critical thresholds.This motivates its subsequent variants [67] and developments, whose examples are Bayesian PCA (BPCA) [68] and two Bayesian model order selection methods introduced in [69,70].In [71] the asymptotic consistency of id estimation by (constrained) isotropic version of PPCA is shown with numerical experiments on simulated and real datasets.
While the aforementioned methods have been simply recalled since their id estimation results have shown to be unreliable [7,27], in the following recent and promising proposals are described with more details.
The Simple Exponential Family PCA (SePCA) [72] has been developed to overcome the assumption of Gaussiandistributed data that makes it difficult to handle all types of practical observations, for example, integers and binary values.SePCA achieves promising results by using exponential family distributions; however, it is highly influenced by critical parameter settings and it is successful only if the data distribution is known, which is often not the case, specially when highly nonlinear manifold structures must be treated.
In [73] the authors propose the Sparse Probability PCA (SPPCA) as a probabilistic version of the Sparse PCA (SPCA) [74].Precisely, SPCA selects id by forcing the sparsity of the projection matrix that is the matrix containing the PCs.However, based on the consideration that the level of sparsity is not automatically determined by SPCA, SPPCA employs a Bayesian formulation of SPCA, achieving sparsity by employing a different prior and automatically learning the hyperparameter related to the constraint weight through Evidence Approximation ([75] Section 3.5).The authors' results and also the results of the comparative evaluation proposed in [76] show that this method seems to be less affected by the problems of the aforementioned projective schemes.
An alternative method (MLSVD) [77] applies Singular Value Decomposition (SVD), basically a variant of PCA, locally and in a multiscale fashion to estimate the id characterizing -dimensional datasets drawn from nonlinearly embedded -dimensional manifolds M corrupted by Gaussian noise.Precisely, exploiting the same ideas of the theoretical PCAbased id estimator presented in [78], the authors note that the best way to avoid the effects of the curvature (induced by the nonlinearity of the embedding) is to apply SVD locally, that is, in hyperspheres B(p, ) centered on the data points p and having radius .However, the choice of  is constrained by the following considerations: (1)  must be big enough to have at least  ≥  neighbors, (2)  must be small enough to ensure that M∩B is linear (or at least smooth), and (3)  must be big enough to ensure that the effects of noise are negligible.When these three constraints are met, the tangent space   M (p, ), computed by applying SVD on the  neighbors, is a good approximation of the tangent space of M ∩ B and the number of its relevant eigenvalues corresponds to the (local) id of M. To find a proper value for , the authors propose a multiscale approach that applies SVD on neighborhoods B(p,   ) whose radius varies in a range   ∈ {  , . . .,   }.This allows us to compute  scale-dependent, local singular values  1 (p,   ) ≥ ⋅ ⋅ ⋅ ≥   (p,   ); using a least squares fitting procedure the SVs can be expressed as functions of  whose analysis allows us to identify the range of scales [ min , . . .,  max ] not influenced by either noise or curvature.Finally, in the range   = [ min , . . .,  max ] the squared SVs are analyzed to get the id estimate d that maximizes the gap Δ() =   (p,   ) −  +1 (p,   ) for the largest range of   .The proposed algorithm has been evaluated on unit -dimensional hyperspheres and cubes embedded in R 100 and affected by Gaussian noise.The reported results are very good, while other ten well-known methods [19,23,27,30,[79][80][81] show that the ids estimated on the same datasets are unreliable also in the absence of noise.

Topological Approaches.
Topological approaches for id estimation consider a manifold M ⊆ R  embedded in a higher dimensional space R  through a proper (locally) smooth map  : M → R  and assume that the given dataset is , where x  are independent identically distributed (i.i.d.) points drawn from M through a smooth probability density function (pdf)  : Under this assumption the id to be estimated is the manifold's topological dimension, defined either through the firstly proposed Brouwer Large Inductive Dimension [82] or the equivalent Lebesgue Covering Dimension [83].Since Brouwer's definition has been soon neglected by mathematicians for its difficult proof [83], the commonly adopted topological dimension definition is Lebesgue's Covering Dimension, reported in the following.
Definition 2 (refinement of a cover).A refinement of a cover C of a set Y is another cover C  such that each set in C  is contained in some sets of C.

Definition 3 (topological dimension (Lebesgue Covering Dimension))
. Given the aforementioned definitions, the topological dimension of the topological space X, also called Lebesgue Covering Dimension, is  if every finite cover of X admits a refinement C  such that no subset of X has more than  + 1 intersecting open sets in C  .If no such minimal integer value exists, X is said to be of infinite topological dimension.
To our knowledge, at the state of the art only two estimators have been explicitly designed to estimate the topological dimension.
One of them, the Tensor Voting Framework (TVF) [84] and its variants [85], relies on the usage of an iterative information diffusion process based on Gestalt principles of perceptual organization [86].TVF iteratively diffuses local information describing, for each p  ∈ P  , the tangent space approximating the underlying neighborhood of M. To this aim, the information diffused at each iteration is secondorder symmetric positive definite tensors whose eigenvectors span the local tangent space.Practically, during the initialization step a ball tensor T 0  , which is an identity matrix representing the absence of orientation, is used to initialize a token   for each point p  as {  = (p  , T 0  )}  =1 .During iteration  each token   "generates" the set of tensors {T  , }  ̸ = that enact as votes cast to neighboring tokens; precisely, T  , is sent to the th neighbor, and it encodes information related to the local tangent space estimate in p  at time  and decays as the curvature and the distance from the th neighbor increase.On the other side, at iteration  each token   receives votes that are summed to update the   's tensor as T +1  = ∑  ̸ = T  , ; this essentially refines the estimate of the local tangent space in p  .Based on the definition of topological dimension provided by Brouwer [82], in [87] it is noted that TVF can be employed to estimate the local ids by identifying the number of most relevant eigenvalues of the computed second-order tensors.Although interesting, this method has a too high computational cost, which makes it unfeasible for spaces of dimension  ≥ 4.
From the definition of Lebesgue Covering Dimension it can be derived [88] that the topological dimension of any M ⊆ R  coincides with the affine dimension  of a finite simplicial complex (a simplicial complex in R  has affine dimension  if it is a collection of affine simplexes in R  , having at most dimension  or having at most  + 1 vertices) covering M.This essentially means that a dimensional manifold could be approximated by a collection of -dimensional simplexes (each having at most  + 1 vertices); therefore, the topological dimension of M could be practically estimated by analyzing the number of vertices of the collection of simplexes estimated on P  .To this aim, in [89] a method is proposed to find the number of relevant positive coefficients that are needed to reconstruct each p  ∈ P  from a linear combination of its  neighbors, where  is a parameter to be manually set in the range  <  ≤  + 1.This algorithm is based on the fact that neighbors with positive reconstruction coefficients are the vertices of a simplex with dimension equal to the topological dimension of M. Practically, to ensure that  > , its value is set to , the reconstruction coefficients are calculated by means of an optimization problem constrained to be nonnegative, and the coefficients bigger than a user-defined threshold are considered as the relevant ones.The id estimate is then computed by employing two alternative approaches: the first one simply computes the mode of the number of relevant coefficients for each neighborhood and the second one sorts in descending order the coefficients computed for each neighborhood, computes the mean c of the sorted coefficients, and estimates id as the number of relevant values in c.Note that, since  > , this method is strongly affected by the curvature of the manifold when the id is big enough.Indeed, the results of the reported experimental evaluation make the authors assert that the method works well only on noisy-free data of low id (id ≤ 6), under the assumption that the sampling process is uniform and the data points are sufficient.
Though interesting, both approaches have shown to be effective only for manifolds of low curvature as well as low id values.
In the following we survey other id estimators employing two different estimation approaches, which allow us to categorize them.More precisely, in Section 3.2.1 fractal id estimators are described, which estimate different fractal dimensions since they are good approximations of the topological one; in Section 3.2.2nearest neighbors-based () id estimators are recalled, which are often based on the statistical analysis of the distribution of points within small neighborhoods.

Fractal id Estimators.
Since topological dimension cannot be practically estimated, several authors implicitly assume that M has a somehow fractal structure (see [90] for an exhaustive description of fractal sets) and estimate id by employing fractal dimension estimators, the most relevant of which are surveyed in this section.
Very roughly, since the basis concept of all fractal dimensions is that the volume of a -dimensional ball of radius  scales with its size  as   [90,91], all fractal dimension estimators are based on the idea of counting the number of observations in a neighborhood of radius  to (somehow) estimate the rate of growth of this number.If the estimated growth is   , then the estimated fractal dimension of the data is considered to be equal to .
Note that all the derived estimators have the fundamental limitation that, in order to get an accurate estimation, the size  of the dataset with id  has to satisfy the inequality proved by Eckmann and Ruelle in [92] for the correlation dimension estimator (CD [19], see below): (1) This will lead to a large value of , even for a dataset with lower id.Among fractal dimension estimators, one of the most cited algorithms is presented in [19] and will be referred to as CD in the following.It is an estimator of the correlation dimension (dim Corr ), whose formal definition is as follows.
Definition 4 (correlation dimension).Given a finite sample set P  , let where ‖ ⋅ ‖ is the Euclidean norm and (⋅) is the step function used to simulate a closed ball of radius  centered on each p  (() = 0 if  < 0, and () = 1 otherwise).Then, for a countable set, the correlation dimension dim Corr is defined as In practice CD computes an estimate, d, of dim Corr by computing   () for different   and applying least squares to fit a line through the points (log   ; log   (  )).It has to be noted that, to produce correct id estimates, this estimator needs a very large number of data points [22], which is never available for practical applications; however, the computed unreliable estimations can be corrected by the correction method proposed in [28].
The relevance of the CD estimator is shown by its several variants and extensions.An example is the work proposed in [91], where the authors propose a normalized CD estimator for binary data and achieve estimates approximating those computed by CD.
Since CD is heavily influenced by the setting of the scale parameters, in [93] the authors estimate the id by computing the expectation value of dim Corr through maximum likelihood estimate of the distribution of distances among points.The estimated d is computed as where  is the set  = {  |   < } and   is the Euclidean distance between two generic data points and  is a real value, called cut-off radius.
To develop an estimator more efficient than CD, in [94] the authors choose a different notion of Fd, namely, the Information Dimension dim  : where N() is the minimum number of -sized hypercubes covering a topological space and   is the probability of finding a point in the th hypercube.Noting that when the scale  in ( 5) is big enough the different coverings used to estimate dim  could produce different values for N(), the authors look for the covering composed by the minimum number N min () of nonempty sets.Similar to the CD algorithm, the id is the average slope of the curve obtained by fitting the points with coordinates (log ; ∑ N min () =1 log   ).This algorithm is compared with the CD estimator, and the experimental tests show that both methods compute the same estimates.However, the achieved computation time is much lower than that of CD.
Considering that CD can severely underestimate the topological dimension if the data distribution on the manifold is nearly nonuniform, in [7] the author proposes the Packing Number (PN), a fractal dimension estimator that approximates the Capacity Dimension (dim Cap ).This choice is motivated by the fact that dim Cap does not depend on the data distribution on the manifold and, if both dim Cap and the topological dimension exist (which is certainly the case in machine learning applications), the two dimensions agree.To formally define dim Cap , the -covering number N() of a set S ⊂ X must be defined; N() is the minimum number of open balls B(x 0 , ) = {x ∈ X : ‖x 0 −x‖ < } whose union is a covering of S, where ‖ ⋅ ‖ is a distance metric.The definition of dim Cap of S ⊂ X is based on the observation that the covering number N() of a -dimensional set is proportional to  − : Since the estimation of N() is computationally expensive, based on the relation N() ≤ N Pack () ≤ N(/2), the authors employ the -Packing Number N Pack (), defined in [95] as the maximum cardinality of an -separated set.Employing a greedy algorithm to compute N Pack (), the estimate, d, of dim Cap is computed as To estimate d a greedy algorithm is used; however, as noted by the author, the dependency of d with respect to the order in which the points are visited by the greedy algorithm introduces a high variance.To avoid this problem, the algorithm iterates the procedure  times on random permutations of the data and considers the average as the final id estimate.The comparative evaluation with the CD estimator makes the authors assert that PN "seems more reliable if data contains noise or the distribution on the manifold is not uniform."Unfortunately, also this method is scale-dependent.
To avoid any scale-dependency in [79] the authors propose an estimator (Hein) based on the asymptotes of a smoothed version of (2), obtained by replacing the step function (⋅) with a suitable kernel function.Precisely, they define where  ℎ is a kernel function with bandwidth ℎ and  is the assumed dimensionality of the manifold from which the points are sampled.Note that, to guarantee the converge of ( 8), the bandwidth ℎ has to fulfill the constraint lim  → ∞ (ℎ  ) = ∞.For this reason the authors formalize ℎ as a function of  and, to achieve scale-independency, propose a method that estimates the id by analyzing the convergence of (, ℎ, ) when varying the parameters  and .
This work is notable since the empirical tests are performed on synthetic datasets specifically designed to study the influence of high curvature as well as noise on the proposed estimator.The usefulness of these datasets is confirmed by the fact that they have been also employed to assess several subsequent methods [76,96].
In [97] the authors present a fractal dimension estimator derived by the analysis of a vector quantizer applied to datasets P  ⊆ R  .Considering the codebook Y = {y 1 , . . ., y  } ⊂ R  containing  code-vectors y  , a k-point quantizer is defined by a measurable function   : R  → Y, which brings each data point to one of the code-vectors in Y.This partitions the dataset into  so-called quantizer cells S  = {p  ∈ P  :   (p  ) = y  }, where log 2 () is called the rate of the quantizer.Being X a random vector distributed according to a probability distribution ], the quantization error is When the quantizer rate is high, the quantizer cells can be well approximated by -dimensional hyperspheres with radius equal to  and centered on each code-vector y  ∈ Y.In this case, the regularity of ] ensures that the probability of such balls is proportional to  1/ , and it can be shown [98] that  *  (  | ]) ≈  −1/ .This is referred to as the high-rate approximation and motivates the definition of quantization dimension of order : The theory of high-rate quantization [98] confirms that, for a regular ] supported on the manifold M,   (]) exists for each 1 ≤  ≤ ∞ and equals the intrinsic dimension of M. Furthermore, the limit  → ∞ allows us to motivate the relation between the quantization dimension and the Capacity Dimension.Indeed, according to the theory of high-rate quantization [98,99], there exists a decreasing sequence {  }, such that for sufficiently large values of  (i.e., in the high-rate regime that is when  → ∞) the ratio −(log )/(log  *  ( | ])) can be approximated increasingly finely, both from below and from above, by quantities converging to the common value dim Cap .To practically compute an estimate of the quantization dimension, having fixed the value of , the authors select a range  1 ≤  ≤  2 of codebook sizes and design a set of quantizers giving good approximations ê ( | ]) of  *  ( | ) over the chosen range of .An id estimate is obtained by fitting the points with coordinates (log(); − log ê ( | ])) and measuring the average slope over the chosen range .Though the authors mention that their algorithm is less affected by underestimation biases than neighborhood-based methods (see Section 3.2.2), in [23] this statement is confuted with theoretical arguments.

Nearest Neighbors-Based id Estimators.
In this section we consider estimators, referred to as  estimators in the following, that describe data-neighborhoods' distributions as functions of .They usually assume that close points are uniformly drawn from small -dimensional balls (hyperspheres) B  (x, ) having radius (a small radius  → 0 ∈ R + guarantees that samples included into B  (x, ), being less influenced by the curvature induced by the map , are approximating well enough the intrinsic structure of the underlying portion of M)  → 0 ∈ R + and being centered on x ∈ M.
Practically, given an input dataset P  , the value of functions () is computed by approximating the sampling process related to B  through the -nearest neighbor algorithm (kNN).
Among  id estimators, Trunk's method [100] is often cited as one of the first methods.It formulates the distribution function, (), with an ad hoc statistic based on geometric considerations concerning angles; in practice, having fixed a threshold  and a starting value for the parameter , it applies kNN to find the neighbors of each p  ∈ P  and calculates the angle ]  between the ( + 1)th-nearest neighbor and the subspace spanned by the -nearest neighbors.Considering a threshold parameter , if (1/) ∑  =1 ]  ≤ , then  is considered as the id estimate; otherwise,  is incremented by 1 and the process is repeated.The main limitation of this method is the difficult choice of a proper value for .
The work presented by Pettis et al. [26] is notable since it is one of the first works providing a mathematical motivation for the use of nearest-neighbor distances.
Indeed, for an i.i.d.sample P  ⊆ R  drawn from a density distribution (x) in R  , the following approximation holds: where  is the number of nearest neighbors to x within the hypersphere B  (x, ) of radius  and centered on x and () is the volume of the (unit -dimensional) ball in R  .This means that the proportion of sample points falling in B  (x, ) is roughly approximated by (x) times the volume of B  (x, ).Since this volume grows as   , assuming the density () to be a constant, it follows that the number of samples in B  (x, ) is proportional to   .From the relationship in (11), and assuming that the samples are locally uniformly distributed, the authors derive an id estimator for : where   is the average of the distances from each sample point to its th nearest neighbors; defining  ()  as the distance between x  and its th-nearest neighbor,   is expressed as .Since this algorithm is limited by the choice of a suitable value for parameter , in [63] the authors propose a variant which considers a range of neighborhood sizes [ min ,  max ].However, in the same work the authors themselves show that this technique generally yields an underestimate of the id when its value is high.
Taking into account relation (11), in [101] the number  B  of data points in B  (x, ) is described by a polynomial () = ∑  =0     of degree .In practice, considering p  , p  ∈ P  , calling   = ‖p  − p  ‖ the interpoint distances, and being  = min  ,=1   ,  a parameter adaptively estimated (to estimate  by means of P  , the radius value corresponding to the first significant peak of the histogram of the   s is found), a set of  radius values r = {  = +(−)/}  =1 is selected and used to calculate  pairs {(  , f(  ))}  =1 , where f( is the number of interpoint distances strictly lower than   .To estimate the coefficients {  }  =1 , the computed pairs are fit by a least squares fitting procedure that estimates exactly  + 1 coefficients.Since by hypothesis the degree of  is , the significance test described in [17] is used to estimate the degree d of f, which is considered as the id estimate.The comparative evaluation of this algorithm with the well-known Maximum Likelihood Estimator (MLE) [27] and its improved version [102], both described below, has shown that it is more robust than them when dealing with high-dimensional datasets.MLE [27], one of the most cited estimators, treats the neighbors of each point p  ∈ P  as events in a Poisson process and the distance  () (p  ) between the query point p  and its th nearest neighbor as the event's arrival time.Since this process depends on , MLE estimates id by maximizing the log-likelihood of the observed process.In practice a local id estimate is computed as Averaging the d(p  , )s, the global id estimate is The theoretical stability of the proposed id estimator for data living in  1 submanifold of R  ,  ≤ , and for data in an affine subspace of R  has been proved, respectively, in [103,104].Though the authors' comparative evaluation shows the superior performance of the proposed estimator with respect to the CD estimator [19] (see Section 3.2.1)and the NN estimator [26], they further improve it by removing its dependency from the parameter ; to this end, different values for  are adopted and the computed results are averaged to obtain the final id estimate: d = (1/) ∑ ∈{ 1 ,...,  } d().
Considering that, in practice, MLE is highly biased for both large and small values of , a variant of MLE is proposed in [102], where the arithmetic mean is substituted with the harmonic average, leading to the following estimator: Though the proposal in [102] seems to achieve more accurate results, it is based on the assumption that neighbors surrounding each p  are independent, which is clearly incorrect.To cope with this problem, in [105] an interesting regularized version of MLE applies a regularized maximum likelihood technique to distances between neighbors.The comparative evaluation with the aforementioned MLE methods [27,102] makes the authors state that, though the method might be the first to converge to the actual estimate given enough data points, its estimation accuracy is comparable to that achieved by the competing schemes.
In [59,106] a further improvement of MLE is presented; it achieves a better performance by substituting euclidean distances with geodesic ones.
Despite the good results achieved by MLE-based approaches, these techniques have shown to be affected by the curvature induced by  on the manifold neighborhoods approximated by kNN.To reduce this effect, various id estimators have been proposed in [76,107]; here, we review those achieving the most promising experimental results.
After noting that this algorithm is still affected by a bias causing underestimations when the dataset dimensionality becomes sufficiently high (i.e.,  ≥ 10), the authors present theoretical considerations which relate the bias to the fact that id estimators based on nearest-neighbor distances are often founded on statistics derived under the assumption that the amount of available data is unlimited, which is never the case in practical applications.Based on these considerations, two different estimators, named MiND KL and IDEA, are presented.
MiND KL compares the empirical pdf of the neighborhood distances computed on the dataset ( Data ) with the distribution of the neighborhood distances computed from points uniformly drawn from hyperspheres of known increasing dimensionality (  Sphere ).The id estimate is the dimensionality that minimizes their Kullback-Leibler divergence KL( Data ,   Sphere ), which is evaluated by means of the datadriven technique proposed in [108].
IDEA relies on the authors' observation that the quantities 1 −  () (p  )/ (+1) (p  ) are distributed according to the beta distribution  1, with parameters 1 and , respectively.Therefore, since E[ 1, ] =  = 1/(1 + ), a consistent id estimator d ≃  equals To reduce the effect of the aforementioned bias, IDEA finally applies an asymptotic correction step that, inspired by the correction method presented in [28], models the underestimation error by considering both the base algorithm and the given dataset.Motivated by the promising results achieved by MiND KL , in [76] the authors propose its extension, namely, DANCo; it further reduces the underestimation effect by combining an estimator employing normalized nearest-neighbor distances with one employing mutual angles.More precisely, DANCo compares the statistics estimated on P  with those estimated on (uniformly drawn) synthetic datasets of known id.The comparisons are performed by two Kullback-Leibler divergences applied to the distribution of normalized nearestneighbor distances (; , ), having (; , ) =  −1 (1 −   ) −1 , and the distribution of pairwise angles (x; ^, ), (x; ^, ) being the von Mises-Fisher distribution (VMF) [109] with parameters ^and .
The id estimate d is the one minimizing the sum of the two divergences: A fast implementation of DANCo (Fast-DANCo) is also developed.Comparative evaluations show that this algorithm achieves promising results (as shown in [76] and Section 4).Another work, which is notable because the authors not only prove the consistency in probability of the presented estimators but also derive upper bounds (see (19) below) on the probability of the estimation-error for finite, and large enough, values of , is proposed in [33].More precisely, the authors introduce two estimators by firstly defining a function  : R  × R → R + slowly varying near the origin (see [33] for a detailed description and motivation of this assumption).The function  is then used to express the logarithm of the probability of a point p of being in the hypersphere B  (p  , ): log(P(p ∈ B  (p  , ))) = log((p, )) + log(), having P(p ∈ B  (p  , )) = (p, )  .

Mathematical Problems in Engineering
Considering that P(p ∈ B(p  ,  () (p  ))) ≈ / for  big enough, the authors derive the following system of equations: and solve it for d(p  ) to obtain a local id estimate: The two proposed estimators are then computed either by averaging ( davg ) or by voting ( dvote ): where #[cond]  =1 denotes the number of points p  for which cond is verified.
Under differentiability assumptions on the function  and regularity assumptions on M the authors prove the consistency in probability of their estimators and provide upper bounds (see (19)) on the probability of the estimationerror for finite, and large enough, values of .However, the derived bounds depend on unknown universal constants ,   ,   > 0: 3.3.Graph-Based id Estimators.As noted in [96], the work of [110] has cleared up the fact that theories underlying graphs can be applied to solve a variety of statistical problems; indeed, also in the field of id estimation various types of graph structures have been proposed [29,96,111,112] and used for id estimation.Examples are the kNN graph (kNNG) [30], the Minimum Spanning Tree (MST) [113] and its variation, the geodesic MST (GMST) [29]; and the sphere of influence graph (SIG) [114] and its generalization, the −sphere of influence graph (kSIG) [96].
A kNNG  (P  ) is built by employing a distance function, which commonly is the Euclidean one, to weight the arcs connecting each p  to its kNNs.
A MST(P  ) is the spanning tree minimizing the sum of the edge weights.When the weights approximate Geodesic distances [56], a GMST  (P  ) is obtained.
A SIG  (P  ) is defined by connecting nodes p  and p  iff ‖p  − p  ‖ ≤ () + (), where () is the distance between p  and its nearest neighbor in P  .Essentially, two vertices p  and p  are connected if the corresponding NN hyperspheres intersect.A generalization of SIG  (P  ) is kSIG(P  ), where nodes p  and p  are connected iff ‖p  −p  ‖ ≤   ()+  (),   () being the distance between p  and its kNN in P  .This means that the kNN hyperspheres centered on p  and p  intersect.
In the following we recall interesting id estimators based on GMST(P  ), kNNG(P  ), and kSIG(P  ).
In [29,30], after defining the length functional L(  (P  )) = ∑ | , |  ,  ∈ (0, ), to build either the GMST(P  ) or the MST(P  ) of kNNG(P  ), graph theories are exploited to estimate both the id of the underlying manifold structure M and its intrinsic Rènyi -entropy H M .To this aim, the authors derive the linear model: logL(MST(P  )) =  log +,  = (−)/,  = log+H M ,  being an unknown constant, and exploit it to define an estimator of both  and H M .Briefly, a set of cardinalities {  }  =1 is chosen and, for each   , the MST(P   ) is constructed on the set P   , which contains   points randomly sampled from P  , to obtain a set of  pairs (logL(MST(P   )),   ).Fitting them with a least squares procedure the estimates â ≃  and b ≃  are computed.Recalling that  = (−)/, the id is calculated as d = round{/(1 − â)} ≃ .This process is iterated to produce the final estimate as the average of the obtained results.
The aforementioned kNNG based algorithm [29,30] is exploited in [112], where the authors consider datasets sampled from a union of disjoint manifolds with possibly different ids.To estimate the local ids, the authors propose a heuristic, which is not described here, to automatically determine the local neighborhoods with similar geometric structures without any prior knowledge on the number of manifolds, their ids, and their sampling distributions.
In [96] the authors present three id estimation approaches, defined as "graph theoretic methods" since the statistics they compute are functions only of graph properties (such as vertex degrees and vertex eccentricities) and do not directly depend on the interpoint distances.
The first statistic, denoted by  1  (P  ) =   (kNNG(P  )) in the following, is based on the reach (the reach  , (p  , ), in  steps of a node p  ∈ , is the total number of vertices which are connected to p  by a path composed of  arcs or less in ) of vertices in the kNNG(P  ).Considering that the reach of each vertex p  ∈ kNNG(P  ) grows as the id increases, in [115] the average reach   (kNNG) in  steps of vertices in kNNG(P  ) is employed:  1  (P  ) =   (kNNG(P  )) = (1/) ∑  =1  , (p  , kNNG(P  )).The second statistic, denoted by  2  (P  ) =   (MST(P  )), is computed by considering the degree of vertices in the MST(P  ).Recalling that, for datasets P  obtained from a continuous distribution on R  , the ratio of nodes with a given degree  in MST  (P  ) converges a.s.
to a limit depending only on  and  [116] and that the average degree in a tree is a constant depending only on the number of vertices, the authors empirically observe a dependency between the average degree and the id.This leads to the definition of an id estimator employing the statistic  2  =   (MST(P  )) = (1/) ∑  =1 (deg MST(P  ) (p  )) 2 .The third statistic, denoted by  3  (P  ) =    (kSIG(P  )), is motivated by studies in the literature [117] showing that the expected number of neighbors shared by a given pair of points depends on the id of the underlying manifold.Accordingly, calling  , the number of samples in the intersection of the two kNN hyperspheres centered on p  and p  , intuitions similar to those considered for   (kNNG) lead to defining of  3   (P  ) =    (kSIG(P  )) = (1/) ∑ ≤  , .Based on their theoretical results and empirical tests on synthetically generated datasets characterized by id values   in a finite range F ⊆  + (where F = {  } 12    =2 in the reported experiments), the authors propose an approximate Bayesian estimator that could indistinctly employ each of the three statistics  1  ,  2  , and  3  , denoted by  *  in the following.To this aim, they assume that each statistic can be approximated by a Gaussian density    (⋅) = N((  ),  2 (  )); to estimate (  ) and  2 (  ), for each   F,  datasets of large size are synthetically generated by random sampling from the uniform distribution on the unit   -cube.These datasets are then used to estimate the parameters μ(  ) ≃ (  ) and σ2 (  ) ≃  2 (  ) that define the approximation f  (⋅), computed on a generic sample set with size  and id =   , of the Gaussian density    (⋅) of  *  .At this stage, given a new input dataset P  having unknown id, the statistic  *  (P  ) =  P is computed and used to calculate the approximated value f  ( P ) = N( μ2 (  ), σ2 (  )/) ≃    ( P ).Assuming equal a priori probability for all the   ∈ F, the posterior probability [  |  *  ] is given by and employed to compute an "a posteriori expected value" of the id: The authors evaluate the performance of their methods on synthetic datasets, some of which have been used by similar studies in the literature [79], while the others (challenging ones) are proposed by the authors to have manifolds with nonconstant curvature.The comparison of the achieved results with those obtained by the estimators proposed in [27,33,112,118] has led to the conclusion that none of the methods has a good performance on all the tested datasets.However, graph theoretic approaches would appear to behave better when manifolds of nonconstant curvature are processed.
This interesting comparison strengthens the need of defining a benchmark framework to allow an objective and reproducible comparative evaluation of id estimators.For this reason, in Section 4 we describe our proposal in this direction.

A Benchmark Proposal
At the present, an objective comparison of different id estimators is not possible due to the lack of a standardized benchmark framework; therefore, in this section, after recalling experimental datasets and evaluation procedures introduced in the literature (see Sections 4.1 and 4.2), we choose some of them to propose a benchmark framework (see Section 4.3) that allows for reproducible and comparable experimental setups.The usefulness of the proposed benchmark is then shown by employing it to compare relevant stateof-the-art id estimators whose code is publicly available (see Section 4.4).

Datasets.
The datasets employed in the literature are both synthetically generated datasets and real ones.In the following sections we describe those we choose to use in our benchmark study.

Synthetic Datasets.
Synthetic datasets are generated by drawing samples from manifolds (M) linearly or nonlinearly embedded in higher dimensional spaces.
The publicly available tool (http://www.mL.uni-saarland .de/code/IntDim/IntDim.htm) proposed by Hein and Audibert in [79] allows us to generate 13 kinds of synthetic datasets by uniformly drawing samples from 13 manifolds of known id; they are schematically described in Table 1, where they are referred to as M  * .These manifolds are embedded in higher dimensional spaces through both linear and nonlinear maps and are characterized by different curvatures.We note that manifold M  8 is particularly challenging for its high curvature; indeed, when it is used for testing, most relevant id estimators compute pronounced id overestimates (see also the results reported in [107]).
Another interesting dataset [96] is generated by sampling a -dimensional paraboloid, M  , nonlinearly embedded in a higher (3( + 1)) dimensional space, according to a multivariate Burr distribution with parameter  = 1.Tests on this dataset are particularly challenging since the underlying manifold is characterized by a nonconstant curvature.
To perform tests on datasets generated by employing a smooth nonuniform pdf, we propose the dataset M beta , obtained as follows: we sample  points in [0, 1) 10 , according to a beta distribution  0.5,10 with parameters 0.5 and 10, respectively (high skewness), and store them in a matrix X  ∈ R ×10 ; multiply each point of X  (X  (, )) by sin(cos(2X  (, ))), thus obtaining a matrix D 1 ∈ R ×10 ; multiply each point of X  by cos(sin(2X  (, ))), thus obtaining another matrix D 2 ∈ R ×10 ; append D 1 and D 2 to generate a matrix D 3 ∈ R 2500×20 ; append D 3 to its duplicate to finally generate a test dataset containing  points in R 40 .To further test estimators' performance on nonlinearly embedded manifolds of high id, we propose to generate two datasets, referred to as M 1 and M 2 in the following (a tool to generate the datasets sampled from -dimensional paraboloids, the M beta dataset, the M 1 dataset, and the M 2 dataset, is available at http://security.di.unimi.it/∼fox721/dataset generator.m).Precisely, to generate M 1 we uniformly draw  points in [0, 1] 18 , we transform each point by means of tan(x  cos(x 18−+1 )) where  = 1, . . ., 18, we obtain points in R 36 by appending each transformed x to arctan(x 18−+1 sin(x  )), and we duplicate the coordinates of each point to finally generate points in R 72 .The id of M 1 is 18, and its points are drawn from a manifold nonlinearly embedded in R 72 .To generate M 2 containing  points in R 96 , we applied the same procedure on vectors sampled in [0, 1] 24 .
4.1.2.Real Datasets.Real datasets employed in the literature generally concern problems in the fields of image analysis, signal processing, time series prediction, and biochemistry.Among them, the most known and used datasets are ISOMAP face database [56], MNIST database [119], Isolet dataset [120], 2 Santa Fe [121] dataset, and DSVC1 time series [21].Recently, the Crystal Fingerprint space for the chemical compound silicon dioxide dataset has also been proposed [22].
ISOMAP face database consists in 698 gray-level images of size 64 × 64 depicting the face of a sculpture.This dataset has three degrees of freedom: two for the pose and one for the lighting direction (see Figure 2(a)).
MNIST database consists in 70000 gray-level images of size 28 × 28 of hand-written digits (see Figure 2(b)).The real id of this database is not actually known, but some works [79,122] propose similar estimates for the different digits; as an example, the proposed id values for the digit "1" are in the range {8, . . ., 11}.
Isolet dataset has been generated as follows: 150 subjects spoke the name of each letter of the alphabet twice, thus producing about 52 training examples from each speaker, for a total of 7797 samples.The speakers are grouped into 5 sets of 30 speakers each and are referred to as o1, 2, 3, 4, and 5.The real id value characterizing this dataset is not actually known, but a study reported in [123] shows that the correct estimate could be in the range {16, . . ., 22}.
The version 2 of Santa Fe dataset is a time series of 50000 one-dimensional points having nine degrees of freedom (id = 9) and being generated by a simulation of particle motion.In order to estimate the attractor dimension of this time series, it is possible to employ the method of delays described in [124], which generates -dimensional vectors by partitioning the original dataset in blocks containing  consecutive values; as an example, by choosing  = 50 a dataset containing 1000 points in R 50 is obtained.
DSVC1 is a time series composed by 5000 samples measured from a hardware realization of Chua's circuit [125].Employing the method of delays with  = 20, a dataset containing 250 points in R 20 is obtained.The id characterizing this dataset is ∼2.26 [21].Crystal Fingerprint spaces, or Crystal Finger spaces, have been recently proposed in crystallography [22] with the aim of representing crystalline structures; these spaces are built starting from the measured distances between atoms in the crystalline structure.The theoretical id of one Crystal Finger space consists in 3  + 3 crystal degrees of freedom, where   is the number of atoms in the crystalline unitary cell.

Experimental Procedures and Evaluation Measures.
At the state of the art, two approaches have been mainly used to assess id estimators on datasets of known id.
The first one subsamples the test dataset to obtain  subsets of fixed cardinality and computes the percentage of correct estimations.To analyze estimators' behavior with respect to the cardinality of input datasets, this procedure may be repeated by using different cardinality values [29,30,79,122], thus obtaining a distinct performance evaluation measure for each cardinality.
The second approach estimates the id on  permutations of the same dataset and averages the  id estimates to obtain the final one [27,76,107,126].This value is then compared with the real one to assess the id estimator.
To also test the estimator's robustness with respect to its parameter settings, in [27,107,126] the authors apply a further test, originally proposed by Levina and Bickel in [27].Precisely, sample sets with different cardinalities are drawn from the standard Gaussian pdf in R 5 and, for each set, the estimator is applied varying the values of its parameters in fixed ranges; this allows us to analyze the behavior of the id estimate as a function of both the dataset's cardinality and the parameter settings.
Note that, since id estimators are usually tested on different datasets to evaluate their reliability when confronted by different dataset structures and configurations, in [126] an overall evaluation measure is proposed.This indicator, called Mean Percentage Error (MPE), summarizes all the obtained results in a unique value computed as MPE = (100/#M) ∑ M (| dM −  M |/ M ), where #M is the number of tested datasets, dM is the id estimated on the dataset M, and  M is the real id of M. To apply this technique to real datasets whose id belongs to the range { min , . . .,  max }, the same authors propose to calculate the associated MPE's term as min ∈{ min ,..., max } (| dM −|/ M ), where  M is the mean of the range.
The benchmark is composed by following steps: (1) Test all the considered estimators on both the synthetic and real datasets described below.We highlight that the synthetic datasets whose id is a user-defined parameter should be created with sufficiently high id values (id ≥ 10).
(2) Comparative evaluation steps are as follows: (a) compute the MPE indicator both for synthetic and real datasets, (b) compute a ranking test with control methods; to this aim, we suggest the Friedman test with Bonferroni-Dunn post hoc analyses [127], (c) perform the tests proposed in [27] to evaluate the robustness, with respect to different cardinalities and parameter settings.
The 21 synthetic datasets used in the benchmark, referred to as M * in the following, are listed in Table 2 with their relevant characteristics (, , and ).The first 15 datasets are generated with the tool proposed in [79]; they include 4 instances, M 10 * , of dataset M 10 , which are drawn from M  10 after its embedding in R  by setting  = {11, 18, 25, 71}.Note that we did not include the dataset sampled from M  8 (see Table 1) since relevant and recent id estimators have similarly produced highly overestimated results when tested on it [107].Indeed, dealing with highly curved manifolds is still a quite challenging problem in the field.
The last six synthetic datasets are M 1 , M 2 , M beta , and 3 instances of dataset M  * , which are sampled from paraboloids M  whose id is, respectively,  = {3, 6, 9}.
To perform multiple tests, 20 instances of each dataset have been generated, and the achieved results have been averaged.
Regarding the real datasets we used the DSVC1 time series [21] (M DSVC1 , id ∼ 2.26), the ISOMAP face database [56] (M ISOMAP , id = 3), the Santa Fe dataset [121] (M Santa Fe , id = 9), the MNIST database [119] (M MNIST1 , id ∈ {8, . . ., 13}), the Isolet dataset [120] (M Isolet , id ∈ {16, . . ., 22}), and the Crystal Fingerprint space for the chemical compound silicon Table 3: Parameter settings for the different estimators:  represents the number of neighbors,  represents the edge weighting factor for kNN,  represents the number of Least Square (LS) runs,  represents the number of resampling trials per LS iteration,  and  represent the parameters (shape and rate) of the Gamma prior distributions, which describe the hyperparameters and the observation noise model of BPCA, and  contains the mean and the precision of the Gaussian prior distribution describing the bias inserted in the inference of BPCA.To run multiple tests also on M MNIST1 , M SiO2 , and M Isolet , for each of them we generated 5 instances by extracting random subsets containing 2500 points each and we averaged the achieved results.

Dataset
Table 3 summarizes the parameter values we employed for different estimators.Note that, to relax the dependency of the kNNG algorithm from the setting of its parameter , we performed multiple runs with  1 ≤  ≤  2 and we averaged the achieved results.Furthermore, we tested two versions of the algorithm (referred to as kNNG 1 and kNNG 2 ) obtained by varying the parameters  and .

Experimental Results. Table 4 summarizes the results
obtained by the compared estimators on the synthetic datasets, while in Table 5 the results obtained on the real datasets are reported.
Looking at the number of correct estimations computed by each algorithm (highlighted in boldface), we have the following ranking: MLSVD is correct on 13 out of 21 synthetic datasets, DANCo (correct on 10 out of 21 datasets), Hein By observing the MPE indicator, which accounts for the precision of the achieved estimates, we obtain a different ranking: DANCo, MiND KL , and kNNG 1 and kNNG 2 , MLE, Hein, CD, and MLSVD.This difference is due to the fact that algorithms, such as kNNG 1 and kNNG 2 , MLE, and Hein, most of the times produce results close to the correct value.
Regarding the real datasets, all the algorithms achieve a much worse MPE indicator, and again DANCo is best performing method.
Furthermore, we compute the Friedman ranking test with the Bonferroni-Dunn post hoc analysis as proposed in Section 4.3 to state the quality of the achieved results on both the synthetic and real datasets.Tables 6 and 7 summarize the ranking results.Finally, we performed the tests proposed in [27] to evaluate the robustness of MiND KL , MLE, DANCo, and kNNG * with respect to the settings of their  parameter.Precisely, these tests employ synthetic datasets subsampled from the standard Gaussian pdf in R 5 (id = 5).As proposed in Section 4.2, we repeated the tests for datasets with cardinalities  ∈ {200, 500, 1000, 2000} varying the parameter  in the range {5, . . ., 100}.
As shown in Figure 3 many of the tested techniques are strongly influenced by the parameter settings; therefore, studying the variability of the algorithms' behavior when changing their parameter settings is of utmost importance.

Conclusions and Open Problems
This work presents the base theories of state-of-the-art id estimators and surveys the most relevant and recent among them, highlighting their strengths and their drawbacks.Unfortunately, performing an objective comparative evaluation among the surveyed methods is difficult because, to our knowledge, no benchmark framework exists in this research field; therefore, in Section 4 we propose an evaluation approach that employs both real and synthetic datasets and suggests experiments to evaluate the estimators' robustness with respect to their parameter settings.Note that, the benchmark is designed to evaluate the performance achieved by id estimators when both low and high id data must be processed; this consideration is due to the fact that, to our knowledge, only few methods [28,76,107,126] have empirically investigated the problem of datasets drawn from manifolds nonlinearly embedded in higher dimensional spaces and characterized by a sufficiently high id (i.e., id ⩾ 10).However, due to the continuous technological advances, high id datasets are becoming more and more common, and the construction of a theoretically well-formed and robust id estimator able to deal with high id data and limited amount of points remains one of the open research challenges in machine learning.Besides, id estimators should be developed by also considering datasets drawn through nonuniform smooth pdfs from manifolds M characterized by a nonconstant curvature; indeed, most of the algorithms are tested by only employing data drawn by means of uniform pdf.
We further note that, though the aforementioned problems still need further investigations, most researches in this field are presently focusing on tasks that require to estimate the id as the first step.Examples are "multimanifold learning," whose aim is to process datasets drawn from multiple manifolds, each characterized by different id, to identify the underlying structures (see [128] for an example); "nonlinear dimensionality reduction"; or "manifold reconstruction," whose aim is to find the mapping that projects the data (embedded in a higher -dimensional space) on the lower d-dimensional subspace, d being the id estimated on the input dataset (as examples, see [9,11,129]).

Table 2 :
Synthetic datasets and real datasets suggested by the benchmark; N is the dataset cardinality, d is the id, and D is the embedding space dimension.

Table 4 :
Results achieved on the synthetic datasets.The bottom row reports the MPE achieved by each algorithm; anyhow, for each test dataset the best approximation results are highlighted in boldface.

Table 5 :
Results achieved on the real datasets by the compared approaches.The bottom row reports the MPE achieved by each algorithm; anyhow, for each test dataset the best approximation results are highlighted in boldface (when the real id takes values in a range, we highlighted the results that best approximate the mean value of the range).

Table 6 :
Friedman ranking results achieved on all the datasets.The null hypothesis that the algorithms perform comparably is rejected with  value < 0.00001.CD, MLE, and Hein obtain good estimates only for low id manifolds, while they produce underestimated values when processing datasets of high id.

Table 7 :
Hypothesis testing of significance between techniques.Bonferroni-Dunn's procedure rejects those hypotheses that have a  value ≤ 0.0125.