Dimension Estimation Using Weighted Correlation Dimension Method

Dimensionreductionisanimportanttoolforfeatureextractionandhasbeenwidelyusedinmanyfieldsincludingimageprocessing, discrete-timesystems,andfaultdiagnosis.Asakeyparameterofthedimensionreduction,intrinsicdimensionrepresentsthe smallestnumberofvariableswhichisusedtodescribeacompletedataset.Amongallthedimensionestimationmethods,correlation dimension(CD)methodisoneofthemostpopularones,whichalwaysassumesthattheeffectofeverypointontheintrinsic dimensionestimationisidentical.However,itisdifferentwhenthedistributionofadatasetisnonuniform.Intrinsicdimension estimatedbythehighdensityareaismorereliablethantheonesestimatedbythelowdensityorboundaryarea.Inthispaper,anovel weightedcorrelationdimension(WCD)approachisproposed.Thevertexdegreeofanundirectedgraphisinvokedtomeasurethe contributionofeachpointtotheintrinsicdimensionestimation.InordertoimprovetheadaptabilityofWCDestimation, 𝑘 -means clustering algorithm is adopted to adaptively select the linear portion of the log-log sequence ( log 𝛿 𝑘 , log 𝐶(𝑛,𝛿 𝑘 )) . Various factors that affect the performance of WCD are studied. Experiments on synthetic and real datasets show the validity and the advantages of the development of technique.


Introduction
Many engineering applications are difficult to be analyzed by traditional methods owing to the existence of high dimensional signals, such as face recognition [1][2][3], nonlinear dynamic systems [4,5], and fault diagnosis.Therefore, a qualified dimension reduction for the high dimension signals is necessary before further proceeding.
Currently, considerable attention has been paid to the dimension reduction and many techniques have been reported [6,7].They can be toughly divided into two groups: linear methods and nonlinear methods.Principal component analysis (PCA) [8], local discriminant analysis (LDA), local preserving projections (LPP), and multidimensional scaling (MDS) are the classical linear methods, in which the original space is uniformly assumed to be linear and the raw data can be directly mapped into a lower dimension space.Classical nonlinear methods such as isometric mapping (Isomap), locally linear embedding (LLE) [9], Laplacian eigenmaps (LE), local tangent space alignment (LTSA), Hessian locally linear embedding (HLLE), and diffusion maps (DM) all regard the dataset as being locally homeomorphic to   and the local geometric approximation of the high dimensional space is preserved in low one.
For dimension reduction, one key is to choose proper intrinsic dimension.The lower intrinsic dimension estimation may lose significant information, whereas the higher one may leave too much redundant information, increasing amount of calculation and obscuring the important features.Recently, intrinsic dimension estimation methods have attracted plenty of concerns [10][11][12][13][14][15][16].Usually they can be categorized into three classes, projection approach [17], probabilistic approach, and geometric approach.For projection approach, the first step is to extract a low-dimensional representation from a high-dimensional space; then the representation is analyzed and the dimension is estimated by PCA, factor analysis, or MDS.The classical probabilistic approach is maximum likelihood estimate (MLE) [18], which estimates the probability distribution of a dataset first, and then the intrinsic dimension is estimated by maximum likelihood method.The accuracy of intrinsic dimension completely depends on the estimation of the probability distribution.The geometric approach includes geodesic minimal spanning tree (GMST) and fractal method.GMST simply constructs a minimal spanning tree sequence [19] using geodesic edge matrix and estimates the intrinsic dimension by the overall lengths of MST.GMST is a global method which does not require estimating the multivariate density of the dataset, but the drawback of GMST is the restriction to isometric embeddings.Fractal dimension [20,21] is a statistical index of complexity of a dataset, which is commonly calculated by box-counting method [22][23][24] and CD method [25,26].
In this paper, a WCD method is presented to improve the accuracy of CD method.The remainder of this paper is organized as follows.Section 2 presents a review of previous work on dimension estimation.In Section 3, theoretical analysis of WCD estimation is conducted.Section 4 thoroughly analyzes the influence of various factors on WCD by experiments.In Section 5, experiments on synthetic and real world datasets are used to confirm the effectiveness of WCD.Finally, conclusion is drawn in Section 6.

Previous Work on Dimension Estimation
Informally, intrinsic dimension of a dataset is the minimum number of independent variables that can completely describe a dataset and it can be used to measure complexity of a dataset.The smaller intrinsic dimension indicates a simpler dataset and vice versa.The accurate estimator of intrinsic dimension is useful to improve the performance of dimension reduction methods and to extract features.
A detailed review of intrinsic dimension estimation methods can be found in [16], which summarised almost all the typical intrinsic dimension estimation methods so far, including Fukunaga-Olsen's method, near neighbor methods, TRN-based methods, projection techniques, multidimensional scaling methods, and fractal-based methods.Recently some new intrinsic dimension estimation methods have been presented, such as minimal cover method [27], axiomatic method [28], packing number method [29], and expected absolute projection (EAP) method [30].Each method has its own characteristic and, therefore, can only suit different datasets.
Fractal methods are a powerful tool to estimate the intrinsic dimension.Among the existing fractal methods, Hausdorff dimension method, box-counting dimension method, and CD method are the most representative ones.Further research on the fractal methods refers to [31].
Hausdorff dimension is the basis of fractal dimension, which is derived from Hausdorff measure.To proceed further, the Hausdorff measure [32] is firstly introduced.
Definition 1 (Hausdorff measure).Let (, ) be a metric space.For any subset  ⊂ , one defines a nonnegative function where diam() = sup{(,) : , ∈ } represents diameter of subset . dimension Hausdorff measure of  can be defined as Definition 2 (Hausdorff dimension).Hausdorff dimension of a set  in a metric space (, ) is Hence, Hausdorff dimension  is a critical value of Hausdorff measure from ∞ to 0. Hausdorff dimension presents a perfect theoretical framework for dimension estimation, from which many new fractal dimension estimation methods can be derived.But Hausdorff dimension is difficult for dimension estimation in practice.The box-counting dimension derived from Hausdorff dimension simplifies calculation complexity of Hausdorff dimension.
Definition 3 (box-counting dimension).For a totally bounded set  in a metric space, let   () be the minimal number of balls with scale  that cover .The box-counting dimension is then [33] dim BC () = lim and the necessary condition for the existence of limit is that   () is proportional to : where  is a constant.Take the logarithm on ( 5) The box-counting dimension  can be expressed as and according to (7), in order to obtain a good estimate of , log / log  must approach 0. In practice, affected by sample size or the value of , log / log  cannot be completely eliminated.Usually, box-counting dimension is determined by calculating a slope of the linear part of curve fitted by log   () versus log .
Although box-counting method is simpler in calculation compared with Hausdorff method, it still has more computation complexity than CD method [32].Let  = { 1 ,  2 , . . .,   } denote a dataset,  ∈  × .Correlation integral (, ) [34] can be defined as where ‖  −   ‖ can be any metric between data points   and   .(⋅) is Heaviside function, which is 1 if the condition is met and otherwise 0. (, ) is a statistical average of distances less than .It can also be written The CD is defined as although ( 7) and ( 10) are the same form of the formula, their calculation process is completely different.The numerator of CD method represents a global bulk with scale ; however, the numerator of box-counting method stands for the minimum number of hyperspheres with scale  that covers the dataset.Note that ( 10) cannot be directly applied to obtain CD in practice.A commonly used scheme is to calculate the slope of a curve, which indicates the relationship of log (, ) and log .Let (log (, log  1 ), log  1 ) and (log (,  2 ), log  2 ) denote any two points of curve, respectively; the slope is then defined as and the accuracy of CD method is much dependent on the choice of  1 and  2 .To get high accurate CD, the linear portion of the log-log (log   , log (,   )) sequence is selected and a new straight line is then fitted by the linear portion.

Analysis of WCD Estimation.
From a geometric point of view, an object's bulk is directly related to the dimension power of its scale  [31].For example, a straight line length is one power of scale.The area of a circle is two powers of scale.
The relationship between the bulk and the  can be described as bulk ∼  dimension , where the bulk can be any metric like a volume, area, or mass.Although many notions of bulk are possible, a good quantity for bulk function    () is defined in CD method [31]: and (13) indicates that the local bulk is denoted by the number of points falling into the hypersphere with scale  at center   .
It is noted that  =  should be excluded, which implies that the denominator is  − 1 rather than .Since    () is a local bulk, some averaging method should be used for the global bulk.In CD method, the algebraic average is used: where (, ) is correlation integral, that is, global bulk.For the uniform dataset, a good result can be obtained by algebraic average for correlation integral (, ).However, for the nonuniform dataset, it is unreasonable to treat every point equally due to the fact that the local bulk    () is different at different point.Here, a developed weighted bulk approach could be considered for global bulk, that is, treating each local bulk with different weights for global bulk; then the global bulk can be described as where  is the weighted vector.
Local bulk calculated at three cases including high dense points, sparse points, and boundary points is shown in Figure 1.Without considering the noise points, it is obvious that the local bulks estimated at high dense area are more reliable than the other two cases.It is natural for us to increase the weights of high dense area and simultaneously decrease the ones of low dense area and boundary area for dimension estimation.So accurate estimation of the data distribution is important, and there are many methods estimating the distribution of dataset, such as the probability distribution estimation methods and the boundary detection methods.In this paper, the vertex degree of an undirected graph is used to measure the distribution of a dataset, upon which a novel and simple WCD method is then proposed to improve the performance of CD method.If the vertex degree is big, the area around the vertex is dense; otherwise it is a sparse point or a boundary point.Moreover, vertex degree can reflect the credibility of the local bulk estimated.It is reasonable to regard the vertex degree as a weight of the local bulk.Twenty points are marked by vertex degree method in the dataset in Figure 2, in which ten squares represent the biggest vertexes degree and ten circles indicate the smallest ones.We can see that the density area and the sparse or boundary points are distinguished correctly.Therefore, the WCD method is more accurate for the intrinsic dimension than CD method.The specific description of WCD method is shown in Algorithm 1.

Selecting the Linear Portion of the log-log Sequence.
Selecting different portion of the log-log sequence to calculate the slope will lead to different precision of CD estimation.A log-log plot drawn by the log-log sequence (log   , log (,   )) is shown in Figure 3 and it can be divided into three portions, the low portion, the middle portion, and the upper portion.In the low portion, the scale  of the Input: Signal dataset .Output: Intrinsic dimension .
(3) The scale sequences ( 1 ,  2 , . . .,   ) are computed by   = min( 1 ) + ((max( 1 ) − min( 1 ))/),  = 1, 2, . . ., .Where  is the number of the scale .hyperspheres is small and only few points fall into the hyperspheres.So very small noise points can cause great error, which is the reason that the low portion occurs fluctuating phenomenon.Besides, in the upper portion, where the scales  of the hyperspheres are larger than a specific value, the number falling into the hyperspheres will not increase.The scattering plot of the dataset is shown in Figure 4.This is the reason that the upper portion bends down and approaches a  plateau.Usually, the middle portion is linear which is perfect to estimate CD of a dataset.In order to minimize the error caused by nonlinearity, we should choose small points from the log-log sequence (log   , log (,   )) and try our best to choose the linear portion of the sequence.However, to maximize our sample size, we want to include as many points as possible.How can we accurately choose the linear points from the log-log sequence?For the obvious characteristics of the three portions of the sequence, we can use the means clustering method to decide which pairs of the loglog sequence should be used for CD estimation.-means clustering method aims to partition the log-log sequence into three categories by minimizing the objective function arg min where   is the pair in the sequence (log   , log (,   )). = 3 represents three categories, including the low portion, the middle portion, and the upper portion, respectively. 1 ,  2 ,  3 are the number of the three categories.  is the mean of   .Hence, those points that belong to  2 are chosen to fit a curve by the least squares method and used to estimate the CD.The most important factor of the -means method is the initial value of   .In this paper, the curve is divided equally into three portions and the mean of each portion represents the initial value of   .

Complexity Analysis of WCD Method.
In this section, the computational complexity of WCD method is investigated and compared with CD method.From the whole calculation process, we can see that the local bulk of WCD method costs more calculations than that of CD method.For the analysis, we assume that the sample size is .The calculation of a local bulk   () at point   with scale  requires  − 1 operations and the complexity is ( − 1).There are  local bulks that should be calculated, so all of the complexity is (( − 1)  ).However, the CD method is only (( − 1)!).
In addition, compared with CD method, vertex degree need be calculated in WCD method and the complexity cannot be ignored, when the sample size is huge.All these seem that the computational complexity of WCD method is much higher than CD method.But actually, it is unnecessary to calculate all local bulks of the dataset for WCD method.We can only use very few points to estimate the local bulks and can also get a high accuracy result.The computational complexity of WCD method is almost the same as CD method and this can be proved by the following experiments.

Experimental Study
There are many factors affecting the results using WCD method, including the sample size, the intrinsic dimension, selecting of linear portion of log-log sequence, number of local bulks used for correlation integral (, ), and selecting scales.In our experiments, samples with different dimensions and sample sizes are generated by MATLAB randn function.Each sample is independent of Gauss distribution.The performance of WCD method is compared with CD method and the various factors are analyzed.Correlation dimensions are depicted in Figures 5(a), 5(b), and 5(c) for both WCD and CD methods, respectively, with three different sample sizes.Specifically, only sample sizes of 100, 200, and 500 and intrinsic dimensions of 3, 5, and 8 are used to plot.It is similar for other sample sizes and intrinsic dimensions.For each plot in Figure 5, the horizontal axis indicates the number of local bulks , whose maximum value is the same as the sample size.The vertical axis represents the actual and the estimated values (via the WCD method and the CD method) of the intrinsic dimension.Each horizontal green line represents the actual intrinsic dimension for reference.Each red dot denotes the intrinsic dimension estimated by the WCD method.Each black asterisk denotes the intrinsic dimension estimated by the CD method.It can be well observed from Figure 5 that the intrinsic dimensions calculated by the WCD method are more accurate than the ones by the CD method.However, the front part of the curves plotted by the WCD method fluctuates frequently.This is because there are few local bulks used for intrinsic dimension estimation which lead to the fact that the result is instability.In addition, all the curves plotted by the WCD method slop downward with the number of local bulks increasing, but they still can converge to a good value.In general, the front part of the curves plotted by the WCD method is more precise than the latter part.The loss of precision is mainly caused by the data distribution.The high dense area is chosen first to calculate the local bulks by the vertex degree method, leading to the high accuracy.However, with more sparse points or boundary points being used to calculate the local bulks, the accuracy will be lost.Hence, it is inferred that the number of the points used to calculate the local bulks is one of the main factors to the intrinsic dimension estimation.This also verifies the effectiveness of our developed methods of using small high dense points to estimate the intrinsic dimension by the WCD method.Examining the curves estimated by both methods, when the samples size is fixed, the accuracy will gradually reduce with the increase of actual intrinsic dimension.The main reason is that the dataset becomes more and more sparse with the increasing intrinsic dimension in the same sample size.Observing the curves in Figures 5(a), 5(b), and 5(c), respectively, it can be seen that the accuracy of both methods tends to improve, along with the increasing sample sizes in the same actual intrinsic dimension.This is because the dataset will become dense with the increase of the sample sizes.Additionally, the selection of scales  is also an important factor of affecting the performance of the intrinsic dimension.The smaller scales  will be easily susceptible to noise; however, the larger scales will result in saturation phenomenon, in which the correlation integral (, ) will not change with the increasing scales .In addition, abundance scales will inevitably increase the computational cost and the smaller number one will reduce the precision.
For the purpose of analyzing the calculation speed, we generate three dimension datasets with sample sizes from 100 to 4000 by MATLAB randn function and estimate intrinsic dimension by these four methods.The computation time of all four methods is shown in Figure 6.It reveals that the GMST method costs the most computation time, while WCD method, MLE method, and CD method cost almost the same calculation time.But the computation speed of WCD method will obviously slow down with the increase of the local bulks.

Empirical Results
In order to validate the proposed method, WCD method is used to estimate the intrinsic dimension of two kinds of datasets (the synthetic datasets and the real world datasets).Moreover, the comparisons with geodesic minimum spanning tree (GMST), correlation dimension (CD), and maximum likelihood estimation (MLE) are also performed to further the advantage of our developed findings in this paper.

Synthetic Datasets.
In this subsection, two synthetic datasets (Koch curve and S-curve) are firstly investigated.The sample sizes of the two datasets are 2000, respectively, and plots are shown in Figures 7 and 8.The dimensions estimated by all methods are listed in Table 1.Koch curve originates from a line whose middle segment is repeatedly replaced by an equilateral triangle.If we use a tool whose dimension is less than 1 to measure Koch curve, its Hausdorff measure is  inf.If we use two dimensions to measure it, its Hausdorff measure is 0. So the intrinsic dimension of Koch curve is between 1 and 2 and the dimension estimated by the four methods falls into this range.Moreover, the data points in S-curve dataset are contained in a curved surface in threedimensional space, so the intrinsic dimension of S-curve dataset is 2. The obtained results show that all the considered methods have high accuracy, in which the developed one in this paper is the most optimal.
5.2.Real Datasets.Following a similar process in 5.1, another three real datasets (the laser generated data, the Ikeda map, and the Hénon map) will be analyzed in this subsection, where the specific explanations of the considered real datasets are illustrated as follows.

5.2.1.
Laser Generated Data.The data were recorded from a far-infrared-laser in a chaotic state [4], formed by 1000 samples, and the attractor dimension is approximately 2.26.The plot is shown in Figure 9.

Ikeda Map.
Ikeda map [31] is a complex map, which is defined by Ikeda map is derived from a model of the plane-wave interactivity field in an optical ring laser.It is iterated many times, and the points [Re(), Im()] are plotted for  = 2000.Here,  = 1.0,  = 0.9,  = 0.4, and  = 6.The intrinsic dimension of this attractor is approximately 1.7.The visualization of the map is shown in Figure 10.[31] is usually cast as an equation of the form

Hénon Map. Hénon map
with  = 1.4 and  = 0.3, and gives an attractor with intrinsic dimension of approximately 1.3.The plot of Hénon map dataset for  = 2000 is shown in Figure 11.
For estimating the intrinsic dimension of laser generated data, phase space is reconstructed by delay-time embedding technology.Although Takens has proved that original state space of a dynamical system will be reconstructed, as long  as  > 2 + 1, where  is the embedding dimension and  denotes the intrinsic dimension of the attractor, it is nontrivial to choose the embedding parameters.If the product ( − 1) is too large, then the reconstructed vector will be effectively decorrelated in phase space, which lead to a larger dimension estimation.When the product ( − 1) is too small, the reconstructed vector becomes effectively redundant, which will lead to a smaller dimension estimation.In order to compare the index with [4], we select embedding dimension  = 5, delay time  = 10.Furthermore, the dimension of Ikeda map and Hénon map is estimated directly by dimension estimation method, which avoids selecting  and .From Figures 10 and 11, we note that the thinner attractor is the lower dimension.The results are listed in Table 2, from which we can infer that the WCD method is also effective on the real datasets.

Conclusion
When the distribution of a dataset is nonuniform, the CD method for intrinsic dimension suffers from large bias.To address this issue, the WCD method has been proposed with an optimized weighted vector determined by the vertex degree.The influencing factors of the WCD method have also been comprehensively analyzed, including the sample size, the selecting of the linear portion of the log-log sequence, the number of local bulks used for correlation integral (, ), and the selecting scales.The WCD method is validated by experiments on synthetic datasets and real world datasets.
Compared with the CD method, the main drawback of the WCD method is that the speed of the computation will slow down, when a lot of local bulks   () are calculated.But the experiments indicate that it is unnecessary to calculate all the local bulks of the dataset and only a few points in the high dense area of the dataset used to calculate will also obtain a good result.From above experiments, it can be seen that the computational complexity of WCD method is almost the same as CD method, when the local bulks are less than 3500.Moreover, the density estimation of a dataset by vertex degree is only applicable to a single distribution.when the dataset is multiple distribution, WCD method will fail, which should be further studied.

Figure 2 :
Figure 2: The indication of falling into the circle at different location.

Figure 3 :
Figure 3: log-log plot for computation of CD.

Figure 4 :
Figure 4: Bending explanation for the upper portion.

Figure 5 :
Figure 5: Estimated and actual intrinsic dimension for datasets on different sample size.

Table 1 :
Intrinsic dimension estimation of synthetic datasets with different methods.

Table 2 :
Intrinsic dimension estimation of real datasets with different methods.