Spatial Object Tracking Using an Enhanced Mean Shift Method Based on Perceptual Spatial-Space Generation Model

Object tracking is one of the fundamental problems in computer vision, but existing efficientmethodsmay not be suitable for spatial object tracking.Therefore, it is necessary to propose a more intelligent mathematical model. In this paper, we present an intelligent modeling method using an enhanced mean shift method based on a perceptual spatial-space generation model. We use a series of basic and composite graphic operators to complete signal perceptual transformation. The Monte Carlo contour detection method could overcome the dimensions problem of existing local filters. We also propose the enhanced mean shift method with estimation of spatial shape parameters. This method could adaptively adjust tracking areas and eliminate spatial background interference. Extensive experiments on a variety of spatial video sequences with comparison to several state-of-the-art methods demonstrate that our method could achieve reliable and accurate spatial object tracking.


Introduction
Mathematical formalism is probably the most precise and logical language in science research.It is typical for researchers in pure natural sciences to attempt to describe observed phenomena using mathematical correlations.However, because of real-world scenarios, it is often very difficult to construct a perfect and permanent mathematical model for one specific issue in engineering fields [1].During the last years, the effort concentrated in self-optimizing and selfadaption was leading to a new field between mathematics and applications, called intelligent modeling [2].In this paper, we propose a new intelligent modeling method using the enhanced mean shift method based on a perceptual spatial-space generation model for spatial object tracking.Object tracking has been applied to many fields, such as video surveillance [3], robot recognition [4], and traffic control [5].In spatial on-orbit docking, object tracking could be used to track spatial aircraft and assist with ground control.Because the spatial images are mainly generated from low-rate videos [6] or airborne spectral imagery [7], which are captured by the aircraft sensors [8,9], their resolution and spatialtemporal coverage are not very ideal.In addition, because of differences in the sensor spectral bands, acquisition position, and contrast gradient setting, there are shifts in the relative position and scale zoom in multisource images with the same scene.All this will bring influence to spatial object tracking results.
Recently, multidimensional decomposition and multiscale representation methods have been widely applied to image processing and computer vision.Mumford and Gidas proposed the stochastic model [10], in which truncation errors and noise interference could be isolated from the discrete domain.Witkin and Koenderink proposed the image scale-space model [11,12], which cleared noise interference in fine scales, and analysis errors could decrease in coarse scales.Burt and Lindeberg proposed the coarse-to-fine model [13,14], which could reduce useless gradients in gradient entropy calculation.In the object tracking field, the traditional method is rectangular block region tagging [15,16].In [17], Isard and Blake applied a local filter to object tracking field.Sun and Liu proposed using a combination of the local description and global representation in object tracking [18].Recently, a graphics model based on a Bayesian neural network was also applied to continuous object tracking [19].However, if we applied these methods to spatial object tracking, spatial background clutter and moving object overlapping appeared in different scenes.The existing multiscale deviation will seriously reduce the spatial object tracking accuracy.
In this paper, we propose a spatial object tracking method using an enhanced mean shift method based on a perceptual spatial-space generation model.We detect the spatial object continuity and saliency between different scales in the perceptual spatial-space generation model.The enhanced mean shift considers the relevance between motion area and static background.It could achieve more robust object tracking.Our proposed method is shown in Figure 1.This paper is organized as follows.Section 2 describes the perceptual spatial-space generation model.Section 3 proposes an enhanced mean shift method.Section 4 shows experimental results.Section 5 is the conclusion.

Prototype Pyramid Generation.
Generation model is a joint probability function with prototype  and image , and Δ  is a dictionary which includes image primitives, such as blobs, edges, crosses, and bars.It can be expressed as The decomposition probability could be divided into primitives and texture.
Consider the following: where  = ⟨, ⟩ is the properties graphics,  is the collection of primitives in , (, V) denoting the primitives in the dictionary Δ, and  is the variance of the corresponding primitive features.The priority model () is an uneven Gibbs model, which defines the graphical attributes on .It focuses on continuity properties in the perceptual model, such as smoothness, continuity, and typical functions.
Consider the following: where   is the primitive mark in  and its connection degree is .(  ,  V ) is potential association for two correlation functions.Because there is uncertainty in the inner perception posterior probability, the prototype pyramid may not appear continuous for each layer calculation.In order to ensure transition and consistency of single frame diagram, we define a set of graphical operation factors Graphical operation factors could synthesize detected graphic edges into pairs of characteristic bridges.Each bridge is related with the properties of probability function.The conversion from   to  +1 is realized by a series of conversion rules   , and rule order could directly determine conversion efficiency.The generative rule graphic path from   to  +1 could be expressed as where  is the optimal-calculated prototype.Under the condition that there will be no loss in perception model accuracy, we assume that   begins to decay from  through the single operation factor.  will gradually reduce the resolution, and   will also have a related complexity.The posterior probability could be expressed as A layered reduced perception model will not completely adapt to the complex model , so the first logarithmic ratio is often negative.The parameter  is used to balance model fitness and complexity.If  = 1, we could launch simplified ().  could be decided in the following range: The transform between   and  +1 is achieved using a group of greed detection.The accurate scale of graphical operation factor will make differences based on the subjective goal.We suppose that the graphical operation factor is between  and Based on formulas ( 6) and ( 7), we determined that the corresponding interval is 2.2.Perceptual Transform of Prototype Pyramid.In this section, our goal is to determine the optimum conversion path and deduce the hidden graphics prototype.Our method is scanning the prototype pyramid from top to bottom based on each primitive learning decision rule.Our method can be divided into three steps.
Step 1 (prototype pyramid independent calculation).We apply a pyramid algorithm in the bottom of image  and calculate the Gaussian pyramid  0 .Because each prototype layer is calculated using a MAP estimation [20], there is a certain loss in the continuity of the prototype pyramid.The specific formula is as follows: Step 2 (pattern matching from bottom to up).We match the image prototype attribute from   to  +1 using an image registration algorithm from a previous study [21].We use , , , and  as a judgment function to process each node  and obtain the related image characteristics at each time.Specifically, the matching degree between the th node at the th scale and the th node at the ( + 1)th scale can be expressed as where  is the variance of related features.For similarity matching between   and  +1 , this formulation allows the empty variable prototype to appear in   .We multiply   by the related variance  and obtain a homologous  +1 with subsidiary value.Pattern matching results will be used as the initial parameter for the Markov chain matching in the next step.
Consider the following: Step 3 (Markov chain matching).Due to the uncertainty for the initial perception in the posterior probability model and the relative complexity for dynamic graphic structure mining, we use the reversibility of Markov chain matching to match a perceptual transform.The Markov chain includes 25 pairs of reversible jump.In each path of Markov chain matching, there is a reversible jump between  1 and  2 [22].These reversible jumps are related to their corresponding grammars, and each pair of these rules is based on probability selection.We use this mechanism to optimize the perception conversion path, which could lead a cross-scale continuous perception prediction.Consider the following:

Object Contour Evolution.
The problem of such matching strategy is that it does not contain any matching, which could split a long contour into short edges.We propose a new Monte Carlo contour detection method.This method mainly chooses the right standard in certain scale using spatial-space domain knowledge.Our proposed method uses the weight set { , ,  , } to estimate the posterior probability density ( , ,  , ).According to the resampling theory [23], it is feasible to calculate the sample specimen (, ) with appropriate weights from the normal density distribution    .Consider the following: For the sequence sample, the important probability (, ) can be chosen using the following mode: The entire probability can be approximated with a simplified formulation We use only one part of the whole sample; ( Based on the resampling theory, the sampling weight can be updated as follows: where  is the sample retrieval in the th part,  is the sample length of the th part, and    is the sample set of the th part.As we use the Markov property, the density function (   |  ,  ) could do further approximation relied on the unit product of local observation similarity unit .
Consider the following: In our proposed framework, all parts of the detected object will be tracked at the same time.There is no need to calculate all unit probabilities (   |  ,  ), so we can use the local possibility and directly estimate the weight of all units.

Markov Random Field Representation.
We propose Markov random field representation for perception generation model.We define the pixel perception set , in which any two arbitrary pixels are adjacent.The adjacent relation is an interactive relationship; in the an adjacent pixels system  = {(, ): (, ) ∈ }, if (, ) is adjacent point of (, V), then (, V) is also adjacent with (, ), and () is the Markov random field with respect to the perception set   ( , |  (, )) =  ( , |  ( \  (, ))) . ( We can define () as the pixels set that contains all pixels of .The Markov property mentioned in formula (20) relies on the density distribution of the adjacent pixels.According to S. Geman and D. Geman [24], the Markov random fieldrelated pixels in system  can be rewritten as the following Gibbs distribution: where   (()) is the potential variable function defined in (). is a conventional constant, which can maintain the () sum up to 1.For spatial object tracking, the perception latent variables that appear in pairs are difficult to describe accurately.If  has an amount of pixels, then   will be a multidimensional function.Here, we use the topological model to solve the above problems.Assuming that the spatial perceptual rules set is (), then  , () could extract the local characteristics near pixels (, ).A specific example is  , () = ⟨,  ,, ⟩,  = (, ).We transform the image  with the Gibbs distribution where  ,, is the low-dimensional image characteristic function set,  is the conventional constant that depends on , and  ,, = (,  ,, ).Assuming that the normal distribution is (),  ,, is the normal distribution of  ,, under  − ().
In the formulation of ( | ),  * can be seen as the best approximation of probability , and it also can be marked as the "maximum likelihood."  In order to model the observed image, we assume that  ,, ( ) =   ( ),  does not depend on (, ).We can continue to parameterize the   or normalize  in a low-dimensional scale.If we normalize   in  = 1, . . .,  and make   ( ,, ()) =   , we could rewrite the Gibbs distribution as follows: where   () =  ,  ,, () represents the quantity of the effective points  ,, () which falls into the interval  and   = (  , ) is the marginal matrix related to { ,, , }.If we want to find the maximum estimate coefficient , we need to calculate the spatial-scale statistical coefficient   () In other words, we need to match the spatial data and the related model.The most suitable model is determined by   ( observed ), and   [  ()] is an average parameter; value  is a natural parameter.In the perceptual model, there will be a global balance variable.We can make a minor adjustment where  0 is the approximation value of global observation estimated from observed spatial images.The local equilibrium parameter defined in any pixel will obey the specific distribution where   is local approximation for pixel  0 .It is the only distribution that has similar effects with perceptual model.

Enhanced Mean Shift Method
3.1.Mean Shift.In this section, we propose an enhanced mean shift method.Our method uses the mathematical recursion method [25].The spatial tracking object will be represented by a spatial histogram consisting of a weighted evaluation, in which the probability estimation function () and ( 0 ) are used to represent the potential motion probability in images () and ( 0 ).The histogram variables can be expressed as follows: where We apply the first-order Taylor sequence extension, in which (, ) are the coordinates of the center position in the previous frame, and then we obtain the following extended formulas: The center of kernel function can be determined by the estimate of (, )  =   In order to estimate the kernel function, the normalized bandwidth will be applied to a similarity judgment.The normalized bandwidth can be obtained by estimating |Σ|(, ) where    = (  −) could accurately determine (, ).Equations (32) and (33) will be calculated in an alternative iteration until the estimated parameters can cover all the variables.

Estimation of Spatial Shape Parameters.
We also use the iterated function to determine boundary parameter  (2)   , which contains five fully adjustable affine box parameters.These parameters are the width, height, length, orientation, and center location.The orientation  will be defined as the angle between the horizontal and width matrix.The box with width  and height ℎ is the ellipse area between the long and short coordinate system.The relationship between , ℎ,  and the bandwidth matrix Σ can be expressed as follows: where  = [ cos  − sin  sin  cos  ] . (34) These parameters can be calculated using the octave decomposition method [26]; the specific formula is as follows: where  pre  and  cur  are the octave decomposition components of the previous frame and the current frame.

Spatial Object Tracking
Using Enhanced Mean Shift.In order to limit the possibility that background pixels appear in the tracking object, we use a relatively small elliptical area, in which the contract domain is determined by the factor ; in our experiments,  = 0.7.The elliptical area can be defined as follows: In our proposed method, we should determine whether the previous frame motion area (Object ) will be used to guide the next frame object (Object ) tracking.If the number of continuous characteristic pixels and the Bhattacharyya coefficient for  are both higher, the initial rectangular tracking area for  will refer to 's settings, and the tracking area will be determined by the previous frame mean shift.Consider the following: where  (2)  1 and  (2)  2 determine the threshold value.In order to solve drift and error propagation, we use enhanced mean shift for frame resampling.The object tracking resampling operation can be summarized as follows: where   ← where  (2)  , and  (2)  −1, are the four-dimensional motion areas in frames  and −1 and  (2)  3 and  (2)  4 are the threshold values determined by the graphic distance and the similarity shape.

Experimental Result
We conducted experiments on four different spatial video sequences, in which tracking objects are spatial satellites and aircrafts.We uniformed the sequence image size for the same spatial resolution; each frame is 320 × 256.The superiority of our proposed algorithm will be validated by an intuitive performance and objective evaluation.

Contour Evolution in Prototype
Pyramid.Contour evolution experimental results are shown in Figure 2.Each prototype pyramid layer is calculated independently.The performances have shown that the contour evolution has a better continuity in the layer-by-layer prototype pyramid, and the approximate effects are derived from a perceptual spatialspace generation model and are closer to the human visual perception system.

Markov Random Field
Representation.The Markov random field representation is shown in Figure 3.The significant representative region derivate from the perceptual spatialspace generation model contains significant feature information, which is closely related to the different color and texture distribution in the spatial area.The experimental results show that the region representation effect, which has been smoothed, could highlight the sensitivity of the motion area more than the initial representation.

Enhanced Mean Shift Object Tracking.
We conducted experiments on the ten video sequences, in which the tracking objects include spatial satellite and aircraft, highway and park surveillance, and the human body.The enhanced mean shift conducted an iterative calculation 20 times on each video sequence.The normalized matrix bandwidth parameters are determined by a different experimental sample.In our experiments, it is 0.63 for video sequences 1, 2, and 3, 0.42 for video sequences 4 and 6, 0.56 for video sequences 5 and 8, 0.21 for video sequence 7, and 0.60 for video sequence 9 and 10.

Satellite-1, 2, and 3.
Tracker-1: the particle filter will have a negative impact on the horizontal direction, and the motion estimation will appear noncontinuous.Tracker-2: the distance metric learning will affect motion area determination, and the rectangular window tracking will produce some deviation.Our proposed algorithm can better track the spatial objects, and satellites and aircraft can be completely contained in the rectangular window with a similar color distribution and background quiver.The edge deviation variance can be controlled well using spatial object detection, with no offset and blurring.

Automobile-1 and 2.
In a nighttime environment, as the weak light and brightness, Tracker-1 and Tracker-2 obtain a vague tracking result, and the confusion area between background and the tracking object becomes larger.In the Automobile-2 sequence in particular, as the illumination from other automotive foreground lamps, Track-1 produces particularly serious deviation.Our proposed enhanced mean shift method could distinguish the tracking object from The red (solid line) box is from our proposed method, the blue (dashed line) box is Tracker-1 from the particle filters in [27], and the green (dashed line) box is Tracker-2 from the online distance metric learning in [28].The rows 1-10 contain images from the videos "Satellite-1, " "Satellite-2, " "Satellite-3, " "Automobile-1, " "Running, " "Automobile-2, " "Highway, " "Walking, " "Automobile-3" and "Walking-2, " respectively.background confusion, and the tracking results do not appear to obviously deviate.

Highway.
Tracker-1: this method will lose the tracking center in some frames.There are some cross-rectangular windows between the far and close vision sequences.Shape errors exist in the rectangle window estimations.Tracker-2: the tracking results also have a tracking deviation in the far vision.Our proposed method can maintain consistency between different visions and does not appear to have huge deviations.

Running, Walking.
There are no obvious differences between the different methods.Except for some slow motion (walking body), Tracker-2 shows some partial deviations.Our proposed method is more robust when dealing with scenarios of two or more moving objects.

Objective Evaluation.
We use three objective evaluations for evaluating the object tracking performance (Figure 4).

Euclidian Distance.
The Euclidian distance is the distance between rectangular windows obtained by tracking methods and the artificially marked.The specific calculation is as follows: where ( , ,  , ), ( , ,  , ), and  = 1, 2, 3, 4 are corner coordinates of the rectangular window calculated by tracking methods and artificially marked.Figure 5 shows the Euclidian distance between the tracked and artificially marked areas for our proposed method and the two other trackers on the videos.
The averaged Euclidian distance calculated on all videos is shown in Table 1.Compared to the distance values from Tracker-1 and Tracker-2, our proposed method clearly shows smaller and bounded Euclidian distances for the tested videos.

Mean Square Errors (MSEs). We have MSE
where (   ,    ), (   ,    ) is the center of the tracking area obtained by methods and artificially marked, respectively. is the total number of video sequence frames.The experimental results are shown in Table 2 and, it can be seen that our (41) The Bhattacharyya distance between the tracked object area and the artificial marked region is shown in Figure 6.Among nine case studies, our proposed method has shown a marked improvement on the tracking accuracy as compared with the two existing trackers (Tracker-1 and Tracker-2).It is mainly due to our combination of perceptual spatialspace generation model and the enhanced mean shift.The averaged Bhattacharyya distance on different video is shown in Table 3.Our proposed method has the smallest average tracking deviation between different methods.

Conclusions
In this paper, we propose a new intelligent modeling method using the enhanced mean shift method based on a perceptual spatial-space model for spatial object tracking.The perceptual spatial-space model can obtain a continuous spatial object contour and highlight tracking object saliency.Enhanced mean shift method uses enhanced version mean shift, which focuses on the estimation of spatial shape parameters.The method could effectively cope with severe spatial interferences.The comparison between our method and other state-of-the-art methods demonstrates that our proposed method has a higher tracking accuracy and precision.
In future research, we can incorporate more spatial object information, such as spatial textures and aircraft shapes, into our intelligent model to generate a more robust spatial object tracking method.

Figure 1 :
Figure 1: Enhanced mean shift method based on a perceptual spatial-space generation model.

Figure 2 :
Figure 2: Contour evolution on spatial video sequences.(a) Original video frame and (b)-(d) contour evolution from a coarse to fine scale.

Figure 3 :
Figure 3: Markov random field representation in the perceptual spatial-space generation model.(a) Original frame and (b)-(c) initial and global smoothing representation.
,  |  , −1 ) simulates the interaction value between the two adjacent areas  ,  and  ) is the weight of the interactive area.In this paper, we use a sequence of Monte Carlo simulations to estimate the interactive part (

Table 1 :
Results of the averaged Euclidian distance over all frames in each video.These cases are used to further test the robustness of our proposed method in the scene containing two or more moving objects.We could observe that Tracker-2 results in less accurate boxes probably due to its poor estimate of different moving object center position in the same scene.The performance of Tracker-1 is somewhat better; however, it also produces partial deviations.

Table 2 :
Results of averaged MSE errors for the tracked box from our proposed method, Tracker-1, and Tracker-2.methodhasthe minimum MSE, which means it has the lowest tracking deviation.Our method has obvious advantages compared to other methods.4.4.3.Bhattacharyya Distance.The Bhattacharyya distance is used to judge the deviation degree between the tracking area and the actual motion area.The specific calculation method is as follows: mean  and mean  are the mean vectors with respect to the tracking area and are calculated by our method and artificially marked.The variables cov  and cov  are covariance matrices with respect to the tracking area and are calculated by our method and artificially marked. proposed