Video Shot Boundary Recognition Based on Adaptive Locality Preserving Projections

A novel video shot boundary recognition method is proposed, which includes two stages of video feature extraction and shot boundary recognition. Firstly, we use adaptive locality preserving projections (ALPP) to extract video feature. Unlike locality preserving projections, we define the discriminating similarity with mode prior probabilities and adaptive neighborhood selection strategy whichmake ALPPmore suitable to preserve the local structure and label information of the original data. Secondly, we use an optimized multiple kernel support vector machine to classify video frames into boundary and nonboundary frames, in which the weights of different types of kernels are optimized with an ant colony optimization method. Experimental results show the effectiveness of our method.


Introduction
Video shot boundary recognition is a fundamental process towards video summarization and analysis.There are many boundary recognition methods already presented [1,2].Common method to recognize shot boundary is comparing the difference of two adjacent frames with a threshold.In paper [3], the abrupt shot boundary is detected based on an adaptive threshold and gradual transition boundary is detected with a set of standard templates.Warhade et al. [4] detected shot boundary with cross-correlation coefficient, stationary wavelet transform, and combination of local.Thakar and Hadia [5] proposed a new gradual shot detection method in which the threshold can be adaptively determined based on the totaly information change of video frames.Huo et al. [6] used a statistical model according to the video frame differences to determine the adaptive threshold.In paper [7], the threshold is automatically determined according to the magnitude of color differences quantification.Warhade et al. [8] first extracted structure features from each video frame by using dual-tree complex wavelet transform and then decided the shot boundary based on the spatial domain similarity.In order to reduce the computation, Gao and Ma [9] used color histogram and mutual information to measure the difference between frames, and then the corner distribution of frames is utilized to exclude most of the false boundaries.
The main disadvantage of these methods is susceptible to the effect of thresholds, which can make a mistake for some complicated long gradual shots.To resolve this problem, people see the video shot recognition as a categorization task.In paper [10], a fuzzy logic method is used to detect shot boundary.This method contains two processing modes, where one is dedicated to detection of abrupt shot and the other for detection of gradual shot.In paper [11], the video feature including HSV (hue, saturation, value), edge orientation, and texture feature is obtained, and then the Kohonen self-organized network is used to recognize shot boundary.Huang et al. [12] classified video frames with a radial basis function neural network.Mohanta et al. [13] used a multilayer perception network to classify video frames based on local features matrix.To improve the recognition performance, Li et al. [14] first removed some frames from the original video which were clearly not shot boundaries, then used a novel SIFT key point matching algorithm to detect shot boundary.Zhao et al. [15] used context feature vector and Tabu-SVM to recognize shot boundary.In paper [16], the proposed approach first detected general shot boundary with Fisher criterion and then classified the cut and gradual shot with SVM.In order to improve the effect of SVM, Zhao et al. [17] optimized the parameters of SVM with particle swarm method.In addition, Lankinen and Kamarainen [18] detected shot boundary using a visual bag-of-words approach.Donate and Liu [19] extracted salient features from a video sequence and tracked them over time to estimate shot boundaries.Li and Chen [20] recognized shot boundary with macroblock type information which can save a lot of computation cost.
In this paper, we present a novel method to improve the shot boundary recognition accuracy.Firstly, based on the analysis of LPP, we present an adaptive LPP to extract more useful and discriminating features.Secondly, we recognize shot boundary with an optimized multiple kernel support vector machine.
The rest of this paper is organized as follows.Section 2 abstracts the theoretical fundamentals about LPP.In Section 3, we extract the video feature with improved LPP.Section 4 uses an optimized multiple kernel SVM to recognize shot boundary.Some experiments are used to evaluate the presented method in Section 5 and all the paper is concluded in Section 6.

Locality Preserving Projection.
Locality preserving projection (LPP) is a dimensionality reduction method which can be explained by the graph theory [21].Assume there is dimensional data point set  = [ 1 ,  2 , . . .,   ]; we try to find a project matrix  to project these data point into a lowdimensional subspace, and the projection can be expressed as   =     .
The objective function of LPP is as follows: min where the weight matrix  can be defined as follows: Then, the objective function of LPP can be converted into the following minimization problem: where   = ∑    is a diagonal matrix and  =  −  is a Laplacian matrix.Lastly, the project matrix  can be obtained by solving a generalized eigenvalue problem as follows: Let the column vectors  1 ,  2 , . . .,   be the solutions of (4), ordered according to their eigenvalues,  1 ,  2 , . . .,   ; we can define the following transformation form: 2.2.Weight Definition.In the LPP, the weight between two points is defined to be a simple either 1 or 0 or heat kernel, which cannot reflect the class information.Given  = [ 1 ,  2 , . . .,   ], let   and (  ) be the label and  the nearest neighbors of the point   , Li et al. [22] presented an orthogonal discriminating projection (ODP), in which the weight   between two points is defined as follows: where (  ,   ) denote the geodesic distance between points   and   , and  is a parameter which is used as a regular.
From Figure 1 we find that   is not a monotonically decreasing function of  2 (  ,   )/, which is due to the fact that  2 is a nonmonotonic function.When 0 <  2 (  ,   )/ < 0.71,   is monotonically increasing.In the actual applications,   should decrease with the increase of  2 (  ,   )/.Zhang et al. [23] proposed a modified ODP (MODP) with correlation coefficient Obviously, the above definitions about   consider only the space structure and not the manifold structure.Meanwhile, the presented  nearest neighbors do not reflect the real information of manifold structure.

Video Feature Reduction
In this section, we extract more discriminating video features from the original color, shape, and texture feature.Firstly, we propose an improved LPP.We adaptively select the  nearest neighbors of each point and introduce the model information into the new weight similarity which can make the weight be a monotonically decreasing function of distance.The major merit of ALPP is preserving the local structure and label information of original data.Then we use ALPP to extract more discriminating video features for shot boundary recognition.

Mode Detection.
In order to preserve the mode information of data points, we use median-shift method relied on computing the median of local neighborhoods instead of the mean to detect mode.Considering that the median of a set is a point in the set, the method is more robust than meanshift method.Most importantly, the median-shift method is not a nonparametric method that does not require a prior knowledge of the number of clusters nor does it place any limitations on the shape of the clusters.The process of mode detection is as follows [24].Suppose we are given a set  = { 1 , . . .,   },   ∈   ,  > 1; define the Tukey depth of a point to be where the Tukey depth of a point is the minimum of its depth along any projection vector V.
Firstly, the median of  is an element with maximal depth Then we seek the mode with the median-shift algorithms.For each point we wish to ascend in the direction of the positive gradient of the underlying probability density function.We define the median-shift for point  in set  as where  is a bandwidth parameter.Since   uses necessarily a point in the dataset, there is no need for multiple iterations in this step.After one iteration all points are linked and we can only go through the list of discovered medians to find a mode.The results of this step are a set of modes representing clusters.
Next, we proceed by iteratively working on the reduced set of modes, replacing the median calculation by weighted median calculation until convergence, where weights are the number of points mapped to the given mode.The weights are taken into account during the calculation of the depth of each point in the next iteration by modifying the definitions as follows: and defining  = ∑   Weight(  ) as the total weights in the neighborhood of   , then Finally, in case of data clustering, and not only mode detection, we map each data point to its closest mode.Let   be the model of the point   , and let  be the set of   mode; we can obtain where

Adaptive Neighborhood Selection.
Considering the unchangeable  neighborhood not reflecting the mode information of manifold structure, we apply an adaptive strategy to select the neighborhood of each point.Firstly, we define the manifold adjusted length of line segment (  ,  ) − 1   and   are neighboring points infinite else, (13) where (  ,   ) is the Euclidean distance between   and   and  is a flexing factor.Obviously, this formulation can be utilized to describe the global consistency.In addition, the length of line segment between two points can be elongated or shortened by adjusting the flexing factor  [25].
Then, let data points be the nodes of graph  = (; ) and let  ∈   be a path of length  = || − 1 connecting the nodes  1 and  || in which (  ,  +1 ) ∈ , 1 ≤  ≤ ||.Let   denote the set of all paths connecting nodes   and   ; the manifold distance metric between two points is defined as follows: where (, ) denotes the manifold adjusted length of line segment.
Next, the average manifold distance of point   is defined as follows: where   is the total number which meets the conditions   =   .
Lastly, the adaptive neighborhood of point   is constructed as which shows that the neighborhood of point   is adaptively built with the points where the distance is shorter than the average distance.
The major merit of the adaptive neighborhood selection method can be summarized as follows.
(1) The manifold distance metric can measure the geodesic distance along the manifold, which can elongate the distance among data points in different regions of high density and simultaneously shorten that in the same region of high density.
(2) The neighborhood of each point is different from others, which is decided by the local density of the origin space.When the local density of   is lower, the neighborhood is higher and vice versa.

Improve Weight Definition.
In order to resolve the problem which is described in Section 2.2, we improve the weight definition between two points   and   as follows: where (  ,   ) denotes the distance between points   and   ,  1 and  2 are the regular parameters, and   is the label of the point   for definition of   (  ) please refer to (16).
Similar to paper [26], let exp(− 2 (  ,   )/ 1 ) be the local weight, and let exp(− 2 (  ,   )/ 2 ) be the intermode discriminating weight.The new weight definition can be viewed as the local weight and discriminating weight.It means that the discriminating similarity reflects both the local neighborhood structure of model and label information of the data set.
The properties and the corresponding advantages of the improved weight definition can be summarized as follows.
(1) The improved weight definition make use of the label information and model information to preserve the manifold information, which is very important for classification.(2) Since the value of exp(− 2 (  ,   )/ 2 ) ranges from 0 to 1, no matter how far the two points are, the intermode similarity can be limited in certain ranges.

Video Feature Reduction.
In order to enhance the discriminating information for shot boundary recognition, we hope to combine the label information and model information to improve the discriminating ability and preserve the local neighborhood structure of the original data.Due to introducing the similarity matrix , we define the local scatter matrix as follows [23]: where Then we define the nonlocal scatter matrix as follows: where   = ∑  =1 ∑  =1 (  −   )(  −   )  ,   =   −   .Lastly, the objective function of the improved LPP can be expressed as follows: where  is an adjustable factor.So we can find that  consists of the eigenvectors associated with  top eigenvalues of the following eigen-equation: The algorithmic procedure of video feature extraction is stated below.
(1) Extract original video feature including the color, shape and texture feature.
(2) Perform PCA projection.In order to make the matrix   become nonsingular, we project the dataset into a PCA subspace with a transformation matrix.
(3) Define the similarity matrix.For each point   , compute the similarity   = exp(− (4) Compute the diagonal matrix  and Laplacian matrix  and then compute the top  eigenvalues and its corresponding eigenvectors based on (20).
(5) Perform the ALPP transformation.Let  be an optimal projection matrix; we can project the new data into low dimensionality with

Shot Boundary Recognition
The process of shot boundary detection includes two steps.Firstly, we extract the video features using ALPP method.Secondly, we detect shot boundary using an optimized MKSVM.
4.1.Multiple Kernel SVM.Support vector machines are a family of pattern classification algorithms which is based on the idea of structural risk minimization rather than empirical risk minimization [27].However, it is often unclear what the most suitable kernel for the task at hand is.Recently, the multiple kernel learning theory has been used for training different kernels by jointly optimizing both the coefficients of the classifiers and the weights of the kernels which have a more excellent effectiveness for object recognition than SVM [28].In this paper, we combine several possible kernels to improve the precision of shot boundary recognition.Let  = [ 1 ,  2 , . . .  ]  be a vector of weights for the mixture of kernels.A multiple kernel is the combination of the  basis kernels Assume there are a data set  = {  ,   }  =1 of labeled examples, where   ∈  is the input vector and   ∈ {−1, 1}.
According to paper [29], the primal form of multiple kernel support vector machine (MKSVM) is thus formulated as the following optimization problem: Similar to the SVM, with the constraint on   , the above minimization problem can thus be transformed into the following dual problem: For the test input , the decision function of MKSVM can be computed as 4.2.Ant Colony Optimization Method.Ant colony optimization method (ACO) is an optimizing method inspired by the foraging behavior of ant colonies [30].When ants walk between their nest and a food source, they mark the paths with special kind of chemical termed pheromone, and the shorter paths can attract more and more pheromone [31].In the method, an ant determines its transfer direction according to the amount of pheromone in each path.Firstly, every ant constructs an edge from a start vertex to an end vertex.Then when all ants reach the end vertex, the edges are marked with a pheromone quantity.Thus the colony can converge to the shortest path [32].In this paper, we apply the ant colony optimization method to solve the weights optimized problem in MKSVM.
Let    () be the proximity which an ant  transfer from element  to  at iteration .   () can be defined as where   () is the element which has not been unvisited.  is the pheromone quantity of path (, ),   is a heuristic measure of moving element  to element , , and  are two parameters that control the relative importance of the information heuristics and exception heuristics factor, respectively.The amount of  all  ants pheromone trail on a path deposited step by step.After time , the pheromone quantity   () associated with an edge joining element  and  is updated according to the following formula: where  is a pheromone evaporation loss coefficient and Δ   () is the pheromone quantity deposited at iteration  by ant  on an edge joining element  and .The Δ   () is usually defined as where  is a constant and   is the cost function of th ant.
For the process of ant colony optimization method please refer to paper [33].

Shot Boundary Recognition.
The process of shot boundary recognition includes the following stages.Firstly, we extract the original feature.Then we reduce the video feature with improved LPP.Lastly, we classify the frame into boundary frame with the optimized MKSVM classifier based on ACO method.
The algorithm of video boundary recognition is described as follows.
(1) Extract original video feature which includes colors, shape, and texture feature.
(6) Update   by   =  +1 and choose the ant element , as its transform direction according to function (27). (

Experiments and Analysis
In this section we present some experiments to validate the proposed approach.Firstly, we investigate the performance of the proposed ALPP method for video feature exaction experiment.Then we recognize shot boundary with the proposed method.The video database includes movie, news, sports, documentary, and MTV.The reason of selecting this type video is that news videos have many long abrupt shots, MTV has fast changes of scenes, sport have fast camera movements and zooming in, movie includes many gradual shots.Some shot samples are shown in Figure 3.For evaluation, we use the common figures of merit of the algorithm standard precision and recall [34] Precision = # (Boundaries correctly detected) # (Total boundaries detected) , Recall = # (Boundaries correctly detected) # (Total ground truth boundaries) . (31)

Video Feature Exaction Experiment.
In order to testify the effectiveness of the adaptive LPP (ALPP), we extract video feature with ODP, MODP, and ALPP method.For convenience of comparison, we use the same method in paper [35] to detect shot boundary.In the experiment, original video feature is built on color feature, shape feature, and texture feature.Then ODP, MODP, and ALPP method, are used to extract video feature with  = 0.5 and  = 0.6.In the ODP and MODP methods we adopt  nearest neighbor   criterion to define the adjacency matrix, in which the  set to 10.The results of performance comparison by using ODP, MODP, and ALPP are shown in Table 1.
From Table 1, we can find that ALPP obtains comparable recognition performance to LPP and ODP with the same shot boundary detect method.In ALPP method, the improved weight definition combines the label information and model information with adaptive nearest neighbor select strategy, which is very important to reflect the data information truthfully.The experiment shows that ALPP has more useful and discriminating ability to extract video feature than others.

Shot Boundary Recognition Experiment.
In order to investigate the performance of the proposed method, we recognize different video shots, especially for gradual shots.The system performance is compared with the multilayer perception network method (MPN method) [36].We use the original video feature and set the parameters  and  to be the same as before.In the proposed method, we use polynomial kernel, radial basis kernel and linear kernel to build MKSVM and set  = 3,  2 = 0.005,  = 1,  = 5,  = 0.5, and  = 100.The experimental results are summarized in Figure 4.
From Figure 4, we found that the above methods can detect not only abrupt cuts but also gradual shots very well, but the proposed method achieves more desired performance than the MPN method for shot boundary recognition.The average precision and recall of the proposed method is up to 94.1% and 91.7%, which is higher by 3.5% and 3.1% than the MPN method, respectively.These results demonstrate that the proposed method is a good tool for shot boundary recognition by using the optimized MKSVM.

Discussion
. Two experiments for different type video have been systematically performed, and so now we can conclude the following.
(1) We improve the effect of shot boundary detection in two stages.In the feature extract stage, we use ALPP to extract more useful and discriminating video (2) For feature extraction, the proposed ALPP performed better than LPP and ODP.This is because the former makes use of the label information and model information with adaptive nearest neighbor select strategy.At the same time, the improved weight definition can guarantee that two near points from different modes have a smaller similarity.
(3) Compared with the MPN method, the proposed method can yield better performance on shot boundary recognition.It owes much to the optimized MKSVM, in which the parameters are optimized by the ant colony method.It should be noted that there are some false detection results in the above methods, which may be due to the existence of irregular object movement and the small content change between consecutive frames.

Conclusion
In this paper, we present a new video shot boundary recognition method, which focuses on two key problems: extracting more useful and discriminating feature and improving the accuracy of shot boundary classifier.The major contributions of the paper are to propose an optimized locality preserving present method with model detection and optimized neighbor selection strategy.Meanwhile, an optimized shot boundary classifier based on MKSVM is designed with the ant colony optimization method.Experiments demonstrate that the proposed method is outstanding.The future work is to optimize the other parameters of MKSVM to achieve more desired result.

Figure 4 :
Figure 4: Comparison of two recognition methods.
Type plot of   as a function of  2 (  ,   )/.(3) With the decrease of the geodesic distance, exp(− 2 (  ,   )/ 2 ) is a decrease, which means that two near points from different modes have a smaller similarity.(4) Note that  1 and  2 always decrease when   and   are far apart and they increase when   and   are close.Thus,   is a monotonically decreasing function of  2 (  ,   )/.
Update the taboo table pointer.Move the ant to the selected new element and add the element into ant taboo table.(8) If all elements of the set have been fully traversed, go to step 7 or else go to the next step.(9) Recalculate the pheromone of each path, if   <   (max), go to step 6, or else save the weights  = [ 1 ,  2 , . . .  ].