Sparsity Based Locality-Sensitive Discriminative Dictionary Learning for Video Semantic Analysis

,


Introduction
SR is an active research hotspot, particularly its application for signal reconstruction [1], and it performs well with video analysis and image classification based on experimental outcomes discussed in [2,3].DL on the other hand is to learn a good dictionary from training samples so that a given signal could be well represented; hence the quality of the dictionary is very crucial for efficient SR [4].The dictionary could be determined by either using all the training samples as the dictionary to code the test samples (e.g., locality-constrained linear coding (LLC) in [5]) or adopting a learned dictionary for the sparse representation for each training sample in the set (e.g., KSVD in [6], Fisher discriminative dictionary learning (FDDL) [7]).Besides, group-centered sparse coding likened to rank minimization problem is used to measure the sparse coefficient of each group by estimating the values of each grouping in [8].All the methods that adopt the first strategy use training samples as the dictionary.Although they show good classification performance the dictionary might not be effective enough to represent the samples well because of noisy information that could have accompanied the original training samples and it may not also wholly exploit the discrimination information hidden in the training samples.The second category is also not suitable for recognition because it only requires that the dictionary is best expressed in the training samples with strict sparse representation.The above stated issues were addressed by the LSDL approach that incorporated a locality constraint into the objective function of the DL which ensures that overcomplete dictionary learned is more representative.The problem with the traditional sparse representation methods is that they cannot produce the same or similar results when the input features are from the same categorization.
The elementary approach to building the dictionary is by exhausting all the training samples which may result in the size of the dictionary being huge and is subsequently unfavorable to the sparse solver in [9].A lot of techniques such as the Method of Optimal Direction (MOD) in [10] which updates all the atoms simultaneously with fixed sparse codes using least-square methods and an orthogonal matching pursuit for sparse coding, the K-SVD algorithm in [11] for learning a handful sized dictionary for sparse coding from the training data have recently been proposed.An attempt was made to resolve this issue by further iteratively updating the KSVD (which focuses on only the representational power or efficiency of the dictionary as discussed in [12], but not its discrimination abilities) trained dictionary grounded on the result of a linear classifier, by obtaining a dictionary that may be good for classification with representational power.Other efforts along similar direction include [13,14], which use more sophisticated objective functions in dictionary optimization for the training phase so as to gain some discriminative power for the dictionary.The other discriminative dictionary learning method uses discriminative and reconstructive error to learn the dictionary.A structured based Fisher Discriminative Dictionary Learning (FDDL) was proposed in [15].The approach improves the performance of pattern classification whose dictionary has a relation with the class labels and further exploited the discriminability in both representative coefficient and its residual in [16].Recently a discriminative structured dictionary learning with hierarchical group sparsity that reduces the linear predictive error and discriminability of sparse codes using hierarchical group sparsity was introduced in [17].A discriminative structured dictionary learning for image classification that combines into the objective function, the discriminative properties of the reconstruction error, representative error, and the classification error terms was also proposed in [18].In the same vein of improving the discriminability of dictionary learning algorithms, [19] proposed the structure adaptive dictionary leaning for sparse representation based classification, [20] proposed the learning of dictionary discriminatively for group sparse representation, and discriminative dictionary learning for face recognition with low ranked regularization was also introduced in [21].The locality-constrained linear coding for image classification (LLC) built on the theories of linear spatial pyramid matching using sparse coding for image classification [22] was proposed in [5].LLC could achieve a smaller reconstruction error with multiple code reconstruction and local smoothness with respect to sparsity.
More recently, there has been an advancement in sparse dictionary learning for video semantic as proposed in [2], a video semantic detection method based on locality-sensitive discriminant sparse representation and weighted KNN (LSDSR-WKNN), to have better category discrimination on the sparse representation of video semantic concepts.Despite having better category discrimination on the SR of video semantic concepts, the LSDSR-WKNN technique failed to fully exploit the potential information of the samples.A latent label consistent DL (LLCDL) that solves a unified framework that minimizes, simultaneously, the structured discrimination sparse codes approximation, the classification error, and the structured sparse reconstruction to boost representation and classification performances was proposed in [23].The authors of LLCDL algorithm previously introduced the joint label consistence embedded and dictionary learning (JEDL) algorithm [24] based on LC-KSVD.The technique concurrently learns discriminative sparse codes, sparse reconstruction, classification errors, and code approximation that enable linear combination of signals for the sparse code autoextractor with sparse codes during classification.More so, vital similarity information of signals is achieved by enforcing sparse coefficients to be discriminative among different classes.However, it implemented the KSVD computation approach to obtain the dictionary columns which renders it computationally expensive.All the above SR for DL techniques demonstrates an outstanding efficient performance, but are still faced with multiplicity of scene tolerance issues, in enhancing the accuracy of video semantic analysis further.
Based on the aforementioned, this paper seeks to enhance the power of discrimination of sparse representation features by encoding group sparse codes of samples from the same category and thus proposes a sparsity based locality-sensitive discriminative dictionary learning for (SLSDDL) for VSA, which is more efficient and has superior numeric stability.The SLSDDL algorithm enables the adjustment of the dictionary adaptively using the differences between the reconstructed dictionary atoms and the training samples with the locality adaptor.More so, the implementation of the locality adaptor at both DL and sparse coding stages increases the efficiency of the algorithm of the locality and similarity of the dictionary atoms.The dictionary information could also be exploited by the proposed algorithm with the introduction of a discriminant loss function during the dictionary learning stage, to obtain more discriminative information and enhance the classification ability of spares representation features.The dictionaries learned could realistically represent video semantic more and thus give a better representation of the samples belonging to the sparse coefficients.
The main contributions of this paper are as enumerated as follows: (1) A discriminative loss function is utilized, giving an optimal dictionary for addressing sparse representation issues and optimized sparse coefficients for reconstructing the dictionary, so as to enhance the discriminative classification of sparse representation features.
(2) The introduction of the group sparsity also enables efficient reconstruction of samples most especially when the size gets large.
(3) A locality-sensitive discriminative dictionary learning and sparse representation based on group sparsity is developed for video semantic detection that encodes video features originally.Thus features are sparsely encoded with video semantic detection for better preservation of the dictionary, hence an improvement in the accuracy of video concept detection.
The rest of the paper is organized into the following sections: Section 2 presents a review of related works; Section 3 discusses the proposed algorithms; experimental results are presented in Section 4. Finally, Section 5 outlines the main conclusions and recommendations.

Related Works
2.1.Sparse Representation Algorithms.SR for signal acquirement and the overall techniques included coding, sampling, compression, transmission, and decoding and is one of the utmost essential ideologies in the field of signal processing theorem which tells us that a signal that has been sampled can be impeccably reconstructed from an arrangement of samples if the sampling rate surpasses twice the maximum frequency of the original signal [25].It utilizes an overcomplete dictionary to linearly reconstruct a data case for a given dictionary.Suppose that the example is from space   , and thus all the examples concatenated to form a matrix, denoted as  M   .If any sample can be approximately represented by a linear amalgamation of dictionary  and the number of the samples is larger than the dimension of samples in D, i.e., m > d, dictionary  is referred to as an overcomplete dictionary.A signal is said to be compressible if it is a sparse signal in the original or transformed domain when there is no information or energy loss during the course of transformation.Generally, from the matrix stated above, if d<m, and y M  then we define the linear system of equations as The sparsest representation solution can be acquired by solving (1) with the  0 -norm minimization constraint [26].Thus problem (1) can be converted to the following optimization problem: Problem ( 2) is called the -sparse approximation problem.Because real data always contains noise, representation noise is unavoidable in most cases.Thus the original model ( 1) can be revised to a modified model with respect to small possible noise by denoting where     refers to representation noise and is bounded as ‖s‖ 2 ≤ .With the presence of noise, the sparse solutions of problems (2) can be approximately obtained by resolving the following optimization problems: Or Furthermore, according to the Lagrange multiplier theorem, a proper constant  exists such that (4) and ( 5) are equivalent to the following unconstrained minimization problem with a proper value of .
where  refers to the Lagrange multiplier associated with ‖‖ 0 .

Locality-Sensitive Sparse Representation.
Test examples in SR are usually represented by the dictionary atoms that may actually not be neighboring it for signal or data reconstructions.This may render it incongruous for holding data locality and hence could lead to poor recognition rates.In the LSDL method, the locality constraints of data locality were added in the phase of the dictionary learning and sparse coding, so that the test sample can be well represented by the neighboring dictionary atoms.The optimized function for LSDL is stated as follows: min where    1 is locality adapter and  is the scalar parameters which measures the locality constraint.The LSDL method does not consider the class information, and therefore we design the discriminating loss function using class information to enhance the sparse representation based classification, in order to improve the accuracy of image data representation or video analysis.

Proposed Method
The proposed algorithm for SLSDDL is based on the localitysensitive adaptor, and a discriminative lost function centered on group SR is incorporating into the objective function of locality-sensitive discriminative dictionary learning method as stated in (7) and detailed as explained below.The proposed method could achieve optimal dictionary and further enhance the power of discriminability of sparse codes as well as computational cost.As a result, features from representation coefficient might not be capable of achieving the same coding results; hence the assumption is made that samples should be encoded as similar sparse coefficients in the SR based data detection with the objective of enhancing the power of dis- ) is the proposed discriminant loss function to enforce group sparsity discriminability as defined in (11) below, the shift variant constant 1     enforces the coding result of x to remain the same although the origin of the data coordinate may be shifted as indicated in [27], and the regularization parameter controlling the reconstruction error and the sparsity is .
Bearing in mind cases where sample features from the same category could be having similar sparse codes, we propose a discriminant loss function based on a group sparse coding of the sparse coefficients for the purposes of enhancing the power of discrimination of input signals or samples in SR.The group structure enforced on the coefficient vector ensures that the variables in the same group tend to be either zero or nonzero simultaneously and thus enhances classification, hence better classification results.Consequently, samples from same category are compacted and the ones from different categories are separated.The discriminant loss function, (   ), based on sparse coefficients is explained as follows:

𝑔 (𝑥
The proposed method as stated equation ( 12) is implemented by enforcing data locality with the locality adaptor being used to measure the distance between the test sample  and each column of .Note that  = [ 1 ,  2 , . . .,   ] represents all the training samples in the feature space.The vector P, the dissimilarity vector in ( 12) is implemented to suppress the corresponding weight and also penalizes the distance between the test and each training samples in the feature space.Furthermore, it should however be made known that the resulting coefficients in our SLSDDL formulation may not be fully sparse with regard to l2 -norm, but is seen as sparse because the representation solutions only have few significant values with most being zero.The test samples and their neighboring training samples in feature space are encoded when the problem of ( 11) is being minimized and the resulting coefficients X still are sparse because as    gets large,    shrink to be zero.Hence most coefficients get zero with just some few having significant values.(12) could be divided into two subproblems by updating X with D fixed and updating D with X fixed since it is not convex as a whole.To find the desired dictionary D and the sparse coefficient X, an alternative optimization is implemented iteratively adopting the theories of [15,23,24,29].

Optimization. The objective function in
Updating X with D Fixed, Sparse Coding Stage.When the coefficient matrix X is being updated with the dictionary D being fixed, the objection function of ( 12) is reduced to the coding problem below: X in (15) Equation ( 18) can be converted as follows: where  = ( 2 is a diagonal matrix whose nonzero elements are the square of the entries of where  = 2( +  1 diag(   ) 2 +  2 ( + 1)); once (20) is premultiplied by 1   −1 , we get  = −(1   −1 1) −1 .Then through substituting  into (19), the analytical solution of ( 17) is obtained as Furthermore, the normalized    of x  is given as Updating  with the Sparse Coefficient Matrix  Fixed, the Dictionary Updates Stage.During the dictionary update stage, the objective function in ( 12) is reduced to (22) as stated below: Letting () = min  ‖ − ‖ ) ) ) . ( ) ) ) The optimal X and D could be gotten by alternating between sparse coding and dictionary learning until an optimal iteration, or error of reconstruction less than a threshold value is gotten.The algorithm for SLSDDL is summarized in Algorithm 1.As an extension of the LSDL method, the proposed SLSDDL method seeks to obtain discriminative information of all training samples.Therefore, the LSDL method is espoused to initialize the sparse coefficients of all the training samples.
Let (, )/ = 0, which gives us where  = 2(  +  diag() 2 ); once ( 29) is premultiplied by . Then through substituting  into (30), the analytical solution of (30), x, is obtained as Sparse coding coefficient  is the then obtained by normalizing x and is determined as Using the analysis in [30], we get the sparse coding efficient associated with test sample y by solving (30) and (31), after which the residual   () = || −     || 2 2 is calculated, where   is the sparse coding coefficient on the i th class and the test sample  is finally classified into the class that has the minimum residual as formulated below:

Experimental Results and Analysis
This section details the experimental results and their analysis to demonstrate the effectiveness of the proposed method.

Video Shot Preprocessing.
In the experiments, ordered samples clustering centered on artificial immune in [31] was adopted to extract the static image frames from each original video data utilizing the video key-frame extraction approach.Afterwards, features consisting of the 5-dimensional radial Tchebichef moment [32], 6-dimensional gray level cooccurrence matrix (GLCM) [33], 81-dimensional HSV color histogram [34], and 30-dimensional multiscale local binary pattern (LBP) [35] from the key-frame were extracted.The details of these features could be referred to in [2].
Figure 1 shows some key-frames of the three video datasets implemented.

Database Selection and Video Analysis.
Experiments are performed on natural images to demonstrate the efficient performance of our methods in this section.We analyze and compare the performance of our proposed algorithm with most typical algorithms such as the KSVD [11], FDDL, LLC [5], and LSDL [3] algorithms.Out of the many collections of datasets for video categorization and classification, three of them are being used in our experimental setup or evaluations.The public datasets being used are the TRECVID 2012 video dataset, OV video dataset, and YouTube video dataset.
The TRECVID 2012 videos dataset has airplane, baby, building, car, dog, flower, instrumental musician, mountain, scene text, and speech as its video semantic models with each class containing 60 data sample; 50 of the samples were randomly as training samples, and the remaining as test samples.The YouTube video dataset is comprised of basketball, biking, diving, golf swing, horse riding, soccer juggling, swing, tennis, trampoline jumping, volleyball, and walking as semantic videos with each class containing 70 data samples, out of which 60 were randomly selected as training samples, and the rest as test samples, and the semantic concepts of the OV dataset contain aircraft, sea, rocket, parachute, road, face, satellite, and star.Each class consists of 70 samples.For each class, 60 samples are randomly selected as training samples, and the remaining ones are testing samples.The proposed approach considered the L2 norm locality adaptor.The experimental results were realized by twentyfold cross-validation in which training samples are selected randomly and the remaining as the testing samples from the various video datasets.

Parameter Selection.
Several parameters including , the classification parameter,  1 , a positive weighing parameter for the locality-sensitive constraint, and  2 for the discriminative constraint were utilized by the proposed SLSDDL and the classification error scheme, with varied values assigned depending on the dataset being used.More so, the classification parameter, , made a little or no effect on the output of the experimental results and hence was set to 0.01 in all experiments.The recognition rates of the proposed SLSDDL approach (for the optimization phase) with varying values of  1 and  2 on TRECVID dataset are as depicted by Figure 2. It could be seen from the figure that the optimum results were obtained for recognition when  1 is 0.01 and  2 is 0.007, respectively, with  2 and  1 values being fixed for each Input: Training samples A, Initial dictionary D and parameters  1 ,  2 Output: The learned dictionary D (a) Initialize the coding coefficient matrix X and i ( ← 1) (b) Optimize the solution by updating the sparse coding matrix X and the dictionary D. Update X by fixing D and updating each   = ( = 1, . . ., ) according to equation (22) 1, an experiment was conducted to validate the performance of our proposed.

Experimental Results and Analysis: TRECVID 2012 Video
Dataset.The recognition rates of video semantic detection of the proposed SLSDDL with varying video features for the initial dictionaries are as indicated in Table 2.It could be seen that the SLSDDL gives an optimum performance when the size of the dictionary is 122 x 400 and thus resulted in the highest accuracy of semantic analysis and hence is chosen.The recognition rates of LLC, FDDL, LSDL, KSVD, and SLSDDL are as shown in Table 3.It could be seen that the proposed SLSDDL comparatively obtained the highest recognition rates than the other recognition approaches.As indicated by Figure 3, the average recognition rates of various video semantic detection methods associated with each class on the TRECVID video dataset are also presented, which clearly suggests that the proposed method is best among all the comparative approaches on most categories.The recognition rate of the proposed method on the TRECVID dataset for instance outperformed KSVD by 8.23% and 7.86% against LSDL which confirms the effectiveness of the proposed method.Hence, the proposed SLSDDL approach could effectively improve best the accuracy rates of TRECVID video semantic feature detections.Nevertheless, the recognition rates for the proposed SLSDDL are low for Instrument and Mountain as compared to that of the KSVD.This is because KSVD performs better in terms of semantic detection for Instrument and Mountain video categories for TRECVID 2012 dataset.

Experimental Results and Analysis: OV Video Dataset.
The recognition rates of video semantic detection of the proposed SLSDDL with varying video features for the initial dictionaries are as indicated in Table 4.It could be seen that the SLSDDL gives an optimum performance when the size of the dictionary is 122 x 500 and thus resulted in the highest accuracy of semantic analysis and hence is chosen.
The recognition rates of LLC, FDDL, LSDL, KSVD, and SLSDDL are as shown in Table 5.It could be seen that the proposed SLSDDL comparatively obtained the highest recognition rates than the other recognition approaches.As indicated by Figure 4, the average recognition rates of various video semantic detection methods associated with each class on the OV video dataset is also presented, which clearly suggests that the proposed method is best among all the comparative approaches on most categories.The recognition rate of the proposed method on the OV dataset, for instance, outperformed KSVD by 10.86% and 10.41% against LSDL, which goes to confirm the effectiveness of the proposed method.Hence, the proposed SLSDDL approach could effectively improve best the accuracy rates of OV video semantic feature detections.

Experimental Results and Analysis: YouTube Video
Dataset.The recognition rates of video semantic detection of the proposed SLSDDL with varying video features for the initial dictionaries are as indicated in Table 6.It could be seen    that the SLSDDL gives an optimum performance when the size of the dictionary is 122 x 550 and thus resulted in the highest accuracy of semantic analysis and hence is chosen.The recognition rates of LLC, FDDL, LSDL, KSVD, and SLSDDL are as shown in Table 7.It could be seen that the proposed SLSDDL comparatively obtained the highest recognition rates than the other recognition approaches.As indicated by Figure 5, the average recognition rates of various video semantic detection methods associated with each class on the YouTube video dataset are also presented, which clearly suggests that the proposed method is best among all the comparative approaches on most categories.The recognition rate of the proposed method on the YouTube dataset also outperformed KSVD by 8.9% and 9.36% against LSDL which confirms the effectiveness of the proposed method.Hence, the proposed SLSDDL approach could effectively improve best the accuracy rates of YouTube video semantic feature detections.

Conclusion and Recommendations
In this paper, a locality-sensitive discriminative dictionary learning based on sparsity is proposed.The proposed approach learns that video samples from the same category should be encoded as similar sparse coefficients in the process of video semantic detection based on sparse representation, so as to enhance the power of discrimination of sparse representation features.The paper also introduced into the structure of the locality-sensitive algorithm, a lost function centered on sparse coefficients to optimize the dictionary.The SLSDDL enhances the power of discrimination of sparse representation features based on the principles of Fishers for better dictionary optimization as a result of the constraints and also to improve video semantic classification.
Based on the experimental results, it could be seen that the proposed SLSDDL algorithm for video semantic analysis is more effective and outperforms the other state-of-the-art approaches such as LLC, FDDL, LSDL, and KSVD.Despite superior results demonstrated by the proposed SLSDDL on the TRECVID, YouTube, and OV video datasets, there is still the need to improve the execution time and further improve the power of discrimination; hence we plan on using deep learning discussed in [36] for the extraction of the video features and also introducing kernel into the structure in our future work.
(a) Some key frames from OV video set (b) Part key frames from the TRECVID 2012 video set (c) Part key frames from the YouTube video set

Figure 1 :
Figure 1: Some key frames from video shots for the respective datasets.

Figure 2 :
Figure 2: Recognition rate Vrs Variations of parameter values on TRECVID Dataset.

Figure 3 :
Figure 3: Recognition rate of TRECVID 2012 videos with different algorithms.

Figure 4 :Figure 5 :
Figure 4: Recognition rate of OV videos with different algorithms.
crimination in SR based classifiers.Assume that  = [ 1 ,  2 , . . .,   ] = [ 1 ,  2 , . . .,   ]   denotes n training samples from k classes where the column vector  1 is the sample  ( = 1, . . ., ), and submatrix   consists of column vectors from class  ( = 1, . . ., ).If there are M atoms in the dictionary  = [ 1 ,  2 , . . .,   ]   , an overcomplete dictionary for SR with ( ≤ ), then coding coefficients of  over  could be denoted by  = [ 1 ,  2 , . . .,   ]   .Now considering image or video features of the same category of the representation coefficient or the representation solution with class information, which is very imperative for classification, our SLSDDL framework is modeled as follows: the locality adaptor as  1 ,   is the training sample,   is the dictionary,    1 is the sparse coefficient vector, and the discriminative lost function for group sparsity is ( using the proposed method significantly is computationally less expensive, compared with the existing SR classification approaches as a result of the smaller dictionary size.Thus    is much more easier to obtain than   , and it enforces the capture of dependencies among video samples of the same categorization.With  set to 1 for simplicity, (9) can be reformulated as is calculated.The encoding of (28)me.For a given test sample y, the representation coefficients on dictionary D are given as follows:The test sample can finally be classified after obtaining D, where  is a scalar constant that measures sparsity, Θ represents element-wise multiplicative symbol,  = [ 1 ,  2 , ...,   ]  , and every entry is represented   = ‖ −   ‖ 2 .We utilized the Lagrange multiplier to obtain the solution for(28)as explained below:

Table 1 :
(24)lass bases by solving for    and then do   (:, ) ←  If  = , update X, which means  ←   else do  =  + 1 and go to (b) (d) Initialize  1 and  2 respectively with  1 ← 1 and  2 ← 1 (e) Solve equation(24)for the solution of ( 1 ,  2 ) (f) Go to (g) if  2 =  else do  2 =  2 + 1 and return to (e) (g) Compute ( 1 , :) from equation (25) (h) Fix X and update D using equation (22) (i) If  2 = , execute   = (/)  and go to (j) otherwise execute  1 ←  1 + 1,  2 ← 1 and return to (e) (j) Compute the reconstruction error.If the error value is less than the current minimum values, update the dictionary D  on the minimum error which is   ←   .(k) Iteratively return to step (b) until convergence or the reconstruction error is less than the threshold error, update D and the algorithm ends or go to (a) Algorithm 1: Sparsity Locality-Sensitive Discriminative Dictionary Learning Algorithm.Parameter values implemented on the various data sets.

Table 1
indicates the various parameter selections with their respective datasets.Based on the aforementioned values, an experiment was conducted to validate the performance of our proposed algorithms and the findings are detailed below.Different datasets have different parameter settings because the structure of each dataset differs and hence the different parameter values.dataset by twentyfold cross validation.It is worth noting that different parameter values were realized for the experiment on the respective datasets because of the varying fundamental structure of the datasets.Based on the values stated in Table

Table 2 :
Classification results on different dictionary atoms of TRECVID 2012.

Table 3 :
The recognition rate of different video semantic analysis algorithms of TRECVID 2012.

Table 4 :
Classification results on different dictionary atoms of OV dataset.

Table 5 :
The recognition rate of different video semantic analysis algorithms of OV dataset.

Table 6 :
Classification results on different dictionary atoms of YouTube.

Table 7 :
The recognition rate of different video semantic analysis algorithms of YouTube.