Audio-Visual-Based Query by Example Video Retrieval

. Query by example video retrieval aims at automatic retrieval of video samples which are similar to a user-provided example from video database. Considering that much of prior work on video analysis support retrieval using only visual features, in this paper, a two-step method for query by example is proposed, in which both audio and visual features are used. In the proposed method, a set of audio and visual features are, respectively, extracted from the shot level and key frame level. Among these features, audio features are employed to rough retrieval, while visual features are applied to refine retrieval. The experimental results demonstrate the good performance of the proposed approach.


Introduction
While query by example image retrieval is becoming a mature field [1][2][3][4][5], the query by example video retrieval (QEVR) is still in its infancy.Video retrieval in particular has received much attention in the area of digital video processing.For videos, the management of ever growing databases is time consuming when done completely manually.This is why automatic systems are required to lighten the job.Usually, retrieving a video from a big video database requires knowledge on the players/actors, team/director name, or other textual information.These textual queries are efficient for the cases when users have quite precise knowledge on what they are searching for within manually indexed and structured databases.However, this classical approach cannot work when users need to search video clips in highly unstructured database such as the web or to discover video clips without information on players/actors or team/director name.Now a widely accepted query type for video is query by example [6].Query by example video retrieval aims at automatic retrieving of samples from a database, which are similar to the example provided by the user.
As discussed by Hu et al. [7], there are two widely accepted query types for video retrieval: nonsemantic-based video query types and semantic-based video query types.
Nonsemantic video query types include query by example [8], by sketch [9] and by objects [10].Semantic-based query types include query by key words and by natural language [11,12].
An example video may be a video clip, composed of a set of shots describing a particular event.Although people often treat a video as a sequence of images, it is actually a compound medium, integrating diverse media such as realistic images, graphics, text, and audio.Multimedia data are represented by features from multiple media sources.Among these media, audio is a significant part for videos.In fact, some audio information in video plays an important role in video detection.For instance, the audio signal from sports video is capable of characterizing some short duration events, such as a goal or audience applause.Although it has attracted many attentions from the researchers, most of them focus on applying audio features in event detection or copy detection [13].Little research has been conducted in applying audio signal to video retrieval.In this sense, query by example video retrieval can be a compromise solution which can fill the gap between low-level features and human perception.The search for videos that are similar to a given query example, from the point view of their content, has become a very important part of research.
In this paper, a two-step retrieval framework for QEVR applying audio-visual features is presented to facilitate video 2 Mathematical Problems in Engineering retrieval and access.We focus on the features from shot level and key frame level, respectively.For shot level, audio feature analysis is firstly conducted on each clip for rough retrieval; after that the scope of objective videos in database is narrowed.Then the fine similar measurement is conducted by applying visual features at key frame level.The system finally returns the similar videos according to the visual features.Based on our approach, existing video processing techniques and algorithms can be integrated to find similar videos.We use sports videos and other social videos as our test bed.Our performance study on video datasets indicates that the proposed approach offers good performance.
The rest part of this paper is organized as follows.Section 2 describes the two-step retrieval system architecture using audio and visual features, the specific extracted audio and visual features in system are elaborated in Section 3, and Section 4 proposes the similarity measure.Section 5 presents the experimental settings and performance evaluation.Section 6 concludes this paper.

Two-Step Retrieval and Management Architecture
The proposed two-step retrieval and management scheme is shown in Figure 1.
There are two stages in the proposed video retrieval scheme: offline database archiving and online video retrieval.
During the offline database archiving stage, in order to support shot-based video retrieval, video preprocessing is essential: segment video using shot cut technique [14] and extract audio-visual features.Audio features are stored in database for rough retrieval while visual features from key frames are used for fine retrieval.
The video preprocessing procedures for the video retrieval stage are the same as those for the database archiving stage.After feature extraction, a two-step retrieval method is conducted: the first step is rough retrieval based on audio features, the aim is to narrow the scope of target videos by excluding irrelevant videos; the second step is to refine retrieval, which is to confirm the final result videos using similarity measure method discussed in Section 4. Lastly, the system returns the result videos.

Features
When designing a QEVR system, we pay attention to both audio and visual features.With respect to the fusion of them, as mentioned in the literature [15], there are two types of approaches: some audio-video feature fusion approaches try to directly evaluate the interaction of the two features [16]; other methods map first the features into a subspace where this relationship is enhanced and can therefore be estimated.In this paper, we adopt the latter: the audio features are firstly mapped into a feature subspace for rough retrieval; after that the retrieval results are optimized by visual features.

Audio Features at Shot Level.
There are many features that can be used to characterize audio signals.Generally they can be separated into two categories: time domain and frequency domain.To reduce the sensitivity to noisy for single audio feature, several audio features are employed to make sure they are complementary.
To map the audio features into a feature subspace, the feature vector representation of audio clips can be obtained in a three-step way.
Step 1 (meta-feature extraction).For each frame, six features are calculated: energy entropy, short time energy, zero crossing rate, spectral roll-off, spectral centroid, and spectral flux.This step leads to six feature sequences for the whole audio signal.above basic audio feature are described as follows.
(a) Energy entropy: it reflects the energy in the timefrequency domain volatility and distribution.We define probability density function (PDF) as   [17], where   is amplitude of the th frame after fast Fourier Transform (FFT),  is the length of audio frame.Then, for the th frame, the entropy can be described as   , (b) Short-time energy: it is one of the short-time analysis functions reflecting slowly changing of audio signal over time.The short-time energy of the th frame can be defined as   [18]: where  is the length of the th frame.
(c) Zero crossing rate (ZCR): it is a correlate of the spectral centroid.It is defined as the number of timedomain zero crossings within the processing window and can distinguish devoiced and voiced.The ZCR is high at high frequency band while relatively low at low frequency band.The ZCR of speech signal   () can be defined as follows [18]: (d) Spectral features: they characterize a signal's distribution of energy and frequency, which are calculated based on the short-time Fourier transform and performed frame by frame along the time axis.Three features are employed.
(1) Spectral centroid: this measure is obtained by evaluating the "center of gravity" using the Fourier transform's frequency and magnitude information.The individual centroid of a spectral frame is defined as the average frequency Audio Shot detection

Audio
Visual Feature extraction Step 1: rough retrieval weighted by amplitudes, divided by the sum of the amplitudes, or here,  is the amplitude of bin number, [] is the center frequency of that bin.
(2) Spectral flux [19]: it is a feature that measures the degree of variation in the spectrum across time.The 2-norm of the frame-to-frame spectral amplitude difference vector is (3) Spectral roll-off [19]: this measure distinguishes voiced from unvoiced speech.Unvoiced speech has a high proportion of energy contained in the high-frequency range of the spectrum, where most of the energy for unvoiced signal is contained in lower bands.This is a measure of the "skewness" of the spectral shape-the value is higher for right-skewed distributions.
Step 2 (statistical feature).In order to achieve computational simplicity and detection effectiveness, for each of the six feature sequences, we use the statistical characteristic value of above features to quantify them, that is: energy entropy standard deviation (Std), signal energy Std by mean (average) ratio, zero crossing rate Std, spectral roll-off Std, spectral centroid Std, and spectral flux Std by mean ratio.This step leads to six single statistic values (one for each feature sequence).Those six values are the final feature values that characterize the audio clip using a six-dimensional feature vector, which is denoted as  = { 1 ,  2 ,  3 ,  4 ,  5 ,  6 }.
Step 3 (normalized statistical feature).Normalization can ensure that contributions of all audio feature elements are adequately represented, preventing one feature from dominating the whole feature vector.
In this paper, each audio feature is computed over each shot audio clip (30 ms, duration).For  = { 1 ,  2 ,  3 ,  4 ,  5 ,  6 }, suppose  max and  min are the maximum and minimum value of feature components, respectively, which can be defined as follows: A normalized audio feature vector   consisting of six components is constructed, denoted as (7).

Visual Features at Key
Frame Level.We segment video clips into shots and extract representative frames from each shot.The key frames of a shot reflect the characteristics of the video to some extent.Traditional image retrieval techniques can be applied to key frames to achieve video retrieval.In the field of image index, image representation became color oriented, since most of the images of interests are in colors.Many of the previous researches used the color composition of an image [20].The advantages of color histogram are that it is invariant for translation and rotation of the viewing axis.However, using a single image attribute for retrieval may lack sufficient discriminatory information, such as color feature cannot convey the space information of image.Texture feature describes the spatial correlation between pixels, compensating for this shortcoming.
For the color feature, we defined it in terms of a histogram in the quantized hue-saturation-value (HSV) color space, since the HSV space conforms more to human perceptual similarity of colors, this property is used in quantizing the color space into a small number of colors.The color basic properties are emphasized and then the color histogram bins are chosen as 180.For the texture, we defined it in terms of GLCM (gray level cooccurrence matrix) [21,22].Here, we adopted eight indicators extracted from GLCM, namely, the mean and standard deviation of energy, entropy, and contrast and inverse difference moment.Energy is a measure of textural uniformity of an image.Energy reaches its highest value when gray level distribution has either a constant or a periodic form.Entropy measures the disorder of an image and it achieves its largest value when all elements are in GLCM.Contrast is a difference moment of the GLCM, and it measures the amount of local variations in an image.Inverse difference moment measures image homogeneity.This parameter achieves its largest value when most of the occurrences in GLCM are concentrated near the main diagonal.Users can choose other feature representation according to their own requirements.
Shot  is represented as a weighted matrix by the following: where   and   represent the weight of color, and texture, respectively, when describing a representative frame of shot .Considering that cognition on shot can be quite subjective therefore, different weight assignments may reflect different user requirements and preferences.The symbols   and   represent normalized feature vector of color and texture, respectively,   (1 ≤  ≤ 180) and   (1 ≤  ≤ 8) represent the components of HSV histogram and GLCM, respectively.The Normalization method of visual feature vector is the same as that of audio feature vector as elaborated in Section 3.1.

Similarity Measure
The overall similarity matching contains two levels: rough retrieval based on audio features and refined retrieval based on visual features.

Rough Retrieval
Based on Audio Feature.Firstly, the audio signals were segmented at 30 ms/frame with Hamming window, which is the basic unit for feature extraction as shown in Figure 2.And then, the audio features elaborated in Section 3.1 are used to construct feature vector for audio frame.The audio samples were collected with 44.1 kHz sample rate, stereo channels, and 16 bits per sample.
Feature analysis is conducted on each clip.And the aim of rough retrieval is to narrow the scope of objective videos in database utilizing the audio feature vector mentioned above.
A normalized Euclidean distance (  ,   ) is used to measure the similarity between audio clips from query video shot  and database video shot .  and   are the normalized feature vectors depicting the characteristics of audio clip  and .

𝑑 (𝑎
We say  and  are dissimilar if and only if (  ,   ) >  audio (a distance threshold).And the greater , the greater dissimilarity between   and   .

Refined Retrieval Based on Visual
Features.The scope of objective videos is narrowed after rough retrieval.In this section, we will address how the temporal similarity of two videos can be exactly measured properly.
We use key frames to represent a shot in video, and each key frame is represented by (8) using the visual features elaborated in Section 3.2.Then, a large number of key frames are extracted from query video as the baseline for comparison.Alternative temporal sampling is also possible, while our current comparison is in key frame features only.
The similarity between the shots in above subclassification and the query video shot can be estimated with the Euclidean distance.Firstly, let us denote shots set of above subclassification and query video by   = { 1 ,  2 , . . .,   , . . .,   } and   , respectively.The shots similar to query video   are defined as (11).
where (  ,   ) is the Euclidean distance function to measure the similarity of   and   .The parameter  represents the number of shots in .We say   and   are similar if ((  ,   ) <  visual ,   ), where  visual is a small positive value called shots similarity threshold.Now, considering that many of current videos are the results of synthesis and editing in the later stage, let's denote the query video by   = { 1 ,  2 , . . .,   , . . .,   } and database video belonging to above sub-classification by   = { 1  1 ,  2 1 , . . .,   V , . ..},where  is the number of shots in query video,  and V represent serial number of shot and video, respectively.The shot matching result between   and   can be seen as in Figure 3.
Based on the above observations, the database video which is similar to query video   is defined by (12).
where Sim() denotes the shot which is similar to shot .

Evaluation Criteria for Experiment.
For analyzing the effectiveness of our proposed approach, two major criteria in retrieval system, namely precision and recall, are used to measure the related experimental evaluations.The precision and recall are defined as follows [5]: where correct is the number of relationally retrieved videos, retrieved is the number of all retrieved videos by the proposed approach, and relevant is the ground truth representing the number of all relational videos in the database.The criterion precision delivers the ability for hunting the desired videos in user's mind and the recall represents the ability for finding the accumulated positive videos in a query session.

Parameter Settings.
Before evaluating the proposed approach, the appropriate parameter settings need to be elicited.During our related experiments, two group parameters need to be assigned: the first one is the distances threshold, denoted as  audio and  visual in ( 10) and (11), which measure the similarity between query video and database videos from the perspective of audio features and visual features, respectively.The second group is the feature weight.For audio feature vector described in Section 3.1, consider that audio features are just used for filtering some irrelevant candidate videos and to keep the balance between them, we equally treat the weight of six audio features during the stage of rough retrieval.In the same way, in addition, 0.5 is adopted as our default setting both for   and   .
The appropriate parameters setting about distance threshold  audio was decided by experiments on four types of videos, which is shown as in Tables 1 and 2 as follows.
From the above two tables, we can see that the best setting for  audio is 0.5, where the scope of target video begins to stabilize.It probably leave out some relevant videos if the value of  audio is too small while increase the amount of videos need computed in the second step if the value is too big.
Applying the similar approach, we find that the appropriate parameters setting for  visual is also 0.5.

Experiments.
The proposed approach is tested on realworld experimental data that was acquired from the collection of videos containing sport videos and other social science videos.In our experiment, the sport videos are composed of 830 shots while social videos composed of 227 shots.Three kinds of sport videos are included: badminton, wrestling and basketball, respectively.
In order to specify the retrieval procedure based on proposed two-step retrieval mechanism, a schematic example is given in Figures 4 and 5 as follows.
In online retrieval, the query video including 27 shots is introduced to the system, and then both audio and visual features are computed for each shot and key frames respectively.The two-step retrieval procedure is implemented subsequently: the first step is rough retrieval based on audio features, for 61 raw videos with 1057 shots in database,  and after rough retrieval there are 18 target videos that are retained; the second step is to refine retrieval based on visual features, after which 10 target videos are reversed as the final target videos.Figure 4 represents part sequences of query video and part videos in database, where part (a) shows the query video sequences while part (b) lists raw target videos in database.
Figure 5 demonstrates the retrieval results, in which part (a) is rough retrieval based on audio features and part (b) is refined retrieval with visual features.
To evaluate the performance of proposed two-step retrieval mechanism in retrieving video in terms of effectiveness and efficiency, we have identified four groups of experiments (badminton, wrestling, basketball, and social science, resp.).
Table 3 shows the comparison results between proposed audio-visual-based retrieval method and audio-based method, in which NS denotes the number of shots in query video.
We also consider two other kinds of situations: one is retrieval with rough retrieval with visual features while refine retrieval with audio features, and the other is retrieval with only audio features.The results are shown in Table 4.
From Table 4 we can find that generally, audio features work worse than visual-based features.However, these two kinds of features are very complementary.When we just simply combined audio and visual features, the average precision and recall can be improved by 16.755% and 11.757%, respectively compared with only visual based, improved by 66.135% and 20.2%, respectively, compared with only audio based.We also compare this paper's approach with Kong's approach [12] in Table 5, which also support the shot-based retrieval from the perspective of precision and recall.
In summary, we have achieved retrieval results with the proposed two-step hierarchical retrieval mechanism base on audio-visual feature, together with the visual-based versus audio-based versus visual-audio-based versus retrieval method in other literature.The comparative results indicates that the proposed method offers better performance.

Conclusion
Considering that audio is a significant part of videos, we present a novel and efficient method jointing audio features for video retrieval query by example.In our system, the audio features are firstly used for rough retrieval to narrow the scope of objective videos in database.Then, the visual features are applied to refine retrieval.Finally, the system returns the similar videos which are similar to a user-provided example.Experimental results indicate that the proposed approach owns better performance in retrieving video when query by example compared with other types of retrieval mechanism.

Figure 1 :
Figure 1: Block diagram of the proposed two-step method for QEVR.

Figure 3 :
Figure 3: Shot matching between query video and database video belongs to subclassification.

Figure 4 :
Figure 4: (a) Part sequences of query video.(b) Part raw target video sequences in database.

Figure 5 :
Figure 5: (a) Rough retrieval based on audio features.(b) Refine retrieval based on visual features.

Table 1 :
Precision performance on four types of video by varying  audio .

Table 2 :
Recall performance on four types of video by varying  audio .

Table 3 :
The comparison with experimental results with audio-visual based and visual based.

Table 4 :
The comparison with experimental results with visual-audio and only audio features.

Table 5 :
Comparison between this paper's approach and Kong's approach.