Existing key-frame extraction methods are basically video summary oriented; yet the index task of key-frames is ignored. This paper presents a novel key-frame extraction approach which can be available for both video summary and video index. First a dynamic distance separability algorithm is advanced to divide a shot into subshots based on semantic structure, and then appropriate key-frames are extracted in each subshot by SVD decomposition. Finally, three evaluation indicators are proposed to evaluate the performance of the new approach. Experimental results show that the proposed approach achieves good semantic structure for semantics-based video index and meanwhile produces video summary consistent with human perception.
In the last few years, the prompt increasing of video data needs efficient techniques for browsing and index of this data [
Early key-frame extraction approaches can be classified into two categories: based on interframe difference and based on clustering. In the approaches based on interframe difference, a new key-frame is extracted only if the interframe difference overtakes a certain threshold [
The approaches based on motion think that motion is an intrinsic attribute of video and human eyes are very sensitive to motion; thus they take into account motion events and camera operations in key-frame extraction. In [
The approaches based on visual attention attempt to find the semantically relevant key-frames by simulating human visual perception mechanism. The approaches usually combine several representative feature maps (values) into a single saliency map (value) which can be used as an indication of the attention level. Lai and Yi [
All semantically relevant methods attempt to find the key-frames by recognizing video semantic content; yet automatic understanding of semantic content is unachievable for contemporary computers, and there are many unsolved problems, especially the following two problems. All existing methods focus on video summary yet ignore the index task of key-frames. A new key-frame extraction approach should be found so as to take account of both tasks. It is the future direction of development and remains an important challenge in which establishing semantic structure for a video is the essential part. Current methods only extract the frame at each peak point which easily leads to content jumps in video summary. Therefore, some intermediate frames, having continuity and similarity in video content, are needed to help viewers to infer the original video content.
To address these problems, this paper proposes a new key-frame extraction method, and the basic concept can be described as follows. This paper divides a shot into several clips (hereafter called subshot) in chronological order according to the overall discrepancies between video frames themselves. Each subshot consists of similar content frames, and there are great visual differences between subshots. Since similar video content expresses the same semantic element, subshot segmentation also means semantic structure division which is the basis of video index. After subshot segmentation, proper key-frames from the same subshot can ensure visual continuity. If each frame is represented as an
The algorithm can be separated into four steps:
The remainder of this paper is organized as follows. The subshot segmentation method is described in Section
Compared with other color spaces, HSV color space is the closest to the characteristics of human vision [
The three color components are synthesized one-dimensional vector
The frames within a subshot, showing similar video content, can be considered the same class, and different subshots can be viewed as different classes. According to distance separability criterion, the greater the between-class distance and the smaller the within-class distance, the higher the separability of two classes. Applying this criterion to subshot segmentation, that is to find the border frames which make that the greatest between-class distance between two subshots on the sides of a border frame and the smallest within-class distance among each subshot. This paper extends this criterion to a dynamic distance separability algorithm for subshot segmentation, and the process can be described as follows: a sliding window of length
This approach uses the dynamic distance separability to achieve subshot segmentation. This method uses the overall differences among frames to track video content changes, rather than some certain factors such as objects, motions, or other physical characteristics, which assures the accuracy and robustness. In addition, similar video content carries identical semantic element; therefore subshot segmentation based on video content is equivalent to subshot segmentation based on semantic structure.
Establish a sliding window of length Calculate the mean vector There are various definitions of distance separability criteria. In practice, the most widely used criterion is based on the within-class dispersion matrix and the between-class dispersion matrix. Equations ( The greater the between-class dispersion, the smaller the within-class dispersion and the better the class separability. As shown in (
As the sliding window is moved backward frame by frame, the
In the calculation of
According to (
Next, we need to determine the frame number of each local-maximum. Assume a new function
At local-maximum points of the
Local-maximum of
Existing algorithms consider only the spatial information of a frame, but not temporal characteristics between the frames. Therefore it is difficult to determine the number and the location of key-frames as a whole. If each frame is represented as a
Determine the number of key-frames; namely, determine the rank of matrix
For
Theorem
We use (
For a static video, as the frames are very close in video content, they are approximate linear relation, which means that the rank of matrix
Locate key-frames; namely, select linearly independent sub-set of matrix
There are three parameters in the proposed algorithm that must be determined: the window length
The parameter
To determine the performance of the proposed method, various test videos are downloaded from the standard video library OPENVIDEO. Six extreme shots with different characteristics are selected in this section.
The first video,
Key-frames extracted from video
The second video,
Key-frames extracted from video
The third video is a shot with object movement, in which a man in white comes to a corner and waits for another man’s arrival and, after a short talk, returns by his original route. According to the above semantic structure, the video shot is divided into two subshots, with the results shown in Figure
Key-frames extracted from
The fourth video is a shot with both object movement and camera movement, in which a girl comes from afar, suddenly stops, then looks around, and finally runs in the opposite direction. When the girl looks around, her face is in a close-up. As shown in Figure
Key-frames extracted from
Except for object and camera motion, artificial editing effects can also give rise to video content changes. The fifth video is a shot with special effects, in which many ordered plates gradually come together into a stack and then disappear suddenly. The extraction results are shown in Figure
Key-frames extracted from
The last video is a shot with scrolling captions, in which a yacht is heading from shore out to sea, and suddenly a motorboat comes up fast from behind and gradually moves out of sight with the yacht. Besides, captions of large areas rapidly glide on the screen all the way. The extracted key-frames are displayed in Figure
Key-frames extracted from
Table
Extraction results for different videos.
Video name | Total frames | Subshots number (manually) | Subshots number (automatically) | Number of key frames | Video characteristics |
---|---|---|---|---|---|
|
329 | 1 | 1 | 1 | Little change |
|
84 | 5 | 5 | 6 | Fast camera motion |
|
283 | 2 | 2 | 4 | Object motion |
|
107 | 4 | 4 | 6 | Both camera and object motion |
|
76 | 3 | 4 | 4 | Special effects |
|
328 | 3 | 4 | 6 | Scrolling captions |
Due to the absence of well-defined objective criteria [
It can be seen that, by analysis of Table
By the analysis above, subshot segmentation based on video content and subshot segmentation based on semantic meaning are not fully identical. Fortunately, in most cases, video content and semantic meaning are basically identical. Therefore, the method in this paper can carry out subshot segmentation based on semantic structure, which can detect both temporal and semantic independence between the frames.
To verify the robustness of the proposed algorithm, 100 video shots are clipped from four different types of videos: lecture, news, documentary, and entertainment. We refer to the mean opinion score (MOS) criterion and recruit twenty testers to give subjective scores to the key-frame extraction results. First, every tester is given five shots, which covered the four types of videos. After viewing the extraction results and the original videos, the testers are asked to assign scores to the extraction results in terms of structure, continuity, and repetition. A scale (0.0–1.0) is used for scoring, where 0 represents great dissatisfaction and 1.0 represents great satisfaction. The scores from each tester are averaged to yield the assessment outcome shown in Table
Assessment outcome for different types of videos.
Video type | Structure | Continuity | Repetition |
---|---|---|---|
Lecture | 0.91 | 0.93 | 0.95 |
News | 0.92 | 0.90 | 0.94 |
Documentary | 0.90 | 0.92 | 0.91 |
Entertainment | 0.87 | 0.91 | 0.84 |
Table
This paper is the first study to fit both video summary and video index; the new method achieves good semantic structure, good visual continuity, and low redundancy, not only can provide a video summary which is consistent with human perception but also can provide an index for further video operations and analysis.
Note that because of the complexity and diversity of videos, the proposed algorithm cannot be proved to be capable of demonstrating good and stable performance on all videos. More experiments should be done to confirm the area of applicability of the algorithm. In addition, the deep reason of oversegmentation is overly simplified feature selection; future researches should also concentrate on composite feature selection to resist scrolling captions.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the Natural Science Foundation of Shanxi Province, China (Grant no. 2011011012-2), and Taiyuan Special Fund for Science and Technology Talents (Grant no. 120247-28). Acknowledgments are due to Cao Changqing, Duan Hao and Yang Qian for their collaboration in the realization of field experiments. The authors would also like to thank the reviewers for their time and their valuable comments.