SIFT Feature-Based Video Camera Boundary Detection Algorithm

,


Introduction
As video is the most complex of all multimedia data types, it not only contains the content of the still image but also includes the motion information of the target in the scene and the information of the objective world changing with time.
e huge amount of data and the unstructured characteristics of film and television make it extremely difficult to carry out effective film and television retrieval.
e traditional film and television retrieval mainly relies on the manual definition of the key words of the film and television.Although this retrieval method is simple, it has many disadvantages.For example, the rich content contained in film and television is difficult to describe comprehensively with concise words; it is a subjective behavior to mark a complete film and television segment by means of employing people.In general, for small film and television units, such as a lens of the film frame, we can only use fast forward, fast back, and other methods to determine the accurate location of information, and it will cause additional transmission bandwidth overhead.
Lens boundary detection is the basis of the contentbased film and television retrieval system.Many scholars and institutions are conducting research related to content-based film and television retrieval and have developed a variety of film and television data retrieval systems, reflecting the main achievements of film and television retrieval system research [1,2].QBIC (Query By image Content) system is a typical representative of the contentbased retrieval system.It allows users to query large image and film databases using sample images, sketching, color and texture mode selection, lens and target motion, and other information.e functions of the system for film and television include automatic segmentation of shots, key frame extraction of shots, static sample query, and text and title content query [3,4].It provides a set of tools for people on the Web to search and retrieve images and video, and it can be realized in Internet content-based video/image retrieval [5,6].Video allows the user to use visual characteristics and to retrieve the relationship between time and space and film and television and has integrated text and visual search method, automatic video object segmentation and tracking, and rich visual feature library, including color, texture, shape, and motion, interactive query, and browsing on the Internet [7,8].According to the two-paragraph algorithm based on film and television content and audio content, the film and television are automatically divided into a large number of film and television clips with logical semantics, and the title decoder and word indicator are added to extract text information, and the film and television are edited by indexing [9,10].Due to the complexity of lens, the problem of lens boundary detection has not been completely solved [11].Traditional lens boundary detection algorithms mainly include edgebased algorithm, histogram-based algorithm 45, and pixel difference method [12].
e edge-based lens boundary detection algorithm is not ideal for the complex frame image content of film and television lens boundary detection.e histogram-based lens detection algorithm, from the statistical point of view, statistics the pixel color distribution of the frame image or the gray distribution of the image, can better adapt to the lowspeed motion of the camera equipment and the object in the lens.In addition, this method also has low computational complexity.
e disadvantage is that when the light intensity changes and the lens moves rapidly, the histogram obtained will be distorted, which will lead to error detection.
e pixel difference method has a low computational complexity, but it is very sensitive to pixel brightness changes and light changes caused by the motion of the camera equipment and objects in the film and television, which are likely to cause lens error detection [13,14].In recent years, texture features [15] and scale-invariant feature conversion features [16] are also often found in literatures related to lens edge detection.e texture feature is the global feature of image.In the detection of the lens edge, texture and other features form feature combination to realize the detection of the lens edge.
e histogram of the gradient direction and color histogram of the texture are proposed to realize the detection of the lens edge [17,18].e color feature and texture feature are extracted by wavelet transform, and the difference of adjacent frames is defined according to the mutual information of the color feature and the mutual information of the texture feature, and the lens edge is further determined according to the dynamic threshold [19].SIFT features are the local features of the image, which have the advantages of image size scaling, rotation, and brightness changes remaining unchanged.SIFT features can effectively reflect the local changes of moving objects.For the frame image of the same lens, the SIFT features of the same lens frame have a high degree of matching, and SIFT features are adopted to achieve lens detection [20], which can better distinguish the moving objects in different gradient lenses, so as to find the lens boundary within the permissible error range.SIFT features are used to match adjacent frames.However, the use of SIFT features alone to realize the detection of lens edges is greatly affected by the rapid movement of the lens and the change of ambient light intensity [21,22].In addition, the SIFT feature was used only to detect the stacking lens, and the detection effect was not ideal.As an important feature of an image, the texture can better reflect the change of the underlying feature of the image, and the similarity between images can also be measured by the texture matching degree of the image.e texture concept of a static image can be extended to the temporal domain to form a dynamic texture.Dynamic textures are defined differently in different applications [23,24].e dynamic texture is described as an image sequence with the stable property of the moving scene on time sequence, such as the wave film lens and smoke film lens.[25,26].For two or more frames of images adjacent to the same lens, if the object in the frame image does local motion, the frame is divided into homogeneous blocks, and the average gradient of each block constitutes the average gradient matrix.
ere is a strong correlation between its gradient matrix, which is similar to the "dynamic texture" [27,28].rough the "movie dynamic texture," the uniform movie frame blocks are processed, and the average area of each gradient is calculated, which is composed of the average gradient matrix.According to the average gradient matrix of the "movie" dynamic texture and the dynamic change of the "texture" between adjacent frames [29,30], it is determined whether the content of adjacent frames has changed dramatically.When the ambient light does not change dramatically [31,32], the dynamic texture of adjacent frames in the same lens will not change fundamentally.
In the aspect of lens boundary detection, the optical flow method is used to extract the motion information in film and television, and a vector quantization method of the optical flow is proposed.According to the quantization method, the running amount of adjacent images is corrected and the frame difference is obtained.e frame difference is used to detect the edge of the candidate shot.A model matching method is proposed to detect the lens of mutation and gradient, so as to obtain the lens boundary information of film and television.In order to avoid the difficulty of threshold setting in lens detection by the threshold method and further improve the detection effect, a unified lens boundary detection strategy based on fuzzy clustering was realized in this paper on the basis of analyzing the characteristics of lens movement and combining the knowledge of fuzzy clustering. is algorithm can detect the abrupt/ gradual lens at the same time without setting the min value.It can effectively reduce the impact of flash, subtitle insertion, advertising, and other factors on the lens detection and reduce the impact of the lens movement on the boundary detection of the film and television, thus further enhancing the robustness of the lens detection.e detection effect of the algorithm is verified by experiments.

Dynamic Texture Boundary Detection Based on SIFT Feature
Dynamic texture matching is to find out the similarity of the dynamic texture between two movie frames.e overall framework for the comparison of film and television dynamic textures is given first, as shown in Figure 1.Two frames of images for a movie clip.It is divided into M * N subregions, and the average gradient of each region is calculated to form the average gradient matrix.
e similarity of the frame image space, the global change of the image space, and the local change of the space can be detected and judged by many methods, and the frame dynamic texture matching method is a better way.It can be seen from the definition of film and television dynamic texture that the film and television dynamic texture is defined by the average gradient of subregions.Two adjacent frames in the same area of the average gradient change reflects the region gray degree; in other words, the son of regional average gradient change reflects the local change of the frame because the whole image is implemented by the average gradient matrix; the change of the average gradient matrix is also reflected by the whole picture of adjacent frames, and the change of the frame for a given dynamic texture can be achieved by comparison: SIFT features are the local features of the image, which has the characteristics of image size scaling, rotation, and brightness changes remaining unchanged.e SIFT feature is a feature commonly used in image object matching, and the SIFT feature of adjacent frames of the same lens has a high degree of matching.For shear lenses, if △η i,k+1 is dense and the SIFT feature of adjacent frames has a low matching degree, then adjacent frames are considered to belong to different lenses.For the gradual change lens with the complex structure (including flexible solution, light fading, and overlapping), if the △η i,k+1 generated by the frame is a sparse matrix, the SIFT feature of adjacent frames has a high degree of matching, and it cannot be completely determined that adjacent frames belong to different lenses, so continue to compare the frames.Here, r < 24, that is, the value is less than the limit resolution frame number of human eyes, and the lens boundary detection error is allowed to be less than the limit resolution frame number of human eyes.ere are slight differences between human eyes in the limit resolution frame number of different types of films and television programs.In this paper, r � 20 is adopted: Since the detected extreme points are unstable, further processing of these extreme points is required, that is, to remove pixels with dog curvature asymmetry.SIFT employs three-dimensional quadratic function fitting, precise scale, and location of extremum points to improve antinoise capability and enhance matching stability.
e extreme points with low contrast are removed: SIFT features can reasonably describe the image, in the case of rotation, zoom, and translation, are not affected.
erefore, SIFT features are used to characterize film and television images.In the comparison of film and television pictures, it can be ensured that the diversity of the pictures is not caused by the rotation, contraction, and translation of a single picture, but by the diversity brought by the real difference of the pictures.
Lens boundary detection is mainly based on the similarity between adjacent frames inside the lens.When the lens is converted, the similarity is destroyed and the difference between frames is generally large.erefore, the basic idea of lens detection is to compare the difference between film and television frames and to compare the difference between the frames and the min value.If the difference reaches a certain degree, it is judged that the lens conversion has occurred.When the shot is suddenly changed, the distance between the frames is often shown as a convex crest.In the process of lens gradient, there is a little bump in the whole waveform, but the difference is not as obvious as the shear.
After the lens boundary detection, the test indexes can be used to objectively evaluate the detection results, measure various lens detection algorithms, and assist to select the correct algorithm.In lens detection, recall and precision are the two most basic and commonly used evaluation parameters.According to different processing methods, the lens boundary can be divided into two types: the lens mutation boundary and the lens gradient boundary, as shown in Figure 2. On the basis of the lens boundary detection, the next step is to analyze the film and television lens by dividing it into two.erefore, the segmentation accuracy of the video shot after the shot is particularly important for the extraction of key frames.
e processing and data analysis of the key frame extraction related to the lens boundary detection and the back of the film affects the accuracy.
It is the main principle of the key frame extraction algorithm based on the image content analysis to measure the image similarity according to the change of some underlying features of the image, such as the image color, shape, texture, and other visual features.e specific steps are as follows: (1) Choose the first shot of the image as a key frame, and look at it as a comparison frame.(2) Calculate the degree of difference between the image frame and the comparison frame in the film and television successively.When a certain frame is found to have a great change, that is, the difference value between this and the comparison frame is greater than the preset threshold T, then the image frame is regarded as the key frame and regarded as the new comparison frame.
Continue the comparison between the subsequent image frames and the new comparison frames, and repeat the first Complexity 3 two steps, until the detection of all the image frames in the film and television, and all the selected key frames are regarded as the final key frame set of the current film and television.
Figure 3 shows the outline diagram of the film and television editing algorithm based on the combination of SIFT features.e following three aspects will be introduced in detail: subjective features based on audience comments, objective features based on visual effects, and film and television editing generation.
e following symbols are defined for expression.C represents a total set of highlights, 1 represents a highlight, and S represents a subset of a highlight set composed of multiple highlights, which is referred to as a fragment set.e goal of this chapter is to find the optimal set of fragments.
ere are also a large number of flash cases in the film and television, and the general flash detection can only exclude the biggest change in the flash sequence of the frame on the impact of the lens detection, and it does not take into account the impact of flash sequence on the lens detection.In the analysis of the false detection, it is found that the mean value of the difference between frames tends to zero when there is a large number of small changes between frames.

Key Frame Extraction of Camera Lens
Boundary Detection Based on SIFT Feature Fusion 4 Complexity experimental points and the corresponding relation.Due to the differences in the brightness, color, and other characteristics covered by each frame of film and television, it is possible to map each frame to the corresponding space based on these differences, and each frame is corresponding to the points in different coordinates.e classification method of the point cluster is adopted to judge the corresponding relation between images in the film and television.e division of the obtained point cluster is based on the distance between the density value of each point and the other corresponding points, instead of using specific coordinates in the two-dimensional space.In this section, the similarity between images is used to measure the corresponding distance.e smaller the similarity is, the greater the corresponding distance value will be.
e corresponding relationship is as follows: SIFT was adopted to obtain the distance between various film and television frames, and then, each frame of the film and television was mapped to the relevant points in the twodimensional space.Secondly, the frames were divided into clusters through clustering operation.e decision diagram is used to indicate that the selected cluster center points correspond to high values, indicating that the selected point is the center of each class, that is, each center point represents the characteristics of the corresponding group.e center of all categories corresponds to the selected frame to express the information of this class.erefore, this image is selected as the key frame to describe the main information of the film and television.e process of selecting key frames using the SIFT image frame-mapping method is as follows: Input: image sequence marking in the source video Output: get the set of key frames corresponding to the source film and television Step 1. Calculate the statistics and texture characteristics for each image In the process of expressing main information through film and television abstract, it is more important to obtain the number of key frames.If too many frames are used, the redundancy between frames will be caused.If the quantity selected is insufficient, it will affect the expression of complete film and television information.e selection of the number of frames is another objective evaluation element in the film summary.In the density peak clustering algorithm, the number of clusters is obtained from the decision graph in the form of human-computer interaction.In the case of clustering in the film and television abstract proposed in this section, the influence between them will be intensified when the length of the film and television is long and the number of frames is large.If the interactive form is adopted again, the number of categories cannot be determined automatically and quickly.
By selecting the path, it is obvious that the closer to the reference point, the more pixels are used for comparison.Choosing the sampling path in this way can not only express the illumination characteristics of the whole image but also reflect the main content of the image.Figure 4 shows the change curves of brightness values of mutant and fading lenses.It can be observed from Figure 4 that when fading into the lens like lens, the image brightness value curve has an obvious process of gradually increasing.However, for the Complexity mutant lens in Figure 4, the image brightness value curve has a sudden change process, with large difference before and after, which is conducive to rapid detection.When significance values are measured at multiple scales, mean values are used to enhance the contrast between significant and nonsignificant areas.It is considered that the significance threshold is acquired from the local significance subregion in the significance graph, and the pixel significance value not located within the scope of the subregion is obtained by the Euclidean distance weighting between the adjacent significant pixels, so as to obtain the new significance value.In this way, the significance value of the vicinity of the significant target is increased and the significance value of the background part is weakened: e acquired significant regions are of great significance for image analysis.We strengthened the analysis of the important content of the image, ignoring some minor parts, which became the key to improve the efficiency and optimize the effect.After obtaining the significant area of the image, the processing of the significant area can increase the difference between different lenses.Mutual information mainly represents the information correlation between the two systems, that is, a system covers the size of the information in the corresponding system.Image mutual information is a measure of how much information each image contains, and switching at the gradient lens is the process of merging the contents of front and rear lenses.erefore, we use mutual information to measure the similarity between images.Mutual information is defined as follows:

Example Verification
In order to verify the edge detection effect of the film and television lens proposed in this paper, two groups of films and television films were selected for algorithm effect verification.e first group tested 150 film and television clips from the Internet, which included shear lenses and a variety of gradient lenses with complex structures (light fading, dissolving, and overlapping).Film and television types include film clips, sports films, and news films.In the second group, the most authoritative international evaluation TRECVID2003 film and television collection was taken as the test film and television, and 6 classic segments were selected, each segment containing the shear lens and gradient lens.e film and television clips include the black and white lens and color lens, with specific parameters shown in Table 1.
In order to measure the detection effect, recoil, accuracy, precision, and F1, a comprehensive evaluation index combining the recall rate and accuracy was used to represent the lens recall rate.Among them, the higher the value of F1, the better the detection effect.
In this algorithm, the choice of parameters has a great influence on the experimental results.e value of δ reflects 6 Complexity the degree of gradient change in subregions, the value of δ 1 reflects the degree of sparsity of the gradient change matrix, and the value of δ 2 reflects the matching degree of SIFT features of adjacent frames.

Influence of δ Value on Experimental Results.
Without considering the contribution of frame SIFT features to lens edge detection, different δ values have different effects on detection results.Figure 5 shows the trend of change of recall rate, accuracy, and precision with TT when δ � 0.75.It can be seen from Figure 5 that good recall rate and accuracy can be achieved when δ � 0.4.Without considering the contribution of frame SIFT features to lens edge detection, different ε values have different effects on detection results.Figure 6 shows the trend of recoil and accuracy precision ε changes of recall rate under δ � 0.75.
In this paper, SIFT features and dynamic texture have different effects on lens detection.To verify SIFTfeatures and the effect of the dynamic texture on lens detection, the algorithm in this paper is compared with the SIFT feature and dynamic texture for lens detection.e first group was used to test the three algorithms.In the experiment, the subregion size of the frame image was taken as 13 × 13, and δ � 0.3 and ε � 0.6 as an empirical decimal where the higher the AVF value is, the better the detection effect will be.It can be seen from Table 2 that the detection results of the algorithm in this paper are better than that of the SIFT feature or dynamic texture only.
In the experiment, the film and television were firstly segmented manually, and the segmentation results were taken as the reference lens boundary.ere are some differences in the criterion of edge judgment for different types of lenses.For the gradient lens, the position of the lens edge is difficult to define accurately, and the error is allowed within 20 frames.e experimental results of the algorithm in this paper are shown in Table 3.
After verifying the relationship between the ratio and the accuracy of shear detection, the experiment was conducted again to verify the influence of the partitioning ratio on the fitting feature error rate at the gradient.Figure 7 shows the influence of the ratio on the fitting error rate for the extraction of the gradient region.
It can also be seen from Figure 7 that the error of gradual fitting features reaches a minimum when the block ratio is close to 0.6.e smaller the error rate of the fitting features is, the closer the arch waveform generated by the feature description of the film and television is to the standard waveform shape.Such an approach is conducive to gradual detection and will improve its accuracy.
To compare robustness, the experimental data were used for several different types of video clips downloaded from the network in the MPG format, including advertisements, video clips, and MTV.A total of 7726 frames were used, including 109 lens boundary conversions, among which 85 were mutant shots and 23 were gradient shots.e algorithm presented in this paper is used to compare and experiment these movie sequences with the histogram-based doublemin method and the pixel-based double-min method.e specific experimental results are shown in Table 4. Algorithm 1 is the algorithm presented in this paper, algorithm 2 is the histogram-based double threshold method, and algorithm 3 is the pixel-based double threshold method.
As can be seen from Table 4 above, for different types of film and television clips, algorithm 1 can achieve good detection results in overall recall rate and precision rate of more than 72%, while algorithm 2 and algorithm 3 have good and bad detection results for different films and television, which confirms the general adaptability of algorithm 1 to films and television.In addition, from the point of view of different types of testing, film, and television, the algorithm has a good test effect on video cameras of advertising and movie editing types.is is because the algorithm has made some improvements in robustness, so when commercials and movie clip-type video cameras are in more complicated situations such as flashing, camera movement, subtitle changes, and noise interference, better test results can still be obtained.
It can be clearly seen from Figure 8 that algorithm 1 has a high recall rate and precision rate, especially a high recall rate, because this algorithm reduces the influence of subtitle insertion and icon insertion on lens detection in the process of calculating and selecting clustering features.On the basis of clustering, preprocessing was carried out according to the characteristics of abrupt and gradual transition boundaries of film and television, and the results were further analyzed.In the process of lens detection, flash, lens movement, and noise detection were added to reduce their impact on lens detection, and finally, the lens boundary detection was realized.erefore, it has better robustness to common interference situations and can obtain better detection effects for the same film and television clips.

Conclusion
A new method of lens edge detection based on dynamic texture and SIFT features is proposed for many kinds of film and television data.
e algorithm mainly includes four aspects: movie and television dynamic texture construction, movie and television dynamic texture matching, frame image SIFT feature matching, and false detection processing.
e dynamic texture of film and television takes into account the local and global changes of the frame image.e method presented in this paper is effective for edge detection of both shear and gradient lenses, especially for edge detection of laminated lenses, and also for reducing the influence of light on lens detection.However, the edge detection effect of the method in this paper on fluid objects (such as seawater in the film and television) needs to be improved, and the adaptive selection method in this paper needs to be further improved.In the detection process, in order to effectively reduce the influence of subtitle and other factors on lens detection, an improved histogram segmentation method is constructed, and the frame difference of each block is taken as the feature, and the weight of each block is taken as the feature weight for fuzzy clustering.On the basis of clustering, preprocessing is carried out on the sudden change and gradient boundary features of movies, and the results are further analyzed to exclude the influence of flash and lens motion lens detection and finally realize the lens boundary detection.However, due to the complexity of film and television, there are still some missing and false detections in this algorithm, which need further study.Film and television is the carrier of a variety of information, including images, text, and sound.Most of the current detection algorithms are limited to using the image features to detect the boundary through the change of the image content.In the future, algorithms can make use of more information, such as superimposed text and audio information.

Step 2 .
Obtain the interval between any point pair, and the SIFT form is used to map the film frame to the points in the two-dimensional space Step 3. Collect the region P value and corresponding value of each point Step 4. Follow the local density value P; the required decision graph is drawn with the function relation of distance, and the number of point groups contained in the decision graph is determined in an interactive form Step 5. Define the image subsets contained in various clusters according to the relationship between film and television frames in spatial point mapping

FilmFigure 3 :
Figure 3: Outline of the film and television editing algorithm based on SIFT features.

Figure 7 :
Figure 7: Block ratio and fitting characteristic error.

Figure 8 :
Figure 8: Comparison of recall and precision of the three algorithms.

Table 1 :
Experimental data set parameters.Figure 5: Influence on recall rate and accuracy of predicted results when δ � 0.75.
Figure 6: Influence of δ � 0.3 and ε on recall rate and accuracy of predicted results.

Table 2 :
Results of the three methods on the test set of sports films and films.

Table 4 :
Comparison of detection results of different types of film and television clips.