Coverless Video Steganography Based on Audio and Frame Features

. The coverless steganography based on video has become a research hot spot recently. However, the existing schemes usually hide secret information based on the single-frame feature of video and do not take advantage of other rich features. In this work, we propose a novel coverless steganography, which makes full use of the audio and frame image features of the video. First, three features are extracted to obtain hash bit sequences, which include DWT (discrete wavelet transform) coeﬃcients and short-term energy of audio and the SIFT (scale-invariant feature transformation) feature of frame images. Then, we build a retrieval database according to the relationship between the generated bit sequences and three features of the corresponding videos. The sender divides the secret information into segments and sends the corresponding retrieval information and carrier videos to the receiver. The receiver can use the retrieval information to recover the secret information from the carrier videos correspondingly. The experimental results show that the proposed method can achieve larger capacity, less time cost, higher hiding success rate, and stronger robustness compared with the existing coverless steganography schemes based on the video.


Introduction
In today's era with frequent information leakage and theft, the safe transmission of confidential information is extremely significant. Information hiding technologies can help to solve the problem of secure transmission and effective recovery of secret information. Traditional steganography schemes mainly embed secret information by changing the specific features of the carrier [1][2][3][4]. However, due to the modification of carriers, this kind of algorithms has the risk of being detected by steganalysis. e coverless steganography sets up a specific mapping relationship with the characteristics of carriers to hide secret information. It has better concealment performance than traditional information hiding algorithms since its carrier has not been changed. e existing coverless steganography schemes are mainly divided into two categories: coverless steganography based on text [5][6][7][8] and coverless steganography based on image [9][10][11][12].
e coverless text steganography usually uses the unique features of text (such as word frequency and keywords) to hide secret information, which was first proposed by Chen et al. [13]. ey vectorized and segmented the Chinese secret information and obtained the retrieval information corresponding to the secret information from the Chinese text retrieval database to hide secret information. By using statistical features of text, Zhang et al. [14] selected the normal texts containing the secret information retrieved from the text database to hide secret information. Different from the aforementioned methods, Wang et al. [15] used the data characteristics of nonrepetitive and diverse lyrics generated by GAN to hide secret information, which had good perceptibility and embedding rate. e coverless image steganography was first proposed by Zhou in 2015 [16]. e key point of this kind of algorithms is to extract the specific features of the image, such as the texture, colour, and shape, to establish a specific mapping relationship to hide the secret information. Zhou et al. [16] divided the secret information into bit sequences and then sent the specific carrier images and auxiliary information matched with the bit sequences to the receiver. Zheng et al. [9] divided the carrier images into blocks and used the direction information of feature points of scale-invariant feature transform (SIFT) to hide the segmented secret bit sequences. Similarly, Zhou et al. [10] used the directional gradient histogram (HOG) of nonoverlapping image blocks to hide the secret bit sequences. Due to the powerful ability of deep neural network to extract features [17], some researchers introduced it to coverless steganography. Liu et al. [18] used DWT to transform images and the DenseNet network to recommend carrier images. Luo et al. [19] used the labels of the object in the carrier image to generate hash bit sequences and hid secret information by sending the carrier image containing multiple objects. e receiver used Faster RCNN [20] to extract the labels of the objects in the carrier images to recover the secret information. In addition to using the features of the image to map hash bits, some researches hid secret information based on image generation. Wu et al. [11] set up a mapping relationship between secret information and texture image and used the synthesis process of texture image to hide secret information. Chen et al. [12] divided the natural images into multiple image blocks, every one of which can represent 1bit secret information, and retrieved the corresponding image blocks to synthesize the carrier image according to the secret information. e generative adversarial networks had caused many technological changes in the field of computer vision [21], and its ability to generate real natural images had been widely recognized. Li et al. [22] used the encoder to extract the content vector of the secret image and input it into the generative model to generate a real and natural carrier image under the penalty mechanism. Yu et al. [23] used the vectorized secret information to directly control the generative model to generate the carrier image and introduced the attention mechanism to correct the image distortion and background anomaly, so as to improve the concealment. However, when more secret information bits need to be hidden, the coverless steganography based on image or text needs to transmit a large number of carriers, which will undoubtedly arouse suspicion by external attackers and increase the risk of secret information being attacked.
Compared with image and text, there are more features that could be extracted from video to hide secret information, such as the frame image features, the temporal features between frames, and the audio features. erefore, coverless steganography based on the video does not need to transmit too many carriers when more secret information bits need to be hidden. At the same time, due to the wide use of portable multimedia devices such as smart phones, the number of short videos is large enough on the Internet. e daily spread of video makes it an ideal covert communication carrier. ese advantages provide a basis for the development of coverless steganography based on the video. However, there are a few coverless steganography methods based on the video. Tan et al. [24] calculated the directional optical flow feature of the adjacent frame images and obtained the robust histograms of oriented optical flow (RHOOF), then mapped the hash bits according to the discrimination relationship of each component of the histogram. e optical flow information of the adjacent frame images in this scheme is sensitive to random noise, so its robustness needs to be improved. Pan et al. [25] first performed framing processing on the video to extract valid frame images and then used the semantic information extracted from the frame images by MobilenetV2 [26] to generate hash sequences. However, this method only used a single-frame image feature of video, and its hiding capacity and hidden success rate were relatively low. At the same time, this method took a long time to train the MobilenetV2 network and the trained model also took a long time to generate a byte, which undoubtedly weakened the practicability of this scheme. Zou et al. [27] used deep neural network to extract the hash codes of frame images and set up mapping rules to improve the capacity of the scheme. However, the hash codes of the frame images were directly generated by the neural network, and the robustness of this network was poor, resulting in the weak anti-interference ability to noise.
In order to make more effective use of the features of carrier video and improve the hiding capacity and robustness, a coverless video steganography scheme based on audio and frame features is proposed in this work. First, the frame images and audio components of the carrier video are extracted. en, three features of the two components are mapped into bit sequences, and the retrieval database is established according to the mapping relationship. e sender divides the secret information into bit sequences of equal length and then searches the retrieval information and carrier videos in the retrieval database. e retrieval information and carrier videos will be sent to the receiver. en, the receiver can recover the secret information according to the mapping rules. e contributions of this paper are as follows: (1) A novel coverless video steganography scheme is proposed based on audio and frame features, which makes full use of the features of frame images and audios of the carrier videos. (2) e feasibility of audio features for coverless steganography is investigated, which has not been fully studied in existing researches. e short-term energy and DWT coefficient of audio are used to hide secret information. e experimental results show good performance.
(3) e robustness, capacity, cost time, and hiding success rate are analysed and tested. e proposed method achieves good improvements compared with the existing video-based coverless steganography schemes. e rest of the paper is arranged as follows. Preliminaries are shown in Section 2. e proposed coverless video steganography is described in Section 3. e experimental results and analysis are shown in Section 4. Finally, conclusions are drawn in Section 5.

Short-Term Energy of Audio Signal.
Since the continuous change of the audio signal with time can be characterized by a nonstationary random process and has short-term correlation, short-term analysis is generally used for audio signal processing. e signal is divided into frames first to ensure the local stability. en, windows are added to keep the signal continuous, as shown in Figure 1.
Assuming the audio signal is X(n), the i-th frame of signal is obtained after windowing by where w(·) is the window function with the width of w len , f n is the total number of frames, L f is the frame length, i nc is the frameshift length, and Y i (n) represents the n-th signal value of the i-th frame of the audio signal. e short-term energy can reflect the strength of the audio signal, which can be obtained by where L represents the length of the audio signal.

Discrete Wavelet Transform.
Discrete wavelet transform (DWT) [18] is a transform method whose process is as follows: Equation (3) is the discrete wavelet function of DWT, and m and n are integers; a 0 is a constant greater than 1, and b 0 is a constant greater than 0; the different values of a and b are affected by m, n, a 0 , and b 0 , and the difference of these two parameters is related to the selection of discrete wavelet function ψ m,n (t). Equation (4) represents the process of DWT. f(t) represents the audio signal in time domain, t represents time, and * represents complex conjugate value of discrete wavelet function ψ m,n (t).
After DWT, the audio signal can output low-frequency and high-frequency components. e low-frequency component contains the most energy of the audio signal, whereas the high-frequency component mainly contains detailed information of speech signal such as the impact of noise. As shown in Figure 2, after each DWT, the length of lowfrequency information is halved, and the contour is more obvious and stable. erefore, the low-frequency component of DWT has good stability and robustness, which can be used for information hiding.

2.3.
Scale-Invariant Feature Transformation. Scale-invariant feature transformation (SIFT) has the feature of scale invariance, which is not affected by the variation of light, noise, and visual angle. Because of its excellent stability and robustness, it can be applied to information hiding [12]. e steps of SIFT feature detection are as follows: (1) Detect the extremum of scale space. Gaussian difference function is used to search for potential feature points with constant scale and direction for all images in scale space.

Our Proposed Method
In this paper, we propose a coverless video steganography based on audio and frame features whose framework is shown in Figure 3. In our method, we first process video to get the audio and image components. en, three features of these two components are extracted: SIFT feature, shortterm energy feature, and DWT coefficient feature are used to generate hash bit sequences by different robust mapping methods. After that, the retrieval database is built according to the position information. At the sender, the secret information is divided into bit sequences, which are used to search corresponding retrieval information and videos in the retrieval database. e retrieval information and videos are sent to the receiver. After receiving the retrieval information and videos, the bit sequences are obtained by calculating the corresponding features in the video according to the retrieval information, such that the secret information can be recovered by these bit sequences.

Mapping of Bit
Sequence. e mapping method of bit sequence is related to the robustness and accuracy of coverless steganography, so it is the core part of the algorithm. Our method includes three features, which are extracted from audio and image, respectively. ree mapping schemes of bit sequence are described as follows.

Mapping Based on Short-Term Energy of the Audio.
e mapping based on short-term energy of audio is as follows: (1) Process the audio signal X(n). According to equation (1), the audio signal X(n) could be divided into frames and windowed to get f n frames of the audio signal Y i (m). Here, we set the frame length w len � 200 and the frameshift i nc � 80. en, we calculate the short-term energy of each frame audio signal according to equation (2). (2) Segment the energy of the audio signal. According to the principle that L 0 � 180 frames of total energy is used to map 1-bit information, the short-term energy E(i) is segmented to get h 0 segments of shortterm energy En(h) according to equation (5), the first 8 × N 0 segments of which is used to map to generate bit sequences.
(3) Generate the hash sequences. e 8 slices of shortterm energy segments are selected from En(h) in sequence, and the mean value is taken as the threshold K. According to the relationship between the short-term energy and the threshold value K, we obtain the bit sequence B 1 , as shown in Figure 4. We obtain the hash sequence B 1 by bit reversal.

Mapping Based on DWT Coefficients of the Audio.
We use the stable low-frequency information obtained by DWT to generate robust bit sequences as follows: (1) Perform DWT on the audio signal. We perform DWT on the audio signal X(n) continuously three times and output the absolute value of the lowfrequency information U whose length is l. (2) Process the coefficient of low-frequency information U. We use L 1 � 2750 values of low-frequency information U to map 1-bit information, and we can get h 1 � floor(l/L 1 ) low-frequency coefficients Zc.
(3) Generate the hash sequences. By comparing the numerical relation of the adjacent DWT coefficients Zc, we obtain the bit sequence H of length h 1 − 1: (4) Output the bit sequences in bytes. In this bit sequence H, the byte sequence B 2 is output byte by byte in sequence, as shown in Figure 5. We obtain the byte sequence B 2 by bit inverting B 2 .

Security and Communication Networks
(1) Process the frame image. For a frame image, we transform it to greyscale first, uniform its size for 512 × 512, and divide it into 3 × 3 blocks. (2) Generate and count SIFT feature points. We perform SFIT transformation on this frame image to obtain the location information of SIFT feature points. en, we count the number of SIFT feature points S(i) of image subblocks, 1 ≤ i ≤ 9.
(3) Generate the hash sequences. By comparing the number of SFIT feature points of different image subblocks, we obtain the hash bit sequence B 3 , as shown in Figure 6. And then, we could obtain the hash bit sequence B 3 by bit inverting B 3 .
(4) Repeat steps 1, 2, and 3 until the bit sequences of all images are generated.

Establishment of Retrieval
Database. e retrieval database could help sender search the carriers corresponding to the secret information, so that it is an important part of algorithm. e establishment process is as follows: (1) Extract two components of the video. We use Arabic numerals to mark the position of this video, which will be used to mark the video ID of the subsequent hash sequences, and then extract the frame images I(m) and audio X(n) from it. (2) Extract different features. We extract the SIFT features of the frame images I(m) and mark their feature ID with 0. en, the short-term energy features and DWT coefficient features of the audios X(n) are extracted, and the feature ID is marked as 1 and 2, respectively.
(3) Generate hash sequences and update the retrieval information. We obtain hash sequences using three mapping ways mentioned above. At the same time, we append 0 at the end of the feature ID if the hash sequence is generated by feature mapping directly, otherwise 1. (4) Repeat steps 1 to 3 until 256 types of different byte sequences are mapped, and the retrieval database is established, as shown in Figure 7. e algorithm of the establishment of the retrieval database is described in Algorithm 1.
It can be seen from Figure 7 that a byte sequence may have multiple corresponding retrieval information. erefore, the sender can randomly select one of the multiple retrieval information of the byte sequence as the corresponding retrieval information, so that the same byte sequence of secret information has multiple different mapping items. It can make the auxiliary information transmitted by the sender, have more variability, increase the cracking difficulty of external attackers, and enhance the complexity of our method.

Transmission of Secret Information.
e specific process of secret information transmission is as follows: (1) Construct retrieval database of videos and obtain the carrier videos V. (2) For the secret information S of length L s , segment every 8 bits (1 byte) and fills the tail with some auxiliary information to get the byte sequence P � P 1 , P 2 , . . . . . . , P L P : If mod(Ls, 8) ≠ 0, we pad 0 to the end of S to form a byte sequence and add 1 byte at the same time, which represents the number of 0 padded; if mod(Ls, 8) � 0, the sender pads a byte 0000 0000 to indicate that the original secret information has not been padded. (3) Get the retrieval information C i according to P i from the retrieval database. have an encryption protocol, the retrieval information C can be encrypted. e algorithm of transmission of secret information is described in Algorithm 2.

Recovery of Secret Information.
e specific process of secret information at receiver is as follows: (1) e receiver can recover the bit sequence P i according to the retrieval information C i and the mapping method described in Section 3.2. (2) Repeat step 1 until all the remaining search information of C has been matched to obtain the byte sequence P � P 1 , P 2 , . . . , P L P . (3) Recover secret information S according to P. If the last byte of P is 0000 0000, the last byte is directly removed to recover the secret information S. If the last byte is not 0000 0000, then according to its corresponding decimal value, the padded "0s" of the last two bytes are removed to get the original secret information S. e algorithm of secret information recovery is described in Algorithm 3.
As far as we know, Pan et al. [25] proposed the first coverless video steganography method in 2020. erefore, we will conduct performance comparison experiments with Pan's scheme on the same data set, containing some public videos we obtained from the Internet using crawler technology. e video data set consists of 240 short videos in the format of MP4. Most of these are standard definition videos, and a few are high-definition videos. Among them, the longest duration of video is about 5 minutes. e themes of this video data set include news brief, music videos, entertainment broadcast, video clip, and documentary clip. In the test of audio features, we extract the audio components of video, in which the sampling rate Fs is 44100, and remove the weak signal values-signals with an absolute value less than at the beginning and the end of the audio components to reduce noise influences. In the test experiment of the frame image features, we select some videos in the video data set for experimental testing of frame images. e partial data sets are shown in Figure 8.
Tan et al. [24] proposed a coverless steganography scheme based on optical flow analysis of video in 2021. In order to compare the latest scheme, we select the public data set UCF101 used in the paper [24] for robustness experiment comparison. UCF101 is a public video data set, the content of which is various actions and scenes. According to Tan's settings, we randomly select videos of different actions and scenes. e size of these files is about 200∼800 kb and the duration of these videos is about 2∼10 seconds, as shown in Figure 9.

Capacity.
e hidden capacity is an important indicator of the information hiding algorithm. A coverless steganography algorithm with a large capacity can help reduce the number of carriers needed for transmission. Our scheme uses the frame image and audio components of video to establish the mapping relationship with the secret information, so the hiding capacity of our algorithm should be discussed in combination with the frame image and audio.

Capacity of the Audio.
In this paper, we segment the audio signal and use the feature mapping of each segment of the audio signal to generate bytes. In the short-term energy feature, we first divide the audio signal into frames, then use 8 × 180 frames of the audio signal to map 1-byte sequence and use bit inversion to get another byte sequence. According to equation (1), the 1-second audio signal can be divided into f n frames when the sampling rate is Fs and we can obtain en, equation (12) is transformed as erefore, a x-second video can be mapped to C 1 bytes.
In the feature of DWT coefficient, we perform four times of DWT on the audio signal to get stable low-frequency information. e length of the audio becomes 1/2 4 of the original audio, and the value of the 1-second audio signal also changed from Fs to Fs/2 4 after four times of DWT transform. We use 8 × 2750 DWT coefficients to map 1 byte, so a video of x seconds can be mapped to generate C 2 bytes.
We set n 2 � Fs/352000, and then, we can simplify C 2 .
erefore, the hidden capacity of the x-second audio is C 1 + C 2 bytes when the audio sampling rate is Fs. It can be seen from the above that the size of the audio feature capacity is related to the audio duration x, the sampling rate Fs, and the parameters L 0 and L 1 . In fact, in order to balance the robustness and capacity of the scheme, we conduct robustness tests on the values of L 0 and L 1 in Section 4.2.1 and finally determined the values of these two parameters.

Capacity of the Video.
For a frame image, our algorithm uses frame image feature mapping and bit inversion Security and Communication Networks operations to generate 2 bytes, whereas Pan's method can only map 1 byte when the robustness is optimal, but Tan's method can map 4 bytes.
For a x-second video, whose audio sampling rate is Fs and the frame images that can be extracted are M, we use the total number of bits mapped on a certain carrier to measure the hidden capacity. e results are shown in Table 1.
It can be seen that the number of bits generated per frame image in our scheme is consistent with that of Zou's scheme, twice that of Pan's scheme, but half that of Tan's . . , C L P (1) Construct retrieval database R and obtain carrier videos V (2) Segment the secret information: S' � segment (S) (3) Padding the bytes sequence: P � pad (S′) (4) For i � 1 to L P (5) Search in the index information C i corresponding to P i (6) End for (7) Send the retrieval information C and carrier videos V to the receiver ALGORITHM 2: Transmission of secret information.

Input: Carrier videos
. . , C L P Output: Secret information S � S 1 , S 2 , . . . , S L S (1) Receive retrieval information C and carrier videos V.
(2) For i � 1 to L P (3) Obtain the byte sequence P i according to C i and mapping method (4) End for (5) Remove the padding bytes at the end of the byte sequence P : S' � Remove (P) (6) Connect byte sequence S′ to restore secret information sequence: S �Connect (S′) ALGORITHM 3: Recovery of secret information.
scheme. However, the other three schemes cannot use audio to map bits, but our method can map and generate 8 × (C 1 + C 2 ) bits using the features of audio. Ideally, when the video length is long enough, the number of hash bits mapped by audio of our scheme is sufficient and the capacity of the solution can exceed the capacity of Tan's scheme.

Robustness.
In the process of transmission, the carrier videos will be affected by noise or external attacks. erefore, the robustness is an important indicator. We use external noise or attack to affect frame images and audios of the carrier video, and evaluate the robustness according to the similarities and differences between the byte sequence P 1 (t) recovered from the retrieval information and the byte sequence P 0 (t) obtained from the original secret information segmentation, which is calculated as where L P is the byte number of sequences P 0 (t) and P 1 (t). e paper [24] used the bit error rate to measure the robustness of the algorithm, which was calculated as where p 0 (t) and p 1 (t) represent t-th bit of sequence P 0 (t) and P 1 (t), respectively, and L c is the total bit number.

Robustness Based on Audio
Features. e audio is mainly affected by external Gaussian white noise. In this section, we use seven kinds of Gaussian noise under different SNRs to test the robustness of two audio features. We test the experiment 20 times, remove the maximum and minimum values of the results, and take the average in the rest of results.
We compare the robust performance of short-term energy features with different frame numbers L 0 . Our method uses the accumulated value of 180 frames of shortterm energy of the audio signal as the mapping. Table 2 shows the experimental test results of the robustness of short-term energy feature with different frame numbers L 0 . It can be seen that the robustness of short-term energy feature increases slowly with the increase of the number of frames. Moreover, the capacity will decrease if increasing L 0 . erefore, in order to balance robustness and capacity, we set L 0 � 180 for better performance.
We compare the robustness performance of DWT coefficient with different segment numbers L 1 in Table 3. Our method uses the cumulative sum of 2750 segment values of U as the mapping. It can be seen that there is no big difference in the antinoise performance of DWT coefficient with different segment numbers L 1 . And there is a negative correlation between the size of L 1 and the capacity of the algorithm. In order to balance robustness and capacity, we set L 1 � 2750 for better performance.
We compare the robust performance of DWT coefficient feature with different times c of DWT in Table 4. It can be seen that at the beginning, with the increase of the number of DWT, the robustness is improved, but if c is greater than 4, the robustness decreases. erefore, we set c � 4 for better performance.
We compare the robust performance of DWT coefficient feature with different wavelet basis functions in Table 5. It can be seen that when the wavelet basis function is rbio3.1, the comprehensive robustness of DWT coefficient is better; therefore, we set the wavelet basis function as rbio3.1.       e overall robustness performance of the two audio features is shown in Table 6. We set the short-term energy feature parameter as L 0 � 180, the DWT coefficient feature parameter as L 1 � 2750, the wavelet basis function as rbio3.1, and the number of DWT as c � 4. It can be seen that the robustness of the audio feature mapping method is strong, which can reach 86% under the Gaussian noise with SNR of 0.

Robustness Based on Frame Image Features.
In this section, we use a variety of geometric attacks and noise attacks with different parameters to test the robustness of different methods on our video data set. According to the setting of paper [25], the j of Pan's scheme is set to 9. e experimental results of different methods on our database are shown in Table 7.
It can be seen that the robust performance of our method is better than that of Zou's and Pan's method under most external image attacks. Because these two schemes use the neural network to directly extract the features of the frame images, the influence of the pixel values has a great impact on the output results of the network, resulting in the weak anti-interference ability to noise. In particular, the pixel matrix of the frame image will be quite different from the original matrix if the video encounters geometric attack. Affected by the prior knowledge of the training set, the extracted features of the neural network may be quite different from the original features. With the scale invariance of SIFT, our method can stably extract the feature points of the frame images and has good robustness to noise attack and geometric attack.
In order to compare the experiment with the state-ofthe-art scheme [24], we use equation (19) to compare the robustness experiments with Pan's scheme and Tan's scheme on the data set UCF101. According to the setting of paper [24], the bin number N is set to 8, and the subblock number S is set to 4; according to the setting of paper [25], j is set to 9. e results are shown in Table 8. We can see that most of the experimental data of our scheme is stronger than the other two schemes, especially the anticompression performance. Compared with other antinoise performance, anticompression performance is particularly important for coverless steganography based on video. Because carrier video generally undergoes a compression step before sending, which often damages the video content.

Efficiency Analysis.
e complexity and efficiency will affect the feasibility and practicability of the steganography scheme. e cost of our scheme is mainly related to the map of hash bits and three features, because it involves the calculation and mapping of three features. We measure the efficiency of the schemes based on the time  required to hide a byte, and the unit is "s/B." From the results in Table 9, it can be seen that the time required in our scheme is the least, which is about one quarter of the time cost of Tan's method and about one seventh of the time cost of Zou's and Pan's methods. erefore, the cost of our scheme is the lowest, which undoubtedly enhances the feasibility of our scheme.

Hiding Success
Rate. Information hiding algorithm should not only consider the capacity of the method but also pay attention to the hiding success rate, which can be expressed by the number of different bytes that a video can hide. Hiding success rate can reflect the effectiveness of the algorithm, and its calculation formula is shown in where Q is the total number of bit sequences generated by multiple videos and w � 8 in this experiment. We use 85 videos in the video data set to test our method and Pan's method, and the results are shown in Figure 10. e hiding success rate of our method is always higher than that of Pan's method, and only 9 videos are enough to map 256 types of different bit sequences.
is is because we use three features and bit inversion operation, and thus, a video can generate a variety of hash sequences. e hiding success rate of Pan's method can only approach 99% with 85 videos, which means that the redundancy of bit sequences generated by multiple videos is high and a large number of videos are needed to map all kinds of bit sequences.

Security Analysis.
e coverless video steganography based on audio and frame features proposed in this paper has multiple securities as follows: (1) We use three features of video to map the hash bit sequences and hide the secret information, rather than modifying the carrier video. erefore, this method could resist steganalysis tools, which could ensure the security of secret information.
(2) e carrier video used by our method is from the abundant short videos on the Internet, which could greatly reduce the attention of the outside world to the secret communication, so as to improve the security of communication.

Conclusion
A coverless video steganography based on audio and frame features is proposed in this work, which makes full use of short-term energy feature, DWT coefficient feature, and SIFT feature of video to map hash bit sequences and hide secret information. e experimental results show that, compared with the existing coverless video steganography, our method has larger capacity, less time cost, higher success rate of hiding, and stronger robustness to most external   attacks. In the future, we will try to further improve the robustness and capacity.
Data Availability e video database we built can be obtained upon request to the corresponding author. e UCF101 data used to support the findings of this study are available at https://www.crcv. ucf.edu/data/UCF101.php.

Conflicts of Interest
e authors declare that they have no conflicts of interest.