SFRNet : FeatureExtraction-FusionSteganalysisNetworkBasedon Squeeze-and-Excitation Block and RepVgg Block

In the era of big data, convolutional neural network (CNN) has been widely used in the field of image classification and has achieved excellent performance. More and more researchers are beginning to combine deep neural networks with steganalysis to improve performance in recent years. However, most of the steganalysis algorithm based on the convolutional neural network has only run test against theWOW and S-UNIWARD algorithms; meanwhile, their versatility is insufficient due to long training time and the limit of image size. +is paper proposes a new network architecture, called SFRNet, to solve these problems. +e feature extraction and fusion layer can extract more features from the digital image. +e RepVgg block is used to accelerate the inference and increase memory utilization. +e SE block improves the detection accuracy rate because it can learn feature weights to make effective feature maps with significant weights and invalid or ineffective feature maps with small weights. Experimental results show that the SFRNet has achieved excellent performance in the detection accuracy rate against four state-of-the-art steganography algorithms in the spatial domain, e.g., HUGO, WOW, S-UNIWARD, and MiPOD, under different payloads. +e SFRNet detection accuracy rate achieves 89.6% against S-UNIWARD algorithm with the payload of 0.4bpp and 72.5% at 0.2bpp. As the same time, the training time of our network is greatly reduced by 35% compared with Yedroudj-Net.


Introduction
e rapid development of social networks provides convenience for users to exchange data. A large number of digital images are uploaded to the Internet every day. e proliferation of digital images provides a good medium for criminals to commit crimes using steganographic algorithms. Digital image steganography is a hiding technology that takes into account data security and communication, which uses the redundancy of the cover image to embed the secret information into the public carrier and transmits it through the public channel to ensure that the secret information is not discovered and intercepted by a third party. Image steganalysis is the opposite of image steganography, which can determine whether the image contains secret information by capturing minor disturbances in the stego image that are not easily perceivable by the human visual system. ey provide a basis for extracting the secret information hidden in the image. In recent years, image steganalysis has played an increasingly important role in many information security systems and has attracted many researchers [1]. At the same time, the fast-developing adaptive steganography algorithm uses syndrome-trellis code (STC) [2] to minimize distortion and retains more complex image statistical properties. e current typical spatial adaptive steganography algorithms include HUGO [3], S-UNWIWARD [4], WOW [5], and MiPOD [6]. ey make the secret information more cleverly hidden in the area where it is difficult to establish a steganalysis model, which improved the security of steganography algorithms and brought significant challenges to steganalysis.
Steganalysis can be considered as a two-class problem of images. Since convolutional neural networks can extract features in the spatial and frequency domains of images, more and more researchers are beginning to combine deep neural networks with steganalysis to improve performance. e signal noise processed by steganalysis is a weak signal which will be affected by image content so that it will be ignored by the traditional classification network. e network needs to be specially modified before it can be used in steganalysis, such as suppressing the image content and enhancing the steganographic noise signal.
Qian et al. [16] proposed a steganalysis network Qian-Net, based on a convolutional neural network, using a Gaussian activation function to replace the rectifying linear unit (ReLU) [17] activation function. Xu et al. [18] proposed Xu-Net based on the Qian-Net framework. e high-pass filter is used to extract noise residuals in the preprocessing layer. Simultaneously, the network adds the absolute value (ABS) layer and TanH-ReLU hybrid activation function. Jian et al. [19] proposed Ye-Net with a deeper network structure, using high-pass filters as the preprocessing layer and the truncated linear unit (TLU) as the activation function, introducing the selection channel. Boroumand et al. [20] proposed a 48-layer deep learning steganalysis framework SRNet, which obtains filters through learning to improve the detection accuracy rate of the network against steganography algorithms. Yedroudj et al. [21] proposed a network architecture based on the concept of Alex-Net [22], called Yedroudj-Net. In addition to using ABS layer and TLU activation function, three fully connected layers are also used. Zhang et al. [23] proposed Zhu-Net, which optimizes the filter kernel of the preprocessing layer and uses pyramid pooling [24] to obtain excellent detection accuracy rate on the S-UNIWARD and WOW algorithms. e problem of steganographic analysis tool with neural networks is that it is impossible to analyze larger-sized images due to limitations in computer resources. And, the versatility of such steganographic analysis tools is not good. Most of the steganalysis algorithm has only run test against the WOW and S-UNIWARD algorithms. At the same time, the training time of the neural network is too long. To enhance the practicality and universality of the steganalysis network framework, we propose a feature fusion steganalysis framework based on the network structure of the RepVgg [25] and squeeze-and-excitation [26] in this paper, which is called the SFRNet. Experimental results show that the SFRNet has achieved excellent performance in the detection accuracy rate of four different steganography algorithms under some different payloads. e SFRNet detection accuracy rate achieves 89.6% against S-UNIWARD algorithm with the payload of 0.4bpp and 72.5% at 0.2bpp. In summary, we make the following contributions in this paper: (i) Instead of using the image as an input, we extract and merge the feature of images into a feature matrix through the rich model and use the generated feature matrix as the actual input of the work, which solves the dependence of the deep neural network on the size of the input image.
(ii) We propose SFRNet, a simple architecture with favorable speed-accuracy trade-off compared to the state of the arts, which uses the RepVgg block as the convolution layer of the network and uses the squeeze-and-excitation (SE) block to improve the detection accuracy rate. (iii) We show the effectiveness of the SFRNet in steganalysis and the efficiency and ease of implementation.
e rest of the paper is organized as follows. Section 2 introduces the prior knowledge including SRM and its several variants and the deep learning methods. In Section 3, the SFRNet is proposed. is section describes feature extraction-fusion and the detailed structure of SFRNet. In Section 4, the dataset partition, training details, and specific parameters of the SFRNet steganalysis framework are introduced. In Section 5, we validate the effective proposed model on several states-of-the-art steganographic algorithms and compare the performance of the SFRNet with several advanced steganalysis algorithms. e study is ends with the conclusion in Section 6.

e Feature Extraction Method.
Friedrich and Kodovsky [8] proposed the spatial rich model (SRM) based on the subtractive pixel adjacency matrix (SPAM) model, which designed various linear and nonlinear high-pass filters (HPF) in spatial domains. It uses these filters to filter the image to obtain a wide variety of residual images and then separately counts the frequency of occurrence of each adjacent residual sample pattern to get the co-occurrence matrix. Finally, the elements of the co-occurrence matrix are arranged into vectors as steganographic analysis features, as shown in Figure 1. e steganographic analysis features can comprehensively perceive the change of image adjacent pixel correlation caused by steganography algorithm. e SRM improves the detection accuracy rate of steganalysis algorithm, which has been used and improved by researchers of general steganalysis.
Denemark et al. [9] proposed the steganalysis method maxSRM combined with the channel selection strategy, which is a variant of the so-called SRM. e maxSRM and maxSRMd2 are built in the same manner as the SRM, but the process of forming the co-occurrence matrices is modified to consider the embedding change probabilities estimated from the analyzed image. e version of the maxSRM with all cooccurrence scan directions replaced with the oblique direction "d2," as shown in Figure 2, is called maxSRMd2. Compared with SRM, maxSRM and maxSRMd2 have significant performance improvement.

e RepVgg Block.
A classic convolutional neural network (ConvNet), VGG [27], achieved massive success in image recognition with a simple architecture composed of a stack of Conv, ReLU, and pooling. With Inception, ResNet [28], and DenseNet, many research interests were shifted to well-designed architectures, making the models more and more complicated. e complicated multibranch designs make the model difficult to implement, customize, slow down the inference, and reduce memory utilization. Ding et al. [25] presented RepVgg, a VGG-like inference-time body composed of nothing but a stack of 3 × 3 convolution and ReLU, while the training-time model has a multibranch topology.
In the SFRNet, we used the RepVgg block instead of the conventional convolution to accelerate the inference and increase memory utilization. e RepVgg block use ResNetlike identity and 1 × 1 branches so that the training-time information flow of a building block is y � x + g(x) + f(x). It uses W (3) ∈ R C 2 ×C 1 ×3×3 to denote the kernel of a 3 × 3 conv layer with C 1 input channels and C 2 output channels and W (1) ∈ R C 2 ×C 2 for the kernel of 1 × 1 branch. It uses μ (3) , σ (3) , c (3) , β (3) as the accumulated mean, standard deviation, learned scaling factor, and bias of the BN layer following 3 × 3 conv layer, μ (1) , σ (1) , c (1) , β (1) for the BN layer following 1 × 1 conv layer, and μ (0) , σ (0) , c (0) , β (0) for the identity branch. e identity branch can be viewed as a 1 × 1 conv layer with an identity matrix as the kernel. Let M (1) ∈ R N×C 1 ×H 1 ×W 1 and M (2) ∈ R N×C 2 ×H 2 ×W 2 be the input and output and * be the convolution operator: en, it obtains the final bias by adding up the three bias vectors and the final 3 × 3 kernel by adding the 1 × 1 kernels onto the central point of 3 × 3 kernel, which can be easily implemented by first zero-padding the two 1 × 1 kernels to 3 × 3 and adding the three kernels up [25], as shown in Figure 3.

e Squeeze-and-Excitation Block.
He et al. [26] focus on the channel relationship and propose the "Squeeze-and-Excitation" block, which can learn to use global information to emphasize informative features and suppress less useful ones selectively. Liu et al. [29] construct a new effective network with diverse filter modules (DFMs) and squeezeand-excitation modules (SEMs), called DFSE-Net, which can better capture the embedding artifacts. e experiments presented that networks can pay more attention to critical channels by SEMs.
e squeeze-and-excitation block is not a complete network structure, which can construct a squeeze-and-excitation network by simply stacking a collection of SE blocks. e SE block can learn feature weights to make effective feature maps with significant weights and invalid or ineffective feature maps with small weights, as shown in

SFRNet
e proposed network architecture is called SFRNet: feature extraction-fusion steganalysis network via squeeze-and-excitation block and RepVgg block. Firstly, we explain the method of preprocessing, i.e., how to get the feature matrix. en, the architecture of network is demonstrated. At the same time, we explored the values of key parameters through experiments.

Feature Extraction and Fusion Layer.
e steganography algorithm modifies the original image content as little as possible when embedding secret information in the cover image to avoid detection. In other words, the steganography algorithm introduces noise in the image, which usually cannot be perceived by the human perceptual system. And, the noise is also easily ignored by those image classification networks which focus on the content of the image. At the same time, it modifies the correlation between adjacent pixels of the original image while also modifying the correlation between adjacent pixels of the residual image. e SRM and its variants are used to process the image, mainly to suppress the relevance of image content. We propose a feature information fusion block inspired by [30].
We use the following steps to extract and fuse the feature to obtain the feature matrix as input of the model. First, the residual image of the stego image and the cover image is Security and Communication Networks filtered by the high-pass filters to obtain submodels. en quantize, round, and truncate each submodel and extract the co-occurrence matrix. Finally, the feature vectors are obtained by using the merging rules in SRM to process the cooccurrence matrix. e high-pass filters are shown in Figure 5. e feature vector extraction process is defined as the follow equations: where X ij represents the i, j pixel of the cover image, N ij is the adjacent pixels of X ij , X ij ≠ N ij , c ∈ N is residual order, X ij (·) is a predictor of cX ij defined on N ij , K k is kth highpass filters, and R K is the residual filtered by the kth highpass filter:  where Round(·) means rounding up by element and Trunc(·) means a truncation operation by element. e purpose of truncation is to curb the dynamic range of residual to all description using co-occurrence matrices with a small T. e SRM model extracts the co-occurrence matrix in the horizontal, defined as equation (4). e vertical co-occurrence, C (v) d , is defined analogically: e maxSRM model extracts the co-occurrence matrix in the horizontal, defined by equation (5), where β k ij is the embedding change probabilities in the kth high-pass filters; refer to [9], for details. e vertical co-occurrence, C (v) d , is defined analogically: e scan direction of the maxSRMd2 model is different from SRM and maxSRM, which are replaced by "d2," as shown in Figure 2, so the co-occurrence matrix is defined as equations (6) and (7): 1st and 3rd order: 2nd order: Square: S5a) spam11

Security and Communication Networks
We choose q ∈ [0.5, 1, 2], T � 2, and d � 4 in all extraction methods to extract feature vector, getting 106 feature vectors. Among them, 17 are 338-dimensional feature vectors and 89 are 325-dimensional features vectors, and the feature vector is defined as where F ⇀ k is the feature vector calculated by using the kth submodel and Merge(·) merges two matrices into one by combining elements with the same or similar statistical laws in the horizontal co-occurrence matrices C (h) d and vertical one C (v) d .We use the zero vector [0, 0] as the segmentation between each feature vector to fill it into a feature vector of 34,969 dimensions, and null values after the last feature in the vector are filled with the zero vector [0,0,0, . . ., 0,0]. It is defined as equation (9), where * can denote SRM, maxSRM, and maxSRMd2: en, we obtain the finally feature matrix fused by the three feature vectors, which are defined as where MF cover is the feature matrix of the cover image, MF stego is the feature matrix of the stego image, and Reshape(·) converts a feature vector of 34,969 dimensions to a feature matrix of 187×187. Finally, our goal is to use SEFNet to train a mapping Map(·) based on the difference between them so that the mapping satisfies equations (11) and (12): Map MF stego � 1,

e SFRNet
Architecture. e overall structure of the SFRNet is presented in Figure 6. e SFRNet accepts an input image of size 256×256 and outputs two-class labels (stego and cover), composed of several number of layers, including one feature extraction-fusion block, five convolution blocks with different amounts of the RepVgg block, three SE blocks, and three fully connected layers. e layer types and parameters are displayed inside boxes in Figure 6. N×(C×W×H) means that the number of batch size is N, the number of channels is C, and the height and width of the feature matrix is W and H. RepVgg denotes the RepVgg block. e details of the RepVgg block and SE block are described below. e full name of AVG is Average Pooling. Similarly, GAP is global average pooling.

e SE Blocks.
Squeeze is achieved by using global average polling to generate channel-wise statistics. e statistic Z is generated by squeezing the input U through its spatial dimensions H×W, and the cth element of z is calculated by Excitation is used to fully capture channel-wise dependencies. First, it must be capable of learning a nonlinear interaction between channels, and second, it must learn a nonmutually exclusive relationship. e operations of excitation can be defined by where δ refers to the ReLU function, W 1 and W 2 are the fully connected operation, and σ refers to the sigmoid function. e final output of the SE block is obtained by rescaling U with the activations s: where F scale (u c , s c ) refers to channel-wise multiplication between the scalar s c and the feature map u c and X � [x 1 , x 2 , . . . , x C ] is the final output of the SE block. In our architecture, the SE block is followed by the first three stages, as shown in Figure 6. To show the performance of the SE block against steganalysis algorithm, we conducted a comparative experiment based on the SFRNet with the SE block and without the SE block. e result in Figures 7 and 8 show that the SE block accelerates the convergence and shows better performance against WOW algorithm at 0.4 bpp.

Nonlinear Activation Layer.
Two different activation functions, TLU and ReLU, are used in the SFRNet. e classical ReLU can prevent gradient vanishing/exploding and accelerate network convergence. e ReLU is used in "stage3," which selectively responds to embedded signals among the input feature map and get more efficient feature. Note that the remaining layers do not use the activation function.
Compared with cover image content, the signal introduced by the embedded message is usually of low amplitude. e high-frequency stego noise adds to the cover as a weak signal, significantly impacted by the image content. erefore, the TLU is used to reduce the dynamic range of input feature maps in "stage1" and "stage2," suppressing image content and extract embedding signals more effectively. It can be defined as where T > 0 is the threshold determined by experiments. To investigate the impact of parameter T in our network, we conduct several experiments with the SEFNet for a range of different T values. e results are shown in Table 1 and Figures 9 and 10. When the value of T is 1, the model achieves better performance and faster convergence.

Experiments
Python 3.8.3 was used for architecture construction, and the model was designed mainly with PyTorch 1.4.0. e operating system of the machine is Ubuntu 20.04 LTS, and the CUDA version is 11.0. e hardware of the machine has a GeForce RTX2080 SUPER with 8 GB and 250W, an Intel I7-9700k processor, and RAM with 32 GB (2 modules of 16 GB with 2666Mhz).

Dataset and the Steganographic Schemes.
All experiments in this paper were evaluated and contrasted on the standard dataset BOSSBase ver. 1.01. is source contains 10,000 images acquired by seven digital cameras in the RAW format and subsequently processed by converting them to 8bit grayscale, resizing, and central-cropping to 512 × 512 pixels. e image and camera information is shown in Table 2. e image source is widely used in research fields, such as information hiding, forensics, and steganalysis, which can be found at http://dde.binghamton.edu/ download/.
Because other steganalysis algorithms use 256 × 256-size image as input, we decided to evaluate the effectiveness of all models on the images with a size of 256 × 256. To this end, we resized all the images into the size of 256 × 256 pixels using "imresize ()" in MATLAB with the default setting to generate the final datasets.
In our experiments, several state-of-the-art steganographic methods in the spatial domain, such as WOW, S-UNWARD, MiPOD, and HUGO, were employed to produce standard datasets. And, the embedding algorithms WOW and S-UNIWARD are implemented with STC simulator based on the publicly available codes, which can be found at the same URL of the BOSSBase original images. We use the MATLAB version rather than C++ implementation to avoid the problem as [31] that all images are embedded with the same key for all the steganographic algorithms. All methods were used to process the original images with two payloads: 0.2 bpp and 0.4 bpp. We use bit-per-pixel (bpp) to represent the size of secret data embedded into cover images in all experiments. For each steganography algorithm, we randomly select 5000 image pairs for training, 1000 image pairs for validating, and 4000 image pairs for testing, and the testing set was untouched during all of the training phases.  Figure 6: e SFRNet architecture.  en, the feature extractor is employed to extract the feature for image steganalysis.

Hyperparameters.
In SFRNet architecture, the Adam [32] optimizer is used to update the parameters of model in the learning phase since Adam can reach convergence faster than stochastic gradient descent (SGD). Due to GPU memory limitation, the minibatch size in training is set 64, containing 32 cover images and their 32 corresponding stego images. e training dataset was shuffled after each epoch. Dropout is used, which followed every fully connected layer. Based on the above settings, the networks are then trained to minimize the cross-entropy loss. e SFRNet training is up to 150 epochs. We often stop training before 150 epochs to prevent overfitting. When the cross-entropy loss on the training set keeps decreasing, detection accuracy rate on validation begins declining, and we stop the training. e performance was evaluated by the testing accuracy rate, where the best validation model obtained during training was selected.

Comparison with the State-of-the-Art Steganalysis.
We report the detection accuracy rate obtained when detecting S-UNIWARD and WOW embedding algorithms at 0.2 bpp and 0.4 bpp, as shown in Table 3. e steganalysis methods are Yedroudj-Net, SRNet, DFSE-Net, and Zhu-Net. e detection accuracy of the Zhu-Net is 0.3% higher than the SFRNet when applied to the WOW algorithm with 0.4 bpp. In addition to this case, the SFRNet generally has better performance than the other four steganographic analysis networks against WOW and S-UNIWARD algorithm at 0.2 bpp and 0.4 bpp, as shown in Table 3 and Figure 11.       Table 4 and Figure 12. e detection accuracy is increased by 8%-10% compared with the latest method Zhu-Net against MiPOD algorithm at 0.4 bpp and 0.2 bpp. Compared with Zhu-Net, the detection accuracy is increased by 4%-7% against HUGO algorithm at 0.2 bpp and 0.4 bpp. e good performance demonstrates the effectiveness of the network structure of the SFRNet in Figure 12.

e Time Consumption and Computational Complexity of SFRNet.
We compare the number of parameters and times spent on network training and testing of the six types of steganalysis networks, as shown in Table 5. e SFRNet also reduces time consumption while improving accuracy. Although the SFRNet designed in this paper is a deeper network structure, the application of the RepVgg block reduces the computational complexity and time consumption than the Zhu-net and SRNet while still ensuring considerable accuracy. Compared to Yedroudj-Net, training time is reduced by about 35%.

Conclusions
In this paper, a deep neural network with high accuracy and low time consumption is proposed for steganalysis. e feature extraction-fusion layers are used to extract features from original images and combine them into a feature matrix, which provides versatility for the steganographic analysis method. Furthermore, we use the SE block and the RepVgg block to construct the SFRNet, which significantly reduces the computational complexity while ensuring    accuracy. At the same time, the SE block is used to extract channel correlation of the feature matrix. e experimental result show that the SFRNet has excellent steganalysis performance in the spatial domain. Especially, compared with the latest algorithm under low payload, the detection accuracy has been improved by 10%. In the future, we would extend our methods to the frequency domain.
Data Availability e steganography algorithm code and BOSSBase ver. 1.01. data used to support the findings of this study are available at http://dde.binghamton.edu/download/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.