Image Compression Based on Hybrid Domain Attention and Postprocessing Enhancement

Deep learning-based image compression methods have made significant achievements recently, of which the two key components are the entropy model for latent representations and the encoder-decoder network. Both the inaccurate estimation of the entropy estimation model and the existence of information redundancy in latent representations lead to a reduction in the compression efficiency. To address these issues, the study suggests an image compression method based on a hybrid domain attention mechanism and postprocessing improvement. This study embeds hybrid domain attention modules as nonlinear transformers in both the main encoder-decoder network and the hyperprior network, aiming at constructing more compact latent features and hyperpriors and then model the latent features as parametric Gaussian-scale mixture models to obtain more precise entropy estimation. In addition, we propose a solution to the errors introduced by quantization in image compression by adding an inverse quantization module. On the decoding side, we also provide a postprocessing enhancement module to further increase image compression performance. The experimental results show that the peak signal-to-noise rate (PSNR) and multiscale structural similarity (MS-SSIM) of the proposed method are higher than those of traditional compression methods and advanced neural network-based methods.


Introduction
In the age of information technology, pictures have become important information carriers, and massive amounts of image data can lead to enormous transmission and storage pressures. For example, an original RGB image with a resolution of 512 × 768 has a theoretical storage size of about 1.125 MB, and after compression, the storage size of the image is only one-sixtieth of the original image or even smaller. erefore, image compression is crucial in computer vision, and the trend of information technology has put forward higher demands on image compression efficiency. Traditional image compression methods [1][2][3] have achieved better performance through finely designed manual features and complex processing. For example, JPEG [1] employs the discrete cosine transform (DCT) [4] to eliminate redundancy among pixels. Traditional compression algorithms, on the other hand, lack learning capabilities. anks to the advancement of deep learning, this also gives new ideas for image compression methods.
In contrast to the traditional methods, where a linear transform module is replaced by a nonlinear neural network, the performance of image compression methods is determined by how the network structure is constructed to produce more compact latent features. In addition, accurate entropy estimation is one of the pivotal factors to improve the performance of image compression. A good entropy model can better suit the true distribution of an image. A further important task is to design the quantization module, which has a significant impact on compression performance.
A classical deep learning-based image compression structure converts images into compressible latent representations by stacking convolutional neural networks [5]. ese latent representations are then entropy coded through statistical redundancy, and lossless compression is performed through entropy coding to create bitstreams. At the same time, the joint optimization decoder decodes the latent representations into images. Classical learning-based image compression algorithms [6,7] use a variational autoencoder structure, and significant progress has been made in various technical components of this architecture, including the use of generalized divisive normalization (GDN) modules in nonlinear transformers, which has been validated for probabilistic modeling and image compression tasks. A nonlocal model was used in the literature [8], and a CNNbased wavelet transform was used in the literature [9] to reduce redundancy among pixels. e entropy estimation model with hyperprior was first proposed in the literature [7] to capture the hidden information of latent representations, aiding the generation of entropy model parameters and improving the mismatch between entropy model and hidden marginal distributions. In the literature [10,11], the autoregressive module was introduced, and the autoregressive and hyperprior modules complement each other to improve the entropy estimation. e literature [12] proposed a parallelizable checkerboard grid context model and changed the decoding order so as to speed up decoding without breaking performance. e literature [13] used importance maps to guide image compression based on generative adversarial network loss functions for enhancing the subjective image quality. e literature [14] used a manually set distortion weight map to control the bit-rate allocation and assigned a larger distortion weight to the region of interest in the image, thus enhancing the quality of the region of interest. e problem that DNN-based networks cannot be directly used for near-lossless coding is addressed in the literature [15]. It is possible to see that the design and improvement of the deep neural network-based image compression methods focus on improving the main encoder-decoder, the quantization module, and the entropy estimation module structures.
In addition, improving the image compression quality is the eternal theme of image coding and decoding. We need to understand that enhancing the image quality at the decoding end is equivalent to improving the compression efficiency. Moreover, since the actual lossy compression standards are not theoretically optimal, there are information redundancies that can continue to be explored and utilized. Furthermore, we find that there are still some limitations in the existing methods. For example, the information transmitted using a hyperprior is not standardized and fully utilized. is part of the information, which is encoded into the bitstream to construct the entropy model, is not used for image reconstruction. We want to eliminate compression artifacts and blur, despite the fact that existing compression algorithms offer decent compression efficiency. e purpose of this study is to investigate how to build an effective learning-based image compression approach. To summarize, the following are the study's innovations: (i) is study describes a hybrid domain attention mechanism that was embedded into the transform encoder-decoder and hyperprior module to output latent representations and hyperpriors with channel global context-adaptive activation. Our hybrid domain attention mechanism was embedded into different network layers, which is not only applied to quantized latent representations. e created attention mask dynamically analyzes the relevance of the features to be compressed through the deep learning-based architecture and allocates more bits to more essential features, which will help increase the entropy coding efficiency. (ii) In addition, because quantization poses a zerogradient problem, the training of a deep learningbased network cannot be carried out. We use the hybrid quantization approach to fix this problem, that is, the forward propagation adopts the rounding function and the backward propagation adopts the form of straight propagation. Apart from the nondifferentiable problem, quantification also leads to a loss of information. is study proposed an inverse quantization module to alleviate the errors induced by quantization. (iii) Finally, as our image compression method is lossy, the reconstructed images will inevitably contain compression artifacts. In order to further increase image compression quality, obtain results with rich texture information, and generate vivid details, we perform a filtering process on the compressed image.

Traditional Image Methods.
Traditional image compression standards use artificially designed encoders, such as the discrete cosine transform used by JPEG [1], separating high-frequency information from low-frequency information and allocating bits according to signal importance to reduce information redundancy. To improve compression performance, entropy coding is also used. e coding of a source according to its probabilities is known as entropy coding. e process of entropy coding is lossless, i.e., there is no loss of image information. Entropy coding includes Huffman coding, tour coding, and arithmetic coding.

Deep Learning-Based Methods.
Deep learning-based image compression is not independent of traditional image compression and is more built on top of it. ere are two main approaches, a convolutional neural network-based compression approach using automatic encoders and a compression method with postprocessing filtering combined with traditional encoders. Traditional image compression uses transform coding that introduces block effects and compression artifacts, and while many methods can deal with these issues well, deep learning has a superior ability to solve these types of problems. For the design of transform encoder-decoder, some works [16,17] used recurrent neural networks (RNNs) to achieve recursive compression of residual information, while other parts of works [6,7,10,11] used stacked convolutional blocks to achieve this. Considering the limited perceptual field of convolutional blocks, 2 Computational Intelligence and Neuroscience Cheng et al. [18] used residual blocks to increase the perceptual field. Rippel and Bourdev [19] proposed the use of feature pyramid pooling (FPN) to obtain more powerful feature representations. In addition, since convolutional operations share features, this would lead to information redundancy. Li et al. [20] proposed using an importance map to adjust the bit allocation of images. e importance map is derived by training a branch of a 3-layer convolutional network. However, according to the method in [20], the explicit learning content needs weight, which increases the computational overhead, and it is difficult to adaptively allocate bits for deep features. Attention mechanisms have demonstrated considerable strength in adaptive learning of feature importance in recent years. In tasks such as natural language processing [21] and semantic segmentation [22], significant results have been made. Furthermore, introducing nonlocal block (NLB) into neural networks can greatly increase image denoising and image super-resolution reconstruction performance [23,24]. is study, therefore, presents a hybrid domain attention mechanism that is embedded into both main and hyper coders, which allows features to have adaptive responses, reinforcing important features, weakening unimportant ones, and further improving the compression performance. Akbar et al. [25] employed multiplicative convolution. e use of multiplicative convolution is also based on the idea of allocating more bits to important regions and reducing spatial redundancy.
Because deep learning-based methods require network the training, quantization operations are not differentiable. e literature [6] proposed the use of adding uniform noise as an alternative to true quantization. e literature [11] sets the gradient of the quantization operation to a fixed value to ensure that deep learning training takes place. To make the quantization smoother, a soft-to-hard quantization is utilized instead of direct scalar quantization in the literature [26].
In the entropy coding section, different entropy models are proposed for the quantized latent representations. In the earlier literature [6], an entropy model constructed by a linear segmentation function was used for bit-rate estimation, the model has fixed parameters of its probabilistic model at the end of training, and the quantized latent representations are entropically encoded and entropically decoded based on these probabilities. To improve the entropy model, a hyperprior structure was proposed in the literature [7], in which the authors used a Gaussian-scale mixture (GSM) (Gaussian with different means and scales) modeling approach to estimate the probability of latent features. While the parameters needed for GSM modeling are derived by the hyperprior module, the latent features are then encoded into the bitstream and sent to the decoder.
is achieves image adaptive coding and also obtains better performance than BPG (4 : 4 : 4) [3]. In the literature [10], an autoregressive context module is proposed to perform parameter prediction of the entropy model in conjunction with the hyperprior structure proposed in the literature [7].
To enhance the image reconstruction quality even more, many research studies [26,27] used generative adversarial network (GAN) as a distortion measurement in the training phase to guide the decoder to generate more realistic texture structures, resulting in reconstructed images with good subjective quality, but the texture structures generated in this way are not real textures and do not have fidelity. For this reason, TuCode technology in [28] proposed an enhancement module that acts on full-resolution images to reduce compression artifacts in images by building a simple neural network to filter the reconstructed images.
e literature [29] investigated the effect of decoding network complexity on image compression performance and concluded that postprocessing networks do not significantly improve compression performance when the network at the decoding end has a strong enough reconstruction capability. When using neural networks for traditional encoding schemes, the auxiliary information generated during the encoding process has a significant effect on image denoising, and ByteDance uses predictive information and coding unit (CU) block segmentation information to assist the neural network in replacing the deblocking (DB) and sample adaptive offset (SAO) modules for loop processing. Qualcomm uses a network of BSassisted information to replace the DB filter for block filtering. Figure 1 shows the network structure. We deploy a deep learning-based automated encoder network. is method mainly includes the main encoder-decoder, hyper encoder-decoder, autoregressive context module, and postprocessing enhancement module. In particular, given the training images x, a transformer encoder generates corresponding latent features y.

Image Compression Architecture.
e quantization quantizes the latent features to y, and subsequently, entropy codes the quantized y into a bitstream for transmission. e autoregressive context module combined with the hyperprior module is used for entropy coding. e entropy coding in this way will first estimate the distribution of latent representations y through the hyperprior network, and the output of the hyperprior encoder will be then quantized and encoded into bitstream. e reason why it will be encoded into the bitstream is that this part of the bitstream is required during decoding, and the accurate entropy model will improve the compression efficiency.
e use of autoregressive prior information to estimate the distribution of latent representations y is to capture an accurate entropy model. e attention module is used to adjust bit allocation based on the importance of the information. Considering that the goal of the image compression algorithms is to obtain the highest possible quality reconstructed image for a given bit-rate target, the method in this study adds a postprocessing enhancement module, and the image decoded by the main decoder and the mean information generated for modeling the Gaussian distribution are jointly fed into the postprocessing network to assist in generating the final reconstructed image.

Computational Intelligence and Neuroscience
We can deduce from our understanding of information theory that when the required coding features are concentrated, the smaller the value of information entropy, the fewer bits are required, and the resulting reconstructed image has more distortion. On the other hand, when the information entropy of the features is higher, the more bits are required, and the image distortion is lower; therefore, we need to use the hyperparameter λ to establish a trade-off between the two. e entire training of the compression method is optimized by means of the following loss function [30]: where R refers to the bit rate, and D indicates the distortion between the images before and after compression. ere are two commonly used distortion metrics, namely, multiscale structural similarity (MS-SSIM) and mean square error (MSE) [31]. e distortion and bit rate are weighed by. λ.
In training, we use an entropy estimation approach consistent with the model in the literature [10], and we model the latent features as follows: Each latent representation y i is modeled as a Gaussian distribution with μ i and σ i , which are forecasted by the distribution of the hidden variable z. z is called the hyperprior, u(·) denotes a uniform distribution, and * denotes the convolution operation. Because prior information about z does not exist, we model the hyperprior z as follows: where p z (i) |ψ (i) represents each univariate's distribution, and ψ (i) represents the parameters of this distribution. Finally, the bit rate in our method consists of the bit rate of the latent representations y and the bit rate of the hidden variable z. ese bit rates are denoted as follows: (4) Table 1 shows the network architecture and related parameters for separate components in our compression method. One of them is the hybrid domain attention mechanism (HDAM), which will be described in Section 3.2. In addition, the inverse quantization module will be described in Section 3.3 and the postprocessing module will be described in Section 3.4.

Hybrid Domain Attention Mechanism.
In previous studies, transform encoders were often implemented using stacks of convolutional neural networks. In learning-based image compression, learning a transform encoder with less redundant information and more critical reconstruction information through convolutional neural networks is one of the keys to better compression performance. While the convolutional layers are limited by the range of perceptual fields and have only local bias induction capabilities, increasing the depth of the network allows for deeper dependencies, but at the same time brings with it a significant increase in computational  effort. Many recent studies have used attention mechanism modules to improve image recovery and compression performance [8,32], where attention modules were introduced into transformers to model the global dependencies between features, resulting in a less redundant latent representation of the image on the transformer encoder side. e attention mechanism starts from the human visual mechanism by adaptively learning the weights of different features and acquiring the areas that need to be focused on. Previous works have used only spatial information to generate attention maps, which do not allow for good mining of correlations between channels of latent features. As shown in Figure 2, we experimented with the effect of the first 32 channels of the latent representations on the distortion performance of the reconstructed images and came to the conclusion that each channel has a different importance in the final reconstruction effect. Unlike previous works and inspired by [33], we proposed a hybrid domain attention mechanism that works in both the channel and spatial domains. e hybrid attention module is embedded in the transformer encoders and adaptively learns relevant compression features to obtain a transformer encoder with reduced information redundancy and with reconstructed key information. In our hybrid domain attention mechanism module, the input features are first passed through the channel attention module to obtain the channel domain attention map, which is then multiplied element by element with the input features to obtain the features required by the spatial attention mechanism. e output of the channel domain is utilized as the input feature map for the spatial attention mechanism, and then, the spatial domain attention map and the input features are multiplied element by element to obtain the final output. Figure 3(a) depicts the architecture of the channel attention mechanism, and the channel mask can be described as follows: where X is the input feature map, A c (·) denotes the channel domain attention map, MLP stands for a 2-layer neural network, and σ represents the sigmoid activation function. e spatial information of the aggregated feature map X is first derived by averaging pooling over spatial dimensions of the feature map, the spatial information is then sent to the MLP to compress its spatial dimensions, and then it undergoes a sigmoid activation operation to create the feature map of channel attention. e spatial attention module is divided into a backbone branch, which uses traditional residual blocks to generate features, and a mask branch, in which the nonlocal block (NLB) [34] is embedded. Figure 3(b) depicts the spatial attention mechanism (b), and the spatial mask can be described as follows: where A S (·) represents the attention mask, and F NLB (·) denotes the result of utilizing NLB and the subsequent residual blocks and convolution. As shown in Table 1, we integrate hybrid domain mechanisms into the framework of the proposed method. e attention module aids the network's global adaptive response by reinforcing essential features while weakening unimportant ones, thus implicitly learning feature importance mapping and delivering more bits for textured regions, resulting in better visualization with the naked eye at similar bit rates.

Hybrid Quantization.
For lossy image compression, all features need to be quantized into integer form for entropy coding. Deep learning-based training, on the other hand, is  hampered by the inherently nondifferentiable nature of quantization. In the training phase, quantization methods for deep learning-based image compression methods [6,7] typically take the form of approximate quantization by adding uniform noise for end-to-end optimization: e other part of the work uses straight-through estimation (STE) of the gradient and manually sets the backward propagation expression for the rounding function: In our method, we integrate two quantization methods. For the encoder output y, we use STE quantization to round up and feed the quantized one into the decoder, while for the entropy model network, we use approximate quantization with added noise for entropy modeling. In addition to this, to reduce the loss of information due to quantization, we incorporate an inverse quantization subnetwork. We treat the loss of floating-point numbers due to rounding operations as being added to the noise in the range of (−0.5, 0.5), which is why most compression methods today use an approximate quantization operation in the form of added noise. e inverse quantization network is similar to existing image denoising efforts, making the inverse quantized feature data as close as possible to the prequantization data. Figure 4 depicts the specific network structure. e addition of uniform noise treats the loss of floating-point numbers due to quantization as a random form of noise, but the loss due to rounding is actually traceable, which also helps the inverse quantization network to perform a more accurate "denoising job."

Postprocessing Module.
Since image compression approaches based on deep learning are lossy, the model in the form of hyperprior needs to ensure that the dimensionality of the latent representation is low; otherwise, the latent representation itself may contain redundancy, which will result in inevitable compression artifacts and poor compression performance. In this case, the hyperprior module may have some loss of information that affects the accuracy of the parameters required to model the entropy rate, especially for high bit rates and high resolutions. Taking into account the degradation information in image compression and the necessity to improve the compressed image's quality and provide better visual effects, we proposed introducing the postprocessing module into the main decoding end. In   Figure 3: (a) e structure of spatial attention in hybrid domain attention module. "RB" is for resblock. "NLB" is for nonlocal module proposed in [34]. (b) e structure of channel attention in hybrid domain attention module. "FC" is for 2-layer fully connected layer. "GAP" is for global average pooling. 6 Computational Intelligence and Neuroscience addition, we proposed reuse of the mean information μ derived from the hyperprior module joint autoregression module. As shown in equation (2), through the training of the neural network, μ will capture the hidden information in y, so μ contains rich structural information. To exploit the full potential of the auxiliary information, we feed the mean information into the postprocessing module to further assist the postprocessing network in removing compression artifacts.
As inspired by image noise reduction and super-resolution network design strategies [35], our postprocessing network uses a residual network structure for quality enhancement of the reconstructed image. As illustrated in Figure 5, we started by adding two convolutional layers to get shallow features, and at the same time, the number of channel dimensions was changed from 6 to 32 and then cascaded through three identical modules for detailed enhancement. Because we are dealing with full-resolution images, we use only three enhancement blocks in order to keep the computational cost from rising as the network depth grows. e enhancement blocks are added with two multiscale residual blocks to extract multiscale features. In addition, three different convolutional kernel sizes are used in these blocks, including 5 × 5, 3 × 3, 1 × 1, and PReLU activation functions, and finally, the convolution is used to change the channel dimension in the image before implementing global residual learning to obtain the enhanced image.
In fact, a general postprocessing module is not necessary because it will fail if the decoder network is powerful enough [29]. However, our postprocessing can increase the quality of the decoded images even further due to the reuse of information from μ. In the ablation experiments in Section 4.3, we proved the effectiveness of the postprocessing module we offered.

Operation Details
(1) Experimental environment. We used PyTorch to implement our method, and we ran all of our tests on an NVIDIA-2080TI GPU with 11 GB of video RAM. (2) Experimental data. To ensure that the recommended technique is effective, we conducted a series of tests.
Since image compression tasks are unsupervised tasks that do not require additional label files, most of the image compression methods crawl high-resolution image data on the web as the dataset for the training model. We used 20,745 high-quality images provided by Flick.com. After preprocessing, these images were randomly cropped to size 256 × 256, and a total of around 800,000 images were obtained as the training set after preprocessing. To assess the efficiency of image compression methods, we tested the performance of various image compression methods using the Kodak Photo CD image dataset [36] as testing data. (3) Comparison of methods. Our comparative experiments include classical traditional image compression methods (JPEG2000 [2], BPG [3]) and more recent deep learning-based image compression methods (Ballé et al. [6,7], Minnen et al. [10], Lee et al. [11]). JPEG2000 [2] compression was tested using the official test model OpenJPEG 2000 configured in YUV 420 [37] and was implemented, and BPG [3] compression was tested using the BPG software [9] in the format YUV 440.

Compression
Performance. e unit of PSNR and MS-SSIM is db, and a higher value means less distortion and better visual impact. As shown in Figure 6 our rate-distortion curves plotted using PSNR (see Figure 6(a)) and MS-SSIM (see Figure 6(b)) as distortion metrics shows that our method has significant performance advantages over JPEG 2000 [2], BPG [3], Ballé2018 [7], and at the same BPP for both of the evaluation indicators mentioned. Our technique exhibits equivalent performance gains to Minnen [10] at low bit rates, and intuitionistic performance increases above Minnen [10] at large bit rates. Furthermore, our method improved performance at all bit rates, demonstrating the efficacy and stability of the suggested method in this study.
We have also conducted ablation studies to validate the effectiveness of each module. As seen in Figure 7, we retrained the model that embeds the hybrid domain attention mechanism into the baseline (four sets of models were trained), and it can see that it produced a rise of around 0.2 db over the baseline model. We also retrained the model by changing the quantization in the baseline to our hybrid quantization (four sets of models were trained), and it can be seen to produce a rise of about 0.2 db over the baseline, with more gain at higher bit rates. We retrained the model with our postprocessing module added to the baseline (four sets of models were trained), and it can be seen to produce a rise of nearly 0.2 db over the baseline as well.
ere are four aspects that affect the efficiency of image compression. One is how to extract more compact features. One is the reduction in quantization losses. One is the reconstruction ability of the decoder. One is the accuracy of the entropy estimation. In our method, we improved the baseline model in these four dimensions, and from our ablation experiments, we can conclude that the hybrid domain attention mechanism has the greatest impact on performance.

Visual Comparison.
To make the effectiveness of our framework clearer, we provide some visualization results. Figure 8 shows the visualization of some images compressed using different compression methods in the Kodak Photo CD dataset [36]. Since neural network-based image compression methods cannot strictly limit the BPP of an image, we can only compare different compression methods at similar BPP of the compressed images. As can be seen from the figures, our method has a higher evaluation index at similar BPP.
In terms of qualitative observation, the image compression methods JPEG 2000 [2] and BPG [3] have obvious block effects and blurring phenomena, and the deep learning-based image compression methods of Lee et al. [11] and Minnen et al. [10] have some loss of edge texture information in the reconstructed images, although they do not have as obvious blurring compression artifacts as the traditional compression methods, while the reconstruction quality of the compression methods in this study is higher.
is is because our method embeds a spatial channel mechanism that is accordingly globally adaptive, reinforces important features (e.g., edge texture details), and allocates more bits to important features, and our framework also uses a hybrid quantization mechanism with the addition of an inverse quantization module and a postprocessing module, all of which can guide the network to use fewer bits to obtain higher reconstruction quality. Overall, the approach in this study is also superior in quality of visual comparison.  Computational Intelligence and Neuroscience

Conclusions
We proposed an efficient trainable image compression method and achieve good performance. In particular, we add a hybrid domain attention module that not only improves the transforming capability of the encoder-decoder network but also generates more compact attention masks with hyperprior distribution, which facilitates more accurate probability estimates and thus improves entropy coding efficiency. In addition, our method combines the reuse of intermediate layer information to synthesize the final reconstructed image through a postprocessing enhancement module. We add a quantization adaptive adjustment module to repair quantization losses. is method reduces the total error by optimizing the network parameters through backward propagation. e results of experiments suggest that our method offers a substantial improvement over both traditional compression methods JPEG 2000 [2] and BPG  [3], and the learning-based method of Ballé et al. 2017 [6], compared to the advanced learning-based image compression methods of Minnen et al. [10]. is method also has a higher performance index and better visual effect. In the realm of learning-based image compression, extracting compact latent features is especially significant. Attention mechanisms can be still unable to completely eliminate data redundancy in latent features, and better capable attention mechanisms are still being explored. e addition of a postprocessing module on the decoder side and the inverse quantization module in hybrid quantization and the nonlocal block (NLB) [34] in the hybrid domain mechanism all add to the computational complexity, but the performance advantages are significant. We expect that more efficient and lightweight attention mechanisms for extracting latent features can be explored.

Data Availability
e data that support the findings of this study are available from the corresponding author, upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.