Advertisement Synthesis Network for Automatic Advertisement Image Synthesis

,


Introduction
With the development of the times and the transformation of the economic model, the current stage of the commodity market has moved from the product generation to the brand generation.Enterprise products no longer completely determine the competitive advantage of the market; branding and marketing, on the contrary, have become an important means for enterprises to stand out in the homogenisation quagmire.Terefore, advertising is the most efective and necessary means of marketing means [1].
It has been found that vivid and intuitive pictures can have a positive impact on consumers' purchasing decisions [2].A well-designed product promotional image can show the product's characteristics through diferent scenes, which can inspire consumers to buy and change their attitudes and impressions of the product.Terefore, most scholars agree that the visual design efect of product promotional images in advertisements can directly afect the efectiveness of advertisements [3].However, the design and production of product advertisement images in advertisements require huge time and capital investment.Enterprises always want to produce attractive advertisements with low cost to stimulate consumers to buy.
To reduce the cost of labor for the creation of these ads, technologies that support automatic synthesis of ads have received considerable attention, including automatic assembly of graphical elements using esthetic principles [4] and simultaneously creating a series of banners for diferent display sizes [5].However, both of these methods automatically generate advertorials by simply splicing together elements such as product images, text messages, and brand logos, which does not generate adverts with images that are attractive enough to consumers.
Fortunately, with the proposed difusion model [6] based on hierarchical construction of denoising autoencoders, the current stage of image synthesis techniques [7,8] has achieved impressive results.It not only makes it possible to synthesise highly creative and artistic complex advertisement images but also greatly reduces the design cost of advertisement images, bringing a revolutionary change to the advertisement design industry.
More specifcally, suppose we receive an order from a fruit company that wants us to create several product advertisements to promote the cherries and watermelons they sell.All we need to do is to write a reasonable prompt text and feed it into the Stable Difusion (SD) model for image synthesis [9], and we can easily obtain a series of vivid images of fruit products (as shown in Figure 1).
However, existing image synthesis algorithms can only be applied to the production of advertisement images for some generic target products (e.g., various types of fruits) but cannot directly generate advertisement images for a specifc target product (e.g., a specifc brand of sports shoes).For example, suppose we need to make a promotional image for a Converse sports shoe; then, the direct use of writing prompts and feeding them to the SD model can only generate an advertisement image with a similar appearance to the target object (as shown in Figure 2).Obviously, such a product advertisement image cannot be used for product promotion and publicity.Although strongly in need, this topic is not well explored by previous researchers.
Terefore, we propose the Advertisement Synthesis Network (ASNet) in this paper to solve this challenge.Diferent from previous methods, ASNet is capable of generating consistent-looking, high-quality product advertisement images of the input target object with zero shooting.Te specifc meaning of consistent-looking is the complete preservation of the appearance details of the target object when ASNet generates advertisement images, which is the biggest advantage of ASNet.
To achieve this, we utilise a two-stage generation structure in the ASNet.Specifcally, we frst generate a pseudo-product advertisement image using the SD-based Pre-Synthesis model.Te product shown in the pseudoproduct advertisement image has similar appearance characteristics as the target product.
Ten, we use PTOE to extract scene features and RTOE to extract real target features, respectively.Finally, we combine these features by injecting them into the pretrained difusion model for interaction and reconstruct the real advertisement image in the pseudo-product advertisement image.
In sum, our work makes the following contributions: (1) We propose a novel Advertisement Synthesis Network for the issue of automatic generation of advertisement images for a given product.ASNet is a two-stage structured end-to-end model that takes prompt text and target object images as inputs to synthesise consistent-looking product advertisement images.To the best of our knowledge, ASNet is the frst fully automated advertisement image generation model without manual intervention.(2) By comparing with state-of-the-art image generation models, we obtain superior advertisement image synthesis results on test data.We believe that the two-stage generation protocol used in this paper breaks the paradigm of intrinsic advertisement image synthesis methods and can provide a generic solution idea for similar tasks.Early Generative Adversarial Networks (GANs) [10,11] are capable of sampling and generating high-resolution images, but they are difcult to optimise [12][13][14] and capture the complete distribution of data [15].In contrast, Variable Autoencoder (VAE) [16] and stream-based models [17,18] are easier to optimise [19][20][21], but the quality of the images they generate will be lower than GAN-based models.

Related Work
Recently, difusion modelling (DM) [6] has achieved state-of-the-art synthesis results on image data and beyond [22,23] by decomposing the image formation process into a sequential applications of denoising autoencoders.Te subsequently proposed latent difusion models (LDMs) [9] achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while signifcantly reducing computational requirements compared to pixel-based DM.
Difusion model-based image generation methods have shown great promise in image generation, beating GANbased methods in generating diversity, and their image synthesis has brought about unprecedented changes.

Advertisement Image Synthesis Model.
Traditional methods for automatic advertisement image generation typically use graphical design strategies that are driven by design rules or structured data.
O'Donovan et al. [24] proposed that an energy function can be constructed by assembling various heuristic visual cues and design principles for optimising the layout of a single page and extended the function as an interactive tool for the automatic generation of advertisement images.Yang et al. [24] proposed a system for generating visual text presentation layouts for the generation of advertisement images, in which colours are automatically determined with the help of a colour harmony model and a colour tone model, and theme colours are defned by the designer.Liu et al. [25] introduced an intelligent banner release tool, Luban, which could automatically synthesise banners with diferent commodities.

International Journal of Antennas and Propagation
With the recent great success of deep learning-based GAN and SD models for image generation [26,27], they are widely studied in the feld of advertisement synthesis.However, these methods, although capable of generating realistic and natural-looking images, are still rarely used for the automatic generation of advertisement images due to the difculty of fnding suitable data pairs for supervised learning.To address this problem, You et al. [28] created a dataset containing 13,280 advertisement images with rich annotations including the outline and colour of the elements, as well as the category and target of the advertisement, and constructed a new probabilistic model to guide the synthesis of the advertisement's style.Te aim is to use a data-driven approach to capture the relationships between individual design attributes and elements in an advertisement image and to automatically synthesise the input of elements into an advertisement image based on a specifed style.

Overview of the ASNet. Te Advertisement Synthesis
Network (ASNet) pipeline is shown in Figure 3.It is capable of generating high-quality and creative advertisement images after inputting specifc target products and well-designed prompt texts.Unlike traditional methods, the ASNet proposed in this paper is able to synthesise advertisement images that match the appearance of the target product in an endto-end manner without manual intervention.
Our core idea of building ASNet is to frst generate preliminary target product scenarios using Pre-Synthesis and then extract representative scene features using Real Target Object Encoder (RTOE) and extract real target objects using Pseudo Target Object Encoder (PTOE).Finally, these features are injected into the pretrained difusion model and recombined in the initial generated target product scenes.

Pre-Synthesis. Te Pre-Synthesis (PS) is built on Stable
Difusion model, which is used to initially generate advertisement image scenarios of the target product in the form of a text-to-image.Mathematically, PS is described as where P θ denotes the PS with network parameter θ.We used PS to convert the prompt text t into initially generate advertisement image I pseudo .It is worth noting that the initially generated advertisement image does not have the appearance of the target product; we just want to use its generated advertisement scene for secondary generation.

Target Object Encoder.
Te Target Object Encoder (TOE) module is shown in Figure 3. Te TOE module can extract rich feature details and scene information from the

Prompt Text
A pair of sneakers, in the flowers, real rendering, beautiful color

Target Object Stable Diffusion Synthetic Object
Figure 2: Comparison between the synthesised advertisement images and the target product.We synthesised a set of advertisement images of sneakers using the text-to-image modes of the Stable Difusion model, but the correlation between them and our target product is very low, and there is a signifcant discrepancy between the texture and appearance.
input image for the secondary synthesis of target product advertisement images.TOE consists of PTOE and RTOE: PTOE is used to extract scene features for pseudoadvertisement image and RTOE is used to extract detailed features of real product image.
Te network architecture of RTOE consists of a selfsupervised model DINO-V2 [29] for feature extraction and a single linear layer T() serial for fne-tuning.
Te input to the RTOE module is a target product image without a background I real .Product images without backgrounds help RTOE to get more neat and unambiguous features in the feature extraction phase.After obtaining the input of the real target product image, RTOE encodes it and fne-tunes the encoded features to fnally obtain a spatially aligned feature f r , which is mathematically described as However, it is not enough to generate an advertisement image using only the feature information of the real target product image.We also need additional guidance to complement the generation of scene information.Terefore, we constructed a PTOE Φ p to extract scene information from pre-generated pseudoadvertisement image using a ControlNet-style [30] network that generates a range of detailed feature information with hierarchical resolution.Te above process expresses this as where f p denotes the scene features extracted from the pseudo-advertisement image.
3.4.Feature Injection.After obtaining f p and f r , we tried to stitch them together to synthesise an advertising image of the real target product.We inject them into a pre-trained textto-image difusion model, at which point we probabilistically sample the image using UNet and project it into the latent space using the stable difusion model to guide the image synthesis.
We set the sampling process function of the UNet model as υ ϑ ; it starts denoising from an initial latent noise ε ∼ U ([0, 1]), takes f p and f r as the condition to generate new image latent z j , and uses the decoder D() to generate the real target product advertisement image that we need to get in the end: where j is the difusion time step and α j and σ j are denoising hyperparameters.

Loss Function.
We employ the mean square error to construct a loss function for facilitating the training of the network:   International Journal of Antennas and Propagation object.Ten, we obtain the target object from the previous frame by the foreground mask.For the next number of frames, we obtain the remaining background image by masking the foreground object.Trough this series of operations, we acquire the target object and the scene image, and the original data frames happen to serve as the ground truth of the data pairs.Te list of raw video data being used to extract the image pairs is shown in Table 1, which encompasses all kinds of scenes and is conducive to improving the generalisation ability of the model.

Baseline.
To the best of our knowledge, this paper is the frst to propose an end-to-end approach to generating an advertisement image for a specifc product, so there are no approximate algorithms available for comparison.So, we used two approximations to complete the comparison experiments.
(1) Advertisement image synthesis for target goods using the reference image approach in Midjourney [37]-this approach takes as input a background-less image of the target product and a set of prompt texts.Te reference image method will combine the above two inputs to generate an advertisement image with the characteristics of the target product.( 2) Combine the text-to-image and image-to-image modes in the Stable Difusion model to synthesise an advertisement image for the target product.(3) Dalle3 is a powerful image compositing model that gives us unprecedented possibilities.It serves as a powerful tool that helps us generate images with a high degree of consistency and coherence more easily.Specifcally, we frst use the text-to-image mode in Stable Difusion to generate an advertisement image.Ten, we combine this advertisement image with the background-less image of the target product into the image-to-image mode and fnally synthesise the advertisement image of the target product.

Evaluation Metrics.
We observe in Figure 4 that the proposed model in this paper is capable of synthesising complex, realistic images.In general, we can use traditional performance metrics such as FIDs [38] to evaluate the quality of the images generated by the model.However, the numerical results of FID do not always agree with actual human sensory judgement [39].In order to better measure the generative capacity of our system, we introduced systematic human evaluations to quantitatively evaluate the model.Tree performance metrics are included in systematic human evaluations: photorealism [40], caption similarity [41], and sample diversity [39].
For the performance metric of photorealism, users are asked to score the advertisement images synthesised by diferent methods, and images that look more realistic should receive higher scores from the users.For caption similarity, users will score the advertisements based on the corresponding headline cues, and images that match the headline better are given higher scores.
Similarly, for sample diversity, users are asked to score the diversity of the four synthetic advertisement images generated by the diferent models, with more diverse advertisement images receiving higher scores.

Experiment Data.
Te ASNet model proposed in this paper generates the corresponding advertisement images for a given product image.In the process of generating advertisement images, the ASNet model frstly needs to generate pseudo-advertisement images initially by using the prompt text corresponding to the target product.After that, we input the target product image without background to correct the information and fnally synthesise the advertisement image which is consistent with the target product.
In order to demonstrate more intuitively the practical application efect of the proposed algorithm in this paper, we randomly selected background-free images of four typical commercial products (shown in Figure 5) and designed corresponding prompt texts for them as the basic input data in the experiment (shown in Table 2).

Quantitative Analysis.
Te main goal of our work is to synthesise end-to-end advertising images of the target products.In order to verify the validity of the work in this paper, we tested the efect of advertisement image synthesis on four diferent target products.
Table 3 shows the results of the systematic human evaluations, in which the values of objective evaluation metrics photorealism, caption similarity, and sample diversity obtained by ASNet proposed in this paper are higher than those of other algorithms.

Qualitative Analysis.
Figure 4 shows the visualisation results of comparing our method with other method lines.From the visual analysis of the experimental results, the advertisement images synthesised by each method are clear, reasonable, and aesthetically pleasing.
However, when we compare the target product image with the synthesised advertisement image one by one, we can clearly fnd that neither the image synthesised by Stable Difusion nor Midjourney can be consistent with the shape and texture details of the target product image.After careful comparison and summary, we fnd that the advertisement images synthesised by the algorithm proposed in this paper have the characteristics of consistent-looking and consistent-detail.
In terms of consistent-looking, we can clearly observe in the frst line of Figure 4 that the Converse sneaker advertisement synthesised by the proposed method is basically

Dataset
Type Samples Quality BURST [31] Video 1493 Low MOSE [32] Video 1507 High VIPSeg [33] Video 3110 High UVO [34] Video 10337 Low YouTubeVOS [35] Video 4453 Low YouTubeViS [36] Video 2283 Low "Type" refers to the original type of data; "Samples" refer to the number of data of that type; "Quality" refers specifcally to the image resolution.
International Journal of Antennas and Propagation consistent with the target product image in both product appearance and colour texture.On the contrary, in the advertisement image synthesised by Stable Difusion, although the colour of the synthesised sneakers is similar to that of the target product image, its appearance is very diferent from that of the target product image.Further, we can see that the Midjournal image, although similar to the target product image in terms of shape, is particularly different in terms of colour and the original placement of the sneakers.
Similarly, we can observe the fourth row in Figure 4.At this stage, we need to generate a corresponding advertisement image using the target product image of Apple mobile phones.Te original target product image has two mobile phones, one presenting the back and the other presenting the front, which are overlapped together.Our proposed algorithm synthesises a mobile phone advertisement image that has a high degree of similarity in appearance and a more consistent product pose with the target product image.On the other hand, the mobile phone advertisement images synthesised by the other two algorithms could not maintain the consistency of appearance with the target product image, and it even appeared that the generated advertisement images were completely inconsistent with the target product image.Te above two sets of comparisons fully demonstrate the superior performance of the algorithms proposed in this paper in terms of appearance heterogeneity.
For consistent-detail, we can fnd that the advertisement images generated by the proposed algorithm in this paper have a better presentation of product details by observing the second and third rows in Figure 4. Specifcally, for example, comparing the Chanel perfume in the second row with its target product image and the synthetic advertisement image, our proposed algorithm is able to efectively maintain the consistency of the trademark information, while the advertisement images generated by the other two algorithms either lose the trademark information or generate unrelated trademark information.Similarly, the advertisement graph of Coca-Cola in the third row has the same problem.Our proposed algorithm synthesised the advertisement image keeping the consistency of the trademark information, but the other two algorithms synthesised the advertisement image with partial misrepresentation of the Coca-Cola logo.

Robustness and Generalisation Experiments.
In order to verify the robustness and generalisation ability of ASNet, we will choose unconventional product categories and lowquality product images as inputs to the model.As shown in the frst row of Figure 6, we choose the universal charger for mobile phone batteries, a product that is almost nonexistent now, as the research object.Around 2000, mobile phone batteries were still removable, so universal chargers were widely used.However, with the integration of mobile phones, the batteries are no longer removable, so it is unlikely that universal chargers would appear in these recent datasets used for training.We observe Figure 6 but fnd that this type of unconventional product does not afect the performance of ASNet, and our model still has good detail preservation.
Unfortunately, when we look at the second row of Figure 6, we fnd that if we choose a low-quality product image as the input to the model, the resulting advertisement image is very disappointing, and the resulting advertisement image does not even have any practical meaning.Te reason for this phenomenon is that low-quality product images have extremely limited feature information, and the model cannot understand these features.Terefore, we can see that the generated advertisement image shown in the second row of Figure 6 is similar to the target product image in some features, but it is totally inconsistent from the overall point of view.

User Purchase Intention Study
In order to further clarify the impact of ASNet generated advertising images on real consumers, we measure the impact of ASNet generated images on real consumers' purchase intention and brand perception through the form of the simplest questionnaire.
We conducted the experiment through a street questionnaire.A total of 100 volunteers were recruited for this experiment, and their participation was voluntary.Each participant was shown a randomly disrupted image of an advertisement generated by a diferent model, along with the original image of the target product.We asked them to rate their willingness to buy and brand perception on a scale of 1-10 (the higher the value, the stronger their desire to buy or the better their perception of the brand) after viewing the advertisement images.

International Journal of Antennas and Propagation
During the questionnaire survey, we collected the basic personal information of the subjects, which included gender, age, education level, and so on.After eliminating 29 invalid questionnaires, there were 71 valid questionnaires, and the specifc information of these 71 people is shown in Table 4.
Te impact of the advertisement images generated by various models on consumers' purchase intention and brand perception was explored through a questionnaire survey.As shown in Table 5, the advertisement images generated by our proposed ASNet are more likely to have a positive impact on consumers' purchase intention and brand perception.
Combined with the results in Table 3, we can reasonably speculate that this result is due to the fact that the advertisement images generated by ASNet maintain the structural and detailed integrity of the reference target object very well, so they are more realistic.

Limitations and Future Work
Te ASNet proposed in this paper is built from the SD model based on Markov chain before and after the difusion process as a base model.It can recover the real data more accurately  and has better ability to maintain the image details, so it can generate realistic and attractive advertisement images.But it also has certain defects and limitations.For example, when the quality of the input target object image is low, ASNet cannot maintain the consistent-looking of the target object well because ASNet inherits the characteristics of the SD model, and it will repair the unknown parts when it cannot accurately identify the detailed features of the target object.Future work should consider how to solve this problem and improve the generalisation ability and robustness of the model.In addition, although ASNet can automatically generate advertisement images end-to-end, it still needs professionals to set up cue synthesis scenarios according to the product characteristics.In the future work, we can consider generating product descriptions into cue words automatically through text models, which can further improve the degree of automation of ASNet.

Conclusions
In this paper, we present a new Advertisement Synthesis Network model for advertisement image synthesis of targeted products.To the best of our knowledge, this is the frst end-to-end automatic ad image synthesis model that can transform a simple target product image into a designer and aesthetically pleasing product advertising image through a two-stage generation approach.Advertisement Synthesis Network is likely to dramatically reduce the cost of advertisement design and revolutionise the advertisement design industry.At the same time, the two-stage generation solution used in this paper can provide a generic solution idea for similar tasks.

Figure 1 :
Figure 1: Synthesis of advertising images of watermelon and cherries from Stable Difusion models.

Figure 3 :
Figure 3: Te pipeline of the proposed ASNet.ASNet requires a background-free target object image and its corresponding prompt text as inputs.We frst employ Stable Difusion model as a Pre-Synthesis to generate a preliminary target object scene (pseudo-advertising image).Ten, we feed the pseudo-advertising image and the target product without background into Target Object Encoder for encoding.Finally, we feed the encoded features into the pre-trained difusion model for the synthesis of the real advertising image.

Figure 4 :Figure 5 :
Figure 4: Comparison between the target product images and the synthetic advertising images.Te left side of the red dotted line shows a sample of the target product image used for testing.Te right side of the red dotted line shows the synthetic advertisement images generated by the three diferent methods.

Figure 6 :
Figure 6: ASNet model generalisation ability and robustness test.We use unconventional product categories and low-quality product images as input to synthesise advertisement images.

Input Prompt Text Pre-Synthesis Target Object Real Target Object Encoder DINO-V2 Linear Pseudo Target Object Encoder Target Object Encoder Pseudo Feature Noise U-Net Real Feature Output
Te models covered in this paper were implemented using the PyTorch framework, and the models were trained and tested using four GeForce RTX 3090Ti GPUs.During training, we processed the image resolution to 512 × 512.We set the initial learning rate to 1e −5 and the optimiser to Adam.
4.1.Network Model Parameter Setting and Evaluation Metrics4.1.1.Implementation Details and Hyperparameters.target product is consistent-looking, so the ideal training data for ASNet are image pairs of "the same object in different scenes," but these image pairs cannot be directly constructed from existing datasets.To solve this problem, a video dataset is generally used to capture diferent frames containing the same object.In detail, we select two adjacent frames from a video and extract the mask of the foreground A pair of sneakers, in the flowers, real rendering, beautiful color

Table 1 :
Details of the dataset used for training.

Table 2 :
Te prompt text for the product image.perfume liquid sank into the sea, surrounded by bubbles.Tere's too much foam.Te soft light refected through the water.Large water ripple network makes the picture beautiful, high resolution, fne detail, front view, C4D rendering

Table 3 :
Te results of the comparisons with other models.Tis table shows the performance metrics comparing the advertising image synthesis results of the proposed ASNet in this paper and other advertising image synthesis algorithms.

Table 4 :
Descriptive statistics of the study sample.

Table 5 :
Numerical results of the impact of synthetic advertising images on real consumers.