Medical Image Description Based on Multimodal Auxiliary Signals and Transformer

the


Introduction
In recent years, with the development of interdisciplinary fusion techniques, automatic generation of image descriptions using deep learning has become a popular research in the feld of computer vision [1][2][3].Image description is a combination of image recognition and text generation from machine learning.Mainstream tasks include news image description generation [4], machine vision quizzing [5], visual inference, remote sensing image description [6], and automatic generation from text to image.Tese techniques have a great value in many practical application scenarios, for example, in the medical feld where it can play the role of a doctor writing medical reports [7].In early childhood education, it can play the role of a lecturer [8].It can also be used to help visually impaired people to perceive the visual content of their surroundings.
Currently, as society progresses, people's desire to pursue a healthy life has led to a dramatic increase in the number of medical images, which has also led to an increased workload for imaging physicians.However, the excessive pressure and the lack of experienced imaging doctors make patients spend more time waiting for medical reports.To solve this problem, we started to study the automatic generation of radiology reports.Generally, a diagnostic report written by a physician is a textual presentation that gives information about the patient's medical image symptoms, as shown in Figure 1.Te generated medical report should be accurate, complete, and readable.
In recent years, with the continuous development of deep learning technology, the technique of medical image description tasks has continued to mature.Nowadays, most of the models adopt the encoder-decoder framework [9].Te encoder consists of a convolutional neural network (CNN) that extracts image features, and the decoder is initially composed of a recurrent neural network (RNN) that generates text descriptions.In order to improve the above network architecture, the researchers applied transformer [10] to the task of medical image description.Compared with existing image description methods, the transformer model can better interact with multimodal data in the multihead attention module; thus, it can better reason about multimodal data and fnally output a more accurate description.
Tere are two key problems with the current deep learning-based medical image description.On the one hand, despite the emergence of public medical image description datasets, such as IU X-ray and MIMIC-CXR [11], medical image description datasets are still scarce.In addition, these datasets may sufer from data bias problems.In terms of the visual bias problem, the distribution of medical image data is extremely unbalanced, which can lead to the low sensitivity of existing deep learning models to odd disease conditions [12].In terms of the text bias problem, as shown in Figure 1, the medical report describes all symptoms in the image, so the generated reports are mostly occupied by health descriptions.In addition, as far as some normal regions are concerned, these similar symptom statements are always repeated in the report of the dataset.Tis unbalanced text distribution will make normal statements too repetitive and make it difcult for the model to generate anomaly descriptions.Some researchers have proposed using the migration learning [13] approach to address the data bias problem.In 2021, Wu et al. [14] explored the role of text labeling auxiliary information on the transformer model to generate medical image descriptions, and the experiments fnally showed that multimodal data interaction can improve the accuracy of the model.On the other hand, although the current state-of-the-art radiology report automatic generation models show a great improvement in performance metrics, the generated medical reports still do not meet the market application standards and have the problem of inconsistency with real reports.
To address the above issues, this manuscript started to explore the impact of multimodal auxiliary data on the performance of the transformer model.A multimodal dataassisted knowledge fusion framework (MDAKF) is proposed to mimic the working mode of a radiologist.It will pay more attention to the areas with a high probability of disease like an experienced radiologist diagnosis.Ten, it analyzes and views the overall image to label abnormal areas and fnally accurately writes the corresponding report.MDAKF introduces two modules, namely, the multimodal data-assisted knowledge module (MDAK) and the multimodal data fusion module (MDF).MDAK can mitigate visual data bias by extracting abnormal regions based on input images.MDF can enhance the information exchange between diferent modalities [15] so that text and audio multimodal [16] information complement each other.Tus, our MDAKF model can generate more accurate and sensitive medical report descriptions of rare anomalous causes.
In summary, the main contributions of this manuscript are as follows: (1) To alleviate the data bias problem, we proposed a multimodal data-assisted knowledge fusion network, including multimodal data-assisted knowledge (MDAK) and multimodal data fusion (MDF).(2) To further enhance the audio-aided knowledgeability in the Transformer model, we introduced a variety of multimodal feature fusion methods to combine text and audio features to achieve data complementarity and jointly promote model performance.(3) A variety of related experiments were performed on IU-X-ray and COV-CTR datasets, and the data analysis of various performance indicators confrmed the efectiveness of the proposed method and outperformed the previous models.
Te rest of this manuscript is organized as follows: Section 2 describes the related work and methods for generating medical image descriptions.Section 3 describes the overall architecture of the MDAKF network.Tis is followed by experimental results (see Section 4) and our conclusions (see Section 5).

Related Works
Tis manuscript introduces related work in three aspects: transformer-based image description, automatic medical image report generation, and multimodal feature fusion.

Transformer-Based Image Description.
Te task of image description generation has received extensive attention from many researchers.Most current approaches for the automatic generation of image text descriptions use an endto-end encoding-decoding structure.Te encoder typically uses a convolutional neural network (CNN) to perform feature extraction on the input image.Te decoder uses a recurrent neural network RNN or a variant LSTM for natural language processing to convert the input image features on the encoding side into text descriptions.Some researchers introduced the Transformer encoding and decoding architecture [17][18][19] instead of the above network architecture; experiments have shown that the Transformer model can improve various performance indicators in medical image description tasks.Te grid features extracted in Transformer are also called feature maps, although its advantage lies in its ability to cover the entire image and capture the details of the target [20].However, the semantic hierarchy of this grid feature is usually relatively low.Ji et al. [21] suggested utilizing a faster R-CNN network to extract features of the target region.Tey then incorporated global information into transformer encoders to combine the benefts of both top-down and bottom-up image description generation schemes.
M2Transformer [22] utilizes an encoder-decoder architecture to transform images into descriptive text, where decoders are mesh-connected to encoders and the input of each decoder is weighted by the results of all encoders,   International Journal of Intelligent Systems facilitating the capture of more detailed features.At the same time, attention slots are used in the attention mechanism of the encoding phase to increase memory and provide a priori knowledge for the subsequent process.Te dual-level [17] Transformer aligns the regions and grid features.It utilizes the comprehensive relation attention (CRA) module to obtain self-attention information for regions and grids by focusing on absolute position information and relative position information.Subsequently, the selfattention information is aligned by utilizing the localityconstrained cross attention (LCCA) module.Finally, the input features undergo encoding and decoding processes in order to generate a textual description.

Automatic Generation of Medical Image Reports.
Current medical image description tasks are still dominated by the encoder-decoder framework, which translates images into individual descriptive sentences.Tis framework has been very successful in driving the technology forward.However, medical image description difers to a greater extent from natural image description in that it is more specialized and diverse.Describing specifc regions in radiological images has greater accuracy requirements.Park et al. [23] proposed generating medical image reports that used the encoder-decoder framework.A convolutional neural network (CNN) was used as encoding to extract the image features.Te decoder was composed of a recurrent neural network (RNN) that performs well in language generation tasks [24], such as long short-term memory network [25] (LSTM) and gated recurrent unit [26] (GRU).Later, in order to increase the variability of the generated text, Chen et al. [27] proposed a dual LSTM to generate standard and anomaly reporting information separately.Recently, some researchers have used Transformer at the encoder and decoder [22] as the network architecture for image description generation.
To make the generated radiology reports more accurate and convincing, some researchers started using deep learning to simulate the process that doctors go through when writing medical reports.For example, when radiologists observe medical images, they combine knowledge and work experience in the medical feld to write a complete diagnostic report that refects medical images.Zhang et al. [28] found that auxiliary signals help in the generation of image descriptions.Liu et al. [29] proposed adding a text visual dual attention mechanism to the CNN-LSTM structure to make the generated text information more complete.It was shown that the introduction of auxiliary signals positively afected the generation of various types of image descriptions.
When processing NLP tasks, previous textual information is usually crucial for subsequent studies.LSTM, RNN, GRU, and other deep learning models have the function of recording previous textual information, but the above models have specifc storage mechanisms, and the capacity of their storage modules can be limited, which may be the reason for the efectiveness of the decoding function.Song et al. [30,31] proposed a new storage component relational memory RM to save feature information to make the model have a larger storage capacity.It can be seen from the experimental results that improving the storage module in the transformer can indeed improve overall performance.

Multimodal Feature Fusion.
Modal is a form of expression of information in the computer feld, and multimodal refers to the combination of multiple modalities.Multimodal feature fusion is the focus of current research on multimodal information processing [32].Te expression of diferent modes is not exactly the same, directly fusing them may appear as information redundancy.A good multimodal feature fusion algorithm can make the feature information richer [33].
Multimodal feature fusion techniques are currently used in various computer vision felds, such as image description generation and image segmentation.In the early stages of the study, the commonly used multimodal fusion methods mainly included element product, element sum, or even simple concatenation between diferent types of features, which are somewhat simple but lack in-depth analysis [34].In 2018, Wu and Han [35] proposed a new multimodal fusion method, namely, multimodal cyclic fusion.Tis feature fusion can take full advantage of the interactions between multimodal feature elements and further improve the performance.Specifcally, after reshaping visual or text vectors into cyclic matrices, respectively, they defned two interaction operations between the original feature vectors and the reshaped cyclic matrix.Finally, they used elementby-element sums to obtain a joint representation of these two cross-fused vectors.As each row of the cyclic matrix was shifted by one element, using the newly defned interaction operations, they explored almost all possible interactions between diferent modal vectors.Recently, Yang et al. [36] introduced a new transformer-based architecture that used "fusion bottlenecks" for multilayer modal fusion through a small amount of bottleneck delay, so that the information between diferent modalities was processed by the model and the necessary information was shared.Te results showed that this algorithm can not only reduce the computational cost but also improve the fusion performance.

The Proposed Method
Automatic medical image report generation is a cross-modal task combining text and image vision and is essentially an image-to-text generation task.We take radiological images as input to the model and transform them into the corresponding source sequence X = (x 1 , x 2 , . .., x n ), x n ∈ R d , where d is the medical image visual feature extracted from the visual extractor and d is the size of the feature vector.Te corresponding report is the target sequence Y � (y 1 , y 2 , . .., y t ), y t ∈ V, where y t is the generated token, t is the length of the generated token, and V is the vocabulary of all possible tokens.An overview of our proposed model is shown in Figure 2, the details of which are illustrated in the following subsections.
International Journal of Intelligent Systems 3.1.Model Structure.Our model can be divided into four main parts: the visual feature extraction module, the multimodal-assisted signal feature extraction and fusion module, the Transformer encoder, and the Transformer decoder.Among them, our innovation is mainly in how to implement audio and text features to assist medical images to generate more accurate reports.Te four components and the training objectives of the tasks are described in detail below.

Visual Feature Extraction.
Te visual extraction module passes the input medical image to extract visual feature sequence X n by a pretrained convolutional neural network (CNN), such as VGG or ResNet, and the encoded result is used as the source sequence for all subsequent modules.Te process is formulated as shown in the equation: (1)  the output is the hidden states I and V encoded from the input features X n and RT s .Finally, since the feature hidden sequences I and V are aligned, they are directly summed to obtain I ′ , where LayerNorm denotes layer normalization: � f e rt 1 , rt 2 , . . ., rt s , (3)

Transformer Decoder.
Te backbone decoder uses a transformer variant architecture (R2Gen) containing RM [30] storage components, which converts the layer normalization of each decoder to MCLN [30].Te transformer decoder is also stacked by a multihead attention mechanism and feedforward neural network.Te hidden state output from the encoding side and the sequence of y t−1 output from the previous time slice at the decoding side are fed to the decoding side to fnally obtain the target vector sequence y t : Te use of text tags as auxiliary signals is relatively poor in the actual system, and speech is the mainstream way of human-computer interaction; it will be more convenient.Terefore, we choose audio as the auxiliary signal.We use audio by fusing text features to make it have better semantic information.It can be better aligned with the visual features generated by annotated medical images.
Te audio-aided data are selected from the image training corpus with a high frequency of abnormal keywords, such as "emphysema," "pneumonia," "cardiomegaly," "pneumothorax," and lesion," which contain the attributes of abnormal content categories and abnormal regions in medical images.Finally, the text data are broadcasted by the machine to generate an audio fle.
Te transformer encoder contains mainly the multiheaded attention mechanism MHA and the feedforward neural network FFN.Te MHA consists of n parallel attention strings, as shown in the equation: q denotes extracted visual features X n , and k and v denote audio text fusion features RT s .Ten, these sequences are entered into the encoder of the transformer to obtain the hidden sequence V focusing on visual and audio text fusion features.Ten, we make V as q and visual features as k, v, and the encoder obtain the visual features I with aligned auxiliary information: However, the audio modality signals with similar pronunciation lack recognizability, which may lead to incorrect mapping of audio labels to medical image regions.Terefore, we further fuse the audio with text features as the auxiliary signal to improve the overall performance of the model.

Multimodal Data Feature Fusion.
In this manuscript, we fuse audio and text representations by the feature fusion module to obtain a fusion representation of multimodal features instead of the original single audio representation.It is semantically more discernible and crosses the heterogeneous gap.

International Journal of Intelligent Systems
Te four feature fusion schemes are used for the multimodal fusion of two auxiliary information feature vectors: such as Add, concat, and mul product operations and attention selection fusion (ATT).Te fusion method is shown in Figure 3.
Add is the point-by-point summation of R s and T s , as shown in the equation: Concat is the feature splicing of R s and T s , as shown in the equation: Mul is the Hadamard product operation of R s and T s , as shown in the equation: ATTrefers to constructing two attention mechanisms for audio and text representations separately and then performing a point-by-point summation operation on the features after the attention mechanism is selected, as shown in the following equation: A T s � sigmoid L2 T s  , where L1 and L2 denote the two fully connected layers and A R s and A T s denote the features acquired after passing the attention mechanism.Ten, by interacting with the multimodal auxiliary signal and the multiheaded attention mechanism in the encoder, the trained model can generate medical analysis reports that pay more attention to the regions indicated by the auxiliary signals.Tis solution addresses the issue of imbalanced data distribution within medical datasets.

Experimental Setup
In this section, we describe in detail two public datasets along with some widely used metrics and experimental settings.Ten, we evaluated and analyzed the proposed approach.
IU-X-ray is a widely used benchmark dataset for evaluating the performance of radiology report generation methods.It contains 7470 chest X-ray images associated with 3955 radiology reports.We randomly split the dataset Te COV-CTR dataset contains lung CT images and their corresponding diagnostic Chinese reports, where the lung CT images are collected during the COVID-19 outbreak, and Li et al. [41] provided the corresponding diagnostic reports to construct the COV-CTR dataset.It includes a total of 728 images, of which 349 are COVID-19 and 379 are non-COVID-19.For a fair comparison, we randomly divided the data into the training set, validation set, and test set in the ratio of 8 : 1 : 1.

Performance Metrics.
It refers to the evaluation of the image description generation model, judging the quality of the description generated by the model.Typically, the experiment can use an automated rule-based evaluation method in medical image description tasks [43].Tis methodology entails the prior collection of a predetermined quantity of reference descriptions that have been authored by human beings specifcally for the provided image.Te evaluation of similarity between the description produced by the model and the reference description is achieved by employing keyword matching.Tis approach can be utilized as a means of evaluating the efcacy of the model.Te mainstream metrics include ROUGE, BLEU, CIDEr, and METEOR.
BLEU calculates the degree of overlap of the N-tuples in the generated report and the target report to measure the similarity between statements.METEOR calculates the similarity between candidate and reference texts based on word-level accuracy and recall, as well as penalties for word order.METEOR has the fexibility to handle word matching and word order problems, so it is more refective of human evaluation of text quality than BLEU.ROUGE-L is responsible for calculating the longest common subsequence of a sentence.CIDEr calculates the cosine similarity between the real description and the model-generated description to measure the efect of the image description.

Parameter Settings. Te datasets are pretrained on
ImageNet, and the extracted features are 2048 7 * 7 shaped feature maps, which are further projected into 512 feature maps.For these two datasets, the same hyperparameters are used for training.Specifcally, the learning rates of the visual extractor and other parameters are set to 5e − 5 and 1e − 4, respectively, and the batch size is 4. In addition, the number of heads and dimensions of the multiheaded attention is 8 and 512, respectively.

Model Performance Comparison.
We compare our approach with a series of state-of-the-art radiological report generation models (Transformer [44], M2Transformer [22], CoAtt [7], HGRG-Agent [45], KERP [46], PPKED [14], SAT [47], AdaAtt [48], R2Gen [30], and ASGMD [49]).For the IU X-ray dataset, the R2Gen, PPKED, Transformer, MDAK (our), and MDAKF (our) model indicator data are obtained through our experiments.In contrast, the other model  1, MDAKF outperforms state-of-the-art methods in some of the metrics for both the COV-CTR and IU X-ray datasets, which proves the efectiveness and accuracy of incorporating audio as an auxiliary signal in medical image description.Specifcally, on the IU X-ray dataset, the MDAK method increases from 0.398 to 0.424 on the evaluation metric CIDEr, which is specifcally used to assess the quality of generating reports.On the COV-CTR dataset, the two methods of MDAK and MDAKF reach 1.452 and 1.243 on CIDEr evaluation indicators, respectively.

Ablation Experiments.
In this section, a quantitative analysis is performed to investigate the contribution of each component in MDAKF.Te experimental results of adding MDAK and MDAKF are shown in Table 2 below.
From Table 2, it can be seen that the performance of the MDAK model with audio-assisted signals is superior to that of the base (R2gen), which signifcantly improves the quality of report generation, fully verifying the efectiveness of the MDAK module.Te automatic generation indicators of image reports in MDAK have shown some improvement, especially METEOR, and CIDEr.Te METEOR score increases from 0.187 to 0.201, and the CIDEr score increases from 0.398 to 0.424.Te indicators have also increased the COV-CTR dataset.For example, the BLEU4 score increases from 0.528 to 0.539, and the ROUGE_L score increases from 0.677 to 0.683.Te input of visual features in medical images signifcantly afects natural language decoders.Terefore, by simulating the working mode of radiologists through audio-assisted signals, the visual encoding process of the model will focus more on the image area aligned with the audio-assisted signal, ultimately providing richer visual features.Tese experiments indicate that focusing on the abnormal areas specifed by audio tags can improve the quality of medical report generation.
To improve the performance of auxiliary signals, we add a multimodal auxiliary signal fusion module (MDAKF).It can combine the semantic information of text and audio cross-modal data.To verify its efectiveness, we cite four diferent feature fusion schemes.Te symbols in parentheses in MDAKF (add, ATT, cat, and mul) indicate the feature fusion scheme, which can be seen in detail in 3.3.From the experimental results in Table 2, it can be seen that the addition of the MDAKF module can continue to improve the overall performance of the model, and it also has a certain improvement in various indicators.Among them, MDAKF (add) has better overall performance, so we chose it as the fnal model for MDAKF.In Table 2, the performance indicators of the MDAKF (add) model are improved.In terms of BLEU1 and ROUGE − L indicators, the BLEU1 score increases from 0.470 to 0.494, and the ROUGE − L score increases from 0.371 to 0.389.It can also be seen that multimodal feature fusion has a signifcant impact on model performance.Terefore, designing a more reasonable multimodal fusion scheme is also a key research direction in the future.Finally, the experimental results indicate that MDAKF can provide better semantic feature information guidance for the model, further improving the performance of the model in generating medical reports.3. We can see that the optimal performance of the MDAKF model can be achieved by using the ResNet-101 network to extract visual features.

Visualization Experiments.
In order to further verify the validity of our model, we select some medical images from the IU-X-ray and COV-CTR medical image report generation datasets for qualitative analysis.Tese medical images are used in diferent models to generate reports.As shown in Figure 4, the frst three generated report instances are selected from IU-X-ray, and the fourth report is selected from COV-CTR.We can observe the medical reports generated by the MDAKF and R2Gen models.
In reports, we can see that the generated reports all abide by a process pattern, reporting frst abnormal fndings (e.g., "cardiac silhouette" and "lung volume"), followed by underlying disease (e.g., "pleural efusion" and "nontenderness").In addition, for the necessary medical terms in the basic fact report, MDAKF covers almost all of these terms in its generated reports.We compare the reports generated by R2Gen and MDAKF with the Ground_Truth reports and used red and blue to distinguish the overlap between them and the actual report.As shown in Figure 4, our proposed network model outperforms the baseline model in generated medical reports.Te MDAKF module can provide more accurate anomalous visual regions during model training, thereby alleviating the problem of visual data bias.To verify this result, we extract multiple chest X-rays from the IU-X-ray and COV-CTR datasets and visualize the image and audio attention mapping guided by the MDAKF module.In Figure 5, the frst medical image is from the COV-CTR dataset, and the second medical image is from the IU-X-ray the thorax was symmetrical the mediastinal cardiac shadow was centered no enlarged lymph nodes were seen in the mediastinum the texture of both lungs was enhanced a patchy solid shadow with fuzzy margins was seen in the lower lobe of the lef lung the bronchi of the lobe were clear and no abnormal density shadow was seen in the bilateral

Conclusion
In this manuscript, the multimodal data-assisted knowledge fusion network is proposed to automatically generate medical reports.Te network is based on the R2Gen framework and aims to facilitate the generation of diagnostic reports using multimodal auxiliary information features.We study diferent types of auxiliary signals to achieve the purpose of automatically generated medical reports that focus on disease regions better and alleviate the visual data bias problem.Prominent experiments have demonstrated the efectiveness of our proposed MDAKF network.In the future, we will investigate more efcient multimodal feature fusion methods to enhance the auxiliary signal feature representation.

Image Heatmap Report
Te thorax was symmetrical, the mediastinal heart shadow was in the middle, no enlarged lymph nodes were seen in the mediastinum, the texture of both lungs was enhanced, the lef lung was seen to have limited increased transmission.
the cardiomediastinal silhouette and pulmonary vasculature are within normal limits in size .the lungs are clear of focal airspace disease pneumothorax or pleural efusion .lung volumes are low normal .there are no acute bony fndings.
clear.No pneumothorax or pleural efusion.Normal heart and mediastinal contours.Normal pulmonary vasculature.Bony thorax intact.

Figure 1 :
Figure 1: Example of a chest X-ray image and its report.

into 7 :
1 : 2 training validation test sections.Tere is no overlap of patients in the training, validation, and test sets.

6
International Journal of Intelligent Systems indicator data are the results of the original papers.For the COV-CTR dataset, the R2gen, MDAK (our), and MDAKF (our) model indicator data are obtained through experiments, while the remaining experimental data are obtained by referring to the original paper of the ASGK [41] model.As shown in Table

Figure 4 :
Figure 4: Visualization of medical image description.
On the original model benchmark, they can implement the recording and storage of frequently occurring words and phrases in the RM structure to drive the model to eventually learn and generate medical reports with more accurate and fuent descriptions.However, this cannot solve the problem of insensitivity of the existing automatic medical report generation techniques to medical abnormal images.We improved the existing model by adding auxiliary signals to enhance the overall performance of the model.Ten, we can describe in detail how to use the auxiliary signals to the transformer network architecture.

Table 1 :
Performance of diferent methods on the IU X-ray and COV-CTR datasets.Te bold values indicate that the model performance of the algorithm is optimal in a certain type of dataset.

Table 2 :
Ablation experiments of each module.
Te bold values indicate that the model performance of the algorithm is optimal in a certain type of dataset.

Table 3 :
Performance of diferent feature extraction networks.
thorax is symmetrical the mediastinal cardiac shadow is in the middle the texture of both lungs is clear.patchy solid shadow is seen in the dorsal part of the lower lobe of the right lung some of the edges are clear and the bronchial air image is seen in the lower lobe of the lef lung small shadow the