Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks

,


Introduction
Ethiopia is a country that is located in the east part of Africa. It is a nation with over 84 nations and nationalities. Te Amharic Language is the ofcial working language of the Federal Democratic Republic of Ethiopia [1,2]. It is spoken by over half of the population and is also spoken in nations such as Eritrea, Canada, the United States, and Sweden [3]. Amharic is written using the Fidel or Abugida script, derived from the ancient Ge'ez language of Ethiopia [1].
Te use of artifcial intelligence techniques to generate image captioning becomes widespread in the past few years [4]. Tis task involves the generation of textual descriptions for images using computer vision and natural language processing techniques. While the accuracy of English Language image captioning models has improved, they still have semantic incorrectness issues. Tis study aims to build a deep learning-based Amharic Language image caption model using attention mechanisms so that it can be applied in various Amharic Language software applications such as tools for the visually impaired, editing software, virtual assistants, and image searching [5].
Te popularization of the deep learning approaches has enabled us to solve complex problems easily and successfully. Recent image captioning studies are using the encoder-decoder architecture which is originally used in machine translation [6][7][8]. Te encoder-decoder approach submits images to an encoder that converts the visual elements into a fxed-length vector, which are then decoded into a textual description [6]. Pretrained CNN models are used for the encoder while long short-term memory (LSTM) or gated recurrent unit (GRU) neural networks are commonly used for the language generation. However, the encoder-decoder method is limited in its ability to preserve all source information in the fxedlength vector, and the unidirectional LSTM decoder only preserves past information which leads to poor outcomes for long sequential data [9].
In this study, we aim to address the limitations of previous models [10,11] by incorporating an attention mechanism that focuses on both visual and linguistic features. Te attention mechanism allows the model to extract only the relevant information, while also emphasizing highlevel semantic features that better describe the image content [12]. Additionally, we employ a Bi-GRU architecture, which captures input information from both forward and backward directions. Tis enhances the capturing of both visual and textual features that can minimize the gap between them and lead to semantically richer image captions.
Tis study proposes a hybridized attention-based approach for Amharic Language captions to address the existing gaps in the area. Te approach combines a Bi-GRU with an attention decoder to generate words by focusing on the required information that lead to semantically correct image captions. Te Flickr8K and BNATURE datasets were used to evaluate the performance of the proposed model.
Te study is organized in the following manner: Section 2 covers related works; Section 3 outlines the proposed approach; Section 4 explain into model building; Section 5 specifes the experiments and methodology; Section 6 presents results and discussions; and fnally, the conclusion is presented in Section 7.

Related Works
Tis section describes the major works on image captions which can be divided into three sections. Te sections are retrieval-based, template-based, and deep neural networks (DNNs) techniques.

Retrieval-Based Approaches.
Te retrieval-based method employs a searching method to encounter proper image descriptions [13]. Tis method retrieves relevant nominee descriptive phrases from a database. Ten generate an intelligible caption based on the text descriptions of the retrieved similar image sets [14]. In general, the retrieval-based approach required to have extensive data that covers each possible query picture.

Template-Based Approaches.
Te template-based methods have specifed templates with some empty slots for the caption generation [15,16]. Te approach proposes a triplet of scene components (object, action, and scene) that load the template space for the captions. Te triplets provide a general idea of what the image is to generate a caption. Few authors [17] suggest a quadruplet template that includes (nouns-verbs-scenes-prepositions).
Te template-based approaches can generate grammatically correct descriptions compared to retrieval-based methods. However, this approach is rigid and cannot generate variable-length image descriptions [15,18]. Hence, the result image captions lack naturalness compared with manually generated sentences.

DNNs Approaches.
DNNs are a type of artifcial neural networks (ANNs) that consist of multiple hidden layers, allowing for the creation of more complex models capable of higher levels of abstraction. Te papers [19,20] discuss the use of ANNs in computing for prediction purposes. Te study focuses on training ANNs using meta-heuristic algorithms to improve precision and determining neural network (NN) input coefcients. An integrated algorithm was used and compared to other algorithms such as ant colony and invasive weed optimization. Te results showed that the proposed algorithm had better convergence with NN coefcients and reduced prediction error in the NN. ANNs are a key component of the deep learning, forming the foundation for designing and deploying complex DNNs that learn and predict on intricate data. A deep learning based cyberattack detection and classifcation technique was introduced for intelligent systems (FDL-CADIS) [21]. Te technique transforms malware binary fles into 2dimensional images and uses a MobileNetv2 model and an ensemble of voting-based classifers for classifcation. Te results of the experimental analysis showed promising performance in detecting and classifying malware cyberattacks. Te image caption model leverages the power of DNNs, which relies on ANNs as a fundamental component to learn and make predictions on complex data and generate concise textual descriptions of the visual content.
DNNs dominate in the previous image caption techniques compared to that of template-based and retrievalbased methods. A recent study [19] proposes an innovative approach to image captioning using DNN architecture. Te DNN employs a CNN as an encoder to extract image features which is then projected into a LSTM model. Te authors also introduce a novel decoder neural network language model called structure-content neural language model (SC-NLM), which generates words using a combination of vector content and structure. Te approach enhances the accuracy of the image captioning by leveraging the strengths of both the content and the structural information. Researchers have made several enhancements to the image caption model in order to produce semantically accurate and fuent captions. Te problem of image caption generation using a custom ensemble model consisting of an Inception model and a 2-layer LSTM model was used in [22]. Te results are evaluated using Bilingual evaluation understudy (BLEU) scores and it achieved a BLEU-4 score of 55.8%.
Reference [23] proposed utilizing semantic representation to improve image captioning by incorporating important image elements that are not captured by global feature representations. Te authors employed regionbased convolutional neural networks (R-CNN) to identify such elements, generating detailed captions that utilized semantic embedding. However, the static representation of semantic elements failed to consider the relevant image features. Te paper [24] is used an area attention-based encoder-decoder model that associates parts of the image with the words of a description. In other study [25], the authors propose a novel method for generating image captions that combines spatial and channel-wise attention mechanisms over a 3D CNN features map. Unlike the previous approaches [24], which mainly used features from spatial locations, the authors 2 Applied Computational Intelligence and Soft Computing incorporate features extracted from diferent channels and multiple layers to focus on essential regions of the image. However, their approach overlooks the relevance of the sentence generation for image description.
Te study [25] purposes a deep learning model that generates captions using a custom ensemble of LSTM and CNN algorithms. Te author employs GRU and bidirectional LSTM for caption generation and uses Global Vectors (GloVe) embedding to generate the word vectors. Tis model does not include an attention mechanism. Te study [26] is a variant model based on LSTM that was inspired by stimulus-driven and concept-driven attention mechanisms in psychology. Te attention mechanism is used to detect image features and to obtain attention distribution in the images using a Gaussian flter applied to change region impact factors.
A question-answering model is proposed [27] using target relationship detection to answer questions about an image content. It incorporates a new attention mechanism and theories related to word vector space to improve image semantic tasks. Te model uses question-based attention and converts target semantic information into word vector space to improve its generalization. Te authors [28] proposed a semantic text summarization method of long videos. Te researchers used unidirectional LSTM to generate the fnal caption. However, the unidirectional LSTM problem is restricted to have only past information which yielding poor outcomes for long sequences.
Te aforementioned studies and most of the image captions studies are done on the English Language. Tere are a few studies, in other languages for example [29] which studied image caption in the Bangla Language. Te authors used hybrid encoder-decoder architecture to generate the image caption. Tey, also create their own dataset for the study. Tis dataset is called Bangla natural language image to text (BNLIT) and contains 8700 images with single (one) annotation for each image.
Another study, [30] proposed an encoder-decoder model for image caption in the Bangla Language. Te authors used Inception-v3 as a CNN encoder to extract the visual elements. Tey applied Bi-GRU to generate the textual description of the images. Te authors argue that GRU has better results compared to LSTM. Tey also proposed a new dataset called BNATURE. Tis dataset is prepared based on Flickr8k dataset. Te dataset contains of 8000 images with fve Bengali Language captions for each image.
In general, the previous studies [29,30] achieved good results in BLEU scores with Bilingual Language. However, their approach has limitation with the image feature extraction part. Te approach directly passes the entire features into the language model without fltering the relevant features which comes from the image. Figure 1 shows the proposed model architecture of the image caption generation for the Amharic Language. It comprises four main components that are discussed in detail in the next sections.

Word Embedding.
In NLP, word embeddings are used to convert words or document vocabulary into numerical form. Te embeddings are applied to clean caption data obtained from image descriptions in training datasets. Tis involves transforming the sentences into word tokens. In this context, an input sentence containing T words is represented with x 1 , x 2 , . . . x T tokens. Each word in the sentence, x i , is transformed into a feature vector, e i , through a matrixvector product [31].
Te weight matrix for word embedding denoted as W wrd , belongs to the real number set R d w |V|. Where d w is the size of the word embedding (a hyperparameter) and |V| is the size of the vocabulary (the number of distinct words in the corpus). Each word vector v i is also given as |V|. Te fnal output of the word embedding process is a real-valued feature vector, Emb s � e 1 , e 2 , . . . e T , which represents the Te formula for the conversion of a single word x i into a feature vector e i is where W wrd is the weight matrix for word embedding and v i is the one-hot encoded vector for the i-th word in the vocabulary.

CNN Image
Finally, the feature vectors are fed into a dense layer with ReLU activation which reduces the dimension to match the size of the word embedding Emb s . Te resulting feature vector is then used for visual attention to determine the relevant image section for generating captions.

Visual Attention Mechanism.
Te visual attention function [30] is calculated in the proposed method by taking the hidden state (h (t−1) ) and the output of the encoder (a i ). Te attention score is computed for each time step (t) and location (i) in the image by applying a nonlinear activation function (tanh) on (h (t−1) ) and (a i ), and then using the SoftMax activation function to get the attention distribution. Each part of the image is assigned a weight which represents its importance in combining feature vector (a i ). Te visual feature vectors are 2048-dimensional vectors in the R space and part of the set A � [a 1 , a 2 , . . . a L ]. Mathematically, the attention score is represented by α ti

Applied Computational Intelligence and Soft Computing
where W a and W h are weight matrices to learn the importance of i-th element in input sequence and the previous hidden state, respectively. Te superscript T denotes the transpose of the weight matrix. Te context vector is obtained by computing a weighted sum of the visual feature vectors using the attention weights (score) obtained from the SoftMax activation function. In other words, for each time step (t), the context vector represents the combined visual representation of each word in the input sequence, where the contribution of each word to the context vector is determined by its corresponding attention weight. Mathematically, the context vector (C t ) is expressed as Tis helps to reduce the gap between the image and the candidate captions for the image.
Finally, context vector output, and text feature representation outputs are combined to reduce the gap between the image and candidate caption. Te result of this combination is represented by Con x , which is calculated as the element-wise sum (⊕) of the word embedding (Emb s ) and the context vector (C t ). Tis integration helps in combining the context vector and the word embedding information to produce the fnal output Con x , which represents the combined representation of the image and text features. Con x is represented as

Bi-GRU with Attention Mechanism Language Decoder.
Te decoder in the Bi-GRU language decoder network takes the output from the concatenation layer as input and inputs it into a Bi-GRU network. Te Bi-GRU computes both the forward and backward hidden sequences (represented as h → and h ⟵ , respectively) is shown in the Bi-GRU layer section in Figure 2. Te computation is done as follows: (i) Te update gate, z t , is calculated using a sigmoid activation function (σ) on the weighted sum of the previous hidden state (h (t−1) ) and the concatenation of the visual attention context and text feature at time step t (Con xt ) and the corresponding bias term b z . (ii) Te reset gate, r t , is calculated using a sigmoid activation function on the weighted sum of the previous hidden state (h (t−1) ) and the concatenation of the visual attention context and text feature at time step t (Con xt ) and the corresponding bias term b r . (iii) Te current memory content, h, is calculated using a ReLU activation function (σ) on the weighted sum of the element-wise product of the reset gate output and the previous hidden state (r t * h (t−1) ). Te concatenation of the visual attention context and text feature at time step t (Con xt ) and the corresponding bias term b. (iv) Te fnal memory at the current unit, h t , is calculated as the element-wise sum of the previous hidden state with a weight of (1 − z t ) and the current memory content with a weight of z t .
W, W z , and W r are the weight matrices for the current unit, update gate, and reset gate, respectively. b, b z , and r t are the corresponding bias terms for each gate. Te model has 3.5. Addictive Attention. Te addictive attention describes a method for calculating attention context vectors in a language decoder. Te method operates in two steps. Te frst step involves calculating a matching score (alignment score) e tj between the current hidden state h t and previous hidden state h j using an additive projection. Te second step involves calculating the attention context vector C j for hidden state h j based on the output of the alignment scores. Te alignment scores are calculated using the following equation: where V a , W, and U a are learned attention parameters and d is the dimensionality of the hidden state. Te attention layer section in Figure 2 shows the calculation of the attention context vector, which is obtained by summing the product of alignment scores and hidden state h j .
Te fnal output of the addictive attention layer is transformed into a 1-dimensional representation by using a fattened layer. Tis representation is used as input for the fnal two dense layers. Te frst dense layer uses the length parameter and the second dense layer outputs a probability distribution over the vocabulary to generate the fnal image caption. Te use of a Bi-GRU network with an additive attention language decoder enhances the existing work to generate realistic captions by considering both the visual and textual relevant features, as shown in Figure 3.
Te above fgure displays the pipeline of both the baseline and proposed model. Te solid black line represents the baseline model while the solid green line represents the proposed model. Te red dotted line highlights the modifcations and the gaps addressed by the proposed model in comparison to the baseline model.
Te baseline model has two major limitations: (1) A direct connection between the encoder and the Bi-GRU decoder hinders the decoder from focusing on the crucial image information. (2) Te baseline model lacks the ability to efectively select important visual-textual features during image caption generation. To address these limitations, the proposed model incorporates an additive mechanism into the language decoder that enables it to concentrate on the relevant visual-textual features.

Dataset Preparation.
In this study, two well-known datasets, Flickr8k (8,000 images), and BNATURE (8,000 images), were used for training and testing the proposed model. Te datasets were divided into three sections with 75% of images for training, 12.5% for validation, and 12.5% for testing. Each image in the datasets was annotated with fve separate captions in both English and Amharic Languages. Te English captions were translated into Amharic using Google translate then reviewed by Amharic Language experts to correct grammar and semantic errors. A sample of the dataset is presented in Figure 4. Each image in the dataset has fve English Language and Amharic Language captions.

Image Image Captions
A couple of several people sitting on a ledge overlooking the beach.
A group of people sit on a wall at the beach.

A group of teens sit on a wall by a beach.
Crowd of people at the beach.
Several young people sitting on a rail above a crowded beach.

A black and white dog is running in a grassy garden surrounded by a white fence. A black and white dog is running through the grass.
A Boston terrier is running in the grass.

A Boston Terrier is running on lush green grass in front of a white fence.
A dog runs on the green grass near a wooden fence.

Applied Computational Intelligence and Soft Computing
Te motivation of using the Flickr8k and BNATURE dataset has two reasons. First, their availability and suitable size for training with a low-power graphics processing unit (GPU), making them easier to analyse compared to other datasets such as Flickr30k and Microsoft common objects in context (MSCOCO). Second, these datasets captions were collected by human experts from Amazon Mechanical Turk (AMT) workers.
Te approach used precise instructions to verbalizing the main action depicted in the images.

Text Preprocessing Technique.
Preprocessing is a crucial step in the development of machine learning and deep learning models. It aims to convert the raw data into a format that is suitable for processing and analysis. Tis step is especially critical when working with Amharic Language, as they require specifc language-based rules and the removal of irrelevant words and characters.
Te data cleaning process is applied to ensure that the captions are in a standardized format to make it easier for the algorithms. Tis practice includes the lower-casing of words to avoid duplications and the removal of special characters, such as "+," "%," "$," "#," "@," etc. Additionally, any words that contain numbers, like "Hello123," are also removed. To further enhance the accuracy of the captions, any inappropriate words and symbols are manually adjusted to ensure that the translated captions maintain their original meaning.
In this study, data cleaning was performed on a dataset of 40,000 (8000 * 5) sentences. Once the data cleaning was complete, each caption sentence was marked to indicate the beginning and the end using the tags "<start>" and "<end>." Tis was done to indicate the start and end of the image captions for the algorithms.
After the data cleaning process, the next steps involved in the preprocessing stage are tokenization and vectorization. Tokenization is the process of breaking down the caption sentences into smaller units, such as words, characters, or sentences. In this study, the image captions were broken down into word-level tokens. Te vectorization step involves converting the tokenized captions into numerical representations, making it easier for the algorithms to process the information.

Image Preprocessing.
Te image dataset used in this study consists of 8,000 unique images that have been preprocessed to ensure uniformity in images size. Te images were resized to a resolution of 299 width, 299 height, and 3 colour channels. In addition to scaling, grayscale histogram and data augmentation techniques were applied to further preprocess the image datasets.
To extract the image features, a pretrained CNN model was used instead of training the model from scratch. Training a convolutional network from scratch requires a large dataset and high computing power; therefore the Inception-v3 CNN model is used for this purpose. Inception-v3 is a pretrained model that was originally trained for image classifcation tasks and has a lower number of parameters compared to other pretrained CNN models such as VGG-16 and ResNet-50.
Te Inception-v3 model is used to extract the most important features from the images as a feature extraction technique. Te feature extraction was implemented by removing the last SoftMax layer of the Inception-v3 model and focusing on the 2048 features of each image. Tis approach ensures the capturing and utilization of the most signifcant information in the deep learning models building process. As a result, the model can give more accurate and reliable image captions.

Evaluation Metric. BLEU is a Precision-Based Metric for
Machine-Generated Text. Evaluating the quality of a machine-generated text is crucial in the development of language models. Te BLEU metric is a widely used method for measuring the accuracy of machine-generated text. It is a precision-based metric that ranges from 0 to 100, where 100 indicating a perfect match and 0 indicating a perfect mismatch.
Tis approach evaluates the quality of machinegenerated text by comparing the n-grams of the predicted outcome of a model to the n-grams of the actual data [32]. A high BLEU score, close to 100, indicates a better model, while a score close to zero is considered as a poor model. BLEU has been widely used in the feld of natural language processing and has been found to be a reliable and efective method for evaluating the quality of machine-generated text.

Experiments and Model Building
In this study, two experiments are conducted to evaluate the performance of the proposed model in comparison to the base model. Te frst experiment was performed on the CNN-Bi-GRU encoder-decoder model, while the second experiment is performed on the proposed hybridized attention-based CNN-Bi-GRU model. Experiment on the base model includes: (1) Image features were extracted using the pretrained Inception-v3 model (2) Te extracted features are fed into the CNN encoder (3) Te output of the CNN encoder is combined with the word embedding layer and sent to the Bi-GRU layer (4) Te fnal output of the model is the predicted word probability based on the Bi-GRU unit.
Te proposed hybridized attention-based CNN-Bi-GRU model experiment covers: (1) Attention mechanisms are implemented on both the CNN encoder and the Bi-GRU decoder (2) Te frst visual attention was placed between the CNN and the Bi-GRU to allow the Bi-GRU to focus only on the relevant image features (3) Extending the attention mechanism to the existing Bi-GRU decoder aimed to reduce the gap between the vision and the textual component and generate semantically correct image captions.

Applied Computational Intelligence and Soft Computing
Te experiments were performed using the Flick8k dataset in both English and Amharic languages caption. Te result of the experiments is evaluated using the BLEU scores. Te model is training and optimized with the hyperparameters shown in Table 1. Generally, the study is conducted based on the following assumptions: (i) Te image features that are extracted using the pretrained Inception-v3 CNN model will provide enough information for the encoder-decoder models to generate image captions (ii) Te tokenization and vectorization techniques applied to the image captions can enable the models to efectively comprehend language patterns (iii) Te hybridized attention mechanism, applied to both the CNN encoder and the Bi-GRU decoder, can enhance the performance of the model and generate semantically correct image captions.
On the other hand, it is important to note that the result of the model is infuenced by the limitations in the data and algorithms used.

Results and Discussion
Te experimental results show that the proposed hybridized model outperforms the basic model in terms of image captioning accuracy for the Amharic Language. As indicated in Table 2, the hybridized model achieves a 15%, 12%, 10%, and 10% improvement in 1G-BLEU, 2G-BLEU, 3G-BLEU, and 4G-BLEU scores, respectively. Gram (G) refers to the number of words in the caption. Tis signifcant increase in performance can be attributed to the integration of visual attention and Bi-GRU with an attention mechanism.
Te visual attention allows the model to focus on the most important parts of the image while the Bi-GRU attention mechanism selects the most relevant words to describe the content of the image. Te results of the captions are more descriptive and accurately refect the context of the image. Furthermore, the use of Bi-GRU with an attention decoder during the caption generation process ensures that the generated words are highly relevant and appropriate for the image context.
Te results of the two models are compared on the BNATURE and Flickr8k datasets, as show in Table 3. Te purpose of these experiments is to evaluate the robustness and generalizability of the models using diferent datasets. Te results with the BNATURE dataset provide more evidence on the performance improvement of the proposed hybridized model compared to the base model. Te implication of these results lies in their demonstration of the ability of the hybridized model to adapt to diferent data sources without losing its sematic accuracy. Tis indicates that the proposed model is not only effective on the Flickr8k dataset but also on other similar datasets, making it a more fexible and practical solution for image captioning. Te proposed model generated image captions with better meaning compared to the basic model.
Te results of the proposed model demonstrate its dominance over other models, as it achieved better results on all BLEU scores in both datasets. Te model's training accuracy on the BNATURE dataset was 0.925, with a testing accuracy of 0.928. Te loss during the training phase was approximately 0.186 and during the testing phase, it was 0.181, as shown in Figure 5. Similarly, on the Flickr8k dataset, the training accuracy was 0.882 and the testing accuracy was 0.885. Te overall training loss was 0.303 and the test loss was 0.295, as demonstrated in Figure 6.
One of the advantages of using the Bi-GRU technique in this study is its simple confguration compared to Bi-LSTM. It only uses two gates, the update and reset gates, which results in faster speed calculation. Moreover, incorporating the DNNs techniques such as Inception-v3 and Bi-GRU improves the quality of the image captions. Te Inception-v3 was used as the encoder to obtain visual features, while the decoder Bi-GRU was used to predict the words that make up the image caption. Te results of the study show that the use of attention-based Bi-GRU or the hybridized model results in a better image captions.
Te proposed model has demonstrated its advantages over other models in terms of image captioning accuracy and speed of computation. Te results of the study show that the use of Bi-GRU and attention-based techniques signifcantly improves the semantic quality of the image captions.
However, there are also some shortcomings of the proposed model that need to be considered. One of the main drawbacks is the increased complexity of the model. Te use of multiple components such as Inception-v3 and Bi-GRU with attention mechanisms results in a more complex model that requires more computational resources and longer training times. Additionally, the model may not perform well on images with complex or unusual content, as it relies on the accuracy of the visual features extracted by the Inception-v3 encoder. In such cases, alternative encoders or  Another shortcoming is that the model may not generalize well to diferent languages or cultures, as it is trained on a specifc dataset that represents a particular language and cultural context. To address this issue, future research could explore the use of multilingual or cross-cultural datasets to train the model and increase its generalizability.
In summary, while the proposed model has demonstrated its efectiveness in improving the quality of image captions, it also has its limitations that need to be considered in future research. Despite these limitations, the results of   Applied Computational Intelligence and Soft Computing the study provide strong evidence of the potential of using Bi-GRU with attention mechanisms for image captioning.
Te proposed model improves image captioning by focusing on essential image features and corresponding text features. Results show a 10% increase in performance, as measured by the 4G-BLEU score, compared to the basic Bi-GRU decoder.
Te experimental results in Table 3 reveal the performance of the proposed hybridized model using two diferent datasets. To assess the efectiveness of the proposed approach, the results were compared with two base models, CNN-Bi-GRU [30] and Bag-LSTM [7].
Te integration of visual attention and Bi-GRU with an attention mechanism decoder has proven to be a more effective approach for generating image captions. Te hybridized model has demonstrated a signifcant improvement in performance compared to the existing models, with a 21% increase in 4G-BLEU score compared to both CNN-Bi-GRU and Bag-LSTM. Te results in Figure 7 further confrm the superiority of the proposed hybridized model in generating high-quality image captions.

Conclusion
Te signifcance of Amharic image captioning for a range of Amharic language-based applications has been highlighted in this study. By combining the image processing and text processing domains, the challenge of generating grammatically and semantically correct captions has been addressed. A hybridized attention-based CNN-Bi-GRU model has been proposed to overcome these challenges and enhance the quality of Amharic image captions.
Te proposed model comprises of four main components, including the word embedding, image encoder, the visual attention mechanism, and the language decoder.
Word embedding is a technique in NLP that maps each word in a vocabulary to a high-dimensional vector of real numbers which can be used as a representation of the word's meaning. Te image encoder extracts image feature. Te visual attention mechanism focuses on the critical areas of the image and the language decoder learns a two-way (bidirectional) long-term dependency between sequential information to produce an image description. Experiments on the translated Flickr8k and BNATURE datasets have shown that the proposed model outperforms the baseline CNN-Bi-GRU and Bag-LSTM models.
Te results indicate that integrating the visual attention mechanism and the Bi-GRU language decoder into the image captioning process improves the semantics of the generated descriptions. Te study concludes that this approach to Amharic image captioning is an efective means of improving the quality of image descriptions and is a valuable contribution to the feld. In the future, we aim to further develop our dataset to be more similar to the Flickr8k English dataset by collecting images that refect the diverse cultures of our country. Tis will involve collecting images from various sources and annotating each image with fve Amharic sentences. Additionally, we plan to utilize the successful proposed model in a new task such as recognizing activities in Amharic videos.

Data Availability
Te data used to support this study are available in "https://drive. google.com/fle/d/1geLObzMiTaJLpOwE1MMzZ4BXc64Go3g2/ view?usp=sharing."

Conflicts of Interest
Te authors declare that they have no conficts of interest.

10
Applied Computational Intelligence and Soft Computing