RLMD-PA: A Reinforcement Learning-Based Myocarditis Diagnosis Combined with a Population-Based Algorithm for Pretraining Weights

Myocarditis is heart muscle inflammation that is becoming more prevalent these days, especially with the prevalence of COVID-19. Noninvasive imaging cardiac magnetic resonance (CMR) can be used to diagnose myocarditis, but the interpretation is time-consuming and requires expert physicians. Computer-aided diagnostic systems can facilitate the automatic screening of CMR images for triage. This paper presents an automatic model for myocarditis classification based on a deep reinforcement learning approach called as reinforcement learning-based myocarditis diagnosis combined with population-based algorithm (RLMD-PA) that we evaluated using the Z-Alizadeh Sani myocarditis dataset of CMR images prospectively acquired at Omid Hospital, Tehran. This model addresses the imbalanced classification problem inherent to the CMR dataset and formulates the classification problem as a sequential decision-making process. The policy of architecture is based on convolutional neural network (CNN). To implement this model, we first apply the artificial bee colony (ABC) algorithm to obtain initial values for RLMD-PA weights. Next, the agent receives a sample at each step and classifies it. For each classification act, the agent gets a reward from the environment in which the reward of the minority class is greater than the reward of the majority class. Eventually, the agent finds an optimal policy under the guidance of a particular reward function and a helpful learning environment. Experimental results based on standard performance metrics show that RLMD-PA has achieved high accuracy for myocarditis classification, indicating that the proposed model is suitable for myocarditis diagnosis.


Introduction
Myocarditis is a condition that causes inflammation of the heart muscle [1]. It can affect heart pump function as well as electrical activation and conduction, resulting in heart failure and arrhythmia, respectively. e etiology is diverse, including infection (e.g., viral infections such as COVID-19 and parvovirus) [2], systemic inflammatory and autoimmune diseases, and drug reactions. Symptoms of myocarditis include chest pain, fatigue, and shortness of breath [3]. Patients with suspected myocarditis should seek cardiology advice for early diagnosis and treatment. Endomyocardial biopsy, an invasive procedure, is recommended in severe cases to confirm the diagnosis and to guide treatment [4]. Management comprises supportive measures, symptomatic heart failure therapy, antimicrobials for identified infective agents, and immunosuppression for severe inflammation. Early diagnosis and prompt institution of treatment can significantly reduce morbidity and mortality. Noninvasive cardiac imaging with cardiovascular magnetic resonance imaging (MRI) [5] can help clinch the diagnosis. However, MRI requires expert interpretation, which is manually intensive and subject to operator bias. In this regard, automated diagnostic systems can be developed that employ various machine learning and data mining algorithms to solve medical image classification problems efficiently [6]. ey can be applied to reporting workflows to screen images automatically, saving physicians time, reducing errors, and enhancing diagnostic accuracy.
Excellent performance of in-depth models has been demonstrated in diverse applications, including natural language processing [7][8][9], computer vision, and medical image analysis [10,11]. Deep learning-based algorithms converge with suitable weights to minimize the error between the real and predicted outputs. Typically, deep models use gradient-based algorithms as backpropagation to learn the weights. However, such optimization methods are sensitive to initial weights and may become trapped in local minima [12].
is issue is mainly encountered during classification [13]. Few researchers have shown that population-based meta-heuristic (PBMH) algorithms [14,15] may help to overcome this problem [16]. Among PBMH algorithms, the ABC algorithm is one of the most effective optimizers [17,18]. It emulates the behavior of bees in nature and, unlike traditional optimization algorithms, dispenses with the need to calculate gradients, thereby reducing the probability of getting stuck in local optimizations [19].
Classification performance in many machine learning algorithms may be adversely affected by imbalanced classification [20], which occurs when one class contains disproportionately more data than the others [21]. While imbalanced models may still attain reasonable detection rates for majority samples, the performance for minority samples is weak as minority class specimens can be difficult to identify due to their rarity and randomness. Also, misalignment of minority class samples can result in high costs. Methods have been proposed to address the problem at two levels [22]: data level and algorithmic level. In the former [23][24][25], training data are manipulated to balance the class distribution by oversampling minority class and/or undersampling majority class [26]. For instance, the synthetic minority oversampling technique (SMOTE) generates new samples by linear interpolation between adjoining minority samples [24], whereas NearMiss undersamples majority samples using the nearest neighbor algorithm [25]. Of note, oversampling and undersampling can risk overfitting and loss of worthy information, respectively [27]. At the algorithmic level, the importance of the minority class can be raised using techniques [28][29][30][31][32] that include cost-sensitive learning, ensemble learning, and decision threshold adjustment. In cost-sensitive learning, different incorrect classification costs are attributed to the loss function for the whole class, with a higher cost being allocated to minority class misclassification. Ensemble learning systems train several subclassifications and then apply voting or combination to obtain better results. reshold adjustment techniques train the classifier in the imbalanced dataset and modify the decision threshold during the test. Deep learningbased methods have also been suggested for imbalanced data classification [33][34][35]. e authors in Reference [36] introduced a new loss function for deep networks that could capture classification errors from both minority and majority classes. Reference [37] introduces a method that could learn the unique features of an imbalanced dataset while maintaining intercluster and interclass margins.
To the best of our knowledge, only one work [3] based on deep learning models has been proposed for the diagnosis of myocarditis. e authors developed an algorithm for classifying images based on CNN and the k-means algorithm [38], which has the following workflow: after the data preprocessing stage, the images were placed in several clusters, and each cluster was considered a class in which the CNN classified. e algorithm was repeated for different clusters, and all the results were combined for the final decision. e main problem with the method was that it considered the image matrix as a vector in k-means, which resulted in missed pixels around a specific pixel.
is paper presents a method based on the ABC algorithm and reinforcement learning called RLMD-PA that we believe would address the above mentioned problems. e RLMD-PA model poses the classification problem as a guessing game embodied in a sequential decision-making process. At each step, the agent receives an environmental state represented by a training instance and then executes a classification under the direction of a policy. If the agent performs classification perfectly, it will be given a positive reward and, otherwise, a negative one. e minority class is rewarded more than the majority class. e agent's goal is to accumulate as many rewards as possible during the 2 Contrast Media & Molecular Imaging sequential decision-making process to classify the samples as correctly as possible. e main contributions of this article are as follows: (1) we considered the classification problem of medical images as a sequential decision-making process. We presented a reinforcement learning-based algorithm for imbalanced classification; (2) instead of randomly weighting, we have developed an encoding strategy and calculated the optimal initial value using the ABC algorithm, and (3) this work is based on a new well-annotated MRI dataset acquired from Tehran's Omid Hospital that we have named the Z-Alizadeh Sani myocarditis dataset and made publicly downloadable. e rest of the article is structured as follows: the second section is a brief overview of the ABC algorithm and its working. e third section introduces the proposed model. e fourth section presents the evaluation criteria, dataset, and analysis of the results. e last section states the conclusions and future works. [39] is one of the most efficient algorithms for optimizing numerical problems. It is straightforward, robust, and populationbased [19]. e algorithm emulates the intelligent foraging behavior of bees to arrive at the optimal solution. ere is a list of food sources that bees seek out over time to get to the best positions. e algorithm involves three groups of bees: employed bees, onlooker bees, and scout bees. Employed bees discover the positions of food sources, whereas onlooker bees wait in the hive for the nectar from food positions to be sent by employed bees. Onlooker bees use the information to select food source positions. Once an employed bee has exhausted the food source, it becomes a scout bee to search for new positions randomly. e number of employed bees equals the number of unemployed (onlooker and scout) bees. e steps for optimizing an algorithm using the ABC algorithm are as follows:

Artificial Bee Colony Algorithm. Artificial bee colony (ABC) introduced by Karaboga and Basturk
(1) Initialization: in the first step, an initial population S of size C is formed from the positions (solutions), as in where i represents the i-th position, each solution s i is D dimensions, and D means the number of parameters that must be optimized. s j min and s j max are the smallest and largest values in s j , respectively.
(2) Employed bee phase: at this point, new solutions are recognized by searching the neighborhood for current potential solutions. To keep the population size constant, the quality of new solutions is evaluated. If it is better than the previous ones, it will be replaced; otherwise, it will remain fixed. is step can be formed as follows: where k is a random solution such that k ≠ i. φ j i is a random number picked from the interval [0, 1]. e potentially new solution v i is obtained by changing only one element of s i .
(3) Onlooker bee phase: for the onlooker bees update, one solution is stochastically elected from the potential solutions, that is, one of the open facility solutions, according to the probability relation p i anticipated as follows: e selection process follows the equation provided: the more appropriate a solution is, the higher the chance it will be selected. If the chosen employed bee scores higher than the current onlooker bee's current solution, the current solution replaces the previous one. is process is repeated for all onlooker bees in population S. (4) Scout bee phase: a solution that does not improve its fit after some repetitions can get the algorithm caught up in local optimization [40]. To prevent this, once the solution's fit does not improve after t iterations, the algorithm will discard it, and a new solution will be supplied according to equation (2). (5) Algorithm end condition: although different conditions can be defined for the end of the algorithm, the term termination is repeated in this study, which means that the algorithm ends after MaxItr iterations.
e complete ABC algorithm is given in Algorithm 1.

Reinforcement
Learning. Reinforcement learning [41] is an important branch of machine learning that encompasses many domains. Reinforcement learning can achieve relatively good classification results because it can effectively learn the compelling features of noisy data. In Reference [42], the authors defined classification as a sequential decision problem that used several factors to interact with the environment in order to learn an optimal policy function. Due to the complex simulation between the factors and the environment, the run time was inordinately prolonged. e model presented in [43] is a classification based on reinforcement learning provided for noisy text data. e proposed structure comprises of two classifiers: sample selector and relational classifier. e former selects a quality sentence from the noisy data by following the agent, whereas the latter classifier learns acceptable quality performance from clean data and gives a delayed reward to the sample selector for feedback. Finally, the model yields a superior classifier and quality dataset. e authors in Reference [44] proposed a solution for time series data in which the reward function and Markov process are explicitly defined. In various specific applications [45][46][47][48], reinforcement learning has been applied to learn the efficient features. ese models promote valuable features for the classification, which leads to higher rewards that guide the agent to select more worthy features.

Contrast Media & Molecular Imaging
To date, limited work has been done on deep learning for the classification of imbalanced data. In Reference [44], an ensemble pruning technique for deciding subclassifiers that adopted reinforcement learning was proposed. However, the model underperformed when the amount of data was increased. is is because it is difficult to choose classifiers when there are too many subclassifications.

The Proposed Solution
e overall structure of the proposed model is shown in Figure 1. We considered two critical options for classification. In the first step, we formulated a vector that includes all the learnable weights in our model. We assumed an initial value for the weights with ABC and then applied the backpropagation in the rest of the path. As mentioned, another problem that most classifiers suffer from, including ours, is imbalanced data. To address this, we employed reinforcement learning [49]. ese concepts are detailed in the following sections.

Pretraining Phase. Weight initialization of deep networks
is an essential part of deep models. Sometimes, incorrect initial values can lead to a failure of convergence in the model. e proposed model has a deep network with weights θ that need to be optimized. In this section, we present an encoding strategy and fitness function for the ABC algorithm.

Encoding Strategy.
In our work, the encoding strategy aims to arrange the CNN and feed-forward weights in a vector that will be considered the position of the bees in the ABC. Setting the specific weights is a challenge. Nevertheless, we have designed an encoding strategy that is as appropriate as possible after a few experiments. Figure 2 illustrates an example with encoding of a three-layer CNN network with three filters in each layer and a feed-forward network with three hidden layers. Note that all weight matrices in the vector are stored in rows.

Fitness Function.
e fitness function is defined as follows to measure the effectiveness of a solution in the ABC algorithm [12]:

Classification
Due to the difference in the amount of data between our two classes, we face the problem of imbalanced classification. To address this, we used the imbalanced classification Markov decision process (ICMDP) to construct a sequential decision problem. In reinforcement learning, an agent tries to obtain an optimal policy by performing a series of actions in the environment while maximizing its score. In the case of our model, a sample of the dataset is provided to the agent at each time point and classified. e environment then transmits the immediate score to the agent. A positive score corresponds to a correct rating, whereas a wrong rating gives a negative one. By maximizing cumulative rewards, the agent can arrive at the optimal policy. Let D � (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), . . . , (x N , y N )} be the imbalanced set of existing images with N samples, where x i corresponds to the i-th image, and y i is its corresponding label. e following explains the intended settings: (i) Policy π θ : policy π means a mapping function S ⟶ A, where S and A are a set of states and actions, respectively. In other words, every π θ (s t ) Input: D: dimensions of every solution, C: population size, limit: number of cycles, MaxItr: maximum number of iterations; (1) Initialize a population of solutions S � [s 1 , s 2 , . . . , s C ] using equation (1); Produce new solution x new using equation (2); (7) Calculate the fitness f new for x new ; (8) Replace x new with x i if better; (9) Calculate the probability p for every solution in S using equation (3); (10) //Onlooker Bee Phase (11) for i � 1 to C do (12) if rand(0, 1) < p i then (13) Produce new solution x new by using equation (2); (14) Calculate the fitness f new for x new ; (15) Replace x new with x i if better; (16) //Scout Bee Phase (17) If an abandoned solution is found, replace it with the solution produced by equation (2); (18) Put the best solution in x best ; (19) Itr � Itr + 1; ALGORITHM 1: Pseudocode of the ABC algorithm.
means performing the action a t in the state s t . π θ is acknowledged as the classifier model with weights θ. (ii) State s t : each state s t is mapped with sample x t from the dataset D. e first data x 1 are deemed the initial state of s 1 . For the model not to learn a particular order, the D is shuffled in each episode. (iii) Action a t : action a t is performed to predict the label x t . Since the offered classification is binary, a t ∈ 0, 1 { }, zero represents the minority class and one represents the majority class. (iv) Reward r t : reward considers the performance of an action. An agent with the correct classification gets a positive reward; otherwise, it gets a negative reward. e amount of this bonus should not be the same for both classes. Rewards can significantly improve model performance because the level of reward and action has been carefully calibrated. In this work, the prize is defined for action according to the following equation [27]: where D H and D S represent the minority and majority classes, that is, healthy and sick, respectively, and λ is a value in the interval [0,1]. e reward λ is less than 1/−1 as the minority class becomes more critical due to fewer data. In effect, we can ascribe more importance to the minority class in order for it to approximate the majority class. In the results section, we will see the importance of the value λ.

Contrast Media & Molecular Imaging
(v) Terminal E: the training process is completed at several terminal states, which occur in every training episode. An episode is the transition trajectory from an initial state to a final state, namely, (s 1 , a 1 , y 1 ), (s 2 , a 2 , y 2 ), (s 3 , a 3 , y 3 ), . . . , (s t , a t , y t ) . In our case, an episode stops when all the training data have been classified or when a sample of the minority class is misclassified. (vi) Transition probability P: the agent goes from state s t to the next state s t+1 based on the order of the read data. e transition probability is determined as p(s t+1 |s t , a t ).
In ICMDP, the policy function reports the probability of all labels by receiving a sample: In reinforcement learning, the intention is to maximize the discounted cumulative reward, or in mathematical terms, to attain a high limit for the following expression: Equation (7) is termed the return function, which contains all the accumulated return values of the agent searches in space. e discount factor c ∈ (0, 1] [50] is the coefficient of the effect of each reward. e function Q measures the quality of a state-action combination: Equation (8) is expanded according to Bellman's formula [51] Q π (s, a) � E π r t + cQ π s t+1 , a t+1 |s t � s, a t � a .
By maximizing the function Q supported by π, more cumulative rewards can be achieved. e optimal policy of π * is assessed by considering the function Q * as follows: By combining the two equations (9) and (10), the function Q * is expressed as follows [27]: In a low-dimensional space state, the function Q can be easily solved by a table. However, the table technique is inadequate when space is joined. To solve this problem, Q-learning algorithms are used. In these algorithms, the tuple (s, a, r, s 0 ) received from equation (11) is saved as experience replay memory M. e agent gets a mini-batch B from M and executes the gradient descent on these data according to the following equation: where y is an estimate of the function Q expressed as follows [27]: where s ′ is the following state s, and a ′ is the action performed in s ′ ; end means whether the agent makes a wrong classification for the minority class or not. Finally, the policy weights π can be updated as follows: In conclusion, the optimal function Q * can be achieved by minimizing the loss function presented in equation (12). Notably, the optimal policy of π * is taken using Q * , which is the optimal model for the proposed classifier.

Overall Algorithm.
We devised the simulation environment according to the above. e structure of the policy network depends on the complexity and number of training samples. According to the structure of the training samples and the output, the network input equals to the number of data classes, which is equivalent to 2. e general training algorithm of the RLMD-PA model is displayed in Algorithm 2. In this algorithm, the policy weights are first initialized using the ABC algorithm, and then, the agent continues the training process until an optimal policy is reached. Action is based on a greedy policy, which is also evaluated by Algorithm 3. e algorithm is repeated for E times, which is taken as 18,000 in this paper. At each step, the policy network weights are stored.

Dataset.
Cardiac magnetic resonance imaging (CMR) [52] allows for comprehensive anatomical and functional evaluation of the heart as well as detailed tissue characterization [53]. It is the preeminent imaging modality for noninvasive diagnosis myocarditis without biopsy. e Lake Louise criterion (LLC) [54] introduced benchmark criteria for diagnosing myocarditis using CMR [55] based on the presence of myocardial necrosis, edema, and hyperemia. e presence of late gadolinium enhancement confirms myocardial necrotic damage. T2-weighted images uncover areas of interstitial edema, which indicates inflammatory response. T1-weighted images before and after contrast can depict hyperemia in the myocardial tissue. Fulfilling two of three LLC criteria confers 80% accuracy for diagnosing myocarditis [56]. is article presents a model for identifying myocarditis by considering the three LLC criteria.
A one-year CMR research project on myocarditis was conducted from September 2016 at Omid Hospital in Tehran, Iran, where we performed CMR on patients who were clinically suspected to have myocarditis (e.g., chest pain, elevated troponin, negative functional imaging and/or coronary angiographic findings, and suspected viral etiology) and the treating physician assessed that CMR would likely affect clinical management (e.g., ongoing symptoms, ongoing myocardial injury evidenced by persistent ECG abnormalities, and presence of ventricular dysfunction). e protocol had been approved by the local ethics committee. CMR examination was performed on a 1.5-Tesla system [57]. All cases were scanned with body coils in standard supine position. T1-weighted images were acquired in the axial views. Shortly after gadolinium injection, the T1-weighted sequences were repeated. After approximately 10-15 minutes, late gadolinium enhancement [58] sequences were performed in standard left ventricular short-and long-axis views. Table 1 summarizes the CMR sequence parameters [3].
A total of 586 patients were identified who had positive evidence of myocarditis on the CMR images, which might show one or more areas of disease. A total of 307 healthy subjects were included as controls. We chose eight CMR images from each patient or control subject for the analysis, which were one long-axis image and one short-axis image acquired using each of the following four CMR sequences: late gadolinium enhancement, perfusion, T2-weighted, and steady-state free precession. e final CMR dataset comprises 4,686 and 2,449 samples from sick (i.e., myocarditis) and healthy subjects, respectively. Figure 3 shows example images obtained from this dataset. It may be noted that in this study, analysis is performed at the image level, and not at the patient level. In other words, prediction is based on a single image regardless of how many images are available for each patient.
(1) Function Reward (x t , a t , y t ): end t � True; (9) end (10) else (11) if y t � � a t then (12) r t � λ; (13) else (14) r t � −λ; (15) end (16) end (17) return    purposes. Approval was granted on the grounds of existing datasets. Informed consent was received from all of the patients in this study. All methods were carried out in accordance with relevant guidelines and regulations. Ethical approval for using these data was obtained from the Tehran Omid Hospital.

Metrics.
To evaluate the classification performance of the proposed model, we used six standard performance metrics, namely, accuracy, recall, precision, F-measure, specificity, and G-means [59], and they are defined as follows: where TP, TN, FN, and FP are true positive, true negative, false negative, and false positive, respectively. e F-measure and G-means are commonly applied to evaluate imbalanced classification [27], which aligns nicely with our dataset sample distribution and the reason for existing our proposed method. In addition, it is noteworthy that our prediction is per image. In this way, the intelligent myocarditis classification system can effectively screen entire CMR studies and flag individual images for scrutiny by physician readers. For this purpose, low FP and high recall metrics would be desired.

Details of Model.
is work used Python and the PyTorch framework. e codes are written in Jupyter notebook. We used five layers of two-dimensional convolution for the CNN network with 128, 64, 32, 16, and 8 filters. e size of the kernel, stride, and padding in each layer are 3, 2, and 1 for both dimensions, respectively. Each convolution layer involves a max-pooling layer with dimensions of 2 × 2. e three fully connected layers have 128, 64, and 32 hidden layers, respectively. To prevent overfitting, dropout with a probability of 0.4 and early stopping are employed. In every experiment, the batch size is set to 64. e images in the dataset are in gray-scale and light intensities of image pixels are mapped to the range [0, 1]. e images in the dataset come in different sizes and are all resized to 100 × 100 for analysis.

Experimental Results.
While standard techniques like data augmentation and weighted loss function [60] can sometimes be used to correct the imbalanced data distributions, they are not applicable in all situations. In our experiments, data augmentation and weighted loss function do not enrich our model, which is not unexpected.
We used k-fold cross-validation (k � 5 or 5-CV) in all our implementations. e entire dataset is divided into k subsets. k − 1 subsets are applied for training and the remaining one k for test. is procedure is iterated k times until all data subsets are utilized exactly four times for training and once for testing. All parameters are expressed as Table 3: 5-CV classification performances (F-measure, specificity, and G-means) obtained for automated myocarditis detection using various combinations of methods with the Z-Alizadeh Sani myocarditis dataset.  Contrast Media & Molecular Imaging means, standard deviations, medians, minimums, and maximums. First, we compared our proposed method with the only published work in this field, CNN-KCL [3]. Next, to investigate the contributions of the two distinct components ABC and RL in our model, we compared the performance of a basic model without ABC and RL, that is, CNN + random weight, versus the models CNN + ABC and CNN + RL, which used ABC and RL for training, respectively. e evaluation results of our RLMD-PA model performance as well as the aforementioned comparisons on the Z-Alizadeh Sani myocarditis dataset are presented in Tables 2 and 3. In general, the RLMD-PA model reduces the error by more than 43%. From the means of all the performance metrics, the RLMD-PA model outperforms the CNN-KCL method as well as CNN + random weight, CNN + ABC, and CNN + RL combinations of its components. Both ABC and RL individually improve on the basic CNN network across all assessed performance metrics, which supports the use of combined approaches of initial weight and reinforcement learning. For better visualization, the results are illustrated in Figure 4. In terms of time, the best model was obtained after 100 iterations in 2 hours, while CNN-KCL got the best after 350 iterations in 5 hours.
Standard machine learning classifiers have not been successful in classifying medical images, because they typically assume images as one-dimensional vectors, which cause the neighboring pixels of a specific pixel to be spaced apart. In order to compare with our deep model, we used five algorithms: support vector machine (SVM) [61], k-nearest neighbor [62], naïve Bayes [63], logistic regression [64], and random forests [65] to classify the CMR images of the study dataset. SVM performed the best among these methods but is still inferior to deep models. e results are summarized in Tables 4 and 5, and the mean performance metrics is shown in Figure 5.

Investigation of Other Metaheuristic Algorithms on the
Algorithm. e proposed model employs ABC algorithm in conjunction with backpropagation for the initial value. To compare the performance of ABC versus alternative instructors, we employed ABC in our model with five conventional algorithms, namely, gradient descent with momentum backpropagation (GDM) [66], gradient descent with adaptive learning rate backpropagation (GDA) [67], gradient descent with momentum and adaptive learning rate backpropagation (GDMA) [68], one-step secant backpropagation (OSS) [69], and Bayesian regularization backpropagation (BR) [70], and four metaheuristic Table 4: 5-CV classification performances (accuracy, recall, and precision) obtained for automated myocarditis detection using various machine learning algorithms with the Z-Alizadeh Sani myocarditis dataset.

Accuracy
Recall  Table 5: 5-CV classification performance (F-measure, specificity, and G-means) obtained for automated myocarditis detection using various machine learning algorithms with the Z-Alizadeh Sani myocarditis dataset. algorithms, namely, gray wolf optimization (GWO) [71], the Bat algorithm (BA) [72], Cuckoo optimization algorithm (COA) [73], and whale optimization algorithm (WOA) [74]. e population size and number of function evaluations are 100 and 25,000 for all metaheuristic algorithms, respectively. Other parameter settings can be seen in Table 6. e performance metrics of these comparisons are summarized in Tables 7 and 8 and illustrated in Figure 6. In general, metaheuristic algorithms are better than conventional algorithms with the exception of GDMA in terms of accuracy, recall, and F-measure scores. Importantly, the ABC algorithm outperformed all conventional and metaheuristic algorithms to improve the error in the recall and F-measure criteria by more than 25% and 22%, respectively.

Explore the Reward Function.
e reward function is a practical device that helps the agent to achieve the goal. In this work, the minority class reward is +1/−1, while the majority is +λ/−λ. To examine the effect of the value λ on the classification model, we test 10 values of λ ∈ 0, 0.1, 0.2, 0.3, { 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} on the model. Details of the results for all the criteria for these experiments are given in Table 9. For better visualization, we have plotted the trends in Figure 7. On examination, for the accuracy criterion, when λ takes the values from [0, 0.3], the chart has an ascending trend, and from [0.3, 1] has a descending move` is process is valid for all criteria. If λ � 0, the importance of the majority class is disregarded, and if λ � 1, the importance of both classes is the same. Although the minority class is more important to us, the majority class cannot be ignored.    Table 8: Results of 5-CV classification performances (F-measure, specificity, and G-means) obtained for automated myocarditis detection using various conventional and metaheuristic algorithms with the Z-Alizadeh Sani myocarditis dataset.

Conclusion and Future Directions
is article presents a new model for classifying myocarditis images. e proposed model consists of two steps. First, the model weights are initialized using the ABC algorithm. Next, the model is considered an ICMDP problem. e environment assigns a high reward to the minority class and a low reward to the majority class. e algorithm terminates when the agent makes a wrong classification for the minority class, or the number of episodes runs out. We performed several experiments to examine various factors that affect the performance of the proposed model. e designed experiments confirmed that the RLMD-PA model with ABC and RL is an effective classifier for myocarditis images.
In the future, we will try to employ ensemble convolutional neural network (ECNN), as our model to use a set of CNN networks and connect them to yield higher performance. In addition, we can also work with the generative adversarial network (GAN), which is widely used in many applications. It may be worth exploring to employ the developed model for other medical applications such as stroke detection, cancer detection and plaque detection.

Data Availability
e dataset used to support the findings of this study is available on GitHub: https://github.com/vahid-moravvej/Z-Alizadeh-Sani-myocarditis-dataset.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Seyed Vahid Moravvej, Roohallah Alizadehsani, and Ru-San Tan contributed to prepare the first draft. Nahrizul Adib Kadri, Muhammad Mokhzaini Azizan, and U. Rajendra Acharya contributed to editing the final draft. Sadia Khanam and Zahra Sobhaninia contributed to all analysis of the data and produced the results accordingly. Afshin Shoeibi and Fahime Khozeimeh searched for papers and then extracted data. Zahra Alizadeh Sani, N. Arunkumar, Abbas Khosravi, and Saeid Nahavandi provided overall guidance and managed the project.