Chimp Optimization Algorithm to Optimize a Convolutional Neural Network for Recognizing Persian/Arabic Handwritten Words

Handwritten character recognition is an attractive subject in computer vision. In recent years, numerous researchers have implemented techniques to recognize handwritten characters using optical character recognition (OCR) approaches for many languages. One the most common methods to improve the OCR accuracy is based on convolutional neural networks (CNNs). A CNN model contains several kernels accompanying with pooling layers and nonlinear functions. This model overcomes the problem of adjusting the value of weights and interconnections of the neural network (NN) for creating an appropriate pipeline to process the spatial and temporal information. However, the training process of a CNN is a challenging issue. Various optimization strategies have been recently utilized for optimizing CNN’s biases and weights such as ﬁreﬂy algorithm (FA) and ant colony optimization (ACO) algorithms. In this study, we apply a well-known nature-inspired technique called chimp optimization algorithm (ChOA) to train a classical CNN structure LeNet-5 for Persian/Arabic handwritten recognition. The proposed method is tested on two known and publicly available handwritten word datasets. To deeply investigate and evaluate the approach, the results are compared with three optimization methods including ACO, FA, and particle swarm optimization (PSO). Outcomes indicated that the proposed ChOA technique considerably improves the performance of the original LeNet model and also shows a better performance than the others.


Introduction
Image processing and computer vision fields have been popular for solving numerous issues such as object recognition, face recognition, place recognition, emotion recognition, and so on. One of the widespread applications of image processing in the real world is optical character recognition (OCR) [1]. OCR is a technology developed with the purpose of detecting and recognizing text characters in any image containing printed or handwritten text. is technology has a wide range of applications, most notably in automatic identification of scanned documents, automatic reading of license plates, and automatic recognition of traffic signs [2,3]. OCR can be considered as a compound problem due to the variety of styles, complex rules, fonts, orientation, and languages in which text can be written. erefore, approaches from diverse strategies of computer science (i.e., natural language processing, pattern recognition, and image processing) can be utilized to overcome diverse challenges. Persian and Arabic scripts are specific sets of characters employed by the Middle East countries and these characters are often found in the books, wall carvings, or street signboards. However, people these days face the problem that not everyone is able to read images of Persian and Arabic scripts and there is a need for a system to convert these images into texts that can be translated by machine. Nowadays, many techniques have been developed to create more efficient and robust systems in recognizing characters by employing OCR [4]. Deep learning (DL) is recently gaining attention, and it has been considered as the base pipeline for numerous artificial intelligence (AI) applications including emergency vehicle recognition [5], biometric authentication based on finger knuckle and fingernail [6], synthetic aperture radar (SAR) image processing [7], the diagnosis of tumor location in breast cancer [8], speech recognition [9,10], liver or lung tumor segmentation [11][12][13], and various applications in computer networks such as network traffic classification [14] and network intrusion detection system [15]. In the 1980s, researchers developed the first DL-based architecture built for computer vision applications on the foundations of artificial neural networks (ANNs) [16].
e ANNs with additional hidden layers have a large capacity for procedure of the feature extraction. However, an ANN often encounters gradient propagation or converges to the local optimum, when its structures become complex and deep. e core or fundamental benefit of DL is feature extraction from raw data and learning these features automatically. ese extracted features are obtained by forming the composition of lower level features to generate the higher levels of the hierarchy [17].
DL is able to handle more complex challenges principally fast and well, due to employing more complex frameworks, which permit massive parallelization. ese complex frameworks can be utilized to reduce the error of regression techniques and also enhance the classification efficiency. A DL pipeline entails several dissimilar layers (e.g., convolution layers (Conv layers), pooling layers, activation functions, fully connected (FC) layers, memory cells, gates, and so on), depending on the utilized model (i.e., deep belief network, recursive neural networks, recurrent neural networks, convolutional neural networks, or unsupervised pretrained networks). Among the several pipelines of DL, the focus of this study is on CNN.
Although a CNN framework is able to learn numerous complex patterns through several biases and weights inside the Conv layers, this framework encounters many challenges in the training step to produce the optimal outcomes.
ese biases and weights achieve their finest possible values through a learning step with a lot of input samples. In fact, in all CNN models, the number of samples in the training phase plays a crucial role to obtain the best possible outcome. To reach to this end, several optimization strategies have been implemented to manipulate the value of biases and weights. e most famous techniques are adaptive gradient strategies. Essentially, these techniques amend the learning rate (LR) by a backpropagation approach. ese techniques reduce the LR if the gradient of a parameter is small or vice versa. e most preferred approach among adaptive gradient approaches is the stochastic gradient descent (SGD) [18]. However, due to tuning LR manually in the SGD, they demonstrated to have an inappropriate performance especially when the network is large like a CNN model. is significantly increases the training time of the models for large-scale input. To address this challenge and increase the efficiency of adaptive gradient approaches, new variants of adaptive gradient techniques were offered. ese algorithms include YOGI [19], which improves the process of varying the LR in gaining better convergence, or Nostalgic Adam [20], which places bigger weights on the past gradient compared to the recent gradient. However, they have not achieved attractiveness in the field of image processing using CNNs, which leads to the use of metaheuristic algorithms as alternatives.
Recently, many metaheuristic schemes have been utilized to resolve complicated problems such as detection problems [21] and scheduling [22]. One popular way to classify the metaheuristics is based on the inspiration of human-based approaches, physics-based approaches, swarm intelligence approaches, and evolutionary approaches [23,24]. Evolutionary techniques simulate patterns in natural evolution and apply operators motivated by biology manners like mutation and crossover. e genetic algorithm (GA) can be considered as a conventional evolutionary method, which is motivated by Darwinian evolutionary ideas. Conventional approaches of this group comprise evolution strategy [25], differential evolution [26], and evolutionary programming [27].
Swarm intelligence techniques which simulate the manners of animals in hunting or movement groups are another group of metaheuristics [28]. e distribution of organism details of all animals through the optimization procedure is the main characteristic of this group. Conventional algorithms of this group comprise dolphin echolocation [29], salp swarm approach [30], and krill herd approach [31].
Physics-based approaches originated from physical laws in real life. ese strategies describe the communication of exploration solutions based on adjusting instructions rooted in physical approaches. e most popular employed strategies in these groups are charged system search [32], multiverse optimizer [33], and gravitational search algorithm [34]. Human-based approaches are considered the last optimizer strategy, which are motivated by human behavior and human cooperation in communities. One of the most utilized techniques which are motivated by the human sociopolitical growth practice is the imperialist competitive technique [35,36].
Metaheuristic algorithms are cherished for their simplicity, gradient independence, and ability to evade local optima. ese properties make them a superior choice in the training of ANNs with a great number of parameters [37]. Zhang et al. [38] utilized a whale optimization algorithm (WOA) to diminish the error rate during learning process during the pretraining process of the CNN for classification of skin cancer images. is optimization approach employs half value precision as the cost function. e outcomes of their analysis indicated that the accuracy of the approach is better compared to other recently published approaches.
In this paper, from the numerous kinds of the metaheuristic techniques, the chimp optimization algorithm (ChOA) is employed to obtain weights and biases for improving the performance of the LeNet-5.
is optimizer belongs to the swarm-based algorithms of metaheuristic algorithms inspired by the sexual motivation (SM) and individuals diversity (ID) of chimps in their group hunting. Also, to evaluate our results, some other well-known optimizers, such as firefly algorithm (FA), ant colony optimization (ACO), and particle swarm optimization (PSO), are used for comparison.
Although many metaheuristic optimization algorithms (MOAs) have been employed to improve the CNN performance, our contribution can be considered in four categories: (1) Based on no free lunch theorem (NFL) [39,40] saying that "there is no metaheuristic algorithm that can solve all optimization problems well, so there is always a need to develop new algorithms," this is the first time that the chimp optimization algorithm (ChOA) is used to improve the CNN performance by adjusting the learnable parameters. (2) We introduce an approach to reducing classification errors of the CNN based on local and global search capabilities of gradient descent (GD) and ChOA algorithms. (3) We improve the performance of GD algorithm by finding the optimal initial weights using the ChOA algorithm. (4) e CNN network improved by GD and ChOA is used to recognize handwritten words both in Persian and Arabic languages. e rest of this paper is organized as follows. Section 2 presents a literature review of convolutional neural networks. Section 3 explains the chimp optimization algorithm. In Section 4, the proposed method is comprehensively explained. Section 5 represents and discusses the experimental results, and finally Section 6 gives the conclusion.

Convolutional Neural Network
Convolutional neural networks (CNNs) are considered the most common type of ANNs. eoretically, as each unique neuron in the multilayer perceptron (MLP) maps the weighted inputs to the output using an activation function, CNNs resemble an MLP. ey are specifically proposed to identify multidimensional data with a high degree of invariance to distortion and shift scaling [41][42][43]. e convolution layers (Conv layers) generate some new images from the input image called feature maps. ese Conv layers operate in a quite different way in comparison to the other neural network layers. ese layers use convolution kernels to convert images and do not use connection weights and a weighted sum. Convolution product recognizes visual patterns within the input image [38,44]. All neurons in the feature maps are connected to the neurons of the previous layer within a region with the same dimension of the corresponding kernel. is region can be considered as the receptive field of the neuron. Weight sharing mechanism is also another beneficial property, which uses the identical convolutional kernel in the whole image. is weight sharing mechanism, by permitting the equivalent feature to be searched in the whole image, decreases the number of parameters to be trained [44,45]. e output of convolution layers can be calculated as where I l i is the output, ⊗ indicates the convolution operator, w ij demonstrates convolution kernels, and b i is the bias value [46]. e pooling layers reduce the dimension of the input data. is reducing strategy is implemented by combining neighboring pixels of a definite image region into a value. e well-known pooling operations are L2, average (Mean-Pooling), and maximum pooling (Max-Pooling). While L2 pooling computes the L2 norm of the inputs, Max-Pooling passes the maximum value throughout, and Mean-Pooling takes the average of the input values [47,48].
FC layer is a layer of an ANN in which all neurons connect to all neurons of the following layer. For each connection to an FC, a linear grouping of the outputs of the prior layer with an added bias is considered the input to a node.
In this paper, to recognize correct words, we use the LeNet-5 model [49,50] with eight layers. e initial Conv layer (Conv1) uses six different filters with the dimension of 5 × 5 to extract key information from the input data (the input grayscale image size is 32 × 32) for creating each feature map. Every neuron of Conv1 accepts achieved inputs from a 5 × 5 receptive field at the former input (layer). Due to the weight sharing technique, the same bias and weights are used for all units in these feature maps for producing a linear position invariant kernel in all areas of the input image. During the training process, the sharing weights of the Conv layer are adjusted. Conv1 layer uses 6 convolutional 5 × 5 filters with diverse biases including 122304 connections, 4704 number of neurons, and 156 trainable parameters [51,52]. e succeeding layer is a downsampling layer (Pooling2) with a 2 × 2 kernel for each feature map and six feature maps. Actually, after applying the downsampling strategy in the receptive field using the mean value, the outcome is multiplied (the value of the output pixel). en, this outcome is added by two trainable coefficients which are different in different feature maps but are similar for the output pixels of a feature map. Pooling2 layer has 5880 connections and 12 trainable parameters. e next convolutional layer is Conv3 which has a 5 × 5 kernel for each feature map and 16 feature maps. e succeeding layer (Pooling4) is a downsampling layer with a 2 × 2 filter for each feature map and 16 feature maps. is downsampling layer has 2000 connections and 32 trainable parameters. e next convolutional layer is Conv5 which has a 5 × 5 kernel for each feature map, 120 feature maps, and 48120 connections and trainable parameters. e next layer is a fully connected layer (FC6) which accepts 84 neurons. Each CNN layer has a number of trainable parameters, connections, neurons, and feature maps, which are demonstrated in Table 1.
e overall pipeline of LeNet-5 is demonstrated in Figure 1

Chimp Optimization Algorithm
In order to design a metaheuristic algorithm, two criteria need to be considered: exploitation (intensification) and exploration (diversification). e solution accuracy and convergence speed of the strategy principally can be determined by the balance level among these two criteria. Exploitation can be described as discovering nearby Mathematical Problems in Engineering 3 promising solutions to develop their quality locally. On the contrary, exploration can be described as exploring the search space globally. is ability is related to the avoidance of local optimum and resolving local optima entrapment. At the first step, an efficient optimizer approach needs to explore the search space to discover different solutions. After an adequate transition, the technique often employs local details for generating some improved solutions, which are commonly in the areas close to the existing solutions [53,54]. ere are some dissimilarities between the social behavior of chimps and any other flocking manners including SM and ID. Nutritional benefits of group hunting or SM implies that chimps' hunting can be influenced by the probable social advantages of achieving food (meat). Obtaining food offers a chance for trading it in return for social favors such as grooming and sex. is motivation in the ultimate phase leads chimps to forget their duties in the hunting practice. So, they attempt to gain food chaotically.
is unconditional behavior in the ultimate phase leads to convergence rate progress and exploitation step [53,54].
ID demonstrates that each individual does not have quite similar intelligence and ability compare to other group members, but they all accomplish their responsibilities in their best ways. e ability of each individual can be beneficial in a step of the hunting procedure. Hence, each part of the group based on his special ability accepts some responsibilities for a part of hunt.
In their colony, we can observe four kinds of chimps including attackers, chasers, barriers, and drivers. e responsibility of drivers is following the preys without trying to catch up with them. Barriers are responsible for placing themselves in trees to create a dam across the movement of the prey. Chasers have responsibility to catch up the preys by moving rapidly after them. Lastly, attackers bring preys down into the lower canopy or back towards the chasers by anticipating the breakout route of preys to inflict them. is strategy is demonstrated in Figure 2.
Accordingly, attackers require much more cognitive skills to anticipating the next actions of the prey. So, this significant attacking role correlates positively with the physical ability, smartness, and age. Besides, all group members can keep their same responsibility during the entire procedure or alter duties during the same hunt [55,57]. As mentioned before, the preys can be hunted during the exploitation and exploration steps. To this end, we formulate chasing and driving the preys as follows: where X chimp indicates the position vector of a chimp, X prey demonstrates the vector of prey position c, m and a are the coefficient vectors, and t is the number of current iterations. a, m, and c vectors are calculated by where r 1 and r 2 indicate the random vectors in the range of [0, 1] and f implies reduced nonlinearity in both exploration and exploitation steps from 2.5 to 0 through the iteration process. Lastly, m is a chaotic vector computed based on many chaotic maps to indicate the consequence of the SM in  the hunting procedure [57]. Six chaotic maps have been used in [57] as shown in Table 2 and Figure 3.
e Gauss/mouse map has been used in this study, since it has the most effect on the global minima findings and convergence speed.
ere are two strategies for modeling the attacking behavior including blocking and encircling a prey. e hunting procedure can be typically done by attacker chimps. e chimps by driving, blocking, and chasing are able to discover the prey's location and then encircle it. It is assumed that the first chaser, barrier, driver, and attacker are better notified about the position of a potential prey, to mathematically mimic the behavior of the chimps. Also, the first chaser, barrier, driver, and attacker are considered the best available solutions. erefore, the positions of other chimps are updated based on the best chimps' positions (best solutions yet obtained). is process is demonstrated in Figure 4.
In ChOA, when the attacker attacks the prey, the aforementioned operators are used to update the location of all chimps (i.e., driver, chaser, barrier, and the attacker) based on their current locations. However, the ChOA strategy may trap in minimum local, so other operators need to prevent this problem. As the exploration procedure between the chimps can be implemented by considering the position of driver, chaser, barrier, and attacker chimps, all members diverge for seeking preys and converge to attack preys. is divergence manner means that the vector a with some random values smaller than −1 or higher than 1 can be employed, so that all variables (members) are forced to diverge and be away from preys. is divergence strategy permits the ChOA to search globally and indicates the exploration procedure.
For mathematical modeling of the attacking practice, the value of f should be reduced. It is worth mentioning that the value of a is also reduced since it is dependent to f. In other words, a is a random value in [−2f, 2f], and as the value of f is reducing from 2.5 to 0 in the period of iterations, the value of a is also reduced. In particular, when the value of a is in the range [−1, 1], the next position of a chimp can be anywhere between its current position and the prey position. Figure 5 shows the inequality that forces the chimp to attack prey.
is divergence manner means that the vector a with some random values smaller than −1 or higher than 1 can be employed, so that all variables (members) are forced to diverge and be away from preys. is divergence strategy permits the ChOA to search globally and indicates the exploration procedure.
To conclude, the procedure of searching preys in ChOA is initiated by creating some candidate solutions (a stochastic population of chimps). Next, all members of group are randomly separated into four self-regulating sets entitled driver, chaser, barrier, and attacker. Every group member updates its f coefficients utilizing the group technique.
roughout the repetition epoch, driver, chaser barrier, and attacker chimps approximate the potential prey positions. For every group member, solutions update their distance from the prey. Adjusting of a and c vectors is to accelerate convergence while also avoiding getting into local optima.

Design of the Proposed Technique
As mentioned before, in this study, we propose to train a CNN model to recognize the Arabic and Persian words. is CNN model is optimized using a metaheuristic approach to obtain the best possible biases and weights for increasing the model efficiency.
In order to train CNNs, the final aim and the main problem need to be formulated mathematically in an appropriate way for any optimization approach. e significant variables of a CNN pipeline that lead to define the accuracy of the model are biases and weights. As biases and weights are the trainable variables, by altering these values of neurons in all convolutional layers, the output outcomes of the pipeline can have different results. Consequently, by directing the procedure of employing new biases and weights using an optimizer strategy, we can obtain higher accuracy. As the procedure of the ChOA accepts all input variables in the format of a vector, the variables of the CNN indicated for this approach can be given by . . , θ h , W 1,1 , W 1,2 , . . . , W n,n , (4) where θ j implies the bias (threshold) of the jth hidden node, n is the dimension of the input sample, and W ij indicates the connection weight between ith and jth layers.

Mathematical Problems in Engineering
In the next step, we need to define the objective function for ChOA strategy. To implement effective CNNs, these networks need to be adjusted to the whole set of training data. To this end, the performance of a CNN can be evaluated using the average of mean squared error (MSE) with respect to the training data. By applying a set of training data (2D images or patches) to the CNN model (LeNet-5 in this study), we can measure the difference between the desirable values and the achieved result values through the following equation: in which m indicates the output values, the ideal output of the ith input unit when the kth training data are employed is considered as d k i , and the actual output of the ith input unit when the kth training samples applied to the input is considered as o k i . As at each iteration, the biases and weights accept the best possible values, the probability of an enhanced model   increases progressively. However, due to stochastic nature of the ChOA strategy, there is no guarantee that the best CNN model is achieved. So, by employing adequate number of iterations, the ChOA strategy is able to reach a solution that works more efficient than a random preliminary solution. Figure 6 shows the network training steps, using the proposed algorithm. e learnable parameters include all parameters both in convolution and in fully connected layers. e performance of the classical gradient descent (GD) algorithms strongly depends on the weight initialization while the random initialization can be trapped into a local optimum.
In this study, we use the general search property of the ChOA algorithm to find optimal weight initialization in order to achieve a more appropriate response. In the ChOA, the vector of the initial parameters is generated by the number of chimps. Here, in each iteration, the best answer is utilized in the CNN training phase. us, the probability of reaching the general optimization increases.

Experimental Results and Discussion
Standardized databases play a key role in all papers as they permit comparison of methods and repeatability. e suggested approach has been validated on two known and public datasets: ENIT/IFN (a dataset of Arabic handwritten words) and Iranshahr (a dataset of Persian handwritten words) (see Figure 7). e ENIT/IFN dataset contains 32,492 word-images of Tunisian village and town names and includes five subsets, namely, a, b, c, d, and e. In order to write the vocabulary, more than 1000 writers were employed and this vocabulary entails 946 unique village and city names [58,59]. e known Iranshahr dataset includes nearly 17,000 images of handwritten names of 503 cities of Iran [60][61][62].
In the training, validation, and testing phases for both datasets, all images in the datasets were divided randomly into three groups: 70% of images for training, 15% of images for validation, and the remaining 15% for testing. e parameters used in the ChOA algorithm and the other algorithms are shown in Table 3. It is noteworthy that these parameters were selected in an experimental process in an initial study. e results showed that these values are more suitable for the algorithms studied and compared in this research.
A Hewlett-Packard (HP) computer was utilized to evaluate our approach with the 64 bit operating system and Windows 10 Home, installed memory (RAM) of 8.00 GB, and Intel (R) Core (TM) i7-6500U CPU. For data processing, the MATLAB and Statistics Toolbox Release 2021a were employed.
To obtain the outcomes, two datasets were trained and tested 40 epochs by utilizing each strategy. e number of  Mathematical Problems in Engineering iterations in each period varies depending on the size of the images and the number of batches. Figure 8 shows that the final losses of all methods are similar, and Figure 9 depicts the first fifty iterations. As clearly illustrated in Figure 8, among these strategies, the ChOA technique converges more rapidly in the initial iterations. So, it can be the most efficient optimizer and can achieve higher accuracy. Table 4 shows the CNN classifier performance after being trained with the GD and ChOA, separately, and also, with the GD and ChOA altogether. It can be seen that weight initialization using ChOA for the CNN classifier and training the classifier using the GD results in better performance. Furthermore, if the CNN classifier uses random initial weights and is trained with GD, the performance will be better than the case where the CNN classifier only uses ChOA for initialization without training.
For each result in Tables 5 and 6, the best values of accuracy are highlighted in bold. According to Tables 5 and 6, we observe that optimized LeNet-5 using the ChOA method achieves the highest accuracy for both datasets. Moreover, the NN-based methods in [61,64] gained the worst results among all methods for Persian and Arabic datasets, respectively. Also, for Arabic dataset, there is a minimum difference between the obtained accuracy by employing SIFT [65] and NN-based method [64]. For Persian dataset, there is a minimum difference between the achieved accuracy by utilizing the NN-based method [60] and LeNet-5.

Conclusion
is study demonstrated the effect of applying the ChOA as one of metaheuristic strategies to optimize the LeNet-5 model. At first, the training issue of the CNN was formulated for the ChOA approach. en, the ChOA technique was employed for defining the best values for weights and biases to increase the performance of the CNN model. e suggested ChOA approach was utilized for training two public datasets (handwritten Arabic, ENIT/IFN, and handwritten Persian, Iranshahr). In order to prove the accuracy of the ChOA technique, the outcomes were compared to three recently published methods and three other stochastic optimization approaches (ACO, PSO, and FA). e outcomes represented that the ChOA strategy can effectively train the CNN. is approach develops the probability of finding optimal values for weights and biases in a CNN model.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Table 6: Experimental outcomes for the Arabic dataset.