Improved Arabic Alphabet Characters Classification Using Convolutional Neural Networks (CNN)

Handwritten characters recognition is a challenging research topic. A lot of works have been present to recognize letters of different languages. The availability of Arabic handwritten characters databases is limited. Motivated by this topic of research, we propose a convolution neural network for the classification of Arabic handwritten letters. Also, seven optimization algorithms are performed, and the best algorithm is reported. Faced with few available Arabic handwritten datasets, various data augmentation techniques are implemented to improve the robustness needed for the convolution neural network model. The proposed model is improved by using the dropout regularization method to avoid data overfitting problems. Moreover, suitable change is presented in the choice of optimization algorithms and data augmentation approaches to achieve a good performance. The model has been trained on two Arabic handwritten characters datasets AHCD and Hijja. The proposed algorithm achieved high recognition accuracy of 98.48% and 91.24% on AHCD and Hijja, respectively, outperforming other state-of-the-art models.


Introduction
Approximately a quarter of a billion people around the world speak and write the Arabic language [1]. ere are a lot of historical books and documents that represent a crucial data set for most Arabic countries written in the Arabic language [1,2].
Recently, the area of Arabic handwritten characters recognition (AHCR) has received increased research attention [3][4][5]. It is a challenging topic of computer vision and pattern recognition [1]. is is due to the following: (i) e difference between handwriting patterns [3].
(iv) As shown in Figure 1, in the Arabic language the shape of each handwritten character depends on its position in the world. For example, here in the word ‫"ءارمأ"‬ the character "Alif" is written in two different forms ‫"أ"‬ and ‫,"ا"‬ where, in the Arabic language, each character has between two and four shapes. Table 1 shows the different shapes of the twentyeight Arabic alphabets.
Most researchers improved the CNN architecture to achieve good handwritten characters recognition performance [6,13]. However, a neural network with excellent performance usually requires a good tuning of CNN hyperparameters and a good choice of applied optimization algorithms [14][15][16]. Also, a large amount of training dataset [17,18] is required to achieve outstanding performance.
e main contributions of this research can be summarized as follows: (i) Suggesting a CNN model for recognizing Arabic handwritten characters. (ii) Tuning of different hyperparameters to improve the model performance. (iii) Applying different optimization algorithms.
Reporting the effectiveness of the best ones. (iv) Presenting different data augmentation techniques.
Reporting the influence of each method on the improvement of Arabic handwritten characters recognition. (v) Mixing two different Arabic handwritten characters datasets for shape varying. Testing the impact of the presented data augmentation approaches on the mixed dataset. e rest of this paper is organized as follows. In Section 2, we expose the related works in Arabic handwritten character classification. In Sections 3 and 4, we describe the convolution neural network architecture and the model tuning hyperparameters. In Section 5, we make a detailed description of various used optimization algorithms. In Section 6, we describe the different utilized data augmentation techniques chosen in this study. In Section 7, we provide an overview of the experimental results showing the CNN distinguished performance. Section 8 is conclusion and possible future research directions.

Related Work
In recent years, many studies have addressed the classification and recognition of letters, including Arabic handwritten characters. On the other hand, there are a smaller number of proposed approaches for recognizing individual characters in the Arabic language. As a result, Arabic handwritten character recognition is less common compared to English, French, Chinese, Devanagari, Hangul, Malayalam, etc.
Impressive results were achieved in the classification of handwritten characters from different languages, using deep learning models and in particular the CNN.
El-Sawy et al. [6] gathered their own Arabic Handwritten Character dataset (AHCD) from 60 participants. AHCD consists of 16.800 characters. ey have achieved a classification accuracy of 88% by using a CNN model consisting of 2 convolutional layers. To improve the CNN performance, regularization and different optimization techniques have been implemented to the model. e testing accuracy was improved to 94.93%.
Altwaijry and Turaiki [13] presented a new Arabic handwritten letters dataset (named "Hijja"). It comprised 47.434 characters written by 591 participants. eir proposed CNN model was able to achieve 88% and 97% testing accuracy, using the Hijja and AHCD datasets, respectively.
Younis [19] designed a CNN model to recognize Arabic handwritten characters. e CNN consisted of three convolutional layers followed by one final fully connected layer. e model achieved an accuracy of 94.7% for the AHCD database and 94.8% for the AIA9K (Arabic alphabet's dataset).
Latif et al. [20] designed a CNN to recognize a mix of handwriting of multiple languages: Persian, Devanagari, Eastern Arabic, Urdu, and Western Arabic. e input image is of size (28 × 28) pixels, followed by two convolutional layers, and then a max-pooling operation is applied to both convolution layers. e overall accuracy of the combined multilanguage database was 99.26%. e average accuracy is around 99% for each individual language.
Alrobah and Albahl [21] analyzed the Hijja dataset and found irregularities, such as some distorted letters, blurred symbols, and some blurry characters. ey used the CNN model to extract the important features and SVM model for data classification. ey achieved a testing accuracy of 96.3%.
Mudhsh et al. [22] designed the VGG net architecture for recognizing Arabic handwritten characters and digits. e model consists of 13 convolutional layers, 2 max-pooling layers, and 3 fully connected layers. Data augmentation and  Boufenar et al. [23] used the popular CNN architecture Alexnet. It consists of 5 convolutional layers, 3 max-pooling layers, and 3 fully connected layers. Experiments were conducted on two different databases, OIHACDB-40 and AHCD. Based on the good tuning of the CNN hyperparameters and by using dropout and minibatch techniques, a CNN accuracy of 100% and 99.98% for OIHACDB-40 and AHCD was achieved.
Mustapha et al. [24] proposed a Conditional Deep Convolutional Generative Adversarial Network (CDCGAN) for a guided generation of isolated handwritten Arabic characters. e CDCGAN was trained on the AHCD dataset. ey achieved a 10% performance gap between real and generated handwritten Arabic characters. Table 2 summarizes the literature reviewed for recognizing Arabic handwriting characters using the CNN models. From the previous literature, we notice that most CNN architectures have been trained by using adult Arabic handwriting letters "AHCD". In addition, we observe that most researchers try to improve the performance through the good tuning of the CNN model hyperparameters.

The Proposed Arabic Handwritten Characters Recognition System
As shown in Figure 2, the model that we proposed in this study is composed of three principal components: CNN proposed architecture, optimization algorithms, and data augmentation techniques. In this paper, the proposed CNN model contains four convolution layers, two max-pooling operations, and an ANN model with three fully hidden layers used for the classification. To avoid the overfitting problems and improve the model performance, various optimization techniques were used such as dropout, minipatch, choice of the activation function, etc. Figure 3 describes the proposed CNN model. Also, in this work, the recognition performance of Arabic handwritten letters was improved through the good choice of the optimization algorithm and by using different data augmentation techniques "geometric transformations, feature space augmentation, noise injection, and mixing images."

Convolution Neural Network Architecture
A CNN model [25][26][27][28][29][30][31][32][33][34] is a series of convolution layers followed by fully connected layers. Convolution layers allow the extraction of important features from the input data. Fully connected layers are used for the classification of data. e CNN input is the image to be classified; the output corresponds to the predicted class of the Arabic handwritten character.

Input Data.
e input data is an image I of size (m × m × s). (m × m) Defines the width and the height of the image and s denotes the space or number of channels. e value of s is 1 for a grayscale image and equals 3 for a RGB color image.

Convolution Layer.
e convolution layer consists of a convolution operation followed by a pooling operation.

Convolution Operation.
e basic concept of the classical convolution operation between an input image I of dimension (m × m) and a filter F of size (n × n) is defined as follows (see Figure 4): Here, ⊗ denotes the convolution operation. C is the convolution map of size (a × a), where a � (m − n + 2p/sL) + 1. sL is the stride and denotes the number of pixels by which F is sliding over I. p is the padding; often it is necessary to add a bounding of zeros around I to preserve complete image information. Figure 4 is an example of the convolution operation between an input image of dimension (8 × 8) and a filter F of size (3 × 3). Here, the convolution map C is of size (6 × 6) with a stride sL � 1 and a padding p � 0.
Generally, a nonlinear activation function is applied on the convolution map C. e commonly used activation functions are Sigmoid [34][35][36], Hyperbolic Tangent "Tanh" [35,37], and Rectified Linear Unit "ReLU" [37,38] where here, C a is the convolution map after applying the nonlinear activation function f. Figure 5 shows the C a map when the ReLU activation function is applied on C.

Pooling Operation.
e pooling operation is used to reduce the dimension of C a thus reducing the computational complexity of the network. During the pooling operation, a kernel K of size (s p × s p ) is sliding over C a . s p denotes the number of patches by which K is sliding over C a . In our analysis s p is set to 2. e pooling operation is expressed as where P is the pooling map and pool is the pooling operation. e commonly used pooling operations are averagepooling, max-pooling, and min-pooling. Figure 6 describes the concept of average-pooling and max-pooling operations using a kernel of size (2 × 2) and a stride of 2.

Concatenation
Operation. e concatenation operation maps the set of the convoluted images into a vector called the concatenation vector Y.
Computational Intelligence and Neuroscience here, P c i is the output of the i th convolution layer. n denotes the number of filters applied on the convoluted images P c−1 i−1 .

Fully Connected Layer.
e CNN classification operation is performed through the fully connected layer [39]. Its input is the concatenation vector Y; the predicted class y is the output of the CNN classifier. e classification operation is performed through a series of t fully connected hidden layers. Each fully connected hidden layer is a parallel collection of artificial neurons. Like synapses in the biological brain, the artificial neurons are connected through weights W. e model output of the i th fully connected hidden layer is   Computational Intelligence and Neuroscience where the weight sum vector H i is here, f is a nonlinear activation function (sigmoid, Tanh, ReLU, etc.). e bias value B i defines the activation level of the artificial neurons.

CNN Learning Process
A trained CNN is a system capable of determining the exact class of a given input data. e training is achieved through an update of the layer's parameters (filters, weights, and biases) based on the error between the CNN predicted class and the class label. e CNN learning process is an iterative process based on the feedforward propagation and backpropagation operations.

Feedforward Propagation.
For the CNN model, the feedforward equations can be derived from (1)-(5) and (6). e Softmax activation [40,41] function is applied in the final layer to generate the predicted value of the class of the input image I. For a multiclass model, the Softmax is expressed as follows: where c denotes the number of classes, y i is the i th coordinate of the output vector y, and the artificial neural output h i � n j�1 h i w ij .

Backpropagation.
To update the CNN parameters and perform the learning process, a backpropagation optimization algorithm is developed to minimize a selected cost function E. In this analysis, the cross-entropy (CE) cost function [40] is used.
here, � y i is the desired output (data label). e most used optimization algorithm to solve classification problems is the gradient descent (GD). Various optimizers for the GD algorithm such as momentum, AdaGrad, RMSprop, Adam, AdaMax, and Nadam were used to improve the CNN performance. [40,42]. GD is the simplest form of optimization gradient descent algorithms. It is easy to implement and gives significant classification accuracy. e general update equation of the CNN parameters using the GD algorithm is

Gradient Descent
where φ represents the update of the filters F, the weights W, and the biases B. (zE/zφ) is the gradient with respect to the parameter φ. α is the model learning rate. A too-large value of α may lead to the divergence of the GD algorithm and may cause the oscillation of the model performance. A too-small α stops the learning process. [43]. e momentum hyperparameter m defines the velocity by which the learning rate α must be increased when the model approaches to the minimal of the cost function E. e update equations using the momentum GD algorithm are expressed as follows:  Computational Intelligence and Neuroscience 5

Gradient Descent with Momentum
where v(t) is the moment gained at t th iteration. [44]. In this algorithm, the learning rate is a function of the gradient (zE/zφ). It is defined as follows:

AdaGrad
where where ϵ is a small smoothing value used to avoid the division by 0 and G(t) is the sum of the squares of the gradients (zE/zφ(t)).
With a small magnitude of (zE/zφ), the value of α is increasing. If (zE/zφ) is very large, the value of α is a constant. AdaGrad optimization algorithm changes the learning rate for each parameter at a given time t with considering the previous gradient update. e parameter update equation using AdaGrad is expressed as follows: [45]. e issue of AdaGrad is that with much iteration the learning rate becomes very small which leads to a slow convergence. To fix this problem, AdaDelta algorithm proposed to take an exponentially decaying average as a solution, where

AdaDelta
where E[G 2 (t)] is the decaying average over past squared gradients and c is a set usually around 0.9. [45,46]. In reality, RMSprop is identical to AdaDelta's initial update vector, which we derived above:
is gradient descent optimizer algorithm computes the learning rate α based on two vectors: where r(t) and v(t) are the 1 st and the 2 nd order moments vectors. β 1 and β 2 are the decay rates. r(t − 1) and v(t − 1) represent the mean and the variance of the previous gradient.
When r(t) and v(t) are very small, a large step size is needed for parameters update. To avoid this issue, a bias correction value is added to r(t) and v(t).
where β t 1 is β 1 power t and β t 2 is β 2 power t. e Adam update equation is expressed as follows:  Computational Intelligence and Neuroscience [45,47].

AdaMax
e factor v(t) in the Adam algorithm adjusts the gradient inversely proportionate to the ℓ2 norm of previous gradients (via the v(t − 1)) and current gradient t (zE/zφ(t)) : e generalization of this update to the ℓp norm is as follows: To avoid being numerically unstable, ℓ1 and ℓ2 norms are most common in practice. However, in general ℓ∞ also shows stable behavior. As a result, the authors propose AdaMax and demonstrate that v(t) with ℓ∞ converges to the more stable value. Here, 5.2.8. Nadam [43]. It is a combination of Adam and NAG, where the parameters update equation using NAG is defined as follows: e update equation using Nadam is expressed as follows: ,

Data Augmentation Techniques
Deep convolutional neural networks are heavily reliant on big data to achieve excellent performance and avoid the overfitting problem.
To solve the problem of insufficient data for Arabic handwritten characters, we present some basic data augmentation techniques that enhance the size and quality of training datasets. e image augmentation approaches used in this study include geometric transformations, feature space augmentation, noise injection, and mixing images.
Data augmentation based on geometric transformations and feature space augmentation [17,48] is often related to the application of rotation, flipping, shifting, and zooming.

Rotation.
e input data is rotated right or left on an axis between 1°and 359°. e rotation degree parameter has a significant impact on the safety of the dataset. For example, on digit identification tasks like MNIST, slight rotations like 1 to 20 or −1 to −20 could be useful, but when the rotation degree increases, properly the CNN network cannot accurately distinguish between some digits.

Flipping.
e input image is flipped horizontally or vertically. is augmentation is one of the simplest to implement and has proven useful on some datasets such as ImageNet and CIFAR-10.

Shifting.
e input image is shifting right, left, up, or down. is transformation is a highly effective adjustment to prevent positional bias. Figure 7 shows an example of shifting data augmentation technique using Arabic alphabet characters.

Zooming.
e input image is zooming, either by adding some pixels around the image or by applying random zooms to the image. e amount of zooming has an influence on the quality of the image; for example, if we apply a lot of zooming, we can lose some image pixels.

Noise Injection.
As it could be seen on Arabic handwritten characters, natural noises are presented in images. Noises make recognition more difficult and for this reason, noises are reduced by image preprocessing techniques. e cos of noise reduction is to perform a high classification, but it causes the alteration of the character shape. e main datasets in this research topic are considered with denoising images. e question which we answer here is how the method could be robust to any noise.
Adding noise [48,49] to a convolution neural network during training helps the model learn more robust features, resulting in better performance and faster learning. We can add several types of noise when recognizing images, such as the following.
(i) Gaussian noise: injecting a matrix of random values drawn from a Gaussian distribution (ii) Salt-and-pepper noise: changing randomly a certain amount of the pixels to completely white or completely black (iii) Speckle noise: only adding black pixels "pepper" or white pixels "salt" Adding noise to the input data is the most commonly used approach, but during training, we can add random noise to other parts of the CNN model. Some examples include the following: Computational Intelligence and Neuroscience (i) Adding noise to the outputs of each layer (ii) Adding noise to the gradients to update the model parameters (iii) Adding noise to the target variables 6.6. Mixing Image's Databases. In this study, we augment the training dataset by mixing two different Arabic handwritten characters datasets, AHCD and Hijja, respectively. AHCD is a clean database, but Hijja is a dataset with very low-resolution images. It comprises many distorted alphabets images. en, we evaluate the influence of different mentioned data augmentation techniques (geometric transformations, feature space augmentation, and noise injection) on the recognition performance of the new mixing dataset.

Datasets.
In this study, two datasets of Arabic handwritten characters were used: Arabic handwritten characters dataset "AHCD" and Hijja dataset.
AHCD [6] comprises 16.800 handwritten characters of size (32 × 32 × 1) pixels. It was written by 60 participants between the ages of 19 and 40 years and most of the participants are right handed. Each participant wrote the Arabic alphabet from "alef" to "yeh" 10 times. e dataset has 28 classes. It is divided into a training set of 13.440 characters and a testing set of 3.360 characters.
Hijja dataset [13] consists of 4.434 Arabic characters of size (32 × 32 × 1) pixels. It was written by 591 school children ranging in age between 7 to 12 years. Collecting data from children is a very hard task. Malformed characters are characteristic of children's handwriting; therefore the dataset comprises repeated letters, missing letters, and many distorted or unclear characters. e dataset has 29 classes. It is divided into a training set of 37.933 characters and a testing set of 9.501 characters (80% for training and 20% for test). Figure 8 shows a sample of AHCD and Hijja Arabic handwritten letters datasets.

Experimental Environment and Performance Evaluation.
In this study the implementation and the evaluation of the CNN model are done out in Keras deep learning environment with TensorFlow backend on Google Colab using GPU accelerator.
We evaluate the performance of our proposed model via the following measures: Accuracy (A) is a measure for how many correct predictions your model made for the complete test dataset: Recall (R) is the fraction of images that are correctly classified over the total number of images that belong to class: Precision (P) is the fraction of images that are correctly classified over the total number of images classified: F1 measure is a combination of Recall and Precision measures: Here, TP � true positive (is the total number of images that can be correctly labeled as belonging to a class x), FP � false positive (represents the total number of images that have been incorrectly labeled as belonging to a class x), FN � false negative (represents the total number of images that have been incorrectly labeled as not belonging to a class x), TN � true negative (represents the total number of images that have been correctly labeled as not belonging to a class x).
Also we draw the area under the ROC curve (AUC), where we have the following.
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of all classification thresholds. is curve plots two parameters: (i) True-positive rate (ii) False-positive rate AUC stands for "area under the ROC curve." at is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0.0) to (1.1).

Tuning of CNN Hyperparameters.
e objective is to choose the best model that fits the AHCD and Hijja datasets well. Many try-and-error trials in the network configuration tuning mechanism were performed. e best performance was achieved when the CNN model was constructed of four convolution layers followed by three fully connected hidden layers. e model starts with To reduce the overfitting problem a dropout of 0.6 rate is added to a model between the dense layers and applies to outputs of the prior layer that are fed to the subsequent layer. e optimized parameters used to improve the CNN performance were as follows: Optimizer algorithm is Adam, the loss function is the cross-entropy, learning rate � 0.001, batch size � 16, and epochs � 40. We compare our model to CNN-for-AHCD over both the Hijja dataset and the AHCD dataset. e code for CNNfor-AHCD is available online [31], which allows comparison of its performance over various datasets.
On the Hijja dataset, which has 29 classes, our model achieved an average overall test set accuracy of 88.46%, precision of 87.98%, recall of 88.46%, and an F1 score of 88.47%, while CNN-for-AHCD achieved an average overall test set accuracy of 80%, precision of 80.79%, recall of 80.47%, and an F1 score of 80.4%.
On the AHCD dataset, which has 28 classes, our model achieved an average overall test set accuracy of 96.66%, precision of 96.75%, recall of 96.67%, and an F1 score of 96.67%, while CNN-for-AHCD achieved an average overall test set accuracy of 93.84%, precision of 93.99%, recall of 93.84%, and an F1 score of 93.84%. e detailed metrics are reported per character in Table 3.
We note that our model outperforms CNN-for-AHCD by a large margin on all metrics. Figure 9 shows the testing result AUC of AHCD and Hijja dataset.

Optimizer Algorithms.
e objective is to choose the best optimizers algorithms that fit the AHCD and Hijja best performance. In this context, we tested the influence of the following algorithms on the classification of handwritten Arabic characters: By using Nadam optimization algorithm, on the Hijja dataset, our model achieved an average overall test set accuracy of 88.57%, precision of 87.86%, recall of 87.98%, and an F1 score of 87.95%.
On the AHCD dataset, our model achieved an average overall test set accuracy of 96.73%, precision of 96.80%, recall of 96.73%, and an F1 score of 96.72%. e detailed results of different optimizations algorithms are mentioned in Table 4.

Results of Data Augmentation Techniques.
Generally, the neural network performance is improved through the good tuning of the model hyperparameters. Such improvement in the CNN accuracy is linked to the availability of training dataset. However, the networks are heavily reliant on big data to avoid overfitting problem and perform well.
Data augmentation is the solution to the problem of limited data. e image augmentation techniques used and discussed in this study include geometric transformations and feature space augmentation (rotation, shifting, flipping, and zooming), noise injection, and mixing images from two different datasets.
For the geometric transformations and feature space augmentation, we try to well choose the percentage of     by 180°, the network will not be able to accurately distinguish between the handwritten digits "6" and "9". Likewise, on the AHCD and Hijja datasets, if rotating or flipping techniques are used the network will be unable to distinguish between some handwritten Arabic characters. For example, as shown in Figure 10, with a rotation of 180°, the character Daal isolated ‫)د(‬ will be the same as the character Noon isolated ‫.)ن(‬ e detailed results of rotation, shifting, flipping, and zooming data augmentation techniques are mentioned in Table 5.

Computational Intelligence and Neuroscience
As shown in Table 5 and Figure 11, by using rotation and shifting augmentation approaches, our model achieved a testing accuracy of 98.48% and 91.24% on AHCD dataset and Hijja dataset, respectively. We achieved this accuracy through rotating the input image by 10°and shifting it just by one pixel.
Adding noise is a technique used to augment the training input data. Also in most of the cases, this is bound to increase the robustness of our network.
In this work we used the three types of noise to augment our data:   (iii) Speckle noise e detailed results of different types of noise injection are mentioned in Table 6. As shown by adding different types of noise, the model accuracy is improved, which demonstrate the robustness of our proposed architecture. We achieved good results when adding noise to the outputs of each layer. e proposed idea in this study is to augment the number of training databases by mixing the two datasets AHCD and Hijja, and then we apply the previously mentioned data augmentation methods on the new mixed dataset. Our purpose to use malformed handwritten characters as it proposes the Hijja dataset is to improve the accuracy of our method with noised data. e detailed results of data augmentation techniques on the mixed database are mentioned in Table 7. As shown, the model performance depends on the rate of using Arabic handwriting "Hijja" database. e children had trouble following the reference paper, which results in very lowresolution images comprising many unclear characters. erefore mixing the datasets would certainly reduce performance.

Conclusions and Possible Future Research Directions
In this paper, we proposed a convolution neural network (CNN) to recognize Arabic handwritten characters dataset.
We have trained the model on two Arabic datasets AHCD and Hijja. By the good tuning of the network hyperparameters, we achieved an accuracy of 96.73% and 88.57% on AHCD and Hijja. To improve the model performance, we have implemented different optimization algorithms. For both databases, we achieved an excellent performance by using Nadam optimizer.
To solve the problem of insufficient Arabic handwritten datasets, we have applied different data augmentation techniques.
e augmentation approaches are based on geometric transformation, feature space augmentation, noise injection, and mixing of datasets.
By using rotation and shifting techniques, we achieved a good accuracy equal to 98.48% and 91.24% on AHCD and Hijja.
To improve the robustness of the CNN model and increase the number of training datasets, we added three types of noise (Gaussian noise, Salt-and-pepper, and Speckle noise).
Also in this work we first augmented the database by mixing two Arabic handwritten characters datasets; then we tested the results of the previously mentioned data augmentation techniques on the new mixed dataset, where the first database "AHCD" comprises clear images with a very good resolution, but the second database "Hijja" has many distorted characters. Experimentally show that the geometric transformations (rotation, shifting, and flipping), feature space augmentation, and noise injection always improve the network performance, but the rate of using the unclean database "Hijja" harms the model accuracy.
An interesting future direction is the cleaning and processing of Hijja dataset to eliminate the problem of low-  resolution and unclear images and then the implementation of the proposed CNN network and data augmentation techniques on the new mixed and cleaned database.
In addition, we are interested in evaluating the result of other augmentation approaches, like adversarial training, neural style transfer, and generative adversarial networks on the recognition of Arabic handwritten characters dataset. We plan to incorporate our work into an application for children that teaches Arabic spelling.

AHCR:
Arabic handwritten characters recognition DL: Deep learning CNNs: Convolution neural networks AHCD: Arabic handwritten character dataset SVM: Support vector machine ADBase: Arabic digits database HACDB: Handwritten Arabic characters database OIHACDB: Offline handwritten Arabic character database CDCGAN: Conditional deep convolutional generative adversarial network Tanh: Hyperbolic tangent ReLU: Rectified linear unit CE: Cross-entropy GD: Gradient descent NAG: Nesterov accelerated gradient TP: True positive FP: False positive FN: False Negative TN: True negative AUC: Area under curve ROC: Receiver operating curve ELU: Exponential linear unit Symbols I : Image m: Width and height of the image c: Number of channels F: Filter n: Filter size ⊗: Convolution operation C: Convolution map a: Size of convolution map S c : Stride p: Padding f: Nonlinear activation function C a : Convolution Cost function � y i : Desired output ϕ: Update of the filter F (zE/zφ): Gradient α: Model learning m: Momentum v(t): Moment gained at the i th iteration ε: Smoothing value G(t): Sum of the squares of the gradient E[G(t) 2 ]: Decaying overage r(t): Moments vector β: Decay rate r(t − 1): Mean of the previous gradient v(t − 1): Variance of the previous gradient.

Data Availability
Previously reported AHCD data were used to support this study and are available at https://www.kaggle.com/mloey1/ ahcd1. ese prior studies (and datasets) are cited at relevant places within the text as [43].