Isolated Handwritten Pashto Character Recognition Using a K -NN Classification Tool based on Zoning and HOG Feature Extraction Techniques

,


Introduction
In this modern digital age of ever-growing computer technology, the machine learning algorithms play a key role in all fields of life, especially in the areas of text recognition [1], network security [2,3], privacy [4], traffic flow predictions [5], object detection [6], and may others.One of the major applications of machine learning algorithm is Optical Character Recognition (OCR) system development.
e OCR system reads the text from an image and converts it into a computer-readable form.Several research works have been addressed on the automatic recognition of multiple languages such as Arabic, English, Persian, Chinese, and Urdu [7,8].
e main problems associated with these languages are the cursive writing styles, writer's handwriting habits, and secondary components (diacritics).e Pashto language has incorporated most of the Arabic, Urdu, and Persian letters with some minor modifications.Due to this reason of incorporation of letters, the Pashto language is cursive in nature.e Pashto language consists of a large character set (44 characters) greater than Urdu (38 characters), Arabic (28 characters), and Persian (32 characters).
is large character script and minor change in character shape make the recognition process more complex for Pashto script.
Pashto is the maternal language of a large community of residents in Northern areas of Pakistan and official language of Afghanistan.Ahmad et al. [9] used k-nearest neighbors (k-NNs) as a classification tool for printed Pashto character recognition by using high-level feature extraction techniques.Boulid et al. [7] suggested the use of a neural network with spatial distribution of pixels (SDPs) and local binary patterns (LBPs) for the recognition of handwritten Arabic characters.Boufenar et al. [10,11] presented an artificial immune recognition (AIR) system based on both statistical and structural features for handwritten Arabic letters recognition.Askari et al. [12] introduced the derivative projection profile (DPP) as a feature extractor technique and neural network as a classification tool for isolated Arabic character recognition.El-Sawy presented an Arabic OCR system by using convolutional neural networks [13].
Boufenar and Batouche proposed a deep learning convolutional neural network (DCNN) for Arabic letters recognition [14].e performance of the suggested system is dependent on hyperparameter tuning and the size of the dataset in use.Naz et al. [8] suggested an approach of the convolution neural network and recursive neural network for Urdu Nastali'q text recognition.ey tested the system on the famous Urdu printed text-line image (UPTI) dataset.Sarvaramini et al. [15] suggested the use of the convolutional neural network (CNN) for offline Persian character recognition.
is paper presents an OCR system for offline Pashto characters.A medium-sized database of handwritten Pashto characters is developed for the proposed research work.e results are discussed in Section 4 followed by conclusions in Section 5.

Literature Review
Recent research shows prominent improvements in OCRs for many languages, especially which are cursive in nature.
ese include Arabic, Urdu, Persian, and others that possess same cursive nature script.Optical character recognition (OCR) systems convert images of text into a computerreadable form.e recognition rate of cursive scripts is low as compared to noncursive scripts such as English. is is due to the ambiguity in writing styles.Recent work on OCR for cursive-text-based languages achieved significant results, which are discussed below.
Khan et al. [16] provided a baseline study for handwritten Pashto character recognition using zoning features and three different classifier models.e proposed model showed an accuracy of 56% for support vector machine, 78% for an artificial neural network, and 80.7% for a convolution neural network.A dataset of 4488 characters was used for simulation purposes in their research work.Bhuiyan and Alsaade [17] suggested a hybrid neural network model for Arabic character recognition.
ey used a hybrid neural network by combining a bidirectional associate memory (BAM) and a multilayer perceptron (MLP).Tavoli et al. [18] proposed a new feature extractor for the recognition of Arabic and Persian words, namely, the statistical geometric components of straight lines (SGCSLs) technique.Oujaoura et al. [19] suggested a method for offline Arabic letters identification using three feature extraction techniques including Zernike moments in conjunction with neural networks.Zernike moments surpass rest of the two in recognition rate.
Boufenar et al. [10] have conducted a study for handwritten Arabic character recognition on the famous Offline Isolated Handwritten Arabic Character (OIHACDB) and Arabic Handwritten Character Database (AHCD) datasets using Deep Convolutional Neural Networks (DCNN) and showed state-of-the-art accuracy using this method.Younis presented a DCNN for handwritten Arabic character recognition [20].He also performed batch normalization to prevent overfitting.e model was tested on AIA9K and AHCD datasets.
Jebril et al. [21] used histogram of oriented gradients (HOGs) as features and support vector machines as classifier on a self-made database.Althobaiti and Lu suggested a novel approach to feature extraction using an encoded freeman chain code and change of tangent for isolated handwritten Arabic character recognition [22,23].Jehangir et al. [24] proposed Zernike moments for feature extraction purposes and linear discriminant analysis for the automatic recognition of the handwritten Pashto text.
Naz et al. [25] presented the use of 2-dimensional long short term memory (2DMLSTM) networks for Urdu script recognition based on zoning features.e referenced model is tested on the Urdu Printed Text line Images (UPTI) dataset.Ahmed et al. [26] presented an algorithm for Urdu character recognition using bidirectional long short-term memory (BLSTM) on the Urdu nasta'liq handwritten dataset (UNHD).Jameel and Kumar proposed basis spline (Bspline) curves for Urdu character recognition [27].Nawaz et al. [28] compared siamese and triplet networks and showed performance improvement when combined with a CNN for handwritten Urdu character recognition.
is work is based on offline Pashto character recognition using k-nearest neighbors (k-NNs) as a classification tool.
e histogram of oriented gradients (HOGs) and zoning-based density-based feature extraction techniques is followed as a features extraction tools in the proposed research work.
is work proposes a character database of 11352 character images (258 samples for each 44 characters in the Pashto language).is work also uses both HoGs and the zoning technique for feature extraction purposes.e 2 Complexity performance capabilities of the proposed OCR system are tested using 10-fold cross validation.

Proposed Methodology
e proposed handwritten Pashto character recognition system consists of 4 main phases as depicted in Figure 1. e data collection and accumulation phase, the data processing and character database development phase, the feature extraction and feature map development phase, and at last, the recognition and identification phase.e data collection and accumulation phase is completed by collecting handwritten Pashto samples from different people, while the preprocessing steps include scanning and correction steps. is phase aims to prepare data for the feature extraction purposes as proper characters results in achieving high and accurate feature values that ultimately results in high recognition rates of the handwritten characters.For the feature extraction purposes, we have proposed HoGs and zoning techniques.
ese techniques grab the astute numerical values of the characters.e classification and recognition phase is completed using a k-NN classifier based on the accumulated feature map using HoG and zoning techniques.Individual characters are extracted from the scanned images so as to make a database.e database, thus, formed contained 258 samples for each of 44 characters and a total of 11,352 characters (258 * 44 � 11352 characters).
e final database contained character images with nonuniform dimensions and decentralized characters (appearing either at the top, bottom, right, or left), as shown in Figure 2. ese sliced images were preprocessed to form normalized and centralized character images.

Preprocessing.
Preprocessing is a preliminary step necessary to achieve better classification accuracy.Preprocessing steps are applied here to form normalized and centralized character images.Preprocessing greatly improves OCR accuracy.We applied the following preprocessing steps.

Size Normalization.
To achieve best classification results, it is necessary that the sliced images are normalized and centralized.By normalizing, image size is scaled to a fixed size.All the images here are normalized to the size of 64 × 64 and are converted to the color map of grayscale.Figure 3 shows the size normalization to the dimensions of 64 × 64.

Centralization. Some of the images contained char-
acters that occurred at different positions (top, bottom, right, and left).Firstly, the centroid of the character and image are calculated separately to fix all the characters at the central point so as to calculate accurate features of each handwritten Pashto character.In our case, the character is of 64 × 64 dimensional size, so the central point of the character is 32 × 32 in our case.en, the character centroid is shifted to the centroid of the image to produce a centralized image.Figure 4 shows the centralization of characters "alif" and "twe."

Feature Extraction.
Feature extraction is a pivotal stage of an OCR.Features are used to describe the image in terms of numerical values.ere are two types; statistical features are calculated via mathematical computations whereas structural features are derived from the structure of the image.A good feature extractor should have the ability to discriminate while retaining similarity for similar character images.We applied two techniques, namely, histogram of oriented gradients (HOGs) and zoning-based density features, to our database and compared their results.

Zoning Features. Zoning-based features are efficient for reading and extracting accurate image patterns. Due to its high feature extraction capabilities, this technique is frequently used in many text recognition problems.
is technique divides the image into 8 × 8 zones and then calculates the image pixel densities in each zones that forms the feature vector.It gives a feature vector of 64 features.Figure 5 shows the character "bhe" divided into 64 zones.

Histograms of Oriented Gradients.
e histogram of oriented gradients (HOGs) was firstly introduced by Dalal and Triggs [29].e primary purpose was human detection.Nowadays, this technique is used for character recognition [21,30,31], pedestrian detection [32], face recognition [33], and many other problems of interest.We generated HOG features using cell size 16 × 16 pixels, block size 2 × 2 cells, and 9 bins.HOG visualization over the Pashto character "ye" is shown in Figures 6(a     4 Complexity

Results
Results are calculated for the proposed system based on a zoning-based density feature set and Histogram of Oriented Gradients (HOGs) feature set.For each of the feature extractor technique, the results are drawn based on the knearest neighbors (k-NN) classification model using 10-fold cross validation.Ten-fold cross validation using the 1-NN classifier and Histograms of Oriented Gradients (HOGs) achieved an accuracy of 80.34%, while for zoning features, a relatively lower accuracy of 76.42% is achieved.
e results are compared, and a graph is generated using these data as shown in the graph in Figure 7. Accuracy tends to increase as training data increase because the classifier learns more accurately and produces better results.
e k-NN parameter k value was tuned, and the best score for k � 1 was calculated using the Euclidean distance as depicted by a graph in Figure 8.
Figure 8 shows the k value vs. accuracy.e accuracy vs. k value table is generated as shown in Table 1.Different accuracies were calculated for different values of k.From Figures 9 and 10, it is evident that when the training set increases, the overall recognition rate of the classifier increases, but ultimately, the time consumption also increases.Applicability of the system is also validated by using other performance metrics such as precision, false-positive rate, false-negative rates, true-positive rates, true-negative rates, f1 score, and accuracy based on both HoG-based and zoning-based feature maps.Experimental results based on these performance metrics are depicted in Figure 11.

Conclusions
Handwritten text recognition is followed as the most daunting step in the research work.During the last two decades, cursive text recognition gained a significant interest in the research community to explore.However, the unavailability of a standard database makes it more challenging.To address these problems, an OCR system for handwritten Pashto character recognition is presented in this paper.A medium-sized database containing 11352 character samples (44 characters x 258 samples) was developed for the analysis and experimental work.Histogram of oriented gradients and zoning techniques are used for the feature accumulation purposes.is feature map is used for the identification and recognition of the handwritten Pashto characters using the k-NN classification tool.Based on the calculated feature map, histograms of oriented gradients gives an accuracy rate of 80.34% while zoning-based density features give an accuracy of 76.42%.Ten-fold cross validation was applied for evaluating system results.

3. 1 .
Data Collection.Since there is no standard database available for handwritten Pashto characters, a medium-sized database is developed by collecting data from different people.Most of the samples are collected from the Department of Pashto, University of Swabi, KP (Khyber Pakhtunkhwa), Pakistan, from multiple students and teachers varying in age, gender, and educational backgrounds.

Figures 9 and 10
Figures 9 and 10 show the plot of training data vs. time vs. accuracy for HOGs and zoning-based density features.From Figures9 and 10, it is evident that when the training set increases, the overall recognition rate of the classifier increases, but ultimately, the time consumption also increases.Applicability of the system is also validated by using other performance metrics such as precision, false-positive rate, false-negative rates, true-positive rates, true-negative rates, f1 score, and accuracy based on both HoG-based and zoning-based feature maps.Experimental results based on these performance metrics are depicted in Figure11.