A Framework for the Selection of Binarization Techniques on Palm Leaf Manuscripts Using Support Vector Machine

Challenges for text processing in ancient document images are mainly due to the high degree of variations in foreground and background. Image binarization is an image segmentation technique used to separate the image into text and background components. Although several techniques for binarizing text documents have been proposed, the performance of these techniques varies and depends on the image characteristics.Therefore, selecting binarization techniques can be a key idea to achieve improved results. This paper proposes a framework for selecting binarizing techniques of palm leaf manuscripts using Support Vector Machines (SVMs).The overall process is divided into three steps: (i) feature extraction: feature patterns are extracted from grayscale images based on global intensity, local contrast, and intensity; (ii) treatment of imbalanced data: imbalanced dataset is balanced by using Synthetic Minority Oversampling Technique as to improve the performance of prediction; and (iii) selection: SVM is applied in order to select the appropriate binarization techniques. The proposed framework has been evaluated with palm leaf manuscript images and benchmarking dataset from DIBCO series and compared the performance of prediction between imbalanced and balanced datasets. Experimental results showed that the proposed framework can be used as an integral part of an automatic selection process.


Introduction
Binarization of ancient document images is a crucial process to remove unrelated artefacts and background noise in document images.Image binarization is essential not only to document image analysis, but also to significantly improve the overall performance as poor binarization will result in poor recognition of the original characters and additional noise could be added to the image.Therefore, determining proper binarization techniques can be a key significant factor in achieving promising results from document image analysis.
There are many binarization techniques [1][2][3][4] that have been proposed in the literature.Some techniques performed well on certain datasets but not on the others.Due to this reason, a selection of binarization techniques can be a key step in improving the performance of document image analysis.In most circumstances, a human operator determines and compares results from different processing techniques and then selects one of the approaches based on visual impression, examination, or intuition.However, for an automated approach, there must be some forms of quantitative assessment so that the "optimal" technique is selected.If an automated selection process is implemented, this will assist and improve the system performance.This study therefore aims at proposing a selection process for the most appropriate binarization technique by machine learning, and in particular, the selection is based on Support Vector Machines (SVMs) due to its appropriateness for classification problems [5].
Over the past five centuries, palm leaves have been used as one of the most popular media for written documents in Asian regions.These ancient documents are heritage passed down through many generations.Libraries and museums all across Thailand contain a large collection of palm leaf manuscripts written in ancient local languages.Currently, scanners are able to binarize documents with a good contrast of foreground components and a uniform background [6].

Advances in Decision Sciences
However, most of the palm leaf manuscripts are of poor quality due to smeared or smudged characters, poor writing, and nonuniform changes in colors due to long term storage.
In this study, a proposed framework has been applied to select the binarization techniques with practical dataset that was collected by the project for Palm Leaf Preservation in Northeastern Thailand Division, Mahasarakham University [7].The benchmarking dataset from DIBCO series [8][9][10][11][12][13] was also used to evaluate the framework.In this research, binarization techniques for degradation document have been used in the proposed method that are Adaptive Logical Level (ALL) technique [4], Improvement of Integrated Function (IIF) algorithm [3], Background Estimation (BE) technique [14], and Local Maximum and Minimum (LMM) technique [2].It first extracts feature patterns from a grayscale image by considering global intensity, local contrast, and intensity.As the data in the datasets could be imbalanced, Synthetic Minority Oversampling Technique (SMOTE) [15] is used to synthesize the data in order to provide the balance and improve the performance.
The remaining of this paper is structured as follows: a proposed selection framework of binarization techniques is described in the next section.In Section 3, experimental results are then given and, finally, followed by a conclusion of this work.

Selection Framework of Binarization Techniques
This section explains the selection process based on machine learning technique.The selection is performed by classifying the appropriate techniques based on the features extracted from the image.In this study, the issue of imbalanced data has been addressed in order to improve the accuracy.SVM is then used to select the appropriate binarization technique for generating the binary image.Figure 1 illustrates the overall process of the proposed method for selecting the optimal technique.The dataset is first separated into two sets, that is, training and test set for the learning process.For local binarization techniques, intensity and contrast have been the most frequently used features [16].A contrast feature has also been used and modified by Su et al. [17].If a significant intensity change occurs at the boundary of the foreground text and the background, the contrast of grayscale indicates the characteristics differentiation between the foreground and background.In this study, the contrast feature has been used for feature extraction.The contrast value of this study has been modified by decomposing the image into subimages.In addition, this study also applied the intensity values by using mean, standard deviation, maximum, and minimum of intensity of the subareas.The features of an image used in this study are explained below.
(a) Global Features.Histogram of an image represents the relative frequency of occurrence of the various gray levels in the image.It gives a global description of the image and the shape of the histogram reveals significant contrast information.A discrete function of the histogram [18], , is given by the relation where while  is the level of grayscale that 0 ≤  ≤  − 1,   is the number of pixels in the image with the lth level of grayscale, and  is the total number of pixels in the image.
Decompose an image to be subimages The image histogram carries important content of the image.For global binarization techniques based on clustering, this content is useful to distinguish two objects between foreground and background of an image.
In this study, 64 bins of grayscale histogram of the image are extracted and used as features for the selection module.This represents the global characteristic of the image and it could be used to assist the decision on selecting the appropriate technique.
The mean and standard deviation of the intensity of an image [19] represent the compact features.The mean of an image (  ) captures the first-order moment, and the standard deviation of the image (  ) is captured as the second-order moment.These expressions are shown as follows: where (, ) is the intensity value of the colour pixel at (, ) axis, while  is the number of columns and  is the number of rows of image.The other two intensity features are minimum  min and maximum  max (, ) intensity values of image which were also used in this study.
(b) Local Features.For image binarization, intensity and contrast have been widely used as descriptors to classify foreground and unrelated objects in images [16].An example of image binarization implemented using contrast features was proposed by Su et al. [17].As a significant intensity change at the boundary of the foreground text and the background, the contrast of gray level relates to the characteristics of foreground and background.In this paper, contrast features are used as a local feature by decomposing the image into subimages as shown in Figure 2. In addition, this study also uses pixel intensity-based features by calculating the mean, standard deviation, maximum, and minimum of intensity of the subimages.
The contrast feature in local neighborhood from Su et al. 's study [17] is defined as follows: where (, ) and  max (, ) denote the intensity of pixel (, ) and the maximum intensity values within local area. is a positive value but infinitely small number, which is added in case the local maximum is equal to 0. Cont(, ) refers to the contrast value of the estimating pixel (, ).
In this study, the contrast feature is calculated in each subimage, and this feature is then modified as the following expression: where   (, ) and  max (, ) denote the average intensity of subimage (, ) and the maximum intensity values of subimage (, ). is a positive value but infinitely small number, which is added in case the maximum intensity values are equal to 0. Cont(, ) refers to the contrast value of the estimating subimage (, ), and 1 ≤  ≤  and 1 ≤  ≤  where  and  are the numbers of columns and rows of decomposed image.
The mean and standard deviation of the intensity of an image [19] represent the compact features.The mean of an image (  ) captures the first-order moment, and the standard deviation of the image (  ) is captured as the second-order moment.These expressions are shown as follows: where (, ) is the intensity value of the colour pixel at (, ) axis, while  is the number of columns and  is the number of rows of the image.
The other two intensity features are the minimum  min (, ) and maximum  max (, ) intensity values of subimage which were used in this study.
Forty-five subimages have been considered in this study.225 feature patterns from five features (contrast, mean, standard deviation, maximum, and minimum) are then extracted from an image.The number of overall features is 292 including 69 feature patterns of global features.
In machine learning, it is often unavoidable to have data with high dimensionality input features.To improve the prediction performance and yet provide faster and low cost predictors, dimensionality reduction can be applied.Principal Components Analysis (PCA) [20] is the most widely used technique and powerful tool for feature selection in the transformed space to reduce dimension feature.This technique is an unsupervised method based on a correlation or covariance matrix that has found application such as face recognition and image compression.

Principal Component Analysis.
In machine learning, it is common to deal with data having high dimensionality input features.To improve the prediction performance, dimensionality reduction may be applied through a transformation of the original data.Principal Components Analysis (PCA) is the most widely used technique [21] and a powerful tool [20] for feature selection in the transformed space for dimension reduction.This technique is an unsupervised method based on a correlation or covariance matrix that has been used in applications such as face recognition and image compression [20].
PCA [22] is calculated from the eigenvectors and eigenvalues of the data covariance matrix.The process is to find the axis system where the covariance matrix is diagonal.This technique can reduce the dimension of the representation.On the other hand, the original information content will be preserved as much as possible.The next subsection addresses the problem of imbalanced data for machine learning.

Treating Imbalanced Data with SMOTE.
Real world datasets usually have the problem of imbalanced data.It is a significant problem affecting learning algorithms.It associates with the situations that some classes have much larger numbers of instances than the others.Examples of real world cases with an imbalanced dataset are biomedical applications, fraud detection, and network intrusion [23].
The issue of imbalanced data needs to be approached at data level with the objective of balancing the training data before learning process is applied.Approaches to deal with imbalanced data can be separated into three categories: undersampling, oversampling, and combined techniques.The undersampling technique aims to balance the dataset by removing instances of majority class while oversampling aims to balance the dataset by adding the minority class.In addition, the combined technique is a combination of both undersampling and oversampling techniques.
In this study, there are cases that contain only a few instances in the dataset; oversampling technique is therefore adopted.There are several oversampling techniques such as the Random Oversampling Technique and Synthetic Minority Oversampling Technique (SMOTE) [15].SMOTE has been shown to be a successful method in many applications [23] and the SMOTE algorithm generates synthetic data based on the feature space similarities between minority examples.Other techniques such as Random Oversampling Technique perform oversampling by replicating minority class instances randomly.For this reason, the SMOTE algorithm may avoid the overfitting problem [24] and, in this study, the SMOTE algorithm used is shown in Algorithm 1.
The number of new minority class instances is increased by the above algorithm and the synthetic instances are generated by Euclidian distance technique.The minority class instances that are close together are considered first, before they are employed to form new minority class instances.
In this study, the class imbalanced problem in multiclass data was addressed with the One-Against-All (OAA) scheme [25].The OAA scheme is a promising technique of multiclass problem and is suitable for small sized training data [26].This scheme can be used to deal with the data balancing issue in multiclasses and it can also reduce the complexity in the machine learning process.As some of the datasets in this study contain fewer instances, the OAA scheme was therefore applied.

Selection Module.
A selection process of this study is based on Support Vector Machine (SVM) due to its appropriateness for classification problems.SVM is a kind of classification based on statistical learning theory which was introduced by Vapnik [27], and its applications have provided good results.In this study, SVM was used to select the appropriate binarization technique by learning from feature patterns of a training dataset.The binarization technique is then used to generate the binary image.
The decision function of SVM is calculated from a training dataset.In this study, radial basis functions (RBFs) were used to separate the classes.For building this module, the libSVM library [28] has been used in the implementation.

Experimental Results
In this experiment, the framework of the selection was evaluated by using SVM.In the next subsection, the selection framework is evaluated with the real world dataset of palm leaf manuscript and the dataset of DIBCO series including DIBCO 2009, H-DIBCO 2010, DIBCO 2011, H-DIBCO 2012, and DIBCO 2013 [8][9][10][11][12][13].The ground truth of palm leaf images was selected by visual human while the ground truth images of DIBCO series were generated following a semiautomatic procedure based on [29].In the following subsection, the original binarization techniques and the proposed framework were compared with the dataset of DIBCO series.

Evaluation of the Selection Framework.
In general, accuracy is used to illustrate the overall classification performance.In case of imbalanced dataset, it is premised that if the number of prior classes is very different, this measure may be unsuitable because misclassification may occur [24].Other evaluation measures of the imbalanced problem have been proposed and they are F-measure (FM), the geometric mean (G-mean), and the area under the ROC curve (AUC) [23,30].These indicators aim to maximize the accuracy between the minority class and the majority class so they are good for the class imbalanced problem.These measures were therefore applied to evaluate the performance of the selection of the binarization techniques in this study.The overall features comprised 68 global features (64 bins of histogram, a minimum, a maximum, a mean, and a standard deviation value of intensity of image) and 225 local features (45 subimages of contrast, 45 subimages of mean, 45 subimages of standard deviation, 45 subimages of maximum, and 45 subimages of minimum).In this experiment, benchmarking dataset and palm leaf manuscript dataset were used.In this experiment, 10-fold cross-validation was used in each dataset.The datasets of the experiment were separated into two types which are imbalanced dataset and balanced dataset (applied imbalanced dataset by SMOTE).The details of each dataset are described as the following subsection.

3.1.1.
Benchmarking Dataset from DIBCO Series.This dataset was divided into three categories that are the following.
(1) Imbalanced Dataset.The dataset is composed of 66 instances (images), and there are four classes which are LMM 42 instances, ALL 7 instances, BE 12 instances, and IIF 5 instances (ratio of instances, LMM : ALL : BE : IIF = 64 : 10 : 18 : 7).(3) Balanced Dataset 2. This also applied the SMOTE to synthesize the minority classes from imbalanced dataset, but the number of instances of minority classes in ALL was increased by 400 percent, in BE by 200 percent, and in IIF by 400 percent.The number of instances after synthesis was 103 instances, with 42 instances in LMM, 36 instances in ALL, 35 instances in BE, and 25 instances in IIF (ratio of instances, LMM : ALL : BE : IIF = 30 : 26 : 25 : 18).

Palm Leaf Manuscript Dataset.
The datasets of the experiment were separated into two groups that are the following.
(2) Balanced Dataset.As class distribution of this dataset is imbalanced, SMOTE was applied to synthesize the minority classes.LMM is a majority class and other minority classes are ALL, BE, and IIF.The number of instances of minority classes in ALL was increased by 100 percent, in BE by 200 percent, and in IIF by 200 percent.The number of instances after synthesis was 784 instances, with 280 instances in LMM, 192 instances in ALL, 174 instances in BE, and 138 instances in IIF (ratio of instances, LMM : ALL : BE : IIF = 36 : 24 : 22 : 18).
The performance of selection of binarization techniques on imbalanced and balanced dataset is shown in Table 1.
With respect to the selection of imbalanced dataset, the performances of class LMM are significantly better than those from classes ALL, BE, and IIF, which have smaller instances.By applying SMOTE to balanced dataset, the performance of class LMM increased slightly, while the performances of classes ALL, BE, and IIF improved greatly.By applying SMOTE to the balanced dataset, the selection accuracy, G-mean, and AUC of imbalanced dataset are significantly improved in both datasets.In palm leaf dataset, performance of those terms improved by 40.8%; 90.2%; and 29.7%, respectively.In benchmarking dataset of balanced dataset group 2 of those terms also improved by 11.8%, 73.6%, and 56.8%, respectively.In benchmarking dataset of balanced dataset group 1 improved only in G-mean and AUC measures while accuracy was slightly decreased.

A Comparison of the Proposed Framework and Original
Binarization Techniques.In this study, the proposed technique was compared to the original binarization techniques including ALL, IIF, LMM, and BE with the dataset from DIBCO series.The evaluation measures of this study are described in [11] to compare the binarization images with the ground truth images.These measures consist of F-measure, PSNR, distance reciprocal distortion metric (DRD), and misclassification penalty metric (MPM).The evaluation result is shown in Table 2.
Based on the four original binarization techniques in Table 2, the LMM technique provided the best results in terms of F-measure, PSNR, and DRD while the IIF gave the best result in terms of MPM.Considering the proposed framework, it has superior results in all terms of measures than each of the original binarization techniques.
An automatic selection of multiple binarization techniques used in this framework has been applied to the users in recommending the appropriate technique.Figure 3 shows two sample results of an automatic selection from three binarization techniques (ALL, BE, and LMM) on palm leaf manuscripts.

Conclusions
This paper described experiments and results on document degradation and proposed a framework for an automatic selection from multiple binarization techniques by using SVM with imbalanced and balanced datasets (by applying SMOTE) for palm leaf manuscripts.The evaluation result was also evaluated on benchmarking dataset.The proposed measurement of learning for the selection is based on Fmeasure, accuracy, G-mean, and AUC.Another experiment, the original binarization techniques, and the proposed framework were compared based on F-measure, PSNR, DRD, and MPM.
With regard to the key points in this study, the automatic selection of binarization techniques used in this framework will be helpful to the users in recommending the appropriate technique.A comparison of imbalanced and balanced datasets of the selection framework in all terms of measures indicates that the selection works with better performance on balanced dataset than imbalanced dataset.However, the performance of the selection framework on balanced dataset by SMOTE is quite low in some measures if there are a few instances such as in benchmarking dataset group 1.Because of this, percent of minority classes must increase.This framework forms a prototype system for research in this area.It is noted that this framework still needs to be refined and the selection could be ranked by the user and modified as a semiautomatic approach.In addition, the top two or three rankings of the selection may also be combined as an integral of these techniques.Furthermore, there is a need to generate benchmarking datasets on palm leaf manuscripts for future research.

( 2 )
Balanced Dataset 1.As class distribution of this dataset is imbalanced, SMOTE was applied to synthesize the minority classes.LMM is a majority class and other minority classes are ALL, BE, and IIF.The number of instances of minority classes in ALL was increased by 200 percent, in BE by 100 percent, and in IIF by 200 percent.The number of instances after synthesis was 103 instances, with 42 instances in LMM, 24 instances in ALL, 21 instances in BE, and 16 instances in IIF (ratio of instances, LMM : ALL : BE : IIF = 40 : 23 : 20 : 15).

Figure 3 :
Figure 3: Two sample results from an automatic selection of binarization technique.
is the input dataset  is the set of minority class instances For each instance   in  Find the -Nearest Neighbours (minority class instances) to   in  Obtain x by randomising one from  instances  = random number between zero and one  new =   + (x  −   ) ×  Add  new to

Table 1 :
Performance of the selection of binarization techniques using SVM on imbalanced and balanced dataset.

Table 2 :
Evaluation results of original binarization techniques and the proposed framework on benchmarking dataset.