Challenges for text processing in ancient document images are mainly due to the high degree of variations in foreground and background. Image binarization is an image segmentation technique used to separate the image into text and background components. Although several techniques for binarizing text documents have been proposed, the performance of these techniques varies and depends on the image characteristics. Therefore, selecting binarization techniques can be a key idea to achieve improved results. This paper proposes a framework for selecting binarizing techniques of palm leaf manuscripts using Support Vector Machines (SVMs). The overall process is divided into three steps: (i) feature extraction: feature patterns are extracted from grayscale images based on global intensity, local contrast, and intensity; (ii) treatment of imbalanced data: imbalanced dataset is balanced by using Synthetic Minority Oversampling Technique as to improve the performance of prediction; and (iii) selection: SVM is applied in order to select the appropriate binarization techniques. The proposed framework has been evaluated with palm leaf manuscript images and benchmarking dataset from DIBCO series and compared the performance of prediction between imbalanced and balanced datasets. Experimental results showed that the proposed framework can be used as an integral part of an automatic selection process.
Binarization of ancient document images is a crucial process to remove unrelated artefacts and background noise in document images. Image binarization is essential not only to document image analysis, but also to significantly improve the overall performance as poor binarization will result in poor recognition of the original characters and additional noise could be added to the image. Therefore, determining proper binarization techniques can be a key significant factor in achieving promising results from document image analysis.
There are many binarization techniques [
Over the past five centuries, palm leaves have been used as one of the most popular media for written documents in Asian regions. These ancient documents are heritage passed down through many generations. Libraries and museums all across Thailand contain a large collection of palm leaf manuscripts written in ancient local languages. Currently, scanners are able to binarize documents with a good contrast of foreground components and a uniform background [
In this study, a proposed framework has been applied to select the binarization techniques with practical dataset that was collected by the project for Palm Leaf Preservation in Northeastern Thailand Division, Mahasarakham University [
The remaining of this paper is structured as follows: a proposed selection framework of binarization techniques is described in the next section. In Section
This section explains the selection process based on machine learning technique. The selection is performed by classifying the appropriate techniques based on the features extracted from the image. In this study, the issue of imbalanced data has been addressed in order to improve the accuracy. SVM is then used to select the appropriate binarization technique for generating the binary image. Figure
Overall process of the proposed method for selecting the optimal binarization techniques.
Feature extraction is an essential step in any learning method which transforms the characteristics of original data to feature patterns for decision making. This subsection explains the feature pattern of the images used in the dataset, and
The most commonly used features applied to global binarization techniques are intensity histograms. They are used to convey the colour distribution information. This forms a compact representation of the colour feature. Furthermore, the mean, standard deviation, minimum, and maximum of intensity are also used as global features.
For local binarization techniques, intensity and contrast have been the most frequently used features [
The image histogram carries important content of the image. For global binarization techniques based on clustering, this content is useful to distinguish two objects between foreground and background of an image.
In this study, 64 bins of grayscale histogram of the image are extracted and used as features for the selection module. This represents the global characteristic of the image and it could be used to assist the decision on selecting the appropriate technique.
The mean and standard deviation of the intensity of an image [
The other two intensity features are minimum
Image decomposition.
The contrast feature in local neighborhood from Su et al.’s study [
In this study, the contrast feature is calculated in each subimage, and this feature is then modified as the following expression:
The mean and standard deviation of the intensity of an image [
The other two intensity features are the minimum
Forty-five subimages have been considered in this study. 225 feature patterns from five features (contrast, mean, standard deviation, maximum, and minimum) are then extracted from an image. The number of overall features is 292 including 69 feature patterns of global features.
In machine learning, it is often unavoidable to have data with high dimensionality input features. To improve the prediction performance and yet provide faster and low cost predictors, dimensionality reduction can be applied.
In machine learning, it is common to deal with data having high dimensionality input features. To improve the prediction performance, dimensionality reduction may be applied through a transformation of the original data.
PCA [
Real world datasets usually have the problem of imbalanced data. It is a significant problem affecting learning algorithms. It associates with the situations that some classes have much larger numbers of instances than the others. Examples of real world cases with an imbalanced dataset are biomedical applications, fraud detection, and network intrusion [
The issue of imbalanced data needs to be approached at data level with the objective of balancing the training data before learning process is applied. Approaches to deal with imbalanced data can be separated into three categories: undersampling, oversampling, and combined techniques. The undersampling technique aims to balance the dataset by removing instances of majority class while oversampling aims to balance the dataset by adding the minority class. In addition, the combined technique is a combination of both undersampling and oversampling techniques.
In this study, there are cases that contain only a few instances in the dataset; oversampling technique is therefore adopted. There are several oversampling techniques such as the
For each instance Find the Obtain Add End for
The number of new minority class instances is increased by the above algorithm and the synthetic instances are generated by
In this study, the class imbalanced problem in multiclass data was addressed with the
A selection process of this study is based on
The decision function of SVM is calculated from a training dataset. In this study, radial basis functions (RBFs) were used to separate the classes. For building this module, the libSVM library [
In this experiment, the framework of the selection was evaluated by using SVM. In the next subsection, the selection framework is evaluated with the real world dataset of palm leaf manuscript and the dataset of DIBCO series including DIBCO 2009, H-DIBCO 2010, DIBCO 2011, H-DIBCO 2012, and DIBCO 2013 [
In general, accuracy is used to illustrate the overall classification performance. In case of imbalanced dataset, it is premised that if the number of prior classes is very different, this measure may be unsuitable because misclassification may occur [
The overall features comprised 68 global features (64 bins of histogram, a minimum, a maximum, a mean, and a standard deviation value of intensity of image) and 225 local features (45 subimages of contrast, 45 subimages of mean, 45 subimages of standard deviation, 45 subimages of maximum, and 45 subimages of minimum). In this experiment, benchmarking dataset and palm leaf manuscript dataset were used. In this experiment, 10-fold cross-validation was used in each dataset. The datasets of the experiment were separated into two types which are imbalanced dataset and balanced dataset (applied imbalanced dataset by SMOTE). The details of each dataset are described as the following subsection.
This dataset was divided into three categories that are the following.
The datasets of the experiment were separated into two groups that are the following.
The performance of selection of binarization techniques on imbalanced and balanced dataset is shown in Table
Performance of the selection of binarization techniques using SVM on imbalanced and balanced dataset.
Measure | Class | Dataset from DIBCO series | Palm leaf dataset (1 : 2 : 2) | |||
---|---|---|---|---|---|---|
Imbalanced dataset | Balanced dataset 1 by SMOTE (1 : 2 : 2) | Balanced dataset 2 by SMOTE (2 : 4 : 4) | Imbalanced dataset | Balanced dataset by SMOTE | ||
|
ALL | 0.000 | 0.370 | 0.632 | 0.231 | 0.865 |
LMM | 0.778 | 0.650 | 0.679 | 0.735 | 0.857 | |
BE | 0.000 | 0.615 | 0.882 | 0.033 | 0.943 | |
IIF | 0.000 | 0.522 | 0.889 | 0.000 | 0.949 | |
Accuracy | 0.636 | 0.592 | 0.754 | 0.588 | 0.980 | |
G-mean | 0.000 | 0.448 | 0.736 | 0.000 | 0.902 | |
AUC | 0.381 | 0.828 | 0.949 | 0.683 | 0.980 |
With respect to the selection of imbalanced dataset, the performances of class LMM are significantly better than those from classes ALL, BE, and IIF, which have smaller instances. By applying SMOTE to balanced dataset, the performance of class LMM increased slightly, while the performances of classes ALL, BE, and IIF improved greatly. By applying SMOTE to the balanced dataset, the selection accuracy, G-mean, and AUC of imbalanced dataset are significantly improved in both datasets. In palm leaf dataset, performance of those terms improved by 40.8%; 90.2%; and 29.7%, respectively. In benchmarking dataset of balanced dataset group 2 of those terms also improved by 11.8%, 73.6%, and 56.8%, respectively. In benchmarking dataset of balanced dataset group 1 improved only in G-mean and AUC measures while accuracy was slightly decreased.
In this study, the proposed technique was compared to the original binarization techniques including ALL, IIF, LMM, and BE with the dataset from DIBCO series. The evaluation measures of this study are described in [
Evaluation results of original binarization techniques and the proposed framework on benchmarking dataset.
Measures | Binarization techniques | ||||
---|---|---|---|---|---|
ALL | IIF | BE | LMM | The proposed framework | |
|
80.7873 | 55.2662 | 86.2899 | 89.7139 | 91.2494 |
PSNR | 16.0682 | 14.6766 | 17.8789 | 19.1531 | 19.6587 |
DRD | 8.8728 | 11.8249 | 8.2245 | 3.7656 | 2.8869 |
MPM | 7.8459 | 1.3269 | 11.3429 | 2.5617 | 1.1085 |
Based on the four original binarization techniques in Table
An automatic selection of multiple binarization techniques used in this framework has been applied to the users in recommending the appropriate technique. Figure
Two sample results from an automatic selection of binarization technique.
Original image 1
Binary result from image 1 (by LMM technique)
Original image 2
Binary result from image 2 (by ALL technique)
This paper described experiments and results on document degradation and proposed a framework for an automatic selection from multiple binarization techniques by using SVM with imbalanced and balanced datasets (by applying SMOTE) for palm leaf manuscripts. The evaluation result was also evaluated on benchmarking dataset. The proposed measurement of learning for the selection is based on
With regard to the key points in this study, the automatic selection of binarization techniques used in this framework will be helpful to the users in recommending the appropriate technique. A comparison of imbalanced and balanced datasets of the selection framework in all terms of measures indicates that the selection works with better performance on balanced dataset than imbalanced dataset. However, the performance of the selection framework on balanced dataset by SMOTE is quite low in some measures if there are a few instances such as in benchmarking dataset group 1. Because of this, percent of minority classes must increase. This framework forms a prototype system for research in this area. It is noted that this framework still needs to be refined and the selection could be ranked by the user and modified as a semiautomatic approach. In addition, the top two or three rankings of the selection may also be combined as an integral of these techniques. Furthermore, there is a need to generate benchmarking datasets on palm leaf manuscripts for future research.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors wish to express their thanks and appreciation to the members of the Preservation of Palm Leaf Manuscripts Project, Mahasarakham University, Thailand, for their support and for providing images from their database. In particular, the authors thank Dr. Phatthanaphong Chomphuwiset for proofreading this paper.