Text Extraction from Historical Document Images by the Combination of Several Thresholding Techniques

Thispaperpresentsanewtechniqueforthebinarizationofhistoricaldocumentimagescharacterizedbydeteriorationsanddamagesmakingtheirautomaticprocessingdifficultatseverallevels.Theproposedmethodisbasedonhybridthresholdingcombiningthe advantagesofglobalandlocalmethodsandonthemixtureofseveralbinarizationtechniques.Twostageshavebeenincluded.In thefirststage,globalthresholdingisappliedontheentireimageandtwodifferentthresholdsaredeterminedfromwhichthemost ofimagepixelsareclassifiedinto foreground or background . In the second stage, the remaining pixels are assigned to foreground or background classes based on local analysis. In this stage, several local thresholding methods are combined and the final binary value of each remaining pixel is chosen as the most probable one. The proposed technique has been tested on a large collection of standard and synthetic documents and compared with well-known methods using standard measures and was shown to be more powerful.


Introduction
Binarization is an important step in the process of document analysis and recognition. It has as goal to segment the image into two classes (foreground and background in the case of document images). The resulting image is a binary image in black and white where the black represents the foreground and the white represents the background. In fact, document image binarization is critical in the sense that bad separation will cause the loss of pertinent information and/or add useless information (noise), generating wrong results. This difficulty increases for old documents which have various types of damages and degradations from the digitization process itself, aging effects, humidity, marks, fungus, dirt, and so forth, making the automatic processing of these materials difficult at several levels.
A great number of techniques have been proposed in the literature for the binarization of gray-scale or colored documents images, but no one between them is generic and efficient for all types of documents.
The binarization techniques of grayscale images may be classified into two categories: global thresholding and local thresholding [1,2]. Another category of hybrid methods can be added [3]. The global thresholding methods are widely used in many document image analysis applications for their simplicity and efficiency. However, these methods are powerful only when the original documents are of good quality and well contrasted and have a clear bimodal pattern that separates foreground text and background. For historical document images which are generally noisy and of poor quality, global thresholding methods become not suitable because no single threshold is able to completely separate the foreground from the background of the image since there is no sufficient distinction between the gray range of background and foreground pixels. This kind of document requires a more detailed analysis, which may be guaranteed by local methods. Local methods calculate a different threshold for each pixel based on the information of its neighborhoods. These methods are more robust against uneven illumination, low contrast, and varying colors than global ones, but they are very time consuming since a separate threshold is computed for each pixel of the image considering its neighborhoods. This calculation becomes slower when increasing the size of neighborhood considered. Hybrid 2 Advances in Multimedia methods, in contrast, combine global and local information for segmenting the image.
In this paper, we propose a new hybrid thresholding technique for binarizing images of historical documents. The proposed approach uses a mixture of thresholding methods and it combines the advantages of the two families of techniques: the computation speed and the efficiency. The remainder of this paper is organized as follows. In Section 2, we present some existing binarization methods. Then in Section 3, we describe the proposed approach. The experiments performed and the results will be shown in Section 4, before concluding.

State of the Art
According to [4], existing methodologies for image binarization may be divided under two main strategies: grouping based and thresholding based. Thresholding based methods use global or local threshold(s) to separate the text from the background. In grouping based methods we distinguish two categories: region based grouping and clustering based grouping methods. Region based grouping methods are mainly based on spatial-domain region growing or on splitting and merging. Also, clustering based grouping methods are based on classification of intensity or color values as a function of a homogeneity criterion. However, several techniques have been employed to reach this classification: K-means algorithm, artificial neural networks, and so forth.
Sezgin and Sankur [5] established a classification of binarization methods according to the information that they exploit in 6 categories.
(i) Histogram-based methods: the methods of this class perform a thresholding based on the form of the histogram.
(ii) Clustering-based methods: these methods assign the image pixels to one of the two clusters: object and background.
(iii) Entropy-based methods: these algorithms use the information theory to obtain the threshold.
(iv) Object attribute-based methods: they find a threshold value based on some similarity measurements between original and binary images.
(v) Spatial binarization methods: they find the optimal threshold value taking into account spatial measures.
(vi) Locally adaptive methods: these methods are designed to give a new threshold for every pixel. Several kinds of adaptive methods exist. We find methods based on local gray range, local variation, and so forth.
We present in this section some binarization methods, the most frequently cited in the literature, and we consider only thresholding based methods.

Global Methods.
Note the grayscale image of which the intensities vary from 0 (black) to 1 (white) and its histogram of intensities. The number of pixels having a gray level is denoted as ( ).

Otsu's Method.
Otsu's method [6] tries to find the threshold which separates the gray-level histogram in an optimal way into two segments (which maximize the intersegments variance or which minimize the intrasegments variance). The calculation of the interclasses or intraclasses variances is based on the normalized histogram = [ (0) ⋅ ⋅ ⋅ (255)] of the image, where ∑ ( ) = 1.
The interclasses variance for each gray level is given by such that 2.1.2. ISODATA Method. Thresholding using ISODATA [7] consists in finding a threshold by separating iteratively the gray-level histogram into two classes, with the a priori knowledge of the values associated with each class. This method starts by dividing the interval of nonnull values of the histogram into two equidistant parts, and next we take 1 and 2 as the arithmetic average of each class. Repeat until convergence the calculation of the optimal threshold as the closest integer to ( 1 + 2 )/2 and update the two averages 1 and 2 .

Kapur et al. 's Method.
Kapur et al. 's method [8] is an entropy based method which takes into account the foreground likelihood distribution and the background likelihood distribution ( = 1 − ) in the determination of the division entropy. The binarization threshold is chosen for which the value = + is maximal, such that where is the occurrence probability of the gray level in the image and = ∑ =0 .

Iterative Global Thresholding (IGT).
The proposed method selects a global threshold to the entire image based on an iterative procedure [9]. At each iteration , the following steps are performed: (a) calculating the average gray level ( ) of the image; (b) subtracting from all pixels of the image; Advances in Multimedia 3 (c) histogram equalization to extend the pixels over the whole gray levels interval.

Local Methods.
Local methods compute a local threshold for each pixel by sliding a square or rectangular window over the entire image.

Bernsen's Method.
It is an adaptive local method [10]. Thus for each pixel of coordinates ( , ), the threshold is given by such that low and high are the lowest and the highest gray levels, respectively, in a squared window × centered over the pixel ( , ). However, if the local contrast ( , ) = ( high − low ) is below a threshold ( = 15), then the neighborhood consists of a single class: foreground or background.

Niblack's Method.
The local threshold ( , ) is calculated using the mean and standard deviation of all pixels in the window (neighborhood of the pixel in question) [11]. Thus, the threshold ( , ) is given by such that is a parameter used for determining the number of edge pixels considered as object pixels and takes a negative values ( is fixed −0.2 by authors).

Sauvola and Pietikäinen's Method. Sauvola and
Pietikäinen's algorithm [3] is a modification of that of Niblack in order to give more performance in the documents with a background containing a light texture or too variation and uneven illumination. In the modification of Sauvola, the local binarization threshold is given by where is the dynamic range of the standard deviation and the parameter takes positives values in the interval [0.2, 0.5].

Nick Method.
This method improves considerably the binarization of lighted images and low contrasted images, by downwards moving, the threshold of binarization [2]. The threshold calculation is done as follows: such that is the Niblack factor and varies between −0.1 and −0.2 according to the application need, is the average gray level, is the gray level of pixel , and NP is the total number of pixels. In their tests, the authors used a window of size 19 × 19.

Sari et al. 's Method.
This method uses an artificial neuron network of multilayer perceptron (MLP) type to classify the image pixels into two classes: foreground and background [12]. The MLP has one hidden layer, 25 inputs, and one single output. To assign a new value (black or white) to a pixel, the MLP takes as input a vector of 25 values corresponding to the intensities of the pixel in a 5 × 5 window centered on the processed pixel. The MLP parameters (structure, the input statistics, etc.) have been chosen after several experiments.

Hybrid Methods
This method [13] is an improvement of IGT technique from [9] and it consists of two passes.
In the first pass, a global thresholding is applied to the entire image and in the second pass a local thresholding processes areas still containing noise. To do this, the binary image resulting from global thresholding is divided into several segments of size × and for each segment the frequency of black pixels is calculated. The segments satisfying the following criteria are kept: > + × such that and denote the mean and the standard deviation of the black pixels frequency in the segment, and is a constant (equal to 2 according to the authors). For each detected area, the IGT method is applied to the corresponding area in the original image. Areas of size 50 × 50 give good results according to the authors.

Gangamma and Srikanta's Method. Gangamma and
Srikanta [14] proposed a method based on a simple and effective combination of spatial filters with gray scale morphological operations to remove the background and improve the quality of historical document image of palm scripts. The first step of this technique is to apply adaptive histogram equalization (AHE) to overcome the problem of uneven illumination in the document image. On the resulting image, a morphological opening operation is applied and the opened image is added later with the histogram equalized image. After that, the morphological closing operation is applied to the image for smoothing. The histogram equalized image is subtracted from the smoothed image and the result is subtracted again from the previous addition image. A Gaussian filter is subsequently applied in order to remove the noise. A final improvement is obtained by adding the last image with the histogram equalized image. Finally, a global thresholding (Otsu's algorithm) is required to separate the text from the background.

Thresholding by Background
Subtraction. This technique has been proposed in [15] and it consists of three steps. The background is modeled by removing the handwriting by applying a closing of the original image with a small disk as a structuring element. After that, the background is subtracted from the original image which only leaves the foreground. Finally, the resulting image is segmented using Otsu's algorithm multiplied by an empirical constant.

Tabatabaie and Bohlool's Method.
It is a nonparametric method proposed for the binarization of bad illuminated document images [16]. In this method, the morphological closing operation is used to solve the background uneven illumination problem. Indeed, closing may produce a reasonable estimate of the background if we use the appropriate structuring element. Experiments show that a structuring element of size equal to twice the stroke size gives the best results. The appropriate structuring element size is estimated as follows. A global threshold is first applied to the original image. Then we look for the size of the largest black square for each pixel and we save these values in a matrix . The biggest value of in each connected set of pixels is calculated and assigned to the other elements of the set. After that, the Shistogram ( ) is made starting from the matrix . The value of at the point is equal to the number of elements of having the value . Finally, we determine max , the greatest value of , with satisfying The structuring element size will be 2 max .

Proposed Technique
As we said earlier, the global thresholding techniques are generally simple and fast which tend to calculate a single threshold in order to eliminate all background pixels and preserve all foreground pixels. Unfortunately, these techniques are only applicable when the original documents are of good quality, well contrasted, and with a bimodal histogram. Figure 1 shows an example.
When the documents are of poor quality, containing different types of damages (stains, transparency effects, etc.), with a textured background and uneven illumination or when the gray levels of the foreground pixels and the gray levels of the background pixels are close, it is not possible to find a threshold that completely separates the foreground from the background of the image (Figure 2).
In this case, a more detailed analysis is needed, and we have recourse to local methods. Local methods are more accurate and may be applied to variable backgrounds, quite dark or with low contrast, but they are very slow since the threshold calculation, based on the local neighborhood information, is done for each pixel of the image. This computation becomes slower with larger sliding windows.
To solve this problem, we propose a hybrid thresholding approach that will be fast and at the same time effective as well as local methods and that is achieved by combining the advantages of both families of binarization methods. The proposed technique uses two thresholds 1 and 2 and it runs in two passes. In the first pass, a global thresholding is performed in order to class the most of pixels of the image. All pixels having a gray-level higher than 2 are removed (becomes white) because they represent the background pixels. All pixels having a gray level lower than 1 are considered as foreground pixels and therefore they are kept and colored in black. The remaining pixels are left to the second pass in which they are locally binarized by combining the results of several local thresholding methods to select the most probable value.
We detail in the following the processing steps. image into two classes: foreground and background and since a single threshold is not able to accomplish this task, the use of more separation thresholds seems be a perfect solution. These two thresholds are estimated from the gray levels histogram of the original image and represent the average intensity of the foreground and background, respectively.

Estimation of Two Thresholds
To obtain these two thresholds, we first compute a global threshold using a global thresholding algorithm which can be Otsu's algorithm, Kapur, or any other global algorithm. In our approach, we opted for the Otsu algorithm because this technique has shown its efficiency and overcame other global methods in several comparative studies [17,18]. separates the gray levels histogram of the image into two classes: foreground and background. 1 and 2 are estimated from . Noting min the minimum distance between the average intensity of the foreground is represented by the first half of the histogram and the average intensity of the background is represented by the second half: 3.2. Global Image Thresholding Using 1 and 2 . After the estimation of the two thresholds 1 and 2 , all pixels having a gray level higher than 2 are transformed into white which eliminates most of the image background, and those whose gray level is less than 1 are colored in black. Note the resulting image. These pixels are certainly foreground pixels. The resulting image still contains some noise, but all the foreground information is preserved.

Local Thresholding of the Remaining Pixels.
The pixels unprocessed in the previous step (those with a gray level between 1 and 2 ) may be from the foreground and thus must be preserved; likewise they may be background's or noise's pixels and should be removed. The decision to assign the remaining pixels to one of the two classes: foreground or background is performed using a local process by examining the neighborhood of these pixels. To guarantee a more correct classification, we propose to apply several local thresholding methods. In our experiments, we chose the following methods: Niblack, Sauvola, and Nick, since these methods were ranked in first places in several previous comparative studies [2,19,20]. For each pixel ( , ) of not yet classified, we calculate locally its new binary values (0 for black and 1 for white) 6 Advances in Multimedia obtained by applying Niblack's, Sauvola's, and Nick's methods and we obtain thus three temporary images 1 , 2 , and 3 , respectively. Each one of the three local methods computes the binary value of each remaining pixel ( , ) by with LT 1 , LT 2 , LT 3 being the local threshold computed using Niblack's, Sauvola's, and Nick's methods, respectively.
The final binary value ( , ) of each remaining pixel is that resulted of at least two of the three methods:

Experiments and Results
Experiments have been performed in order to estimate the performance of our approach. We applied the proposed technique over a large test set and compared the obtained results with well-known methods, including global, local, and hybrid methods. The comparison will estimate both the binarization quality and the execution time.
Firstly, for parameterized methods a series of experiments have been performed in order to set their optimal parameter values. Note PS( ) the parameters set of a specific thresholding method . For example, PS(Niblack) = { , }, PS(Sauvola) = { , , }, and so forth. We try to find the optimal values of PS( ) giving the binarization results which are closest to the ground truth images. A specific range of values [ , ] is first defined for each parameter. To improve the accuracy of the process, we used a wide initial range for every parameter. For Niblack's method per example, the range of the parameter is defined as [ , ] = [3,299]. After that, we apply the binarization method with the different values of PS( ) from the predefined ranges on the test set described in Section 4.1. We compare the binarization results with the ground truth images using the evaluation measures detailed in Section 4.2. A ranking of the obtained results is then performed according to each measure separately. By calculating the sum of all ranks, we can infer the optimal set of parameter values as leading to the top ranks. The optimal parameter values of the parameterized methods are summarized in Table 1.  Improved IGT = 50 and = 3 mark). These four collections contain a total of 50 real documents images (37 handwritten and 13 printed) coming from the collections of several libraries, with the associated ground truth images. All the images contain representative degradations which appear frequently (e.g., variable background intensity, shadows, smear, smudge, low contrast, and bleed-through). Figure 3 shows some images from these collections.
The second set of images is a synthetic collection composed of 150 synthetic images of documents constructed by the fusion of 15 different backgrounds and 10 binary images ( Figure 4). The fusion is done by applying the image mosaicing by superimposing technique for blending [21]. The idea is as follows: we start with some images of documents in black and white, which represent the ground truth, and with some backgrounds extracted from old documents and we apply a fusion procedure to get as many different images of old documents. However, Stathis et al.,in [22], proposed two different techniques for the blending: maximum intensity and image averaging. We adopt using the image averaging technique in order to have a more natural result.

Evaluation
where

PSNR.
PSNR is a similarity measure between two images. However, the higher the value of PSNR, the higher the similarity of the two images [24,25]: Advances in Multimedia where and 2 represent the two images matched. and are there height and width, respectively. is the difference between foreground and background.

NRM (Negative Metric Rate).
NRM is based on the pixelwise mismatches between the ground truth and the binarized image [26]. It combines the false negative rate NR FN and the false positive rate NR FP . It is denoted as follows: with Contrary to -measure and PSNR, The better binarization quality is obtained for lower NRM value.

MPM (Misclassification Penalty Metric).
The misclassification penalty metric MPM evaluates the binarization result against the ground truth on an object-by-object basis [26]: where FN and FP denote the distance of the th false negative and the th false positive pixel from the contour of the ground truth segmentation. The normalization factor is the sum over all the pixel-to-contour distances of the ground truth object. A low MPM score denotes that the algorithm is good at identifying an object's boundary.

DRD (Distance Reciprocal Distortion Metric)
. DRD is an objective distortion measure for binary document images, and it was proposed by Lu et al. in [25]. This measure properly correlates with the human visual perception and it measures the distortion for all the flipped pixels as follows: NUBN is the number of the nonuniform 8 × 8 blocks in the GT image. DRD is the distortion of the th flipped pixel of coordinate ( , ) and it is calculated using a 5 × 5 normalized weight matrix . This last is defined in [25] as follows: such that   Table 2. The final ranking of the compared methods is shown in Table 3, which also summarizes the partial ranks of each method according to each evaluation measure and the sum of ranks. From Tables 2 and 3, it is clear that our proposed method is totally ranked first and has the best performances according to all measures of binarization quality. It exceeded local methods such as the famous Sauvola and Pietikainen method, which ranked 3rd in our experiments, and even other hybrid techniques. Indeed, the combination of three local thresholding techniques enabled a more robust determination of the binary value of each pixel by taking the most likely value.
Regarding the execution time, our method is very fast compared to local methods (about 52 times more than Sauvola and Pietikainen's method), which enabled us to earn about than 98% of the execution time. This is logical because only a portion of the pixels (having a gray level between the two thresholds 1 and 2 ) is locally analyzed.

Conclusion
In this paper we tackled the problem of foreground/background separation from the images of historical documents. 10 Thresholding We proposed a hybrid approach of degraded document images binarization. The proposed approach runs on two passes. Firstly, a global thresholding using Otsu's algorithm is applied on the entire image, and two different thresholds are determined. All pixels below the first threshold are preserved and all pixels higher than the second threshold are eliminated as they represent surely background pixels. The remaining pixels are then processed locally based on their neighborhood information. In this step, three local thresholding methods are combined in order to obtain a more accurate decision. Since the number of pixels processed locally is very small compared to the total number of pixels, the time required for the binarization is reduced considerably without decreasing the performances. To validate our approach, we compared it with the state-of-the-art methods from the literature and the obtained results on standard and synthetic collections are encouraging and confirm our approach.