Fuzzy-Based Segmentation for Variable Font-Sized Text Extraction from Images/Videos

Textual information embedded in multimedia can provide a vital tool for indexing and retrieval. A lot of work is done in the field of text localization and detection because of its very fundamental importance. One of the biggest challenges of text detection is to deal with variation in font sizes and image resolution. This problem gets elevated due to the undersegmentation or oversegmentation of the regions in an image. The paper addresses this problem by proposing a solution using novel fuzzy-based method. This paper advocates postprocessing segmentation method that can solve the problem of variation in text sizes and image resolution. The methodology is tested on ICDAR 2011 Robust Reading Challenge dataset which amply proves the strength of the recommended method.


Introduction
Recently there has been a rapid surge in multimedia reservoirs that raised the need of efficient retrieval, indexing, and browsing of multimedia information.Several methodologies are presented in the literature to retrieve image and video data, which exploit color, texture, shape, and relation between objects, and so forth.However, embedded text in images can be extraordinarily instrumental for data retrieval as visual texts in multimedia communicate information regarding news headlines, title of movie, trade-name of products, summaries of sports contest, date and time of events, and so forth.Such information can be influential for the understanding and retrieval of images or videos.
Text implanted in images may be categorized in two classes, namely, caption text and scene text.Caption text is imposed over the image in the editing process for example news headings and match summary/score.It is also referred to as artificial text or superimposed text, whereas scene text is an actual part of the scene, that is, brand name of the product during commercial break, text on sign-board, name plate and text visible on dresses or product, and so forth.
One of the key challenges posed to the text detection process is to deal with text size variations.The text variation may be classified in two types: firstly, the variation of spatial resolution of images and secondly the variation of font sizes within an image.This paper focuses on the above mentioned problem in text detection and provides viable solutions for both categories of the problem.
The rest of the paper is ordered as follows.Section 2 highlights some related work of the field.Section 3 introduces the proposed method to segment text in images.Section 4 presents the dataset used and results of text segmentation algorithm.Section 5 provides some concluding remarks.

Literature Review
A variety of techniques for text extraction have appeared in the recent past [1][2][3][4][5][6].Comprehensive surveys can be traced explicitly in [7][8][9].These techniques can be categorized into two types mainly with reference to the utilized text features, that is, region-based and texture-based methods [10].Texture-based methods pertain to textural properties of the text, distinguishing it from the background.These techniques mostly use Gabor filters, Wavelet, Fast fourier transform, Spatial variance, and so forth.This approach further uses machine learning methods such as support vector machine (SVM), multilayer perceptron (MLP), and adaBoost [11][12][13][14][15]. Region-based methods use distinct region features to extort text content.This methodology deals with the color dissimilarity of the text and its surrounding pixels.Procedures based on color, edge, and connected components are frequently exercised in this category [16][17][18][19].These techniques typically work in the bottom up fashion by initially segmenting the small regions and lately grouping the potential text regions.Region-based methods are generally composed of three modules: (1) segmenting the image into small regions which aims at segregating the character regions from its background, (2) merging and grouping of small regions to form words and sentences, and (3) differentiating between text and nontext objects.
Segmentation identifies the occurrence of different regions in the image but does not recognize the relation between these regions.It is substantial to merge the characters of a word to form a text object, because most of the text detection techniques work on group of characters and it is very difficult to detect the isolated character [20,21].This grouping can utilize the pixel level features or can exploit the high level features.
Presently, few pixel level merging methods are introduced in the literature pertaining to text detection.Dilation is the most commonly used merging technique [22][23][24][25][26], wherein the dimensions of the morphological operator intrinsically characterize the range of the homogeneous segmented regions.Consequently, hefty text blocks are tending to oversegmentation, whereas diminutive text areas are possibly skipped.Fixed size of the structuring element can only materialize for limited spatial resolution and small range of font sizes.Besides, size of the structuring element should be dependent upon the size of the text but usually has the fixed value which cannot deal with the variation in resolution of image and size of text.Some methodologies in literature utilize pyramid approach to solve this problem and extend the range of text sizes for detection [23,27,28].This highly increases the computational requirements or demands for parallel processing mechanisms.
Object level merging is more close to human vision and deals with the objects and regions instead of pixels.It connects the potential character objects to form the text strings.Hence, the grouping and merging are dependent upon some high level features which gives better performance.
Wolf and Jolion [29] used disparity in heights and positions of the connected component to merge the characters.Minetto et al. [27] developed a grouping step, based on the space between the two text areas relative to their height.Pan et al. [30] built component relation using minimum spanning tree.This text detection method merges the characters into words using shape and spatial difference.Gonzalez and Bergasa [31] suggested that characters of the same word should have several similar characteristics, for instance, stroke size, altitude, position, adjacency, and constant interletter and interword spacing.
Shi et al. [32] used the graph model to merge the neighboring regions to form text strings.The adjoining nodes for each node are those ones that persuade the certain conditions based upon difference in color, position, width ratio, and height ratio.Character candidates are linked into pairs in Yao et al. [33] method.If two regions have similar stroke widths (ratio between the mean stroke widths is fewer than 2.0), matching sizes (ratio between their characteristic scales does not surpass 2.5), and similar colors and are closely placed (distance between them is less than two times the sum of their characteristic scales), they are tagged as a couple.Subsequently, a greedy hierarchical agglomerative clustering approach is exercised to combine the pairs into candidate chains.
Though these features are defined by strict boundaries in the existing techniques, the relation between the neighboring characters is not crisp.It is principally inequitable to declare a character as a neighbor if its distance to height ratio is 1.50 or less, whereas the same verdict gets void, if the ratio turns to even 1.51.The parameters to define the proximity of potential character should have been diffused instead of crisp logic.Thus, there is a need to architect a merging process in which the rules of inference are formulated in a general way, utilizing diffused categories.There is a requirement to frame a system which gives some weight to each of the features used for measuring the degree of neighborhood.Moreover, the similarity obtained by the currently reported features mostly does not correspond to human perception.Human perception of propinquity, similar heights, and similar color cannot be fully expressed using discrete and rigid boundaries or thresholds.These linguistic variables can be better defined by the fuzzy logic.

Methodology
Component extraction or segmentation is the procedure of dividing a digital image into multiple fragments, called superpixels [34,35].The objective of segmentation is to reduce the computational complexity of the under process image and make its representation easier to analyze.Image segmentation is classically used to trace objects and boundaries in images.In particular, image segmentation is the process to label the pixels of image, where the pixels with same labels share some common characteristics such as color, intensity, and texture and; moreover, edge detection is a basic instrument used in most image processing applications to obtain sharp alteration in intensity of the region boundaries.
Proposed segmentation method consists of two processes: splitting and merging.Splitting is performed by the traditional region-based segmentation techniques, whereas merging is based on the novel fuzzy-based method.Figure 1 provides the architecture of the proposed work.

Splitting.
There exists sharp transition between the text and its background.Edge detection is the budding segmentation tool for text images because sharp intensity transition is the common feature in all the text objects.Exploiting this feature, edge detection along with the connected component labeling is used for segmentation in the proposed methodology, where Sobel edge detection technique is used for edge detection, and image dilation is applied to connect the broken edges.Adaptive size of the dilation operator is calculated in consonance with the resolution of the image, which ranges between 3 and 5% of the width of the image.Dilation is performed prior to the fuzzy merging just to minimize the computational efforts.Proposed fuzzy merging method can work without this morphological operation.

Fuzzy Merging.
Succeeding section explains the fuzzy merging process.
Let  be the input image and  the set of all the regions of , extracted by the above mentioned method.Let  = { 1 ,  2 ,  3 , . . .,   }, and  is the total number of regions in the image .
The problem of merging process can be defined using the graph theory.Let  denote the undirected graph and () =  represent the vertices of the graph .Edges of the graph  are () = {(  ,   ) ∈  |  ̸ = }.These edges show the probability of joining two vertices.Initially, ∀ ∈ () are set to null.This probability  . can be calculated by fuzzy logic and based upon the four factors.Four factors considered are explained later in the paper.
Fuzzy-based methods assign gradual membership value to the objects, to join with other text instances, which are measured as degrees in [0, 1].This gives the flexibility to connect the object based on more than one feature, depending upon the different membership values of all the parameters.
3.2.1.Feature Extraction.The merging of character candidates relies on number of factors.Four features are extracted for the decision of joining objects as words or sentences.These features are color, height, position, and distance.
Color.Color is taken as the parameter to join the two text objects.Color of the characters of a single word or sentence is mostly the same.If the colors of two text objects are similar, then these objects can be the candidates to merge.In order to get the degree of similarity, difference between the two colors is calculated.
Lab color coding is used to describe the color of the object.Unlike RGB and CMYK, Lab color coding approximates the human vision system.Δ is described in the  *  * ℎ * color space with differences in lightness, chroma, and hue calculated from  *  *  * coordinates.Difference of two colors having coordinates ( * 1 ,  * 1 ,  * 1 ) and ( * 2 ,  * 2 ,  * 2 ) can be defined as Here, Geometrically, the amount Δ *  presents the arithmetic mean of the chord lengths of the equal chroma circles of the two colors.
Height.Difference of heights is the second input parameter for fuzzy system.Only objects with similar heights should be merged because characters of the same word or sentence have the same or similar heights.Difference of heights of two objects is measured as follows: where Ht  and Ht  are the heights of th and th objects, respectively.
Position.Position of the two objects should be the same for merger.This merging process is proposed for horizontal text only as most of the caption text is horizontally aligned.This can be expanded to other directions by considering position at different angles.Consider where Pos  and Pos  are the bottom coordinates of bounding boxes of th and th objects, respectively.
Distance.Characters of the same word or sentence are placed closely.The distance between characters varies with the variation in font size and is highly dependent upon the heights of the characters.Distance (Δ) between two regions in an image is calculated by where   (1) and   (2) are the left and right coordinates of bounding box of th object.Figure 2 explains the height, position, and distance phenomena pictorially.

3.2.2.
Fuzzification.This step gets the inputs and decides the degree to which suitable fuzzy sets belong by means of membership functions.The input has to be a crisp numerical value bounded to the universe of discourse of the input variable and the output is a fuzzy degree of membership in the qualifying linguistic set.Fuzzification of the input refers to either a table lookup or function estimation.Let the inputs to the fuzzy system be represented in the vector notation: where  * belonging to  4 represents real value points.We define symmetric Gaussian function and sigmoid function for the input.The symmetric Gaussian function is defined by two parameters  and : where  = 1, 2;  = 1, 2;  = 1, 2; and ℎ = 1, 2 represent the number of fuzzy sets. 1 () ,  2 () ,  3 () , and  4 (ℎ) represent the means of fuzzy sets, where 3 , and  (ℎ) 4 represent the variances of fuzzy sets.
Third membership function of all the inputs exhibits a progression from miniature start that advances and reached a culmination over time.Sigmoid function is used to express this phenomenon.Consider where  = 3;  = 3;  = 3; and V = 3 represent the fuzzy set's number. ⋅⋅⋅V and  ⋅⋅⋅V are the model parameters to be fitted.

Product Inference Engine.
Multiple inputs and single output fuzzy rule-base is employed for the current merging problem.Product inference engine (PIE) makes use of fuzzy rule base and linguistic rules.PIE encompasses individual rule-based inference with union combination, min implication, min operator for -norm, and max operator for -norm: Ru (1) : IF Δ is  1 and ΔHt is  1 and ΔPos is  1 and Δ is  1 THEN  is  2 ; Ru (2) : IF Δ is  2 and ΔHt is  1 and ΔPos is  1 and Δ is  1 THEN  is  2 ; and ΔPos is  1 and Δ is  1 THEN  is  2 ; Ru (5) and ΔPos is  2 and Δ is  1 THEN  is  2 ; Ru (6) and ΔPos is  2 and Δ is  1 THEN  is  2 ; Ru (7) : where ( 1 ,  2 ,  3 ), ( 1 ,  2 ,  3 ), and ( 1 ,  2 ,  3 ) are the input fuzzy membership functions for the same, similar, and different, ( 1 ,  2 ,  3 ) are the membership values for minimum, average, and maximum and ( 1 ,  2 ) are the output membership functions corresponding to Not join and join.Triangular curve function is used as output membership function that can be defined by More compactly, it can be expressed as The parameters   and   define the feet of the triangle and the parameter   defines the peak.Figure 3 shows different membership functions used in the system.
where  () and  ()  are the center and height of the output fuzzy sets.CAD is chosen because it is computationally less expensive and has more accuracy and continuity when compared to other defuzzifiers [36].

Results and Experiments
Dataset of ICDAR 2011 Robust Reading Competition, Challenge 1: "Reading Text in Born-Digital Images (Web and Email), " is applied in this research, wherein the dataset comprises 102 images of test and 420 images of training sets.The above dataset possesses vast variation in font size, resolution, background complexity, and font type.However, ICDAR dataset is recognized as the most widely used benchmark for text detection.
The ranking metric used for the text segmentation task is accurate.Accuracy of segmentation can be defined as In the text detection and localization problem, isolated character is also considered as under segmentation.Proposed method obtained 90.7% accuracy for segmentation of text objects.Comparison of the segmentation results with and without fuzzy merging can be viewed in Figure 4. Segmentation without fuzzy merging is tested for adaptive and fixed size structuring elements.Achieved results show that fuzzy merging has a very effective role in segmentation for text detection.
In order to prove the practicability of the proposed segmentation method, fuzzy merging is added as the post segmentation process in textorter [37], which is the best technique in ICDAR Robust Reading Competition 2011 [38], whereby the results justify a major improvement in the detection rate of textorter.It is also factual that many isolated characters are not detected as text by textorter, as these are not merged as a complete word.The ranking metric used for the text localization task is the harmonic mean which is computed according to the methodology proposed in [39].
It is a combination of two measures: precision and recall.
Table 1 shows the comparison of results for different text detection methods.
Figure 5 shows the superiority of the proposed method.Results show that fuzzy merging really enhances the segmentation process for text detection.
Different combinations for input and output membership functions are tested, where the results show that the combination testified in the proposed methodology ensures the best outcome.Gaussian, triangular, sigmoid, trapezoidal, and bell-shaped are commonly used membership functions.These functions are tested for making different combinations of fuzzy inference engine.Gaussian, triangular, and sigmoid functions are defined in Section 3.2.2.Bell-shaped function can be defined as The comprehensive bell function can be defined using three parameters , , and , and here the parameter  is mostly positive, whereas parameter  traces the middle of the curve.The trapezoidal curve is a function of a vector "" and is dependent upon four scalar parameters , , , and : or it can be defined compactly as The parameters  and  trace the "feet" of the trapezoid and the parameters  and  set the "shoulders." Figure 6 shows the comparison of different membership functions regarding four inputs.

Conclusion
The paper addresses very crucial problem of text detection, which is variation in font size and resolution.Earlier approaches are primarily dataset specific and unable to deal with enormous variation of font sizes.This paper devises a fuzzy-based postprocessing method for segmentation duly operatable with combination of any segmentation method.Four factors are mainly put forth for joining characters into words.These factors are fed into the fuzzy system which gives the verdict of joining or not joining regions.Dataset of ICDAR 2011 Robust Reading Competition, Challenge 1: "Reading Text in Born-Digital Images (Web and Email), " is applied into this research, whereby the results achieved stand out to be productive when pitched against the above referred retrieval problems.

Figure 1 :Figure 2 :
Figure 1: Architecture of the proposed method.
correctly segemented objects Total number of objects .

Figure 4 :Figure 5 :Figure 6 :
Figure 4: Comparison of proposed methodology with other techniques.

Table 1 :
Comparison of proposed work with other techniques.