Textual information embedded in multimedia can provide a vital tool for indexing and retrieval. A lot of work is done in the field of text localization and detection because of its very fundamental importance. One of the biggest challenges of text detection is to deal with variation in font sizes and image resolution. This problem gets elevated due to the undersegmentation or oversegmentation of the regions in an image. The paper addresses this problem by proposing a solution using novel fuzzy-based method. This paper advocates postprocessing segmentation method that can solve the problem of variation in text sizes and image resolution. The methodology is tested on ICDAR 2011 Robust Reading Challenge dataset which amply proves the strength of the recommended method.
Recently there has been a rapid surge in multimedia reservoirs that raised the need of efficient retrieval, indexing, and browsing of multimedia information. Several methodologies are presented in the literature to retrieve image and video data, which exploit color, texture, shape, and relation between objects, and so forth. However, embedded text in images can be extraordinarily instrumental for data retrieval as visual texts in multimedia communicate information regarding news headlines, title of movie, trade-name of products, summaries of sports contest, date and time of events, and so forth. Such information can be influential for the understanding and retrieval of images or videos.
Text implanted in images may be categorized in two classes, namely, caption text and scene text. Caption text is imposed over the image in the editing process for example news headings and match summary/score. It is also referred to as artificial text or superimposed text, whereas scene text is an actual part of the scene, that is, brand name of the product during commercial break, text on sign-board, name plate and text visible on dresses or product, and so forth.
One of the key challenges posed to the text detection process is to deal with text size variations. The text variation may be classified in two types: firstly, the variation of spatial resolution of images and secondly the variation of font sizes within an image. This paper focuses on the above mentioned problem in text detection and provides viable solutions for both categories of the problem.
The rest of the paper is ordered as follows. Section
A variety of techniques for text extraction have appeared in the recent past [
Segmentation identifies the occurrence of different regions in the image but does not recognize the relation between these regions. It is substantial to merge the characters of a word to form a text object, because most of the text detection techniques work on group of characters and it is very difficult to detect the isolated character [
Presently, few pixel level merging methods are introduced in the literature pertaining to text detection. Dilation is the most commonly used merging technique [
Object level merging is more close to human vision and deals with the objects and regions instead of pixels. It connects the potential character objects to form the text strings. Hence, the grouping and merging are dependent upon some high level features which gives better performance.
Wolf and Jolion [
Shi et al. [
Though these features are defined by strict boundaries in the existing techniques, the relation between the neighboring characters is not crisp. It is principally inequitable to declare a character as a neighbor if its distance to height ratio is 1.50 or less, whereas the same verdict gets void, if the ratio turns to even 1.51. The parameters to define the proximity of potential character should have been diffused instead of crisp logic. Thus, there is a need to architect a merging process in which the rules of inference are formulated in a general way, utilizing diffused categories. There is a requirement to frame a system which gives some weight to each of the features used for measuring the degree of neighborhood. Moreover, the similarity obtained by the currently reported features mostly does not correspond to human perception. Human perception of propinquity, similar heights, and similar color cannot be fully expressed using discrete and rigid boundaries or thresholds. These linguistic variables can be better defined by the fuzzy logic.
Component extraction or segmentation is the procedure of dividing a digital image into multiple fragments, called superpixels [
Proposed segmentation method consists of two processes: splitting and merging. Splitting is performed by the traditional region-based segmentation techniques, whereas merging is based on the novel fuzzy-based method. Figure
Architecture of the proposed method.
There exists sharp transition between the text and its background. Edge detection is the budding segmentation tool for text images because sharp intensity transition is the common feature in all the text objects. Exploiting this feature, edge detection along with the connected component labeling is used for segmentation in the proposed methodology, where Sobel edge detection technique is used for edge detection, and image dilation is applied to connect the broken edges.
Adaptive size of the dilation operator is calculated in consonance with the resolution of the image, which ranges between 3 and 5% of the width of the image. Dilation is performed prior to the fuzzy merging just to minimize the computational efforts. Proposed fuzzy merging method can work without this morphological operation.
Succeeding section explains the fuzzy merging process.
Let
The problem of merging process can be defined using the graph theory. Let
Fuzzy-based methods assign gradual membership value to the objects, to join with other text instances, which are measured as degrees in [0, 1]. This gives the flexibility to connect the object based on more than one feature, depending upon the different membership values of all the parameters.
The merging of character candidates relies on number of factors. Four features are extracted for the decision of joining objects as words or sentences. These features are color, height, position, and distance.
Lab color coding is used to describe the color of the object. Unlike RGB and CMYK, Lab color coding approximates the human vision system.
Geometrically, the amount
Factors fed into the fuzzy system: (a) height, (b) position, and (c) distance.
This step gets the inputs and decides the degree to which suitable fuzzy sets belong by means of membership functions. The input has to be a crisp numerical value bounded to the universe of discourse of the input variable and the output is a fuzzy degree of membership in the qualifying linguistic set. Fuzzification of the input refers to either a table lookup or function estimation.
Let the inputs to the fuzzy system be represented in the vector notation:
The symmetric Gaussian function is defined by two parameters
Third membership function of all the inputs exhibits a progression from miniature start that advances and reached a culmination over time. Sigmoid function is used to express this phenomenon. Consider
The following function is used to map
Multiple inputs and single output fuzzy rule-base is employed for the current merging problem. Product inference engine (PIE) makes use of fuzzy rule base and linguistic rules. PIE encompasses individual rule-based inference with union combination, min implication, min operator for
Membership functions: (a) color, (b) height, (c) position, (d) distance, and (e) output.
PIE can be fully defined by
Defuzzification is the mapping of fuzzy values into the real-world values. Center average fuzzifier (CAD) is used as the weighted average of the centers of fuzzy sets as it provides a reasonable approximation:
Dataset of ICDAR 2011 Robust Reading Competition, Challenge 1: “Reading Text in Born-Digital Images (Web and Email),” is applied in this research, wherein the dataset comprises 102 images of test and 420 images of training sets. The above dataset possesses vast variation in font size, resolution, background complexity, and font type. However, ICDAR dataset is recognized as the most widely used benchmark for text detection.
The ranking metric used for the text segmentation task is accurate. Accuracy of segmentation can be defined as
Comparison of proposed methodology with other techniques.
In order to prove the practicability of the proposed segmentation method, fuzzy merging is added as the post segmentation process in textorter [
Comparison of proposed work with other techniques.
Method | Recall | Precision | Harmonic mean |
---|---|---|---|
|
|
|
|
Textorter [ |
69.62 | 85.83 | 76.88 |
TH-TextLoc* | 73.08 | 80.51 | 76.62 |
Figure
Results of the proposed method. (a) Original images, (b) after splitting method, and (c) after fuzzy merging.
Different combinations for input and output membership functions are tested, where the results show that the combination testified in the proposed methodology ensures the best outcome. Gaussian, triangular, sigmoid, trapezoidal, and bell-shaped are commonly used membership functions. These functions are tested for making different combinations of fuzzy inference engine. Gaussian, triangular, and sigmoid functions are defined in Section
The trapezoidal curve is a function of a vector “
Figure
Comparison of different membership functions.
The paper addresses very crucial problem of text detection, which is variation in font size and resolution. Earlier approaches are primarily dataset specific and unable to deal with enormous variation of font sizes. This paper devises a fuzzy-based postprocessing method for segmentation duly operatable with combination of any segmentation method. Four factors are mainly put forth for joining characters into words. These factors are fed into the fuzzy system which gives the verdict of joining or not joining regions. Dataset of ICDAR 2011 Robust Reading Competition, Challenge 1: “Reading Text in Born-Digital Images (Web and Email),” is applied into this research, whereby the results achieved stand out to be productive when pitched against the above referred retrieval problems.
The authors declare that they have no conflict of interests regarding the publication of this paper.