A Scene Text Detector for Text with Arbitrary Shapes

. The performance of text detection is crucial for the subsequent recognition task. Currently, the accuracy of the text detector still needs further improvement, particularly those with irregular shapes in a complex environment. We propose a pixel-wise method based on instance segmentation for scene text detection. Speciﬁcally, a text instance is split into ﬁve components: a Text Skeleton and four Directional Pixel Regions, then restoring itself based on these elements and receiving supplementary information from other areas when one fails. Besides, a Conﬁdence Scoring Mechanism is designed to ﬁlter characters similar to text instances. Experiments on several challenging benchmarks demonstrate that our method achieves state-of-the-art results in scene text detection with an F-measure of 84.6% on Total-Text and 86.3% on CTW1500.


Introduction
Detecting text in the real world is a fundamental computer vision task that directly determines the subsequent recognition results. Many applications in the real world depend on accurate text detection, such as photo translation [1] and autonomous driving [2]. Now, horizontal- [3][4][5] and oriented- [6][7][8][9][10] based methods no longer meet our requirements, and more flexible pixel-wise detectors [11,12] have become mainstream. However, precisely locating text instances is still a challenge because of arbitrary angles, shapes, and complex backgrounds. e first challenge involves text instances with irregular shapes. Unlike other common objects, the shaped instance often cannot be accurately described by a horizontal box or an oriented quadrilateral. Some typical methods (e.g., EAST [8] and TextBox++ [10]) perform well on the common benchmarks (e.g., ICDAR 2013 [13] and ICDAR 2015 [14]) but degrade in curved text challenges, as shown in Figure 1(a). e second challenge is separating text character boundaries. Although pixel-wise methods do not suffer from a certain shape, they may still fail to separate text areas with adjacent edges, as shown in Figure 1(b). e third challenge is that text identification may face false positives [15] dilemma because of the lack of context information. Some symbols or characters similar to text may be misclassified.
To overcome the aforementioned challenges, we propose a novel method, called TextCohesion. As shown in Figure 2, our method treats a text instance as a combination of a Text Skeleton and four Directional Pixel Regions, where the previous one roughly represents the shape and profile, and the latter is responsible for refining the original region from four directions. Notably, a pixel belongs to more than one Directional Pixel Regions (e.g., up, left), which means the instance has more chances to be recovered. Furthermore, the confidence score of every Text Skeleton is reviewed, only higher then a threshold is considered as a candidate. detection: regression-based and pixel-based. Inspired by the promising of object detection architectures such as Faster R-CNN [26] and SSD [27], a bunch of regression-based detectors are proposed, which simply regress the coordinates of bounding boxes of candidates as the final prediction. TextBoxes [7] adopts SSD and adjusts the default box to relatively long shape to match text instances. PyrBoxes [28] proposes a SSD-based detector equipped with a grouped pyramid to enrich feature. Sheng [29] proposes a novel text detector with learnable anchors to cover all varieties of texts in natural scene. Lyu [30] detects scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. By modifying Faster R-CNN, Rotation Region Proposal Networks [31] insert the rotation branch to fit the oriented shapes of text in natural images. ese methods can achieve satisfying performance on horizontal or multioriented text areas. However, they may suffer from the shape of the bounding box, even with rotations. Mainstream pixel-wise methods drew inspirations from the fully convolutional network (FCN) [32], which removes all fully connected layers and is widely used to generate a semantic segmentation map. Convolution transpose operation then helps the shirked feature restore its original size. TextSnake [11] treats a text instance as a sequence of ordered, overlapping disks centered at symmetric order, each of which is associated with potentially variable radius and orientations. It made significant progress on curved text benchmarks. TexeField [33] learns a direction   Figure 2: e overall procedure of the proposed method consists of Feature Extraction and Postprocessing. Five feature maps are generated from the backbone (e.g., VGG16) and upsampled in the Feature Extraction step. e DPRs and the TS regions are adopted to reconstruct text instances in the postprocessing step. field pointing away from the nearest text boundary to each text point. An image of two-dimensional vectors represents the direction field. SPCNET [34], based on FPN [35] and Mask R-CNN [36], inserts Text Context Module and Rescore mechanism to leave the lack of context information clues and inaccurate classification score. PSENet [37] projects feature into several maps and gradually expand the detected areas from small kernels to large and complete instances. ese pixel-based methods significantly improve the performance of curved benchmarks. However, detection failures are still possible in complex situations. Differs from the previous, the proposed method has more opportunities to recover itself. Specifically, the Text Skeleton represents the profile of the instance, which is smaller and less sticky than the original form. Pixels in text areas are divided into two groups according to four directions: the up-down and leftright. Ideally, a TS can be integrated with any group to restore itself. When some regions fail to reproduce, there is also an opportunity to get additional supplementary from others. We conduct extensive experiments on standard benchmarks, including the horizontal the oriented text, and curved text datasets. Evaluations demonstrate that Text-Cohesion achieves state-of-the-art or very competitive performance.

Methodology
e architecture of TextCohesion is depicted in Figure 2, which consists of a feature extraction section and a postprocessing section. For image feature extraction, an FCNbased convolutional backbone followed by an up-sampling step is employed. Five feature maps containing a Text Skeleton (TS) and four Directional Pixel Regions (DPRs) are generated after up-sampling. e TS features are evaluated by a Confidence Scoring Mechanism (CSM), and finally obtaining the predicted text regions incorporated with the DPRs regions. To optimize the proposed network, a corresponding loss function of the TS and DPRs is designed. More details are introduced in the following section.

Network.
e proposed method inherits the popular VGG16 network by keeping the layers from Conv1 to Conv5, converting the last fully connected layers into convolution layers. e input images are first downsampled to the multilevel features with five convolution blocks, and five feature maps (i.e., P 1 , P 2 , P 3 , P 4 , P 5 ) are generated. en, these features are gradually upsampled to the original size and mixed with the corresponding output of the previous convolution block. e upsampled process can be described by where O is the output of the network, "" refers to feature concatenation, and U p is the upsample function (i.e., Conv(1, 1) − Conv(3, 3) − Deconv − ReLu used to resize the feature map matching other layers. Five feature maps with the same resolution are leveraged as the prediction of the network (the blue box shown in Figure 2) after the upsample step. Each prediction is composed of a TS and four DPRs in the postprocessing. DPRs contain four feature maps according to different directions: R 1 , R 2 , R 3 , and R 4 . e TS is the skeleton of the text instance that is adopted to separate from each other. e CMS is introduced to reduce false positives in terms of evaluating each TS. For clarity, we take a curved text as an example to demonstrate the process of label generation in the rest of Section 3.

Text Skeleton.
Text Skeleton (TS) is an essential component representing the center part of the text instance. As shown in Figure 3(b), the gray area is the TS of the instance. e first step of generating TS is to find the head and tail of the text. Similar to [11], we also use the cosine of adjacent vertices to find the head and tail of text instance, and the remaining two longest sides. e longest two sides along with the text instance (e.g., t 0 t n and b 0 b n ) are called sidelines in the proposed method. en, n vertices of even distribution are sampled from the two sidelines (i.e., Top Sideline and Bottom Sideline in Figure 3(a)), respectively. After that the vertices in the centerline (Head − Tail in Figure 3) can be averaged from these sampled vertices: where t 0 , t 1 , . . . , t i , . . . , t n and b 0 , b 1 , . . . , b i , . . . , b n are vertices in two sidelines of the text instance, respectively, and c 0 , c 1 , . . . , c i , . . . , c n are a set of vertices belong to the center line. Finally, TS is bold by the center line infd3 where e i and f i are pixels that represent the expansion of the center line to both sidelines. e region of e i e i+1 f i f i+1 form a part of TS, as shown in Figure 3(b). β is a parameter that holds the bold rate, and we set it to 0.2 experimentally. When these vertices are completely processed, TS is generated correspondingly.

Directional Pixel Region. Directional Pixel Regions
(DPRs) are used to restore its original form, including R 1 , R 2 , R 3 , and R 4 . Pixels in text instance but not in TS are considered as falling into DPR. In Figure 3 e direction of every fraction is determined by the tangent angle between its corresponding center vertices (c i ) and the next (c i+1 ). More specifically, the tangent angle of two adjacent center vertices is calculated by the following equation: where x and y refer to the coordinates of the center vertices. By comparing the tan(Θ i ) of center vertices with α, the regions of t i t i+1 e i+1 e i and f i f i+1 b i+1 b i are labeled as DPRs , the pixels within its corresponding Mathematical Problems in Engineering can be calculated as follows: 0, other, where condition 1 is used to distinguish the angle of adjacent center vertices and condition 2 ensures the selected pixels are above the TS. α is a parameter that controls the boundary of specific directional regions, which is discussed in detail in the experiment section. y t i and y c i are the vertical coordinates of vertices (x, y) on the sideline and the center line, respectively. e generating process of the R 2 is similar to the R 1 , but the only difference is that the pixels are located below the TS. erefore, condition 2 is reversed naturally: 0, other, where y t i and y t i+1 are logically equivalent to y b i and y b i+1 , which are the vertical coordinates of the sampled vertices on the sidelines. e R 3 and R 4 are generated in the same way, as shown below: where x t i and x c i are the horizontal coordinates of vertices on the sideline and the center line, respectively.

Confidence Scoring Mechanism.
To filter out false positives, the confidence score is utilized to weight every TS. If the score of TS is lower than a threshold, then all components of this instance are discarded: FP, Other, where n is the total number of pixels in the TS. p i is the value of the ith pixel in the TS region. TP and FP refer to the true positives and false positives, respectively. c is the threshold value to filter out the TS with a lower confidence score, and we set it to 0.6 empirically. TS with high confidence will be retained and processed to form the final prediction with its corresponding DPRs. Instead, TS belonging to FP(FalsePositive) with its components are filtered directly. e TS, as the central area of a text instance, contains the key features of the whole text, which are more valuable to use than the whole features of one text instance.

Loss Function.
e proposed method is trained with the following loss function as the three objectives: where L DPR is a smooth L 1 [26] loss and L TS and L CSM are crossentropy classification loss functions. e loss of L TS is computed as follows: where L TS is a self-adjust crossentropy loss function and w i in equation (10) is a self-adjust weight [9]. For the ith instance with area � S i , every positive pixels within it have a weight of w i � B/S i . B is the average area of all text instances in one image. In that case, the pixels in text instances with small areas have a bigger weight than the pixels in big text areas. In our experiments, the weight λ 1 is set to 3 as the TS is essential than other components. Losses for DPR and CSM are calculated: where L DPR is optimized by a Smooth L 1 loss, and the pixels losses in R 1 , R 2 , R 3 , and R 4 are calculated, respectively, which means that one pixel can be simultaneously categorized as two regions (e.g., R 1 and R 3 ). L CSM is a standard crossentropy function. TS i , DPR i , and CS i are ground truth labels and TS i , DPR i , and CS i are predicted values.

Postprocessing.
TextCohesion treats every text instance as TS and four DPRs previously; hence, these components should be grouped, forming the final prediction. e postprocessing algorithm is depicted in Algorithm 1: Every TS represents a text instance, and after passing through CSM, instances with higher confidence are reserved as candidates. Based on these candidates, the corresponding DPRs can be obtained. e postprocessing mainly includes three steps. (1) e TS is used to differentiate the different text instances. (2) For each TS, the outer pixels as initial points are used to search the corresponding pixels in the DPRs iteratively. (3) e TS is eventually merged with corresponding searched regions to form the final prediction. e entire postprocessing is shown in Algorithm 1, where Neighbor(.) refers to a function that obtains the directional information of the adjacent pixels.

Experiment
To evaluate TextCohesion, we conduct extensive experiments on both oriented and curved benchmarks and give a detailed description of these datasets for model training and inference, experimental implementation, results with comparisons, and ablation study, respectively. [38] is a large scale dataset that contains about 800K synthetic images that are created by blending natural images with text rendered with random fonts, sizes, colors, and orientations. ese texts look realistic as the overlaying follows carefully set up configurations and a well-set learning algorithm.

Datasets. SynthText
ICDAR2015 [14] contains 1000 training and 500 test images captured by wearable cameras with relatively low resolutions. Each image includes several oriented texts annotated by four vertices of the quadrangles.
ICDAR 2017 MLT (IC17-MLT) [39] is a large scale multilingual text dataset, which includes 7200 training images, 1800 validation images, and 9000 testing images. e dataset is composed of complete scene images that come from 9 languages. Similarly, with ICDAR 2015, the text regions in ICDAR 2017 MLT are also annotated by four vertices of the quadrangle. CTW1500 [40] is a challenging dataset for curve text detection, which is constructed by Yuliang et al. [18]. It consists of 1000 training images and 500 testing images. Different from traditional text datasets (e.g., ICDAR 2015 and ICDAR 2017 MLT), the text instances in SCUT-CTW1500 are labeled by a polygon with 14 points that can describe the shape of an arbitrarily curve text.
Total-Text [41] is another word-level-based English curve text dataset which is split into training and testing sets with 1255 and 300 images, respectively (Figure 4). [42]. Momentum and weight decay are set to 0.9 and 5 × 10 − 4 , respectively. Learning rate is initialized to 10 − 4 and decayed by 0.1 every 30 epochs. Following [11], all training images are augmented online with rotated and cropped with areas ranging from 0.24 to 1.69 and aspect ratios ranging from 0.33 to 3. After that noise, blur, and lightness are randomly adjusted and lastly resized to 512 × 512. We ensure that the text on the augmented images is still legible if they are legible before augmentation. TextCohesion is firstly pretrained on Syn-thText for 2 epochs and fine-tuned on other datasets. All implementations are deployed on PC with (CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50 GHz; GPU: GTX 1080).

Implementation Details. Training TextCohesion is optimized by SGD with backpropagation
Inferencing to test the ability of detecting arbitrarily shaped text, we evaluate our method on Total-Text and Mathematical Problems in Engineering SCUT-CTW1500, both of them containing the curved instances. Images in the test stage are also resized to 512 × 512. We report the performance on SCUT-CTW1500 in Table 1, in which we can find that the Precision (88.0%), Recall (84.6%), and F-measure (86.3%) achieved by TextCohesion significantly outperform the ones of other competitors. Remarkably,  Mathematical Problems in Engineering the recall and F-measure surpass the second-best record by 4.7% and 2.7%, respectively. Besides, the inference time of the proposed method is also compared with other methods, i.e., DB [43]. e testing scale of the input image is resized to 512 × 512 pixels, and the batch size is set to 1 during all the comparison experiments. e main results are reported in Tables 1-4, where an acceptable inference time can be found.

Experiments on Curved Text Benchmarks.
To test the ability to detect arbitrarily shaped text, we evaluate our method on Total-Text and CTW1500, both of them containing the curved instances. Images in the test stage are also resized to 512 × 512. We report the performance on CTW1500 in Table 1, in which we can find that the Precision (88.0%), Recall (84.6%), and F-measure (86.3%) achieved by TextCohesion significantly outperform the ones of other competitors. Remarkably, the Recall and F-measure surpass the second-best record by 4.7% and 2.7%, respectively. Our method achieves 88.1%, 81.4%, and 84.6% in Precision, Recall, and F-measure, respectively, outperforming    Tables 3 and 4, which also achieves F-measure of 89.1% and 73.1%, respectively. From these results, it can be observed that our method also achieves very competitive performance in dealing with oriented text. Meanwhile, thanks to the robust feature representation, TextCohesion can as well locate the text instance with small instances and in complex illuminations and variable scales.    To further study the Influence of the number of points on the sampling precision, an ablation experiment is performed, as shown in Figure 5(a). eoretically, the performance of the model will improve with the increase of sampling precision.
In the experiment, we found that the performance of the model hardly improve further (around 85%) when the sampling number (n) is greater than 10. n is set to 40 in all experiments. (2). β as an important parameter is used to control the ratio of the TS area to the DPR area. As shown in Figure 5(b), when the value of β is within the range of [0.1, 0.6], the network performs well. In all experiments, β is set to 0.2. (3). α is used to delineate the top, bottom, left, and right regions. 30°, 45°, and 60°are the three specific angles used to investigate the influence of α. As shown in Table 1, the F-measure is relatively good when α is 30°, so we set α to 30°in all experiments.

Influence of the Confidence Scoring Mechanism.
e CSM is used to filter out the false positives (e.g., those symbols or characters that are similar to text). e influence in the results of the model when using the CSM is shown in Table 5. e precision improves 2.8% after the CSM (0.6) is used. To test the robustness of the proposed model while changing the c in equation (8), a comparison experiment is set in Table 5, and the F-measure is relatively good when c is 0.6. In all experiments, c is set to 0.6.

Conclusion and Outlook
In this paper, we propose a novel text detector, which achieves upto 86.3% F-measure among common text benchmarks, including text instance with irregular shapes. e text instance modeling method utilized in this detector could precisely detect text with arbitrary boundaries by splitting one text instance into four DPRs and a TS region. Moreover, a Confidence Scoring Mechanism is incorporated into this detector to filter out false positives, which further improves its detection precision. Simulation experiment results show that the proposed text detector performs well in scene text detection. e proposed method might have potential applications in the field of photo translation, autonomous driving, and product identification.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.