Texts as Lines: Text Detection with Weak Supervision

Scene text detection methods based on deep learning have recently shown remarkable improvement. Most text detection methods train deep convolutional neural networks with full masks requiring pixel accuracy for good quality training. Normally, a skilled engineer needs to drag tens of points to create a full mask for the curved text. .erefore, data labelling based on full masks is time consuming and laborious, particularly for curved texts. To reduce the labelling cost, a weakly supervised method is first proposed in this paper. Unlike the other detectors (e.g., PSENet or TextSnake) that use full masks, our method only needs coarse masks for training. More specifically, the coarse mask for one text instance is a line across the text region in our method. Compared with full mask labelling, data labelling using the proposed method could save labelling time while losing much annotation information. In this context, a network pretrained on synthetic data with full masks is used to enhance the coarse masks in a real image. Finally, the enhanced masks are fed back to train our network. Analysis of experiments performed using the model shows that the performance of our method is close to that of the fully supervised methods on ICDAR2015, CTW1500, Total-Text, and MSRATD5000.


Introduction
At present, natural scene text detection has attracted more attention due to its practical application requirements, such as scene understanding, visual question answering, autonomous driving, text detection [1], and recognition [2,3]. Text is one of the most fundamental semantics appearing everywhere in daily life, for example, in traffic signs, commodity packages, and advertising posters. ese text instances in the real world have varying sizes, random directions, and arbitrary shapes, making them extremely challenging to label and capture accurately. Unlike other general objectives, scene text usually cannot be described accurately by the axis-aligned rectangle, and most detectors using an axis-aligned rectangle only have an F-measure of below 65%, such as TextSnake [4]. Recently, most scene text detectors based on deep learning have tended to detect texts in different shapes with many coordinates for better performance. However, the above detectors require accurate pixel-level labels with expensive costs. e labelling consumes a large amount of manpower and financial resources, especially for texts with arbitrary shapes in complex environments. e precision of text detection has a close connection with the labelling methods of datasets. For example, several common datasets, ICDAR2013 [5], ICDAR2015 [6], ICDAR2017 [7], Total-Text [8], CTW1500 [9], and MSRA-TD500 [10], have different labelling methods for various texts. ICDAR2013, as one of the common datasets, was introduced during the ICDAR Robust Reading Competition in 2013 and mainly includes horizontal bounding boxes made by two points at the word level. Because of this labelling peculiarity, text detectors [11,12] using box regression have a great performance on ICDAR2013. ICDAR2015 was released in the ICDAR2015 Robust Reading Competition for multioriented text detection, using quadrilateral boxes as the annotations, as shown in Figure 1(b). EAST [13] and SPCNet [14], as the representatives of detectors, achieved good results on ICDAR2015. ICDAR2017 was a dataset with texts in nine languages for multilingual scene text detection, using quadrilateral boxes as the annotations as well as ICDAR2015. MSRA-TD500 was released in 2012, and the annotation method is the same as that of ICDAR2015. Unlike the above datasets, Total-Text and CTW1500 contain many curved texts, which aim to solve the arbitrarily shaped text detection problem. CTW1500 has more than 10k text annotations and at least one curved text per image. Total-Text contains many curved and multioriented texts, which require tens of points for accurate labelling. Recently, segmentation-based text detectors [4,15,16] have shown promising performance in existing datasets with high-cost labelling. e annotation design becomes more complicated to fit the requirements of text detection in the real world, and the cost also increases. e bounding box-based labelling method has low labelling costs but cannot fit text instances accurately in the wild, as shown in Figure 1 e pixel-based labelling method matches texts with arbitrary shapes in a complex environment but requires high labelling costs, as shown in Figure 1(c). To mitigate this conflict, we explore detecting texts at the pixel level but with a low labelling cost. Precise drawings of the text region are difficult, but using a cross-line to locate text is simple. erefore, we seek to simplify the complex text labelling as a line named the text line in this work. Compared with the box or full masks, this annotation is extremely simple and contains less pixel information, as shown in Figure 1. Hence, the following two difficulties must be considered: (i) A weak text line label loses the text edge information and nearly all of the background information, which is rather problematic for supervised training (ii) e loss function focuses only on the labelled area and is not sensitive to the unlabelled ground truth To solve the above difficulties, a scene text detector based on weakly supervised learning is proposed in our paper. e model is first pretrained on SynthText to make it sensitive to the text region. Subsequently, in the training process of real data, the pretrained model is used to enhance the text line label. In addition, to enhance the weak label better, a soft label ∈ [0, 1] containing pixel location (distance) information is used. e contributions of this work are summarized as follows: (i) We first propose a scene text detector based on weakly supervised learning that significantly simplifies the annotation process without losing much precision. (ii) A modified crossentropy loss function named degree crossentropy is proposed. e loss function can optimize the soft label containing distance information.

Related Work
Scene text detection has received significant attention over the past few years, and numerous deep learning-based methods [17][18][19][20][21] have achieved great progress. Increasing 2 Mathematical Problems in Engineering detectors tend to capture texts at the pixel level to detect texts more precisely.

Bounding Box-Level Text Detection Methods.
Bounding box regression-based methods [19,22] are inspired by general object detection methods such as SSD [23] and Faster R-CNN [24]. TextBoxes++ [25] further regresses to quadrangles instead of horizontal bounding boxes for multioriented text detection. RRD [26] uses rotation-invariant and sensitive features from two separate branches for better long text detection. DSRN [2] maps multiscale convolution features onto a scale invariant space and obtains uniform activation of multisize text instances for detecting texts. Although regression-based methods have achieved state-of-the-art performance, it is still difficult to capture all text information in a bounding box without involving a large proportion of background and even other text instances.

Pixel-Level Text Detection Methods.
Pixel-level text detectors draw inspirations from FCN [23] and Mask R-CNN [27]. Using the mask as the annotation, PixelLink [28] performs text/nontext and links prediction at the pixel level. TextSnake [4] learns to predict local attributes, including the text centre line, text region, radius, and orientation, achieving improvements of up to 20% accuracy on curved benchmarks. CRAFT [15] trains a convolutional neural network producing the character region score and affinity score. PSENet [16] projects the feature map into several branches to produce multiple segmentation maps. TextField [29] detects scene text by predicting a direction field pointing away from the nearest text boundary to each text point. Text mountain [30] predicts text centre-border probability and text centre-direction to detect the scene text.
Text detectors based on instance segmentation perform better with higher precision annotation.

Weak Supervision Semantic Segmentation.
Sun et al. [31] leveraged the power of deep semantic segmentation CNNs while avoiding requiring expensive annotations for training. Rtfnet [32] took advantage of thermal images and fused both the RGB and thermal information in a novel deep neural network. Tang et al. [33] proposed a normalized cut loss for semisupervised learning; the loss combines partial crossentropy on labelled pixels and normalized cut for unlabelled pixels. Wang et al. [1] proposed a self-supervised approach and developed a pipeline to label drivable areas and road anomalies using RGB-D images automatically.

Weak Supervision Text Detection Methods.
WeText [34] trains scene text detection models on a small number of character-level annotated text images, followed by boosting the performance with a much larger number of weakly annotated images at the word/text line level. WordSup [35] trains a character detector by exploiting word annotations in rich large-scale real scene text datasets.
Recently, all detectors have been trained with fully annotated masks, requiring pixel-level accuracy for good quality prediction. Motivated by weakly supervised semantic segmentation [34,[36][37][38], we propose a weakly supervised scene text detector to alleviate the labelling consumption without losing high precision.

Method
In this section, we first introduce the overall pipeline of the proposed network. Second, the label and the procedure for enhancing the text line are described in detail. Furthermore, the designed loss function for weakly supervised learning is introduced. Finally, we list the simple postprocessing mechanism.
3.1. Overview. Figure 2 shows the overall pipeline of the proposed method, which is divided into three steps: (1) the model pretrained on a synthetic dataset [17], (2) label enhanced on a real dataset, and (3) training with the enhanced label. In the first step, the model is pretrained on a synthetic dataset with the full mask to make our model sensitive to the text region. In the second step, the pretrained model outputs an activation map of a real image as a supplement to the weakly annotated label (i.e., text line). In the final step, the enhanced label is fed back to optimize the network parameters. e output of the model in the final step forms the final prediction result through a contour search.

Labelling and Label Enhancement
3.2.1. Text Line. In this paper, we define the text line as a line across the text region, as shown in Figure 3. All characters within this text region should be connected with a continuous line (e.g., TL-1 to TL-5). ere are no width and curvature requirements for these text lines. However, improper annotations such as TL-6 will result in an obvious decline in text detection accuracy. e BG in Figure 3 represents the background annotation, which has no requirements for the geometric parameters (e.g., shape, width, length, and curvature) of the line. As a result, the TL and the BG constitute the original annotation.

Soft Label.
e soft label containing the distance (location) information is used in our method. e shortest distance between each text pixel and the background is calculated. en, we map these distance values to [0, 1] as the soft label. For pixels concentrated in the centre of the text instance, a strong (high) value that tends to 1 should be given. However, for the estimated edge area, a weak (low) value that tends to 0 should be assigned. As shown in Figure 2 (activation map), the distance-mountain-like activation map is predicted from the model pretrained on SynthText. e shape of the soft label is the same as the distance-mountain shape. e value P i of the label is calculated using the following equation:

Mathematical Problems in Engineering
where D i is the shortest distance between each text pixel (i) and the background pixels. D max is the maximum value for all D i s in the same text instance. Figure 2, label enhancement is an important step in the overall pipeline. e detailed processing of the enhancement is as follows: the network is first pretrained on SynthText for one epoch with full masks, making it sensitive to text areas. e activation maps of real images are generated using the above pretrained model. en, we extract the text skeleton for the given weakly supervised label. Finally, the intersection of the text activation region and the text skeleton is expanded to obtain more annotation information.

Label Enhancement. As shown in
e only purpose of label enhancement is to use the text line to locate the correct detection text region in the activation map of the real image and to obtain more supervision information. Enhanced labels only work on the positive part (i.e., text line), while background annotations are excluded. Figure 2 (right) describes the combination of the text skeleton and activation maps. We first use the text skeleton to locate the corresponding text activation region in the activation map and then attempt to seek the corresponding text edge region through continuous dilation of the intersection of the text activation region and the text skeleton. Detailed seeking refers to considering a pixel as the edge pixel by estimating whether the pixel value approaches 0. Finally, the values of pixels deemed as edge pixels are used as the supplement to enhance the original annotation (i.e., text line). Note that the values in the activation map are not common binary probabilities (i.e., text/nontext prediction) but represent location (distance) values. erefore, we can use the value of each pixel in the text region to confirm the relative distance from the background.

Network Design.
We chose VGG16 [39] as our feature extractor for a fair comparison with other methods. e images are first downsampled to the multilevel features with five convolution blocks, and five feature maps (i.e.,P 1 , P 2 , P 3 , P 4 , P 5 ) are generated in the step. en, the features are gradually upsampled to the original size and mixed with the corresponding output of the previous convolution block: where "‖" refers to the feature concatenation and U p is the upsample function that is used to feed the feature map into the Conv(1, 1)-Conv(3, 3)-Deconv-ReLU layers. e difference in U for U p is obtained without the ReLU layer and reducing the channel number to 1 as the output. Finally, the output obtained through the sigmoid function is used to calculate the loss of the prediction. In addition to the VGG16, other backbones (i.e., ResNet) were also adopted in a comparative study in Section 4.6 Ablation Study.

Loss Function.
e prediction is a two-dimensional feature map, and we map the value to [0, 1] using the sigmoid function. ese values in a text instance are not the confidences of each pixel but represent the degrees of the shortest distance between each pixel and the background. e common binary crossentropy loss function is where t is the ground truth and y is the prediction. e common crossentropy is used to evaluate the confidence of a certain category but cannot calculate the loss value with specific meanings (e.g., our distance values).
In that case, we seek to optimize the loss containing distance values by L1 loss: |f(x) − Y| or L2 loss: |f(x) − Y| 2 . However, we find that L1 and L2 are not sensitive to the distance distribution among [0, 1]. For instance, the L1 loss between the ground truth of 0.5 and the prediction result of 0.55 is too small and not conducive to backpropagation.
To solve the above difficulty, the degree crossentropy is proposed. e degree crossentropy can not only evaluate the confidence of category but also deal with the distance information. Losses for the positive and negative pixels are calculated according to where L CE (x, y) is the traditional crossentropy loss of pixel (x, y) and GT(x, y) is the corresponding ground truth of pixel (x, y). Since the enhanced label may not be accurate, we treat the given label and the postenhanced supplements separately. U c/e is a discriminatory mechanism that calculates the losses of the original label and postenhanced part, respectively. L DCE (x, y) is the degree crossentropy loss: where Pred p is the predicted result after the sigmoid function and GT is the ground truth. e loss of prediction and any goal ∈[0, 1] is calculated to help us to deal with distance degree information of the text. e specific implementation of U c/e is described by where P(x, y) refers to pixel (x, y) in the entire prediction map. TL(x, y) and DP(x, y) represent the annotated pixels (x, y) and postenhanced pixels (x, y), respectively. G is one set of pixels with a difference of more than ρ between the ground truth and prediction. e postenhanced annotation from the pretrained model may not be quite accurate, and noise interference may exist. Several situations are present in label enhancing. For instance, background pixels are viewed as text pixels as positive annotations. e causes are the annotation differences in the datasets and the unreliability of the prediction. To make our network learn from noisy or wrong labels, we propose a discriminatory mechanism called U c/e , which calculates the losses of the original label and postenhanced part. In that case, the network performs strong-supervised learning on labelled pixels and distribution supervised learning on postenhanced pixels. More specifically, the predicted pixel values gradually decrease from the text centre to the edge without fitting the value of the label. e difference between the enhanced annotation and predicted results will be considered reasonable if it is smaller than ρ. e value of ρ is set to 0.1 in all the experiments. erefore, the fault tolerance of U c/e can enhance the robustness of the model and avoid some mistakes from the postenhanced annotation.

3.5.
Postprocessing. Most segmentation-based methods with segmentation have a common difficulty in which the separation of text instances that are close to each other is challenging. To solve this problem, we propose the apexedge expansion algorithm that makes full use of the textmountain shape. Given the prediction result, each text instance appears as a text mountain, as shown in Figure 4(a), where the text centre line region is the peak and the values of the pixels tend to 1. e text edge pixel areas are similar to the feet of the mountain, and their contents are mostly close to zero. Figure 4 presents a vivid example to illustrate the detailed procedure of the apex-edge expansion algorithm. e detailed procedure of the apex-edge expansion algorithm is shown in Figures 4(b) and 4(c). e postprocessing mainly includes three parts. (1) e peak of each text mountain is selected to differentiate the different text instances. e pixel block for which the values of each inner pixel approach 1 is the peak. (2) e dilate in OpenCV is used to expand the peak region continuously until reaching the mountain foot or meeting other text areas. e expansion process is divided into many steps S 1 , S 2 , . . . , S n . S i (i ∈ [1, n]) represents the entire expansion area in the ith step. S i − S i−1 (i ∈ [2, n]) is called the extended area between two adjacent steps. e criterion of expansion ending is that the average score of the extended area approaches 0 or starts to increase. e average score approaching 0 means that the expansion area is close to the background. e increase in the score means that the expansion area begins to cover other text instances. (3) e contour of the whole text instance is represented by many coordinates as the final prediction result after the expansion. e entire postprocessing is shown in Algorithm 1, where S n represents the prediction result, and the output D n is the set of text instances. Dilation (.) is the dilate operation in OpenCV. e value and size of the expansion kernel in dilation (.) can be changed to realize different direction expansions and different scale expansions. Mean (.) is used to calculate the average value of a matrix.represents complementing the set. ⟶ and Δ refer to tending to a number and the value increasing, respectively.

Experiments
In this section, we evaluate our approach using ICDAR2015, Total-Text, MSRA-TD500, and CTW1500. e experimental results demonstrate that the performance of the proposed method is comparable to those of the other methods.

Datasets.
e datasets used for testing our method are briefly introduced below: SynthText is a large-scale dataset that contains approximately 800 K synthetic images. ese images were created by blending natural images with text rendered with random fonts, sizes, colours, and orientations. We used this dataset to pretrain our model. ICDAR2015 is a multioriented text detection dataset for English text that includes only 1,000 training images and 500 testing images. e text regions were annotated by four vertices of the quadrilateral. MSRA-TD500 contains 500 natural images. e indoor images are mainly signs, doorplates, and caution plates, while the outdoor images are mostly guided boards and billboards in complex backgrounds. Total-Text is a world-level English curved text dataset that is split into training and testing sets with 1,255 and 300 images, respectively. e text in these images includes more than 3 different text orientations: horizontal, multioriented, and curved. SCUT-CTW1500 contains 1,000 training images and 500 test images, which contain multioriented text, curved text, and irregularly shaped text. Text regions in this dataset are labelled with 14 scene text boundary points at the sentence level. Data labelling to test our method: we manually marked Total-Text, CTW1500, and TD500. As shown in Figure 5, the annotation method was brief and inexpensive. For ICDAR2015, the official label was used to fit the text line label for the further verification experiment. e detailed fitting method is simple. e text skeleton as a text line is extracted directly from the full label. All annotations will be released.

Training.
e network was pretrained on SynthText for one epoch and fine tuned on other datasets. We adopted the Adam optimizer as our learning rate scheme. During the pretraining phase, the learning rate was fixed to 0.001. During the fine-tuning stage, the learning rate was initially set to 0.0001 and decayed at a rate of 0.94 every 10,000 insertions. All of the experiments were conducted on a regular workstation (CPU: Intel (R) Core (TM) i7-7800X CPU @ 3.50 GHz; GPU: GTX 1080). e model was trained with a batch of 4 on one GPU.
VGG16 was adopted as the backbone network for the contrast experiment in our experiments. All of the experiments use the same training strategy: (1) enhancing the text annotation information with the model pretrained on SynthText and (2)  same condition in the comparative experiments, all of the models used in label enhancement were the same model pretrained on SynthText for one epoch.

Data Augmentation.
e images were randomly rotated, cropped, and mirrored at a probability of 0.4. en, colour and lightness were randomly adjusted. Finally, the images were uniformly resized to 512 × 512.

Postprocessing.
We obtained all of the text instances with the apex-edge expansion and then used findContours in OpenCV to obtain a set of edge coordinates for each text instance. Finally, the text instances of the regular text datasets (i.e., MSRA-TD500) were described by four coordinate points. Methods such as minAreaRect in OpenCV were applied to obtain the bounding boxes of text instances. For curved text datasets, we used a set of coordinate points to describe the text instance (Tables 1  and 2).

Detecting Curve Text.
e CTW1500 and Total-Text datasets were used to test the ability of curve text detection.
In the experiments, manual text line annotation is used for training. e model pretrained with one epoch on SynthText had two effects: one was to heighten the annotation information, and the other was fine-tuning the pretrained model on other datasets. e training started with the pretrained model and achieved the best result between 20 and 40 epochs. e F-measure showed a fluctuation of approximately 5%, while the threshold of the peak was in [0.5, 0.8]. For comparative experiments, the threshold of the peak in the apex-edge expansion algorithm was set to 0.6 for CTW1500 and Total-Text for comparative experiments. We continued to expand the peak region until the average score of the extended area approached 0 or met another text instance. e F-measure of our method with text line was 77.6% on Total-Text, while the F-measure of our method with full masks was 81.1%, as shown in Table 3. e performance with full masks was close to that of the newest method. e difference (3.5%) shows that using the text line can still achieve good results on the challenging poor annotation. e recall (76.7%) was close to the values obtained for the other methods. On CTW1500, our method showed excellent results that were very close to the results obtained by the other strong-supervised methods with an F-measure of 82.3%. e difference (1.9%) between the F-measure of using the text line and that of using the full mask was also acceptable.

Detecting Long
Text. TD500 contains many long text scenes and therefore is an excellent dataset for verifying the robustness of the network in long text cases. In the experiment, text line annotation was enhanced by the model pretrained on SynthText. e pretrained model was also used for fine-tuning on TD500. e threshold of the peak in the apexedge expansion algorithm was set to 0.6, which is the same value as the experiments on CTW1500 and Total-Text. Table 4 compares the proposed method with state-of-the-art methods on TD500. e proposed method achieved an F-measure of 77.2%, which is competitive with other state-of-the-art detectors trained in a strongly supervised way.

Detecting Oriented Text.
All of the parameter settings and training details for ICDAR2015 were the same as those Table 1: Experimental results for Total-Text. "PT"refers to the model pretrained with one epoch on SynthText. "Ext." indicates external data. "FM" refers to the model trained with full mask on Total-Text. All listed results were obtained in a strongly supervised manner.

Method
Ext.    Table 5 that used extra datasets. For instance, the F-measure of PSENet [16] was 80.5% without an extra dataset. e F-measure (79.4%) of our method was already comparatively close to those of the other methods.

Ablation Study.
ree groups of comparative experiments were performed to verify the effectiveness of our method.

Baseline.
e baseline was trained with the text line without label enhancement, and the F-measure of the baseline on Total-Text was 65.0%, as shown in Table 3.

Label Enhancement.
e results are shown in Table 3, which are further analysed for label enhancement of the model on Total-Text. Training with an unenhanced text line shows an unsatisfactory performance (65.0%), while training with a full mask obtained an F-measure of 81.1%. e large difference (16.1%) indicates that the text line loses important supervision information. After introducing the pretrained model on SynthText to enhance the text line, the performance of the model had an obvious improvement from 65.0% to 77.2%. In addition, using the synthetic text line from the full mask shows better performance (77.6%). e main reason for this is that the manual text line had a larger error in extracting the text skeleton compared to the synthetic text line. In addition, we also compared the performance of the model pretrained on different datasets: synthetic data (i.e., SynthText) and realistic data (i.e., SUCT-CTW1500). e F-measures using SynthText and CTW1500 were 77.2% and 79.1%, respectively. Obviously, the performance of our model pretrained with realistic data shows a few advantages.
is also indicates an intrinsic limitation of this method and the dependence on the pretrained model. Table 6, the impact of the width and the offset of the text line was evaluated. For the width of the text line, we used different widths of synthetic or manually marked text lines to test our model. For the manually marked text line, we extracted its skeleton of one-pixel width and dilated the skeleton to different widths while the width was less than that of the original text line. For the synthetic text line, the skeleton of one pixel was extracted from the full mask and used to create different widths. While the width of the text line was the same, using the synthetic text line which usually achieved a better performance than using the manual text line, and the average difference was approximately 0.4%. In addition, with increasing width, the F-measure showed a fluctuation of approximately 1%. e offset of the text line was set to 0 in all experiments to evaluate the influence of the text line width.

Geometric Parameters of the Text Line. As shown in
Apart from the evaluation of the influence of width, the offset between the synthetic text line and centre line of the text instance was also set to test our detection method. e offset in Table 6 refers to the offset error ratio: D o /D t . D o is the distance between the text line and text centre line, and D t is the width of the text region. In the experiment, we only performed the experiment on the synthetic text line, while the offset between the manual text line and text centre line was difficult to calculate. e text centre line was calculated from the original coordinate annotation, and then we created the text line by setting the corresponding offset ratio. e curvature and width of the created text line were the same as those of the text centre line. All widths of the text Table 4: Experimental results for SCUT-TD500. "PT"refers to the model pretrained on SynthText. "Ext." indicates external data. All compared methods were trained in a strongly supervised way. "FM" refers to the model trained with a full mask for strong-supervised learning.

Method
Ext  line or text centre line were one pixel in the experiment. While the offset ratio of the text line was below 20%, the F-measure barely fluctuated. While the offset ratio of the text line exceeded 20%, the performance of the model started to be affected slightly, but the fluctuation around 2% was still acceptable. Table 7, a series of experiments comparing different backbones were performed to evaluate its influence on the proposed method. Similar to VGG16, five feature maps generated from VGG11 were gradually upsampled to the original size. For the ResNet series, four feature maps were used to merge. e F-measure using VGG11 was similar to that of using VGG16, but the latter had a slightly slower inference time. Due to the sophisticated design, the ResNet series had a longer convergence time, but the performance was comparatively accurate and stable.

Backbone. As shown in
4.6.5. Loss Function. As shown in Figure 6(a), due to the instability of the enhanced annotation, the F-measure decreased after dozens of epochs on four common datasets, particularly for curved text datasets. As shown in Figure 6(b), training with the text line was unstable relative   to the method with full labelling, and the model with full labelling showed better convergence performance with an increasing number of training epochs. After incorporating U c/e into the loss function, the model with the text line showed improved convergence, with convergence fluctuation of approximately 3%.

Conclusion and Future Work
In this paper, we first introduced a novel text detector based on weakly supervised learning. e most prominent feature of the method was proposing a novel labelling named the text line and the full use of the model pretrained on SynthText. e use of a text line can help the detector decrease the cost of labelling, and the pretrained model can improve the performance of the detector. e experiments showed that the text line with low-cost labelling can be used to train an effective text detector and further verify the feasibility of using a synthetic text dataset to enhance weak labels. Efficient lowcost text detectors have potential applications in the field of photo translation. Synthetic data will play an increasingly important role in the field of deep learning in the future. One reason for this is that the high cost of annotation hinders the application of actual scenes for arithmetic. Another reason is that synthetic data are increasingly similar to real-world images, and the development of auxiliary methods promotes the development of synthetic text. In future work, it will be important to train the methods with synthetic data but apply them to the real world.
Data Availability e data are now made public at https://github.com/xingjici/ Texts-as-Lines-Text-Detection-with-Weak-Supervision and the corresponding code is still cleaning up. Data description can be found in Abstraction sector.

Conflicts of Interest
e authors declare that they have no conflicts of interest.