TriangleNet: Edge Prior Augmented Network for Semantic Segmentation through Cross-Task Consistency

This paper addresses the task of semantic segmentation in computer vision, aiming to achieve precise pixel-wise classification. We investigate the joint training of models for semantic edge detection and semantic segmentation, which has shown promise. However, implicit cross-task consistency learning in multi-task networks is limited. To address this, we propose a novel"decoupled cross-task consistency loss"that explicitly enhances cross-task consistency. Our semantic segmentation network, TriangleNet, achieves a substantial 2.88\% improvement over the Baseline in mean Intersection over Union (mIoU) on the Cityscapes test set. Notably, TriangleNet operates at 77.4\% mIoU/46.2 FPS on Cityscapes, showcasing real-time inference capabilities at full resolution. With multi-scale inference, performance is further enhanced to 77.8\%. Furthermore, TriangleNet consistently outperforms the Baseline on the FloodNet dataset, demonstrating its robust generalization capabilities. The proposed method underscores the significance of multi-task learning and explicit cross-task consistency enhancement for advancing semantic segmentation and highlights the potential of multitasking in real-time semantic segmentation.


Introduction
The combination of image semantic segmentation and deep learning has gone through a long period of time, accumulating a large number of excellent works such as FCN [1], U-Net [2], FastFCN [3], Gated-SCNN [4], DeepLab Series [5,6,7], Mask R-CNN [8] and so on, as well as leaving unsolved problems.The main challenge is the fine-grained localization of pixel labels [9].The prevailing structure of semantic segmentation networks mostly follows the encoder-decoder structure adopted by FCN [1].First, downsampling is used to expand the receptive field to extract high-level semantics, and then upsampling is used to recover low-level details.The edge details lost by conventional downsampling operations in semantic segmentation networks are difficult to recover during upsampling.A compensatory solution is to introduce additional knowledge among which edge priors are intuitive and easily accessible.In order to inject edge priors into semantic segmentation networks, one way is to train a semantic edge detection model and a semantic segmentation model jointly.General practice is a two-stream framework that trains a semantic edge detection branch and a semantic segmentation branch in a hard parameter-sharing manner [10].The predictions of the semantic edge detection branch on edge points may differ from those of the semantic segmentation branch, which implies the existence of cross-task inconsistency.Conventionally, a fusion module is introduced to cope with this conflict, such as [11,12] do, which intends to fuse features from the semantic edge detection branch to improve the semantic segmentation branch.However, the effects of these fusion modules are sometimes not as effective as expected.As the ablation experiments of [11] point out, the improvement of the mean of class-wise intersection-over-union (mIoU) on the Cityscapes validation set mainly depends on duality loss(+1.44%)rather than semantic edge fusion(+0.22%),or pyramid context module(+0.62%).A considerable amount of segmentation errors along object boundaries still exist, which means the mutual consistency between the semantic segmentation branch and the semantic edge detection branch should be further studied to improve the quality of segmentation results.
We have observed that many semantic segmentation works can be loosely viewed as semantic edge detection tasks, since applying edge detectors to semantic segmentation outputs can yield semantic edge results.Their relationship can be modeled as Figure .2. Logically, in order to conserve consistency among tasks, the results of inferring semantic edges from an input image should be the same regardless of the inference paths.That is, predicting semantic edges by first predicting semantic segmentation maps from an input image should achieve similar predictions as directly predicting seman-tic edges from the input image.This observation aligns with the concept of inference-path invariance, which serves as the guiding ideology in the work by Zamir et al. [13].The concept emphasizes that predictions should remain consistent regardless of the specific inference paths.The input image domain, the semantic segmentation domain, and the semantic edge domain form an Elementary Consistency Unit proposed by Zamir et al. [13], which can be illustrated by Figure . 2.
By imposing a cross-task consistency loss on the endpoint outputs of the two paths, the consistency between semantic segmentation and semantic edge detection can be explicitly learned.Based on these analyses, we propose a new framework to simultaneously train a semantic segmentation branch and a semantic edge detection branch, and the overall process is shown in Figure .3.
The highlights of this paper are as follows.
1. Figure 1 illustrates the superior balance between speed and accuracy achieved by our framework on the Cityscapes dataset, distinguishing it as one of the few models capable of real-time inference at full resolution.Notably, our model operates at an impressive 77.4% mIoU while maintaining a fast frame rate of 46.2 FPS on Cityscapes.
2. We introduce a novel approach, "decoupled cross-task consistency loss," to explicitly enhance cross-task consistency between semantic edge detection and semantic segmentation, resulting in 1.83% improvement in mIoU on the Cityscapes test set.The decoupled loss effectively enforces consistency across tasks, facilitating the learning of shared representations and leading to improved overall performance.
3. Our model demonstrates exceptional efficacy in categories characterized by distinct edges and boundaries, as evidenced by some categories achieving significant IoU improvements, with "train" nearly reaching an 18% increase in IoU on the Cityscapes test set.These results further reinforce the importance of incorporating edge information through our approach, highlighting its impact on enhancing segmentation performance.
4. The decoupled architecture we have designed allows for joint training of multiple tasks without the need for fusion modules during inference, thereby avoiding the introduction of extra inference overhead.This efficient and practical approach enables us to leverage the advantages of multitasking for real-time semantic segmentation without compromising on performance.2 Related Work

Semantic Segmentation
Strengths, weaknesses, and major challenges of semantic segmentation are extensively discussed in the literature [14,15,16,9].There are currently two approaches to semantic segmentation: improving the object's inner consistency or refining details along objects' boundaries.
The inner inconsistency of the object is attributed to the limited receptive field, by which the longer-range relationships of pixels in an image cannot be fully modeled.Consequently, dilated convolution [17] or high-resolution network [18] is introduced to enlarge the receptive field.Furthermore, many attempts have been made to capture contextual information, such as recur- rent networks [19,20], pyramid pooling module [21], graph convolution networks [22], CRF related networks [5,6,23], non-local operator [24], attention mechanism [25,26], etc.The ambiguity along edges is caused by down-sampling operations in the FCNs that result in blurred predictions.It is difficult to recover spatial information lost during down-sampling through simple up-sampling.Thus, previous papers have made efforts to add priors to guide the upsampling process, many of which focus on the use of edge priors.The general practice is a two-stream framework that trains an edge detection branch and a semantic segmentation branch jointly, which will be elaborated later.

Multi-Task Learning
Driven by deep learning, many dense prediction tasks such as semantic segmentation, instance segmentation, etc. have achieved significant performance improvements.Typically, tasks are learned in isolation, i.e. each task is trained with a separate neural network.Recently, multi-task learning (MTL) techniques that learn shared representations by jointly processing multiple tasks have shown promising results.
Almost all theories about MTL are based on the assumption that tasks learned together should be relevant, or a phenomenon called negative transfer would occur.In practice, it is more dependent on expert experience to find relevant tasks.For example, [27,28,29,30] jointly train semantic segmentation and depth estimation to achieve better results, and [31,32] jointly train semantic segmentation and instance segmentation to increase accuracy.[4,33,34,12,11] train semantic segmentation and edge detection jointly to improve metrics.Among these, the edge priors can be further subdivided into binary edge priors and semantic edge priors.For example, in GSCNN [4], the binary edge is used as a gate to improve performance.In BFP [33], binary edge information is used to propagate local features within regions.[34] adopts domain transform to perform edge-preserving filtering controlled by a binary edge map derived from a task-specific edge detection task.[12] applies explicit semantic boundary supervision to learn semantic features and edge features in parallel and an attention-based feature fusion module to combine the high-resolution edge features with wide-receptive-field semantic features.RPCNet [11] presents an interacting multi-task learning framework for semantic segmentation and semantic boundary detection.
The most common multi-task learning framework shares some layers in the feature extraction stage and designs independent layers for each specific task, which is called the hard parameter-sharing approach [10].This approach makes it difficult to ensure that multiple tasks can work together.Although there have been some means of using uncertainty [35] to determine weights of tasks, the relationship between tasks is still not very clear, which drives the study of explicit consistency constraints between tasks.

Consistency Learning
It has been speculated that multi-task networks may automatically produce cross-task consistent predictions since their representations are shared.Numerous studies [36,37,38,13] have observed that this is not necessarily true, since consistent learning is not enforced directly during training, indicating the need for explicit enhancement of consistency during learning.
From the literature, two kinds of explicit consistency constraints can be summarized.One idea is formulated as the Cross-Task Consistency theory based on inference-path invariance by [13].[13] first analyzes the Cross-Task Consistency theory of the triangular shown in Figure .2, and deduces the formula based on the l 1 norm assumption.Then it generalizes to cases where in the larger system of domains, consistency can be enforced using invariance along arbitrary paths, as long as their endpoints are the same.[39] conveys the same insight, which uses the predictions of one task as inputs to another network to predict the other task, obtaining task-transferred predictions.Explicit constraints are imposed between the transferred prediction with the prediction of the other task.
Another idea is that for a specific geometric feature, such as the boundary, the results extracted by different tasks should be consistent.For instance, [40] force the depth border to be consistent with the segmentation border through morphing.[41] penalizes differences between the edges of the semantic heat map and the edges of the depth map through a holistic consistency loss.

Real-Time Semantic Segmentation
Real-time semantic segmentation is a challenging and essential task in computer vision, aiming to perform pixel-wise classification with high accuracy and rapid inference speed.The demand for real-time processing in applications such as autonomous vehicles, robotics, and augmented reality has driven extensive research efforts to develop efficient algorithms and architectures.Attempts to achieve a balance between speed and accuracy in real-time semantic segmentation include efficient architectures [42,43], lightweight convolutions [44,45], knowledge distillation [46,47,48], pruning and quantization [49,50], and optimization techniques [51].
Multi-task learning has shown promise in various computer vision tasks, but it is relatively less popular in real-time semantic segmentation.One of the challenges in deploying multi-task learning for real-time semantic segmentation is the potential increase in inference overhead due to fusion modules or additional processing steps.While multi-task learning can be beneficial during training by leveraging shared representations and learning complementary features from related tasks, the goal is to incorporate this knowledge effectively without introducing extra inference time.To achieve this, researchers are exploring methods to learn shared representations without the need for explicit fusion modules during inference.Some approaches to address this concern include decoupled architectures [52], knowledge distillation, shared layers [53], weight sharing [42,43].

Edge Detection
One notable method of applying deep neural networks to train and predict edges in an image-to-image fashion and end-to-end training is the holisticallynested edge detection (HED) [54].HED is a binary edge detection network, where the edge pixels are all set to 1 and 0 otherwise.Practically, edge pixels appear in contours or junctions belonging to two or more semantics, resulting in a challenging category-aware semantic edge detection problem.A pioneering approach is given by CASENet [55] that extends the work of HED.However, both HED and CASENet employ fixed weight fusion to merge side outputs, ignoring image-specific and location-specific information.To address this, DFF [56] designed an adaptive weight fusion module to assign different fusion weights for different input images and locations adaptively.
In essence, compared to binary edge detection, semantic edge detection is more coupled with semantic segmentation since it provides semantic information about edge pixels while locating edges.

Method
In this section, we will first introduce the overview pipeline of our architecture illustrated in Figure .3 and then explain the components in detail.

Model Overview
As shown in Figure .3, the overall network is a two-stream framework following a hard parameter-sharing manner [10].It contains two branches; the upper one is an implementation of DFF [56] responsible for semantic edge detection, and the lower one is an FCN for semantic segmentation integrated with PPM [21] and FPN [57] as the decoder.The backbone of the FCN is replaceable, the features output by which are shared by the two branches.Except for the shared backbone layers, the other layers of the two branches are task-specific and parallel.An Edge Detector is used to transfer the segmentation maps to semantic edges, which are enforced to be consistent with the output of the semantic edge detection branch by a consistency loss.We call this network TriangleNet because its underlying theory can be formulated by a triangular relation shown in Figure .2.

Edge Detector
There are various strategies to extract semantic edges from the results of semantic segmentation.An edge detection operator such as Canny [58], a Spatial Gradient Solution [11] and a transfer network are optional solutions.To guarantee end-to-end training, the chosen scheme must be differentiable.For simplicity, we choose the Spatial Gradient Solution, which uses adaptive pooling to derive a spatial gradient.The formulation is as follows: where S represents the probability map of semantic segmentation, S k (p) indicates the predicted probability on the k-th semantic category at pixel p, and | • | remarks the absolute value function.pool w is an adaptive average pooling operation with kernel size w.The same as [11], w is used to control the derived boundary width and is set to 3.

Task-specific Elementary Consistency Unit
TriangleNet consists of three domains: the input image domain, the semantic segmentation domain, and the semantic edge domain.As illustrated by Figure.2, χ denotes the query domain(e.g.input RGB images), ψ = {ψ 1 , ψ 2 } is the set of two desired prediction domains.Specifically, ψ 1 represents the semantic segmentation domain and ψ 2 represents the semantic edge domain.The functions that map the query domain onto prediction domains are defined as Γ χψ i (i = 1, 2) which outputs ψ i given χ.Γ ψ 1 ψ 2 denotes the cross-task function that maps the semantic segmentation domain to the semantic edge domain.According to the Elementary Consistency Unit theory proposed by [13], predicting ψ 2 by first predicting ψ 1 from χ should achieve predictions similar to directly predicting ψ 2 from χ.
To enhance the comprehension of the consistent constraint across the three domains, in Figure .2, we provided visual examples for each domain.S represents an instance from domain ψ 1 , while E and C are instances from domain ψ 2 .Here, E corresponds to the output of the semantic edge detection model, while C is the output of the semantic segmentation model after undergoing the edge detector process.Notably, E and C exhibit a high degree of similarity, making them highly comparable and indicative of strong consistency between the two outputs.

Ohem Cross Entropy Loss
In our framework, Γ χψ i are neural networks.Through Γ χψ 1 we can obtain the semantic segmentation probability map S ∈ R H×W ×K , where K is the number of categories.A common way of training the neural network in Γ χψ 1 is to find parameters of Γ χψ 1 that minimize a loss called Cross Entropy Loss.
What we actually use is an improved version called the Ohem Cross Entropy Loss implemented by PaddleSeg [59] The formula for Ohem Cross Entropy Loss can be expressed as follows: where Y (p) denotes the ground truth label at pixel p. P (p) represents the probability of the corresponding label at pixel p. N hard is the number of hard examples to be considered.It can be a fixed number or a percentage of the batch size, depending on implementation.In PaddleSeg [59], the number of hard examples is determined by the min_kept and thresh hyperparameters, where min_kept specifies the minimum number of hard examples to be kept, and thresh sets the probability threshold below which examples are considered hard.

Semantic Edge Loss
Through Γ χψ 2 we can obtain the semantic edge probability map E ∈ R H×W ×K .While training the neural network in Γ χψ 2 , we minimize a loss called Multi-Label Loss, which is formulated as: where G k (p) denotes the ground truth edge label on the k-th semantic category at pixel p and E k (p) indicates the predicted edge probability on the k-th semantic category at pixel p. L e measures the difference between E and the semantic edge ground truth G.

Decomposed Cross-Task Consistency Loss
Γ ψ 1 ψ 2 is modeled as a spatial gradient operation formulated as Equation 1.
Taking S as the input of Γ ψ 1 ψ 2 can get another semantic edge probability map C ∈ R H×W ×K .C and E should be consistent.Instead of directly penalizing the difference between C and E, we penalize the difference between C and G, E and G separately, thus indirectly forcing the alignment between C and E. The formulation is as follows: We call L d c the decomposed cross-task consistency loss, in which where − denote the edge and non-edge ground truth label sets of the k-th class semantic edge, respectively.Similar to E k (p), C k (p) denotes another predicted edge probability on the k-th semantic category at pixel p.
The right-hand side of equation 4 satisfies the following equation: For simplicity, we define the following equations.
Equations 7 and 8 are variants of the l 1 norm, and we call this kind of variant the boundary-aware l 1 norm.Substituting equations 7, 8 and 6 into 4, we derive:

Loss Function
We perform a weighted sum of the above three losses to obtain the loss to predict domain ψ 1 from χ while enforcing the consistency with domain ψ 2 as: in which C s , C e , C c are hyperparameters.As pointed out by [15], grid search is competitive or better compared to existing task balancing techniques in determining the weights of the loss functions.Therefore, in our experiments, C s , C e , C c are obtained by grid search.First, we generate grids representing various coefficient values that we wish to explore and search over during our experiments.The loss function L s primarily computes the loss for the majority of pixels in the image, leading to relatively larger loss values compared to the other two loss functions, which are specifically designed for focusing on object edges.However, we aim to prevent these two losses from being overshadowed due to their smaller values.To achieve this, we assign larger coefficients to the edge-related loss functions, prompting the model to pay closer attention to the edges during training.Specifically, we set C s ∈ {1} and C e , C c ∈ {5, 10, 20}.Then we try all combinations of the hyperparameter values from the defined grids.Since we have only one value for C s and three values for both C e and C c , we have a total of 1 x 3 x 3 = 9 combinations to try.
Substituting equation 9 into 10, we obtain: which is equivalent to the following equation: where the first term is pertinent to network Γ χψ 1 , while the second term is pertinent to network Γ χψ 2 .These two terms are independent and can be dealt with in parallel for task-specific layers in network Γ χψ 1 and Γ χψ 2 , which is exactly the original intention of our definition of L d c as two independent parts.

Experiments
We first conducted experiments on Cityscapes [60] which is a popular computer vision dataset for semantic urban scene understanding.It contains 5000 annotated images with fine annotations collected from 50 cities in different seasons.The images were divided into sets numbered 2,975, 500, and 1,525 for training, validation, and testing.Conventionally, only 19 categories are used to assess the accuracy of category segmentation.Although it also provides coarsely annotated images, we only use finely annotated images.In addition, experiments on FloodNet [61] were also performed to further confirm the generalization and application value of our method.Code and models are available at: https://github.com/nailperry-zd/PaddleSeg-TriangleNet.

Experiments on Cityscapes
Baseline: We append the PPM [21] and FPN [57] decoder to naive FCN as the baseline, where ResNet-18 [62] serves as the backbone, that is, training the semantic segmentation branch shown in Figure .3 independently.
Implementation details: We use the 2.3.0 version of the PaddlePaddle [63] framework to carry out the following experiments.The hardware platform adopts a single V100 GPU with a video memory of 32G.All networks with ResNet-18 [62] as the backbone share some settings, where stochastic gradient descent (SGD) with a batch size of 4 is used as the optimizer, with a momentum of 0.9 and weight decay of 5e-4.All these ResNet-18 variants are trained for 300K batch iterations with an initial learning rate of 0.01.Data augmentation contains normalization, random distortion, random horizontal flip, random resizing with a scale range of [0.5, 2.0] and random cropping with a crop size of 1024 x 1024.During inference, we use the whole picture as input.In terms of loss weights, C s , C e and C c are set to 1, 10, 20, respectively.For quantitative evaluation, mIoU is used for accuracy comparison.

Comparison against state-of-the-art methods
We present a comprehensive comparison of our method with both real-time and non-real-time semantic segmentation algorithms in Table .1 and Table.2, respectively.
In Table .1, it is important to highlight that some of the models listed have been officially integrated into PaddlePaddle and are available in the Paddle-Seg [59] open source library.This integration facilitates a rigorous evaluation of the inference speed for these PaddlePaddle-integrated models, as well as our model, TriangleNet.To measure the speed accurately, we utilize the PaddleInference API from the PaddleSeg [59] library on an A100 GPU with 40GB memory, using the f32 accuracy parameter.However, in cases where certain models do not have a PaddlePaddle implementation or when our direct measurements are not available, we provide FPS data from the original papers or third-party sources within brackets in the table for reference.This meticulous approach ensures a comprehensive and fair comparison, allowing us to draw reliable conclusions regarding TriangleNet's performance in realtime semantic segmentation in comparison to other state-of-the-art models.
As shown in Table .1, at a resolution of 512 x 1024, our model not only surpasses ESPNetV2 in both speed and accuracy but also outperforms STDC1-Seg50 and PP-LiteSeg-T1 in accuracy, despite achieving approximately 60% and 70% of their respective speeds.Similarly, at 768 x 1536 resolution, while our model maintains 85% of BiSeNetV1-L's speed, it exhibits a 1.4% increase in accuracy compared to it.Additionally, our model's speed, at around 50% of STDC1-Seg75 and PP-LiteSeg-T2, is compensated by its approximately 1% higher accuracy over them.Under a resolution of 1024x2048, our model exhibits significantly higher accuracy than ICNet, SwiftNet, and FasterSeg.
Our method demonstrates a remarkable speed/accuracy trade-off across various resolutions when compared to real-time counterparts.Notably, our model achieves impressive accuracy without compromising on speed, enabling real-time inference even at full resolution.Notably, in Table .2, our model demonstrates competitive performance even compared to non-real-time models based on ResNet-101 [62], achieving similar mIoU scores while utilizing only one-fifth of the parameters.This highlights the efficiency and effectiveness of our approach across diverse scenarios.

Ablation Studies
Our approach involves several elements compared to the baseline.Each element may contribute to the improvement of mIoU.To verify the necessity of each element, we performed the following ablation studies.
Ablation study on joint framework: From a multi-task perspective, joint training benefits from higher task correlation.To explore this idea, we conducted experiments combining different tasks.In Table .3, the second row demonstrates joint training of Baseline with HED [54], a classic binary edge detection model, resulting in a slight improvement in mIoU on the Cityscapes test set.Subsequently, we replaced HED with DFF [56], a superior semantic edge detection model, in the third row.This change led to a 0.59% improvement against the Baseline in mIoU on the Cityscapes test set.The results suggest a stronger correlation between semantic segmentation and semantic edge detection tasks.This correlation arises from the accurate extraction of semantic edges under the constraints of semantic segmentation, as semantic segmentation can suppress non-edge pixels, and in turn, relies on semantic edges to distinguish between objects and background.The two tasks mutually complement each other, enhancing the overall performance of the model.
During this process, we adopted the Poly learning rate policy, which is widely used and proven effective, as depicted in Figure . 4 (a).Ablation study on L d c : We introduced the L d c during the last 50% iterations to verify its effect.This idea draws on [74], where a loss called ABL is added at the last 20% epochs, since the gradient of ABL is not useful when the semantic edges output by the network are far from the semantic edge ground truth at the beginning of the training, much similar to our case.
During this process, we employed a custom learning rate policy named   (c), respectively.In the 2-cycle-SGDR Poly policy, the learning rate periodically increases, a process referred to as "restarts" in SGDR [75].The underlying idea is to encourage the model to traverse from one local minimum to another, particularly if it is trapped in a steep trough.
After comparing the first and second rows in Table.4, we observed that employing the 2-cycle-SGDR Poly policy alone resulted in a 0.46% increase in mIoU on the Cityscapes test set.Subsequently, with the introduction of L d c in the third row, the mIoU exhibited a consistent growth trend on both the validation and test sets, with a more significant improvement observed on the test set.The inclusion of L d c further boosted the mIoU by 1.83% on the test set, indicating its positive impact on the overall performance of our model.
Furthermore, upon comparing the third and fourth rows in Table .4, we found that both Cosine Annealing and 2-cycle-SGDR Poly policies can improve the model to some extent, confirming the effectiveness of the "restarts" in SGDR.Notably, 2-cycle-SGDR Poly is more suitable for our model.Therefore, for all subsequent TriangleNet variants, we adopted the 2-cycle-SGDR Poly learning rate schedule to ensure consistent and superior optimization of our model.
Ablation study on different semantic edge detection strategies: In our exploration of state-of-the-art strategies for semantic edge detection, we conducted a comparison between DFF [56] and CASENet [55].The two rows in Table .5 demonstrate that both DFF and CASENet, when employed as semantic edge detection branches, lead to improved segmentation accuracy.This finding highlights the positive role of injecting semantic edges into the semantic segmentation process.Notably, as mentioned in [56], DFF surpasses CASENet in standalone semantic edge extraction.Moreover, even after integrating with semantic segmentation, DFF continues to deliver superior accuracy improvements, reaffirming its effectiveness in enhancing the overall performance of the model.[55] as the semantic edge detection.In this situation, we get all the results by single-scale inference.

Model
Ablation study on different semantic segmentation models: Our evaluation extends to various semantic segmentation models, and the results presented in Table .6 reveal that the integration of U-Net [2] and Baseline into our framework for joint training yields remarkable improvements in semantic segmentation accuracy, showcasing the effectiveness and benefits of our approach in enhancing semantic segmentation performance.

Analyses
The results from the ablation studies demonstrate that the semantic edge detection task exhibits a stronger correlation with the semantic segmentation task compared to the binary edge detection task.When we jointly train both tasks, we observe a 0.59% improvement in mIoU on the Cityscapes test set.Additionally, the adoption of the 2-cycle-SGDR Poly learning rate policy leads to a slight yet meaningful improvement of 0.46%.
However, despite the benefits of multi-task learning, the implicit learning of cross-task consistency in multi-task networks is limited.The explicit enhancement of consistency through this restriction resulted in a further significant improvement of 1.83% in mIoU on the Cityscapes test set.These findings underscore the importance of incorporating explicit consistency constraints during the learning process to achieve enhanced performance in multi-task computer vision systems.
Upon analyzing the IoU of each category, as depicted in Table .7, we observe significant improvements for several categories, with some experiencing increases of more than 3%.Notably, the largest improvement amounts to nearly 18% in IoU score.This further validates the significance of incorporating edge information through our edge prior augmentation approach, particularly for categories such as "truck," "bus," and "train," where distinct edges and boundaries are prevalent.
Furthermore, we investigate the relationship between the number of sample images per category and their respective IoU scores, visualized in Figure. 5.The analysis reveals that categories with higher IoUs tend to have more samples, while those with lower IoUs have fewer samples, aligning with our expectations.Notably, categories such as "truck," "bus," and "train," despite having smaller sample sizes, exhibit remarkable improvements in IoU.This suggests that TriangleNet can effectively generalize from limited samples, resulting in enhanced performance in challenging categories.
However, certain categories, such as "wall," "fence," "pole," and "trafficsign," possess abundant samples but fail to achieve the expected higher IoU values.The underperformance of these categories may be attributed to factors such as complex semantic patterns or limitations in the model architecture to accurately capture their unique features.Further investigation is warranted to identify the specific reasons behind these discrepancies and to devise strategies to enhance the segmentation performance for these demonstrates its ability to accurately locate pixels along two objects by effectively utilizing edge priors or shapes of objects, which the Baseline fails to achieve.Additionally, in the first, third, and fourth rows, TriangleNet leverages the priors obtained from semantic edge detection to perceive trains, buses, and walls as coherent entities, while the Baseline tends to split them into different categories.This highlights TriangleNet's capacity to benefit from edge information and enhance its understanding of complex object structures, resulting in improved semantic segmentation performance.

Experiments on FloodNet
We also performed experiments on FloodNet [61] which is an Unmanned Aerial Vehicle (UAV) dataset to assess the damage from natural disasters to further prove the compatibility of our method.The dataset contains 2343 images in total, which were divided into sets numbered 1,445, 450, and 448 for training, validation, and testing.This dataset contains 10 classes, and the index and specific meaning of each class are given by Table   In summary, TriangleNet showcases its potential in semantic segmentation by effectively leveraging edge priors and incorporating explicit cross-task consistency.This unique combination not only enhances accuracy but also enables real-time inference, making it well-suited for various real-world applications.Further research exploring more detailed explicit constraints may lead to even greater performance improvements.The achievements of Trian-gleNet in the context of real-time semantic segmentation paves the way for future advancements in efficient and accurate computer vision systems.

Figure 1 :
Figure 1: Run-time/accuracy trade-off comparison on the Cityscapes test set.Our models (in red) achieves an excellent run-time vs. accuracy tradeoff among all previous real-time methods.FPS=30 is the red line dividing real-time and non-real-time performance in the graph.The asterisk after the model name indicates that the inference speeds of these models were obtained using the same deep learning framework, PaddlePaddle, and the same hardware platform, A100 40G device.

Figure 2 :
Figure2: The multi-task learning framework of semantic segmentation and semantic edge detection coincides with the Elementary Consistency Unit theory where the prediction χ → ψ 1 is enforced to be consistent with χ → ψ 2 using a function that relates ψ 1 to ψ 2 .S, E, and C respectively denote the outputs processed through Γ χψ 1 , Γ χψ 2 , and Γ ψ 1 ψ 2 .

Figure 3 :
Figure 3: The overall pipeline of TriangleNet.The shared backbone network produces 5-layer features.The task-specific parts of the two branches are enclosed by dashed boxes.
. It stands for "Online Hard Example Mining Cross Entropy Loss."Instead of considering the loss for all examples in a batch, it selects only the hard examples and uses those examples to update the model during training.This helps in dealing with class imbalance and emphasizing difficult examples that can lead to better generalization.Hard examples are considered those examples with low probabilities of the relevant label.In other words, they are examples that the model finds challenging to classify correctly.We denote Ohem Cross Entropy Loss as L s , which measures the difference between S and the semantic segmentation ground truth.

Figure 4 :
Figure 4: Visualization of different learning rate policies.

3 .
In this situation, we get all the results by single-scale inference."2-cycle-SGDR Poly," which can be seen as a variant of the Cosine Annealing policy [75].The visualizations of the Cosine Annealing and 2-cycle-SGDR Poly policies are shown in Figure.4 (b) and Figure.

Table 1 :
[72]racy comparison of our best models based on ResNet-18 against real-time models on the Cityscapes test dataset."-"indicatesthatthecorresponding data are not given.FPS: frames per second.TriangleNet 1 is an instance of the framework shown in Figure.3.In the three columns of FPS, the values outside the brackets are measured by our team, whereas the values within the brackets are either sourced from the original papers or from third-party papers."+"denotes the value is sourced from[72].

Table 2 :
Accuracy comparison of our best model based on ResNet-18 against non-real-time models on the Cityscapes test dataset.

Table 3 :
Ablation study on joint framework.

Table 4 :
Ablation on L d c .All these models are trained for 300K iterations."×"means that L d c is not involved in all iterations.TriangleNet 1 is an instance of the framework shown in Figure.

Table 5 :
Experiments on different semantic edge detection (SED) strategies.TriangleNet 2 is the same as TriangleNet 1 except that TriangleNet 2 uses CASENet

Table 6 :
[2]eriments on different semantic segmentation (SEM) strategies.TriangleNet 3 is the same as TriangleNet 1 except that TriangleNet 3 uses U-Net[2]as the semantic segmentation branch.limitation, we introduced an additional restriction called decoupled cross-task consistency loss, denoted as L d c , to explicitly enhance cross-task consistency between semantic edge detection and semantic segmentation.