Multitask Deep Neural Network for the Fully Automatic Measurement of the Angle of Progression

The angle of progression (AoP) for assessing fetal head (FH) descent during labor is measured from the standard plane of transperineal ultrasound images as the angle between a line through the long axis of pubic symphysis (PS) and a second line from the right end of PS tangentially to the contour of the FH. This paper presents a multitask network with a shared feature encoder and three task-special decoders for standard plane recognition (Task1), image segmentation (Task2) of PS and FH, and endpoint detection (Task3) of PS. Based on the segmented FH and two endpoints of PS from standard plane images, we determined the right FH tangent point that passes through the right endpoint of PS and then computed the AoP using the above three points. In this paper, the efficient channel attention unit is introduced into the shared feature encoder for improving the robustness of layer region encoding, while an attention fusion module is used to promote cross-branch interaction between the encoder for Task2 and that for Task3, and a shape-constrained loss function is designed for enhancing the robustness to noise based on the convex shape-prior. We use Pearson's correlation coefficient and the Bland–Altman graph to assess the degree of agreement. The dataset includes 1964 images, where 919 images are nonstandard planes, and the other 1045 images are standard planes including PS and FH. We achieve a classification accuracy of 92.26%, and for the AoP calculation, an absolute mean (STD) value of the difference in AoP (∆AoP) is 3.898° (3.192°), the Pearson's correlation coefficient between manual and automated AoP was 0.964 and the Bland-Altman plot demonstrates they were statistically significant (P < 0.05). In conclusion, our approach can achieve a fully automatic measurement of AoP with good efficiency and may help labor progress in the future.


Introduction
Cesarean section is an important procedure for both the mother and the fetus in certain medical conditions [1][2][3][4]. However, an unnecessary cesarean section can lead to higher medical risks for both mothers and infants [3]. Therefore, proper maternal and fetal monitoring during labor is very important because this is the only way to assess the progress of labor and identify deviations from the normal. The clinical variable of the head movement of the fetal is used to inform decision-making regarding mode of delivery during active pushing [5]. In clinical practice, digital examination is a fundamental method for monitoring the descent of the fetal head (FH), but the method is known to have limited accuracy [6][7][8] and repeated screening may lead to vaginal bacteria entering the cervix and the uterus and causing harm to the newborn [9].
Recently, several studies have indicated that ultrasound measurements are more accurate and repeatable than digital examination [10][11][12][13][14][15][16][17], and angle of progression (AoP) is found to be the most reproducible ultrasound parameter when examining FH descent [18][19][20][21]. AoP is measured transperineally as the angle between a line through the long axis of pubic symphysis (PS) and a second line from the inferior end of the PS tangentially to the contour of the fetal skull (Figure 1(a). Barbera et al. are the first to use transperineal ultrasound (TPU) to manually measure AoP [22]. And AoP has been found to be correlated to the ischial spines in different studies [23][24][25]. Tutschek et al. found that the zero station of FH corresponds to an AoP of 116 degrees [26]. Arthuis et al. researched computed tomographic (CT) images and found that the ischial spines are associated with an AoP of 110 degrees [24], while Bamberg et al. related an AoP of 120 degrees to the ischial spines obtained with magnetic resonance imaging (MRI) [25]. Moreover, a comparison between the MRI and CT methods showed a mean difference of only 1.4 degrees [27]. A similar, feasible, and highly reproducible method was also used to examine FH descent in breech-presenting fetuses [28]. The use of TPU to simultaneously assess the FH descent would be desirable; however, it is technically challenging for nonexperienced operators to diagnose the FH descent using TPU [29].
Recently, Obstetrics and Gynecology have introduced new techniques to provide fast and automatic identification and measurement of normal and abnormal ultrasound examination results [30,31]. Conversano et al. reported a real-time tracking algorithm for noninvasive and automatic monitoring of AoP during the second stage of labor [32,33]. In the process of AoP measurement, the initial standard plane including PS and FH was manually identified according to targets within the TPU image and the gray level values of their pixels. The two substructures (i.e., PS and FH) in the initial image were automatically segmented and identified as the two patterns to be searched within the subsequent images by maximization of similarity or cross-correlation coefficients [32]. The axis and distal end of PS were segmented in subsequent images, and displacements from the previous position were also calculated. Simultaneously, the pattern location of FH was employed to initialize the automatic edge outlining from subsequent images and to calculate the displacement of the rightmost point of the FH from the previous position. Finally, the coordinates and displacements of the FH for each frame were determined regarding the reference system associated with the PS distal end to calculate AoP. Different from the above method developed by Conversano et al., a deep learning-based approach was first developed and tested preliminarily on a small dataset by Zhou et al. [33]. Firstly, the landmark of PS endpoints was located and areas of PS and FH were segmented by a deep learning network. Secondly, the central axis of PS was obtained with the two endpoints, which are physical points that are used for the determination of AoP. Thirdly, the tangle of FH was determined as it passed through the lower endpoint of PS. Finally, AoP was calculated from the central axis and the tangent. All of these methods are based on the standard planes of the TPU images. Therefore, an end-to-end method for fully automatic measurement of AoP should be further developed.
This paper presents a fully automatic measurement framework of AoP for the multitask process that includes standard plane recognition (Figure 1(b), Task1), image seg-mentation ( Figure 1(c), Task2) of PS and FH, endpoint detection ( Figure 1(c), Task3), and AoP calculation ( Figure 1(d), AoP calculation). In the framework, a multitask Unet (MT-Unet) with a shared feature encoder and three task-special decoders is proposed for the above three tasks. More specifically, the efficient channel attention (ECA) unit in the shared encoder, attention fusion module (AFM) between decoders, and a designed shape-constrained loss function (SLF) are used to improve the performance of our MT-Unet. Based on the segmented FH of Task2 and the detected endpoints, the tangent point of FH is determined, and thereby, AoP is calculated. In brief, our main contributions include the following: (1) A two-stage measurement framework of AoP: MT-Unet is used for standard plane recognition, image segmentation, and endpoint detection in the first stage, while tangent point determination and AoP calculation are conducted in the second stage (2) Based on the multitask process for AoP calculation from TPU images, an MT-Unet is designed for this application. Various modules (including ECA, AFM, and SLF) are used for improving its performance (3) Our method outperforms existing deep learning methods for automatic AoP measurement The remainder of this paper is structured as follows. Section 2 mainly explains our method, experimental details and dataset, etc. Section 3 presents the experimental results, and Section 4 provides some discussion, before some concluding remarks in Section 5.

Materials and Methods
An outline of the proposed automatic AoP measurement algorithm is shown in Figure 2. Firstly, standard ultrasound images of the original TPU images (i.e., input) are selected, target areas are segmented, and two endpoints are determined based on the proposed MT-Unet. Secondly, the contour of the region of the FH is fitted with an ellipse equation, and then, the right tangent point connected to the right endpoint is determined. Finally, AoP (i.e., output) is calculated as the angle between a line through two endpoints and a second line through the right endpoint and the tangent point. In short, the automatic AoP measurement algorithm mainly includes the MT-Unet and postprocessing parts.
2.1. MT-Unet. Taking into account the characteristics of different tags and tasks of the same input image data, we propose a network with one shared encoder and three taskspecific decoders inspired by Zhou et al. [34]. Three main modifications are applied to the MT-Unet for accuracy improvement. The ECA module is used to capture local cross-channel interaction in the shared encoder, the AFM module is used to capture cross-branch interaction among the task-specific decoders, and SLF is designed based on the prior convex shape.  Figure 3 which shows one shared encoder and three decoders (Figure 3(a)).
The encoder is composed of five blocks, each of which contains two convolutional layers and a downsampling layer. The convolution layer includes a convolutional operation (Conv) with a kernel size of 3 × 3, a batch normalization layer (BN), and a rectified linear unit (ReLU). The downsampling layer consists of a maximum pooling (Max-pooling) operation with a kernel size of 2 × 2 and an ECA unit.
Three task-special decoders are designed for standard plane recognition (i.e., Task1), image segmentation (i.e., Task2), and endpoint detection (i.e., Task3). The decoder for Task1 consists of two ResBlocks, and each of them contains a convolutional layer followed by Conv, BN, an operation of shortcut connection, and ReLU. The final output is followed by an average pooling (Avg-pooling) operation, a fully connected (FC) layer, and a Softmax. The decoder for Task2 is made up of four upsampling blocks and a Softmax layer. Each up-sampling block contains an upsampling layer  3 Computational and Mathematical Methods in Medicine and two convolutional layers. Inspired by Unet [35], we introduced skip connections into Task2. Similarly, Task3 has four upsampling blocks as well, but uses Sigmoid as its activation function. Several AFM units are used to fuse features between Task3 and Task2.

ECA Unit.
The ECA module is used to capture local cross-channel interaction considering each channel and its k neighbors [36]. Given the aggregated feature (C × H × W) using channel-wise global average pooling (GAP), it generates channel weights by performing a fast 1D convolution of size k followed by a Sigmoid function (α) (Figure 3(b)).
The kernel size k represents the coverage of the local crosschannel interaction, i.e., the number of neighbors participating in channel attention prediction of each channel. k is adaptively determined via a function of channel dimension C.
where jtj odd represents the nearest odd number of t. In the present study, we set γ and b as 2 and 1, respectively.

AFM Unit.
The AFM is designed to allow the network to learn task-related features. These features include the shared features (red arrow, S i−1 ) of the Task2 branch and task-specific features (green arrow, L i−1 ) from the previous layer. The concatenation of task-related features (i.e., S i−1 and L i−1 ) is the input of the AFM. This input followed by a (a) MT-UNet architecture The MT-Unet is trained in a two-stage manner. In the first stage, the network with the task-specific decoders (i.e., Task2 and Task3) is trained with a linear combination of a set of loss functions as a pretrained model, which is a model created by someone else to solve a similar problem. Here, Dice loss and shape-constrained loss are used in Task2 and mean squared error loss [37] in Task3.
In the second stage, the pretrained model with the shared encoder and the Task1 decoder is trained with the crossentropy loss. It is worth noting that the parameters of the shared encoder are loaded from the pretrained model of the first stage. The cross-entropy loss function (L CE ) in Task1 for binary classification is used [38]. It measures the difference between two probability distributions of the predicted value and the ground truth label and is equivalent to the negative log-likelihood loss as follows: where y andŷ refer to the ground truth label and the predicted value, respectively. The loss consists of three components: Dice loss (L D ) for image segmentation of Task2 [39], shape-constrained loss (L SC ) for convex-shape segmentation of Task2 [40], and mean squared error loss (L MSE ) for endpoint location of Task3 [41]. The total loss is given by where θ 1 and θ 2 are scaling factors determined via the weight uncertainty method [42]. w 1 (0.5) is obtained via hyperparameter analysis. Dice loss (L D ): where y is the ground truth map, p is its corresponding predicted map, N is the number of pixels, and C is the number of classes (excluding the background). Shape-constrained loss (L SC ): , where p, q, and r are three points. ðp, qÞ is a point pair inside the segmented region, and r is on the line (l pq ) that is bounded by p and q, y ip and y iq are the ground truth labels, while p ip , p iq , and p ir are the predicted values. L SC can be activated when p, q, and r have the same label (i.e., B i pqr =1) [40].
Mean squared error loss (L MSE ): where n = 3 is the number of points, δ k is the loss weight, and H k andĤ k represent the predicted heatmaps [33] and the ground-truth heatmaps, respectively. Here, δ 1 , δ 2 , and δ 3 are set to be 1.0, 0.8, and 0.6 according to the importance of each heatmap, respectively.

2.2.
Postprocessing. The output of MT-Unet includes image category, segmented regions, and coordinates of the two endpoints (i.e., right and left endpoints) of the PS. In order to measure AoP, the line from the right endpoint of the PS tangentially to the contour of the FH should be determined. Therefore, the contour of the FH is determined by fitting an ellipse equation through the least square method [43,44], and then, the right tangent connected to the right endpoint of the PS is retained to calculate AoP (see Appendix S1 for details).

Experimental Setup
2.3.1. Dataset. Our dataset consists of 1964 TPU images of 104 volunteers during labor. These images with a resolution of 800 × 652 in BMP format were retrospectively collected from the Zhujiang Hospital of Southern Medical University between 2020 and 2021. TPU examinations were performed in standard B-mode ultrasound using Esaote ultrasound systems. The dataset was divided into two parts (i.e., one includes 1045 standard plane images, while the other includes 919 nonstandard plane images) to generate the image-level labels for image classification. For these standard plane images, three types of pixel-level labels were annotated by a team of 4 expert sonographers and then manually validated. The first type of pixel-level labels is regions of FH and PS for image segmentation, the second type is two endpoints of PS for key point positioning, and the third type is AoP.

2.3.2.
Preprocessing. This image dataset was randomly divided into training, validation, and testing sets in a ratio of 5 : 2 : 3. Since the image dataset includes standard plane set (1045 standard plane images) and nonstandard plane set (919 nonstandard plane images), the two sets are also split into training, validation, and testing sets in a ratio of 5 : 2 : 3.
Since each patient had multiple images, the data was split so that all images from a patient were only in one of the training, validation, and testing sets. Furthermore, we adopted a two-stage training strategy to obtain our MT-Unet. Both standard and nonstandard plane images were used for Task1 at the second stage, but only standard plane images were used for Task2 and Task3 at the first stage. Random rotation (−30°, 30°) and random scaling were used for data augmentation during training; the input images were resized to a size 416 × 384 and normalized to [−1, 1].

Training Settings.
The adaptive moment estimation optimizer [45] was used for optimization. We used the step 5 Computational and Mathematical Methods in Medicine learning rate scheduler [46] (StepLR) with a step size of 20. The learning rate scheduler was adopted to decrease the learning rate from its initial value (0.0001) by a factor gamma (0.1). The network weights were initialized via the Kaiming initialization [47] and trained for 200 epochs with a batch size of 2. All experiments have been carried out based on PyTorch [48] and run on a Nvidia Titan V GPU.

Evaluation
Metrics. Different performance metrics have been adopted for image classification, image segmentation, endpoint detection, and AoP calculation.
where TP, FP, FN, and TN denote true positive, false positive, false negative, and true negative. For the image segmentation task (Task2), we used Acc, Dice scores of the PS (Dice_PS), the FH (Dice_FH), and both targets (Dice_ALL).
For the endpoint detection task (Task3), we firstly located the two pubic symphysis endpoints by regressing Gaussian heatmaps [49] and then used the Euclidean distance (Dist) between the predicted endpoint coordinate (x p , y p ) and the corresponding ground-truth coordinate (x t , y t ). Two distances in Dist_L and Dist_R were introduced to evaluate the performance of Task3 for the left and right endpoints.
For AoP calculation, we evaluated the angle (APT) between the predicted line (L p ! ) through the predicted left endpoint (x pl , y pl ) and the predicted right endpoint (x pr , y pr ) and its the corresponding ground-truth line (L t ! ) through two endpoints (i.e. (x tl , y tl ) and (x tr , y tr )).
In addition, the absolute value of the difference in AoP (ΔAoP) between the predicted AoP (AoP p ) and the clinically acquired one (AoP t ) is an important evaluation metric.

Comparative Experiment.
To investigate the effectiveness of key components in our MT-Unet, we conducted a series of comparative studies. We compare our Task1 with Vgg16 [50] and Resnet50 [51] for standard ultrasound images' identification with Acc, Pre, Sen, and Spe. Furthermore, we removed ECA, AFM, and SLF from the MT-Unet to form the MT-Unet_A and compared it against two independent Unet (Unet) used for segmentation and location to investigate the effectiveness of the multitask network. Based on the MT-Unet, we removed ECA and SLF to form MT-Unet_B, removed AFM and SLF to form MT-Unet_C, and removed SLF to form MT-Unet_D. Finally, we evaluate the performance in Acc, Dice_ALL, Dice_PS, Dice_FH, Dist_ L, Dist_R, APT, and ΔAoP to investigate the effectiveness of the key components of our framework.

Performance of Standard Plane Recognition (Task1).
The performance of our MT-Unet and its variants for standard plane selection (Task1) are listed in Table 1   For Task3, the result of the endpoint location of the two methods is shown in Figure 5(b). The difference was quantified via Dist_L and Dist_R (i.e., the difference between the predicted and annotated coordinates

Computational and Mathematical Methods in Medicine
For AoP calculation, the effects of the two methods on AoP calculation can be evaluated via APT and ΔAoP. As shown in Figure 5(  As is shown in Figure 5(a), when the ECA module is used in the multitask learning, the predicted discrete regions (white and blue rectangles) inside and outside the labeled area for the FH drastically shrunk and the resulting target areas were closer to the annotated regions (e.g., #1 and #3). These differences manifest as an increase in Dice scores.  Table 2).
For Task3, Dist_L and (or) Dist_R were found lower in most cases (#1 and #2) when the ECA module has been used in the multitask method.  (Table 2).

Comparison of Our Method with the Existing Deep
Learning Approach. To the best of our knowledge, there is currently only one study that is based on deep learning for automatic AoP measurement. From the function of the two methods (ours vs. Zhou et al. [33]), the approach of As shown in Figure 6(a), no discrete regions outside of the labeled area for the FH were observed in the case of ours vs. Zhou et al. Moreover, the area and shape of the segmented FH of our method are closer to the label. For endpoint detection (Figure 6(b)) and AoP calculation (Figure 6(c)), improvements have been found for both APT and ΔAoP in the three cases. APT of ours vs. Zhou Table 4).

Statistical Comparison of Our Method versus Clinical
Manual Measurement. Evaluation of predicted AoP accuracy was carried out by making comparisons between maximum and minimum AoP in the test dataset including 289 images. On average, the absolute error in AoP between our method and clinical measurement is 3.898°. The linear regression plot in Figure 7(a) shows that the AoP estimates of both methods are linearly proportional and tightly clustered around the line of best fit y = 1:04137x − 4:71874 and the Pearson's correlation coefficient R = 0:964. In clinical studies, the standardized difference can determine a significant difference between the two results. There was 2.26°deviation between our method and clinical measurement, showing a subtle difference between the two methods. Figure 7(b) is a Bland-Altman plot demonstrating the interchangeability of clinical measurement and our method

Discussions
Compared to the digital examination, the ultrasound examination is the more accurate and repeatable diagnosis of the FH position and the prediction of labor cessation. In clinical practice, doctors use their experience to first determine one standard plane image that includes the PS and the FH and then manually identify the three key points (i.e., two endpoints of the PS and the FH tangent point connected to the right endpoint of the PS) to calculate AoP based on the contour of the PS-FH. The application needs the experience in selecting standard section images from a large collection of TPU images. Furthermore, the identification of key points based on the contour of the PS-FH may introduce errors in the doctor's judgment. To overcome the disadvantages of manual measurement, we have presented a multitask deep learning model to achieve end-to-end fully automatic measurement of AoP. The main contributions of this work include the following: (1) to the best of our knowledge, this is the first study to achieve a fully automatic measurement of AoP. (2) We developed a multitask learning framework for standard plane image recognition, PS-FH segmentation, and key points identification. For the image classification task, it is committed to determining the standard plane, whereas image segmentation and position location tasks aim to obtain the contour of the PS-FH and endpoints of the PS, respectively. (3) We introduced attention mechanisms and an SLF in the MT-Unet for performance improvement. The AFM unit can capture cross-channel interaction to promote mutual learning between the segmentation branch and the location branch, while the ECA module can help avoid dimensionality reduction and capture cross-channel interaction. The convex shape prior loss can enhance robustness against noise and is of vital importance to the calculation of AoP, and (4) we adopted a twostage training approach to make each branch of the network focus on its task. Several steps have to be done to measure AoP: first, the standard plane images should be selected, then the contour of the PS-FH is detected, and the three key points of the detected contour are finally identified for AoP calculation. These three steps can be automatically conducted with our end-to-end MT-Unet, and thereby, our method is fully automatic. The other methods are based either fully or partly on standard plane images and include traditional methods and deep-learning methods. In the former category, Conversano et al. [32] proposed an algorithm that manually identifies the standard plane image first and adopts a pattern tracking algorithm for subsequent sessions to calculate AoP. Youssef et al. [54] reported an AoP measurement method based on commercial software; however, the technical characteristics of the software are not explained in detail [55]. In the deep learning-based category, Zhou et al. [33] applied an endto-end deep learning method to measure AoP, but this approach is not fully automatic because it does not include image classification.   The high accuracy of our method is attributed to the use of ECA, AFM, and SLF. In the encoder of our network, the ECA module and Max-pooling are stacked together to capture the relation between adjacent channels and to compensate for the loss of information caused by downsampling. The performance of all branches is improved by the ECA module. In the decoders for Task2 and Task3, an AFM unit is used to capture cross-branch interaction between the segmentation branch (Task2) and the location branch (Task3). The segmented results include the areas of the PS and FH in Task2, while the predicted points are the two endpoints of the PS in Task3. Therefore, the AFM module can promote mutual learning (see Table 2). In addition, the accuracy of our method is further improved by SLF. The ideal shape of the FH appears elliptic in TPU images. We relax the elliptic-shape condition to the convex prior that enforces the segmentation result to be a convex polygon. The proposed SLF brings better results, which is helpful to perform ellipse fitting and find the tangent point. The AoP difference between predicted AoP and ground truth AoP is reduced by using the SLF (see Table 3). It should be mentioned that the accuracy (evaluated with ΔAoP) of the AoP calculation is higher than the existing deep learning method of Zhou et al. The fact that our results are more consistent with the experts' suggests that our method has the potential to be adopted in practice in the future.
Another cause of our outperforming multitask network is that different loss functions were applied for different tasks. This is because prior works showed that their best performance can only be achieved if the tuning is guided by taskspecific loss functions and in turn by different evaluation metrics. For Task1, standard plane recognition is regarded as a binary classification problem, which is measured by accuracy. Considering that accuracy is a nonderivable equation, cross-entropy loss is chosen as the loss function. Dice loss is chosen as the loss function for Task2, similar to most prior medical image segmentation tasks. Additionally, we introduce SLF based on a convex polygon before modeling the area of PS and FH as a near-oval shape, which helps the calculation of AoP. For Task3, endpoint detection is an object localization problem, where Euclidean distance (e.g., MSE) is usually used to evaluate the deviation between the predicted object and the real object. MSE loss is chosen as the loss function accordingly. The training curves of MT-Unet are shown in Figure 1 of Appendix S5, which shows that our network is neither over-nor underfitting. In order to explore the effect of batch size on experimental results, we experimented with a higher batch size of 4 and 8. But we found that when batch size increased, the performance of Acc, Pre, and Spe dropped, so we still chose the batch size of 2 for our network (details are shown in Appendix S6).
Despite the better performance, the proposed approach still has pitfalls for future improvement: (1) different from other multitasking networks, a two-stage training strategy is adopted to obtain our MT-Unet due to the lack of labels for segmentation and location tasks in nonstandard plane images. (2) The effectiveness of this method on more data remains unknown. While random rotation and random scaling have been used for data augmentation during training to increase the number of limited data, the precision and generalization of this method remain to be investigated if given a much larger training dataset; (3) In this paper, the parameters of MT-Unet are 12.47 MB, and the computational complexity in GFLOPs of MT-Unet is 37.39, and we focus on accuracy improvement without considering computation complexity. In future research, we will work on the development of lightweight models without sacrificing accuracy; (4) inspired by the method of Conversano et al. [32], the accuracy of our method could be further improved by considering the relevance between images of a patient.

Conclusions
To the best of our knowledge, our method is an important step toward the fully automatic measurement of AoP. In the work, the proposed MT-Unet can perform the three tasks (i.e., image classification, image segmentation, and endpoint detection) for AoP calculation in a parallel manner. Our proposed neural network outperformed existing deep learning results.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.