End to End Multitask Joint Learning Model for Osteoporosis Classification in CT Images

Osteoporosis is a significant global health concern that can be difficult to detect early due to a lack of symptoms. At present, the examination of osteoporosis depends mainly on methods containing dual-energyX-ray, quantitative CT, etc., which are high costs in terms of equipment and human time. Therefore, a more efficient and economical method is urgently needed for diagnosing osteoporosis. With the development of deep learning, automatic diagnosis models for various diseases have been proposed. However, the establishment of these models generally requires images with only lesion areas, and annotating the lesion areas is time-consuming. To address this challenge, we propose a joint learning framework for osteoporosis diagnosis that combines localization, segmentation, and classification to enhance diagnostic accuracy. Our method includes a boundary heat map regression branch for thinning segmentation and a gated convolution module for adjusting context features in the classification module. We also integrate segmentation and classification features and propose a feature fusion module to adjust the weight of different levels of vertebrae. We trained our model on a self-built dataset and achieved an overall accuracy rate of 93.3% for the three label categories (normal, osteopenia, and osteoporosis) in the testing datasets. The area under the curve for the normal category is 0.973; for the osteopenia category, it is 0.965; and for the osteoporosis category, it is 0.985. Our method provides a promising alternative for the diagnosis of osteoporosis at present.


Introduction
Osteoporosis (OP) is a disease characterized by impaired bone microstructure and decreased bone mineral density (BMD). With the acceleration of population aging, OP has become an increasingly serious global health problem [1]. Fragile fracture is the most serious complication of OP [2]. OP causes more than 8.9 million brittle fractures each year worldwide [3]. In the US, fragile fractures are more than four times more common than stroke, acute myocardial infarction, and breast cancer [4]. In several developed countries, osteoporotic fractures account for longer hospitalization time than these diseases according to a meeting of the World Health Organization [5]. By 2025, the number of fragility fractures is expected to increase from 3.5 million in 2010 to 4.5 million, a 28% increase [6]. Terefore, reliable technology for the early detection and prevention of OP is urgently needed.
Currently, although dual-energyX-ray absorptiometry (DXA) is the gold standard for measuring bone mineral density for the diagnosis of OP, it is not widely used as a screening tool for OP owing to its high cost and limited availability of equipment [7]. To overcome these limitations, a variety of osteoporosis screening tools have emerged.
Quantitative ultrasound (QUS) is one of them, which has developed into an alternative method for DXA screening of osteoporosis. Its benefts include being portable and economical; however, it may be unavailable in all primary medical settings [8]. In addition, a variety of clinical risk assessment tools have been developed to predict osteoporosis, including the fracture risk assessment tool (FRAX), the QFracture algorithm, the Garvan Fracture Risk Calculator, and the osteoporosis self-assessment tool [9]. Unfortunately, these tools are based on a combination of known risks to calculate the risk of fracture in patients and have poor efciency.
Artifcial intelligence and machine learning algorithms have recently been used in the diagnosis and prediction of osteoporosis [10]. Te existing methods have achieved some success in solving the problem of binary classifcation (osteoporosis and nonosteoporosis) of which the main purpose is to identify whether the patient has osteoporosis [11]. However, these methods also have some obvious shortcomings: (1) the existing artifcial intelligence algorithms treat segmentation and classifcation as two separate tasks, ignoring the information fusion and complementarity between the two tasks; (2) taking the average of two lumbar cancellous bone mineral density measurements (commonly the frst and second lumbar) is widely acknowledged as the best diagnostic criterion for osteoporosis in lumbar QCT [12]. In current models, these data inputs tend to be CT images of a single vertebral body, disregarding the information fusion and complementarity between multiple vertebral images; (3) the problem of class imbalance in the collected data is prevalent due to the lack of standard public datasets; (4) most methods treat osteoporosis as a binary problem, regardless of the urgent need and a strong incentive to turn the binary into a trinomial (osteoporosis, osteopenia, and normal) problem. Although the three classifcations are more difcult, osteopenia can bring some predictability to the prevention and treatment of osteoporosis. In this paper, we address the challenges above in the diagnosis of osteoporosis to facilitate the timely detection of the condition and propose an instance-based and class-based multilevel joint learning framework for bone state classifcation. Te innovation of this method lies in the following steps. Firstly, we locate a vertebral body and remove redundant information from the image. Secondly, by constructing the boundary heat map regression auxiliary branch, the vertebral edge is refned, and the segmentation performance is improved on the segmentation branch of the shared encoder. In addition, lowlevel and high-level features from the segmentation branch and the auxiliary branch, including the shape and boundary of the vertebral body, are fused with feature layers from the diagnostic classifer. Finally, considering the diferent efects of diferent vertebral bodies on the classifcation results of bone state, we design a feature fusion module to adaptively learn feature fusion weights. Te proposed method is novel because it solves the challenges of high dimensionality, multimodality, and multiclassifcation associated with osteoporosis diagnosis, and these challenges have not been resolved in earlier methods. Te contributions of the research are as follows: (i) A joint learning framework is proposed to segment vertebral bodies from CT images and classify bone states (normal, osteopenia, and osteoporosis) (ii) An instance-based and class-based data sampling balancing strategy is introduced to solve the problem of poor model prediction caused by imbalanced data between training datasets (iii) A boundary heat map regression branch is proposed, which uses the Gaussian function to do "soft labeling," accelerating network convergence and improving the performance of vertebral segmentation in joint learning and single-task learning environments (iv) Te efectiveness of segmentation features in guiding a deep classifcation network is verifed by hierarchically fusing the features of the decoder and classifer related to two segmentation tasks (v) A feature fusion module is proposed to adaptively learn the feature weights of vertebrae 1 and 2 and balance the infuence of two vertebrae images on classifcation results To our knowledge, there are many studies [13][14][15][16] on the classifcation of bone status using vertebral images, but there are few studies on multitask joint learning and detection of bone status based on soft tissue window images at the central level of lumbar 1 and lumbar 2 vertebrae. Experimental results show that multitask joint learning can improve the accuracy of disease classifcation.

Related Works
In this section, we briefy review the related research on bone state classifcation, categorizing them into three subareas to introduce the current research on the bone state in the medical image, i.e., vertebral positioning, vertebral CT image segmentation, and vertebral medical CT image classifcation.

Vertebral Positioning.
With the development of deep learning, convolutional neural networks are increasingly used for positioning tasks. However, most of these works describe vertebral recognition as a centroid point detection task. Chen et al. used the advanced features of convolutional neural networks to represent vertebrae from 3D CT volume and eliminated the detection of misplaced centroids based on a random forest classifer [17]. Dong et al. iterated the centroid probability map of a convolutional neural network using a message-passing scheme according to the relationship between the centroids of the vertebrae and used sparse regularization to optimize the localization results to obtain a pixel-level probability of each vertebral centroid [18]. However, it may be more meaningful to directly identify the labels and bounding boxes of vertebrae (rather than the probability map of the centroid point). Zhao et al. proposed a category-consistentself-calibration recognition system to accurately predict the bounding boxes of all vertebrae, improving the discrimination ability of vertebrae categories and the self-awareness of false positive detection [19]. All of these methods identify the vertebrae from the coronal plane, whereas what we want is to get a small image from the transverse view that only contains the vertebrae.

Vertebral Segmentation.
Recently, machine learning is increasingly used in the recognition and segmentation of vertebral bodies. Michael Kelm et al. used iterative variants of edge-space learning to fnd the bounding boxes of intervertebral discs and utilized Markov-based random felds and graphical cutting to initialize and guide the segmentation of the vertebrae [20]. Zukić et al. employed the AdaBoost-based Viola-Jones object detection framework to fnd the bounding boxes of the vertebrae and then split them by expanding the mesh from the center of each vertebra [21]. Chu et al. applied random forest regression to detect the vertebral center and used these to defne target regions for the segmentation of the vertebrae with random forest voxel classifcation [22]. Although these methods can fnd certain vertebral bodies with specifc appearances, they still need to set some parameters empirically and fail to deal with complex pathological cases. However, many recent segmentation methods are based on deep learning, using convolutional neural networks instead of the traditional explicit modeling of spine shape and appearance. For example, Sekuboyina et al. used a multiclass convolutional neural network for pixel labeling, segmented the lumbar spine on a 2D facet slice, and estimated the bounding boxes of the waist region using a simple multilayer perceptron to identify regions of interest in the graph [23]. Janssens et al. depended on two continuous networks to realize this task. First, they used a regression convolutional neural network to estimate the bounding box of the lumbar region and then used a classifcation convolutional neural network to perform voxel labeling in the bounding box to segment the vertebral body [24]. Mushtaq [27]. Tafraouti et al. extracted features from X-ray images and used a support vector machine model to identify osteoporosis, which can well distinguish osteoporosis patients from normal people [28]. Kilic and Hosgormez studied the identifcation of osteoporosis based on a random subspace method and random forest ensemble model. Jang et al. used a deep learning method to identify osteoporosis [29]. In the internal and external test sets, the area under curve (AUC) of osteoporosis screening was 0.91 (95% confdence interval (CI), 0.90-0.92) and 0.88 (95% confdence interval (CI), 0.85-0.90), respectively. Te experimental results illustrate that the use of chest radiographs based on deep learning models may be used for opportunistic automatic screening of osteoporosis patients in the clinical environment [30]. In the latest study, Xue et al. conducted a study in which they labeled the L1-L4 vertebral body in CT images and divided it into three categories based on bone mineral density: osteoporosis, osteopenia, and normal. Te study achieved a high level of accuracy, with a prediction accuracy of 83.4% and a recall rate of 90.0% [31]. Dzierżak and Omiotek have developed a novel method for diagnosing osteoporosis through the use of spine CT imaging and deep convolutional neural networks. To address the issue of a small sample size, they utilized a large dataset to pretrain their model, which resulted in the successful classifcation of osteoporosis and normal cases. Tis approach showed promising results for the accurate diagnosis of osteoporosis using CT scans [32]. In these methods, both the traditional machine learning algorithm and the current popular deep learning algorithm use the image containing only the region of interest as the data source. Te step-by-step preprocessing process is tedious, time-consuming, and inefcient. Terefore, the integration of positioning, segmentation, and classifcation into a network should help to improve efciency, and no research has shown that explicit or implicit features related to the frst 3/4 of the vertebral body can be efectively and interpretably used in deep classifcation networks.

Overview.
Our proposed method aims to classify vertebral images within a joint framework to enable a more fexible diagnosis of osteoporotic lesions. To achieve this goal, as shown in Figure 1, we propose an instance-based and class-based end-to-end multitask joint learning framework. It mainly has a strategy to solve class imbalance and four deep learning modules, including vertebral positioning module, vertebral segmentation module, cascade feature extraction module combined with gated attention, and feature fusion module. As shown in Figure 2, a new multilayer and multilevel joint learning framework is introduced, which integrates positioning, segmentation, and classifcation. Firstly, realizing the accurate location of the target lesion (coronal vertebral body), removing the redundant information of the image through the reduction of resolution (from 512 × 512 to 224 × 224). Secondly, the boundary heat map auxiliary branch is employed to refne the edge to improve the performance of segmentation; meanwhile, segmentation features are cascaded with the classifcation features to improve the accuracy of classifcation. Finally, we propose a feature fusion module, which adaptively assigns feature weights to fuse the features of lumbar L1 and lumbar L2. Diferent magnitudes of losses in multitask learning tend to bring about negative efects on other tasks when the model tends to ft a certain task; to balance this problem, we use the gradient update method to assign weights to each loss, exploiting neural networks to update the weight parameters.

Instance and Class-Based Sampling Methods.
In the actual clinical scene, the data collected by image acquisition will be unbalanced owing to the inherent difculty of collecting labels of rare diseases or other unusual cases. Terefore, when training on extremely unbalanced data, the model may have a high probability of being afected by the number of diferent categories, resulting in the underftting of some categories which may be ignored. At present, the methods to solve the data imbalance include data resampling [33], adaptive loss function [34], and curriculum learning [35]. Inspired by the paper [36,37], methods are introduced to solve the problem of extreme imbalance of our category    images. It combines unbalanced (instance-based) and balanced (class-based) sampling of data, where we extend the method to our three-category practical problems.
We defne the training set as D � (x i , y i ), i � 1, 2, . . . , N , where x i is the sample, y i is the sample category. Assuming that for multiclassifcation problems with K categories, each category has M k samples, and N represents the total number of samples, where K K�1 M k � N, the general sampling strategies can be described as where p j is the probability of sampling from the j th category. If we set n � 0, the probability of sampling from each category is equal to 1/K. Tis is the class-based sampling method.
If we set n � 1, then it is equivalent to selecting the sample by the proportion of a category of samples to all samples, which is instance-based sampling. Here, we introduce a mixed sampling method based on instance and class, which is suitable for data imbalance. We denote the training dataset and sampling strategy by the symbol (D, S). Instance-based sampling and class-based sampling are represented by S I and S C , respectively, so this mixture can be described as x ∧ and y ∧ represent random convex combinations of data and label inputs. Here, we set β � 1. As shown in Figure 3, as α grows, examples from minority classes are combined with a greater weight to avoid overftting of minority classes. Here, we set α � 0.1 to induce a more balanced distribution of training samples by creating synthetic data points around spatial regions where minority classes provide fewer data density.

Vertebral Positioning Module Based on YOLOv3.
Te basic step of vertebral CT image classifcation is to extract robust features from CT images, given W and H of the original images are 512 pixels. To remove redundant features, we use the YOLOv3 [38] to locate the vertebral body in the image with size 512 × 512 × 3 as input to YOLOv3. Te image feature is extracted by DarkNet-53, and then the target classifcation and position regression are performed on the acquired feature map with the help of the FPNs (feature pyramid networks) structure.
In this study, we will obtain the position of the prediction box in the original image p x , p y , p w , p h , in YOLOv3, a set of anchor frames is composed of nine initial frames of diferent sizes. Assuming that the center coordinates, width, and height of an anchor frame are expressed as a x , a y , a w , a h , p x , p y , p w , p h can be obtained by reverse calculation of the regression parameter t x , t y , t w , t h by the output network. Details of the calculation formula are as follows: where σ(·) represents the sigmoid transformation of the variable, aiming at controlling the ofset of the center point between 0 and 1.
Te main purpose of employing YOLOv3 is to obtain the center coordinates p x and p y of the prediction box and utilize this position as the center cutting position of the vertebral body to obtain a 224 × 224 image containing the complete vertebral body as the input of the subsequent convolution module. In this way, we can remove tens of thousands of useless features and improve the efciency of the model.

Boundary Regression Auxiliary Branches.
We suggest dividing the segmentation task into two tasks: vertebral segmentation and contour determination. Tus, our network is mainly composed of a weight-sharing encoder and two decoders composed of the segmentation branch and boundary regression branch. In the encoder, we improve the original U-Net [39] by applying residual blocks to replace the original two efective convolutions. In the decoder stage, we cascade the penultimate features from the boundary regression branch with the penultimate features of the segmentation branch, helping the network to better perceive and refne the vertebral contour. Since vertebrae in CT images may show up hyperosteogeny or other conditions, it is necessary to reconstruct edges by constructing auxiliary tasks, which provide more explicit and implicit topological Computational Intelligence and Neuroscience 5 priors for the coding layer and enable them to assist with the segmentation branches to obtain more accurate target masks. Te problem of boundary inaccuracy is rooted in the similarity of information in the corresponding receptive feld of pixels. When similar features belong to the interior or exterior of the segmented region, this similarity will be advantageous, inversely similar information lies in the segmented boundary will undoubtedly increase the uncertainty of the edge. In terms of the boundary regression auxiliary branch in the segmentation module, we propose to divide the edge based on the region and graph from the whole image, combining it with the spatial proximity and pixel value similarity. In this paper, the accurate boundary of vertebral segmentation should be the inner boundary. We combine the convolutional neural network with the level set, taking the segmentation result obtained by the neural network as the prior knowledge of level set segmentation; then we construct a gray level constraint term on the original level set function and improve the edge indicator function to deal with uneven intensity in the image.

Improve the Edge Indicator Function.
Getreuer [40] proposed the famous Chan-Vese (CV) model in 2001. Tis method uses a region-based segmentation strategy to divide the image into two homogeneous regions, the inner and outer regions, using active contoured lines to fnd the image to be segmented and the original image with the minimum diference to minimize the energy function.
Given the input image I(x, y), the energy function based on the CV is shown as follows: where C 1 and C 2 describe the average gray levels of equivalent parts inside and outside the contour, respectively, Ω 1 and Ω 2 represent the inner and outer regions of the contour, is the edge indicator function which can be used to prevent the curve from exceeding the target area, G is the Gaussian calculation sub, σ is the standard deviation, and δ and H represent Dirac and Heaviside functions, respectively. Te position of contour C and unknowns C 1 (ϕ) and C 2 (ϕ) are fnally obtained through optimization formula (4).
Te evolution of the CV model is constrained by global gray-level information. However, most images, especially medical images, have uneven intensity. To solve this problem, we improve the function g and construct gray-level information constraint terms to constrain the evolution direction. Bilateral fltering is a method that combines the spatial proximity of images with the similarity of pixel values. Based on Gaussian fltering, bilateral fltering introduces the gray value of pixels for the local weighted average. When smoothing the speckle noise of images, bilateral fltering can better maintain the edge features.
In the frst step, the Gaussian function G sr (x, y, σ) is used to construct bilateral flters to obtain smooth images: Image I(x, y) is fltered using bilateral flter operator g(x, y) � G sr (x, y, σ) · I(x, y), where σ r is the standard deviation used to control the smoothness, i, j, k, l are the weight coefcients.
In the second step, the optimal threshold T is calculated based on the fltered image using the adaptive threshold principle. Te maximized interclass variance value of T is shown in the following equation: where w 0 represents the ratio of pixels in the target area to the image, u 0 represents the corresponding average gray level, w 1 is the proportion of background pixels, and u 1 is the average gray level of background pixels. Ten, the new edge indicator function g r can be described as 3.4.2. Auxiliary Branch. We advocate the segmentation results of convolutional neural networks as prior knowledge, namely, the initial contour of the level set, and the curve contour evolved through the level set is used to guide the neural network to optimize toward the edge of the vertebral body. Te specifc expression of the gray level constraint Q is described as where I high is the upper limit of the vertebral gray value obtained by using the convolutional neural network model, I low is the lower limit of vertebral gray value, σ is the average of vertebral gray value, η is the variance of vertebral gray value, and w is a constant. Te function of the gray level information constraint term is to make the level set curve evolve inside the vertebral body to approximate the inner edge contour. When the gray value of the pixel is within the upper and lower limits of the initial vertebral gray value, the energy value of the point is negative, otherwise positive. Te edge result obtained by the neural network is used to replace x and y on the initial contour plane. Gradient descent is used to minimize the energy function, and the formula form of the fnal evolution equation after adding the gray constraint function is shown as follows: I(x, y)).

(9)
In the label aspect of the auxiliary branch, we use the Canny operator to detect the edge of the binary image label. Canny is built on a two-dimensional convolution. To improve the calculation speed of the Canny operator, twodimensional convolution can be decomposed into onedimensional flters, and then a convolution operation with the image A(x, y) is carried out, respectively: A(x, y). Ten, the gradient amplitude A(x, y) and gradient a(x, y) direction can be expressed as Te size of the Gaussian window is adjusted by changing the standard deviation σ of the Gaussian function, that is x + E 2 y ). We frst apply nonmaximum suppression, and then segment images through the dualthreshold method. When the gradient of some pixel is greater than the limit threshold, it will be considered as an edge pixel.
Ten, we construct a soft label heat map in the form of Heatsum based on the processed images: where ○ represents the Hadamard product; it is noted that G bd is normalized between [0, 1].
Here, the boundary regression branch is utilized to refne the segmented edges. We treat this branch as a regression task through mean square error rather than a whole work consisting of a boundary segmentation task together with the segmentation branch.

Cascading Classifcation Module.
In the classifcation module, we use ResNet-101 as a basic feature extractor. ResNet [41] is a traditional deep convolutional neural network where the residual structure is used in the shallow network. Te corresponding structure is illustrated in Figure 4(b). By adding the input value x with the output unit, the residual gains better performance in convergence after the operation of ReLU active. Tese steps can be approximated as an identical mapping of equal input and output, which efectively solves the problems of network learning ability decline, gradient disappearance, and gradient explosion when the number of convolutional neural network layers increases.
Inspired by the gating attention [42] and residual structure, we designed a gating residual module as shown in Figure 4 to replace the frst convolution module in ResNet-101 from conv2_x to conv5_x. Te specifc network parameters can be found in Figure 5. Te gated residual model can be described as follows.
Assuming that x ∈ R C×H×W is the activation feature of the convolutional neural network, where H and W are the height and width of the image, and C is the number of channels of the image, in general, the gating attention performs the following transformation.
Among them, a, β, and c are trainable parameters. Te embedding weight a is mainly responsible for adjusting the embedding output, and the gating weight c and the bias weight β are responsible for adjusting the gating activation.
Tey determine the behavior of gated attention in each channel.
For the specifc process, assuming the given embedding weight as α � [α 1 , α 2 . . . , α c ], modules can be defned as where ∈ is a small constant, which is mainly used to avoid the derivation of zeros. Equation (14) is used to normalize channels, and n represents a small constant. � � C √ is used for normalization the ratio of s c , preventing the condition of small s c when C is too large, α c is a trainable parameter used for controlling the weight of each channel. When α c is close to 0, the channel will not participate in channel normalization.
Ten, we suppose the selection weight c � [c 1 , c 2 . . . , c c ] and the gating ofset β � [β 1 , β 2 . . . , β c ], the gating function can be depicted as follows: Each primitive channel x c is adapted by the corresponding gate, c and β are trainable weights and deviations which is used to control the activation of the gate.
Two 1 × 2048-dimensional feature vectors of vertebrae can be obtained by fattening the feature map.
3.6. Feature Fusion Module. As mentioned above, the detection of bone status is based on the average of lumbar L1 and lumbar L2. To explain the diferent efects of diferent lumbar vertebrae on classifcation, we learn W 1 and W 2 adaptively for each vertebra, which satisfes W 1 + W 2 � 1; W 1 and W 2 represent the fusion weights, respectively.
Specifcally, we calculate W 1 and W 2 (W 1 + W 2 � 1) by F fuse (X 1 ) and F fuse (X 2 ), respectively, where F represents the perception of two layers, that is, two fully connected layers. Te following softmax layer can be used to eliminate the infuence of diferent feature dimensions. After gaining the feature X fuse , the prediction of bone state P(M|I N ) can be given by the fully connected layer and softmax function.

Cascading Classifcation Models.
To balance the impact of diferent dimensions of multiple tasks in the training process we introduce the trade-of parameters λ 1 , λ 2 , λ 3 , λ 4 and λ 5 to balance these four tasks. Te total loss function of multitask learning can be defned as where p I i , p c2 i , p s i , p s i , p b i , p c3 i , respectively, represent the predicted results of the positioning branch, category branch, confdence branch, and segmentation branch of the positioning module for a given input image, the boundary heatmap regression branch, and the classifcation network. S represents the Sigmoid function, t represents the prediction box result, and q cla1 is the result of the category in the positioning module. q represents the probability that a vertebral body exists, G n bd represents the normalized result of G bd , and q cla2 is the expected result of the classifcation network.

Dataset and Preprocessing.
To assess the efectiveness and beneft of the joint learning framework in bone state classifcation, we conducted experiments in a dataset obtained from the Nantong First People's Hospital from May 2021 to May 2022, consisting of CT images of 1048 routine-dose cases. All images were collected by Ingenuity Core 128 CT (Philips Health Care, Holland), the tube voltage was 120 kV, the inpatient tube current Computational Intelligence and Neuroscience modulation technique was used, and the iDose 4 was used to reconstruct the cross-sectional image of the mediastinal window (standard B standard reconstruction algorithm). Te reconstruction layer thickness and layer interval were both 2 mm. Te longitudinal window images of the lumbar 1 and lumbar 2 center planes of each subject were selected for BMD measurement and deep learning model construction. Te QCT pro4 software (Mindways, CA, USA) was used to set the same size of the region of interest (ROI) in the central cancellous bone area of the lumbar 1 and lumbar 2 vertebral bodies, avoiding the cortical bone and the visible vascular area. Te software automatically calculated the BMD values of the lumbar 1 and lumbar 2 vertebral bodies and used their mean values as the BMD values of the individual subjects (BMD individuals). According to the standard recommended by the "expert consensus on imaging and bone mineral density diagnosis of osteoporosis" BMD individuals > 120 mg/cm 3 are normal bone mass, 80 mg/ cm 3 ≤ BMD individuals ≤ 120 mg/cm 3 are osteopenia, and BMD individuals < 80 mg/cm 3 are osteoporosis.
We divide the dataset into training data (50%), validation data (10%), and test data (40%); the class distribution of training, validation, and testing datasets is shown in Figure 6. Tese three datasets do not have any overlapping images, and the CT images of each category in the three datasets are placed in strict proportions. Ten, all images are resized to 512 × 512 and each image is normalized from [0, 255] to [0, 1] before being fed into the network.
To increase the amount of training data and improve the generalization ability and robustness of the model, we enhance the image data employing fipping, rotating, and scaling on the basis of the original data balancing strategy based on an instance and actual class.

Implementation of Framework.
To implement the joint learning framework, we implemented the model based on Python 3.6.12, using the PyTorch framework and two NVIDIA GeForce 3090Ti GPUs. We apply the SGD optimizer to train the joint learning framework for 300 epochs with a learning rate of (10e − 1-10e − 5) and add six adaptive parameters to the SGD optimizer to weigh the loss of multitask learning.

Measurements.
Based on previous work [49][50][51][52], accuracy, sensitivity, specifcity, and F1-score were used to evaluate the performance of classifcation. Te accuracy rate is the ratio of the number of samples correctly classifed by the classifer to the total number of samples. Te sensitivity refects the proportion of positive cases correctly judged by the classifer to the total positive samples. Te specifcity indicates the proportion of negative cases correctly judged by the classifer to the total negative samples. F1-score is the sum of accuracy and sensitivity. In this paper, the threecategory problem is transformed into a two-category problem to evaluate; that is, the category studied at this time is a positive sample and the other categories are negative samples. Based on previous works [53][54][55], we use the intersection over union (IOU) and dice coefcient (Dice) to evaluate the efectiveness of our model segmentation task and use the average precision (AP) to evaluate the efectiveness of the positioning task.

Results.
We use ten-foldcross-validation to calculate the average results and show the performance of the joint framework in Table 1. We set the learning rate of 10e − 1-10e − 5 to evaluate the classifcation performance of the joint framework in diferent situations. We used normal In addition, we compare the best results of joint learning with the most advanced baselines. Te comparison results are reported in Table 2, where the best comparable performance is represented in bold. For the input images of other classifcation methods, we use CT images (512 × 512) generated by labels manually drawn by physicians that contain only regions of interest. To better intuitively compare the classifcation performance of the model, we use the confusion matrix for visual analysis. As shown in Figure 7, joint learning in dealing with the task of identifying low-dose achieves good performance with only 5 cases misclassifed as normal, 2 cases misclassifed as osteoporosis, and 8 cases misclassifed as low doses; in the task of identifying osteoporosis, only 3 cases were misclassifed as low dose. Tis result fully indicates the nonexistence of overftting and underftting states; this result further illustrates that there is no bias to a certain category which increases accuracy results.
Te histogram of accuracy and F1-score can be found in Figure 8. Intuitively, the accuracy rate has increased. Compared with the highest accuracy rate among advanced baseline methods, the accuracy rate of joint learning has increased by 6.2% in the osteopenia category, 3.3% in the normal category, and 0.1% in the osteoporosis category. Notably, when compared to the overall accuracy of advanced baseline methods, the overall accuracy of joint learning was improved by 3.8% which proved the efectiveness of joint learning strategies once again.

Roc Curve.
To better demonstrate the classifcation ability of our proposed joint learning framework, we use the operating characteristic curve (ROC) and the area under curve (AUC) of receivers as further evaluation indicators. Taking the experimental results with a learning rate of 0.01 as an example, we draw the ROC curves of three categories in Figure 9, AUC for each category is also depicted in the fgure. It can be found that the AUC in the osteopenia state is 0.965, the AUC value in the Normal state is 0.973, and the AUC value in the osteoporosis state is 0.985. Tese values prove the efectiveness of joint learning in bone CT image classifcation tasks.

Training Convergence.
For model training, we use the accuracy and loss curve and the training process to imply the training trend of accuracy and model cost. Te accuracy and loss curves of the joint learning framework with a learning rate of 0.01 are shown in Figure 10, which refects that the model's performance achieved satisfactory results at the 150th epoch and became stable. Tese two curves show the convergence of the model and assess its stability in bone CT image classifcation. In addition, the total training time of the joint learning framework on our dataset is about 10 hours, and each epoch takes 2 minutes. In short, training convergence and time reveal the computational efciency of our network.

Model Visualization.
We further use gradient weighted class activation mapping (Grad-CAM) to visualize the decision information of the feature extraction module. Figure 11 shows that the feature extraction modules for diferent categories (normal, osteopenia, and osteoporosis) focus on diferent regions, and the model automatically focuses on the corresponding regions. Compared with the Computational Intelligence and Neuroscience  correctly classifed decision information, we also list some cases of misclassifcation in Figure 12. Te focus area of the wrong case has changed signifcantly compared with the correct case in Figure 11, which may be used as an explanation for the neural network decision error.
Meanwhile, we calculated that the AP value of all testing datasets in the positioning task is 95%, the average IOU in the segmentation task is 0.972 ± 0.125, and the average Dice is 0.983 ± 0.036, which shows that we have good efciency in selecting features in the  positioning and segmentation tasks, but in some cases, these features have no good efect on classifcation.

Ablation Experiments.
In this section, we conduct an ablation study (learning rate is 10e − 3) of our method to prove the efective impact of segmentation feature and classifcation feature layered fusion (LF), gated convolution (GC) module, and feature fusion module (FF). We use the three modules separately and combine them randomly and calculate the overall accuracy of each experiment to evaluate whether the model is improved. Te quantitative result can be found in Table 3. In Figure 13, it can be clearly seen that the accuracy of the model has been greatly improved. When we calculate without using the method of three modules; it is unfortunate to fnd that the accuracy of the model is only 82.1%. However, when we perform a hierarchical fusion of segmentation features and classifcation features, the overall accuracy rate rises to 85.6%, an increase of 3.5%. When we use the gated convolution module, we fnd that the accuracy rate has increased by 2.8%. When we use feature fusion of vertebral bodies at diferent levels, the overall accuracy rate has increased by 3.3%. When we select any two of them, we fnd that the overall accuracy rate has increased by 4%, 8.1%, and 10.5%, respectively. Te seven additional experiments prove the feasibility and efectiveness of our proposed modular methods in improving classifcation accuracy.   : Grad-CAM visualization of 9 cases. It can be seen that diferent categories of networks have diferent emphases, which can be used as an explanation of neural networks. Te frst line of each two lines represents the L1 vertebrae, and the second line represents the corresponding L2 vertebrae. Te frst two lines represent osteoporosis cases, the middle two lines represent osteopenia cases, and the last two lines represent normal cases.

Conclusion
Machine learning can help a great deal in accurately identifying osteoporosis from CT images. In this study, we propose a joint learning framework for bone state detection, where we integrate positioning, segmentation, and classifcation into an end-to-end multitask joint learning framework. Te framework processes from the original input to the fnal output, increasing the overall ft of the model. Te accuracy of classifcation has been improved by modular task fusion, global feature association, and fusion of diferent vertebral features. We used a CT image database containing three categories of vertebrae to evaluate this method. A large number of experiments confrm this method improves the overall accuracy from 82.1% to 93.3%, which shows the efectiveness of joint learning in bone state image classifcation and contributes to solving the problem of clinical diagnosis of osteoporosis.

Data Availability
Te data used to support the study are included within the article.

Ethical Approval
Tis retrospective study was approved by the Ethics Committee of Nantong First People's Hospital (No.: 2021KT028), who waived the need for informed consent. Te study protocol was implemented according to the Good Clinical Practice guidelines defned by the Helsinki Declaration and the International Conference on Harmonisation (ICH).

Conflicts of Interest
Te authors declare that they have no conficts of interest.   16 Computational Intelligence and Neuroscience