CT-Based Automatic Spine Segmentation Using Patch-Based Deep Learning

CT vertebral segmentation plays an essential role in various clinical applications, such as computer-assisted surgical interventions, assessment of spinal abnormalities, and vertebral compression fractures. Automatic CT vertebral segmentation is challenging due to the overlapping shadows of thoracoabdominal structures such as the lungs, bony structures such as the ribs, and other issues such as ambiguous object borders, complicated spine architecture, patient variability, and fluctuations in image contrast. Deep learning is an emerging technique for disease diagnosis in the medical field. This study proposes a patch-based deep learning approach to extract the discriminative features from unlabeled data using a stacked sparse autoencoder (SSAE). 2D slices from a CT volume are divided into overlapping patches fed into the model for training. A random under sampling (RUS)-module is applied to balance the training data by selecting a subset of the majority class. SSAE uses pixel intensities alone to learn high-level features to recognize distinctive features from image patches. Each image is subjected to a sliding window operation to express image patches using autoencoder high-level features, which are then fed into a sigmoid layer to classify whether each patch is a vertebra or not. We validate our approach on three diverse publicly available datasets: VerSe, CSI-Seg, and the Lumbar CT dataset. Our proposed method outperformed other models after configuration optimization by achieving 89.9% in precision, 90.2% in recall, 98.9% in accuracy, 90.4% in F-score, 82.6% in intersection over union (IoU), and 90.2% in Dice coefficient (DC). The results of this study demonstrate that our model’s performance consistency using a variety of validation strategies is flexible, fast, and generalizable, making it suited for clinical application.


Introduction
Te spine plays a key function in mobility and weight transfer in the musculoskeletal system while supporting and sustaining the body and organ structure.It also protects the spinal cord from mechanical shocks and injuries.A number of techniques are used to understand better and quantify the human spine's biomechanics, including vertebral fnite element modeling [1], quantitative imaging [2], spinal alignment analysis [3], and complicated biomechanical models [4].Biomechanical changes can result in disability and severe discomfort in the short term but can have far worse long-term complications, such as an eightfold higher mortality rate due to osteoporosis.Despite their critical nature, spine diseases are frequently underdiagnosed.Tis necessitates using a computer-aided approach to detecting such pathologies early and efciently, allowing for their prevention and efective treatment.
Over the years, the medical imaging community has paid increasing attention to the study of spine image analysis, and vertebrae segmentation [5] is an essential step in comprehending it.Vertebra segmentation has diagnostic signifcance for detecting and classifying vertebral fractures, estimating the spine curve, and recognizing spine deformities such as kyphosis and scoliosis.Aside from diagnostic purposes, these tasks help with fnite element modeling analysis, biomechanical modeling, and surgical planning for metal implantations.Annotating huge structures needs a lot of time, making manual segmentation impractical.A consistent and precise manual delineation is also difcult due to the complicated shape of posterior vertebral components and lower scan resolutions.Various problems exist in automating these tasks, including datasets with highly variable felds of vision, scans in large sizes, scan noise with highly correlated forms of nearby vertebrae, fuctuating scanner settings, and multiple anomalies or pathologies.
Vertebral segmentation traditionally relied on modelbased techniques, which ft a shape before the spine and then distort it to conform to its shape.Statistical shape models [6][7][8], geometric models [9,10], Markov random felds (MRF) [11,12], and active contours [13] are all used as prior shape models.Other approaches use intensities such as a priori variational intensity models [14], level sets [15], and automatic vertebrae segmentation from shape models based on landmark frameworks [16].Machine learning has recently become more popular for the segmentation of vertebrae.A. Suzani et al. [17] detected vertebral structures with a multilayer perceptron (MLP) and segmented them using deformable registration.Similarly, Chu et al. [18] identifed and located the vertebrae using random forest regression followed by segmentation at the voxel level using random forest classifcation.R. Korez et al. [19] employed 3D convolutional neural networks (CNNs) to learn vertebral appearances and forecast probability maps that guided the deformable model's boundaries for vertebral body's segmentation.
Recent years have seen an increase in the popularity of deep learning for vertebral segmentation, and many published approaches used convolutional [20] and recurrent neural networks instead of explicit modeling of vertebral appearance and shape.Te growing popularity of deep learning in vertebral segmentation and greater processing power have prompted researchers to produce promising results.A. Sekuboyina et al. [21] used U-Net to perform patch-based binary segmentation and then denoise the vertebrae masks heat maps with a low resolution.In their other work [22], two diferent types of neural networks were used to segment lumbar vertebrae.First, a simple multilayer perceptron is trained to regress the lumbar region localization, and then a U-Net is trained to perform multiclass segmentation.Janssens et al. [23] improved this by substituting a CNN for the multilayer perceptron and using two sequential CNNs for lumbar vertebrae multiclass segmentation.Using a two-stage iterative technique, Lessmann et al. [24] frst identifed and segmented lower-resolution vertebrae one after the other and then refned the masks with a low resolution using a second CNN.Tese fndings led to the development of a single-phase fully convolutional network by Lessmann et al. [25] that iteratively regressed and segmented the vertebral anatomical label.A maximum likelihood technique is used to adjust the vertebral labels after the complete scan has been segmented.A diferent strategy is proposed by Payer et al. [26], using a coarse-to-fne technique including three steps: vertebra labeling, spine localization, and vertebrae segmentation, all of which rely on purposely built fully convolutional networks.
One limitation of the methodologies described previously was complicated network modeling.As an alternative to the approaches stated previously, it is argued to further improve vertebrae segmentation outcomes.Contextual high-level features that may capture more discriminative sample feature representation are exploited using a regression model to segment the vertebrae in CT images.With the fast advancement of deep learning, an increasing number of deep learning techniques specifcally stacked sparse autoencoders (SSAEs), have been applied to medical images since Hinton and Salakhutdinov [27], and they developed the frst deep autoencoder network.Shin et al. [28] used stacked autoencoders in MRI to identify organs in medical imaging to show the potential of the deep learning technique to be applied in medical image analysis.Many complex medical imaging problems have been addressed using it.For example, CAD system to classify gastric cancer from the breath samples using SSAE [29], stroke lesions segmentation using sparse autoencoder (SAE) layers, followed by support vector machine (SVM) classifer [30], SSAE-based modeling for the vertebrae segmentation [31], deformable prostate segmentation method [32], nuclei detection from histopathological images of breast cancer [33], an automatic vertebrae localization and identifcation by SSAE and structured regression forest [34], an automated nucleus detection [35], and Parkinson's disease diagnosis 2 International Journal of Intelligent Systems modeling also based on stacked sparse autoencoder framework [36].
Based on the previous work [37], we presented CT-based automatic spine segmentation utilizing patch-based deep learning and new PE and RUS-modules were proposed.Te PE-module is used to extract overlapping image patches and label them with a certain pixel ratio, while the RUS-module is employed to address the class imbalance problem.We tested the generalizability and fexibility of our model on three publically available datasets (VerSe, CSI-Seg, and the Lumbar dataset) to show that it is well-suited for clinical application, which was not done in preliminary work [37].Te proposed work is a fully connected framework for high-level feature extraction using SSAE instead of convolutional neural network feature-based representation, which utilizes convolutional and subsampling techniques to extract features from a cluster of locally connected neurons via their local receptive felds.SSAE is a two-stage architecture with an encoderdecoder in which "encoder" encodes pixel intensities via low-dimensional features, while the "decoder" architecture uses low-dimensional attributes to reconstruct the original pixel intensities.SSAE is a fully connected network that uses a single global weight matrix to represent features, while CNN is a model of partial connections that emphasizes the signifcance of locality.Notwithstanding this, SSAE extracts high-level features from the bottom up in an unsupervised manner.Tese efcient representations cause precise image patch classifcation, leading to more robust CT vertebrae segmentation.Terefore, we choose to employ SSAE rather than convolutional neural networks in this work.Te major contributions of this study are as follows: (i) PE-module is applied to extract overlapping patches from input slices of CT images.Splitting slices into patches enhances localization because the trained network is built to focus more on patches' local details.(ii) To classify vertebrae patches efectively (reducing false negatives), RUS-module is used to address the class imbalance problem by sampling an equivalent number of patches (vertebrae and nonvertebrae patches) in the training phase.(iii) Te pretraining step, an unsupervised feature learning module based on the SSAE framework, is used to learn high-level features from a large number of unlabeled image patches, while, in the fne-tuning step, these most discriminative sets of features are then subsequently fed to a sigmoid layer to classify each patch as vertebra or not.(iv) We designed a fve-layer SSAE architecture: one input layer, three hidden layers, and one output layer (sigmoid layer).We validated our approach on three diverse publicly available datasets (VerSe, CSI-Seg, and the Lumbar dataset) to demonstrate that our approach is fexible, fast, and generalizable, making it suited for clinical application.
Te remaining paper is organized as follows: Section 2 briefy describes our proposed method, composed of six modules.Section 3 describes the experimentation, including datasets, performance evaluation, model training, and architecture optimization.Section 4 reports the results and discussion in detail.Finally, Section 5 concludes this work and ofers suggestions for future work.

Preprocessing.
Te main aim of the preprocessing step is to increase the discrimination between vertebrae and other tissues by identifying bone pixels and removing noise from the image.We applied a threshold approach to eliminate noise artifacts from the whole CT scan.For this reason, infuences from the tissues around the vertebra, noise, and imaging artifacts are reduced by setting the intensity to zero outside the bone intensity range of 100 HU (Hounsfeld unit) and 1,500 HU automatically.Input spine CT scans are volumetric and must be processed slice by slice.Because the pixel intensities of vertebrae in CT scans are higher than those of other tissues, the applied threshold diferentiates them from soft tissues.However, vertebrae have similar intensities to other bones (such as the ribs), so we trained a deep learning model to discriminate between vertebrae and other bony structures in CT scans.Ten, a Gaussian flter with a sigma value of 2 is applied to the CT images to smooth them out and ensure that the image gradients are welldefned and there are no intensity singularities.Te data are normalized to a range of 0-1.

PE-Module
. PE-module is applied to extract n × n size overlapping patches from the input CT images by taking a certain pixel stride (Figure 2).A 32 × 32 patch-sized image contains 1024 pixels in total.Te patch is labeled 1 (vertebra) if the total pixels inside it are equal to or greater than 60%; otherwise, it is labeled 0 (background).PE-module uses a specifc pixel stride to construct overlapping patches from the 2D slices for the sliding window.PE-module is an image partition module employed successfully on a patch-based deep learning model for network training to improve classifcation accuracy.

RUS-Module.
After the PE-module, the number of image patches was unbalanced because the area occupied by the spine in the CT scans was so small compared to the background.Te classifer may be biased in the background because most patches are labeled as 0. A high sensitivity rate is preferable from a medical perspective, but on a practical International Journal of Intelligent Systems level, a high false negative rate is unsuitable [38].It is necessary to strike a balance between the size of the positive and negative training image patches to solve this dilemma.
Te RUS-module is applied to balance the training data by selecting a subset of the majority class (background patches).Tis module deletes random image patches from the 1 x 2 x 1024

HL3
Figure 1: Te proposed approach's pipeline is depicted schematically. 4 International Journal of Intelligent Systems majority class (Figure 3).Expressing class B as the majority and class F as the minority, the ratio of the size of the minority and majority classes is defned as r, and we performed RUS on B to achieve a balanced ratio of r.Te balanced r ratios after RUS-module were r (nonvertebrae patch) � 0.6 and r (vertebrae patch) � 0.4 that were unbalanced r ratios (nonvertebrae patch) � 0.94 and (vertebrae patch) � 0.06 before the RUS-module.Tis improves the network's accuracy and convergence rate during the model training [39].However, the testing stage does not include a balanced class of image patches.

L 2 Regularized SSAE.
A fundamental component of SSAE is an autoencoder (AE) composed of three layers: an input, a hidden layer, and an output.Te nodes in an AE's diferent layers are all fully connected.A multilayer neural network can be formed by stacking multi-AEs.We improve the three-layer SSAE network by stacking three AEs (Figure 4).We pretrained the model using the greedy layer-wise SSAE approach.Due to the unsupervised nature of pretraining, the label (ground truth) information is not used.We consider that x � (x 1 , x 2,..., x n ) expresses the autoencoder input vector, y � (y 1 , y 2,..., y n ) expresses the reconstructed representation vector of x, and z � (z 1 , z 2,..., z n ) expresses k hidden node activation vector.Te autoencoder uses the weights w 1 and bias b 1 for encoding the input vector x to � f(w 1 x + b 1 ) because it uses the intermediate hidden layer to rebuild input features on the output layer.In the hidden layer, activation vector y is decoded the z output using decoding weights w 2 and bias b 2 and then y maps the hidden layer latent representation to the output z by y � f(w 2 y + b 2 ).We constructed an L 2 regularized sparse autoencoder using the following cost function:

E(W, b)
where W (weights) and b (bias) are the AE network parameters, mean squared error (MSE) is the frst part in equation, and p is the training data sample size.Te cost function's second portion is the L 2 regularization on the encoding weights, where Ω weights � 1/2 p j�1  n i�1 (w ji ) 2 and λ denotes the L 2 regularization term's penalty coefcient.Sparsity regularization is the third portion of the cost function, where β is the sparsity regularization coefcient and Ω sparsity is the Kullback-Leibler (KL) divergence [28] which is expressed as follows: where shows average activation of the hidden node j over training samples and ρ is the predefned constant sparsity parameter.For a L 2 regularized sparse autoencoder, the weights w and bias b can be optimized using the scaled conjugate gradient descent algorithm [40] to obtain the encoder of a sparse autoencoder.Te hidden layer z's output of layer L-1 autoencoder would be considered as input x of layer L autoencoder.Finally, L 2 regularized SSAE is formed by stacking the multiple sparse AEs.

Sigmoid Regression.
As SSAE is an unsupervised learning approach, each network layer has been trained using unlabeled data.A feature vector was used to generate the input reconstruction.Te classifer uses these feature vectors to classify the input data of the stacked sparse autoencoder.We used a sigmoid regression layer to discriminate between vertebrae and nonvertebrae patches (Figure 4).MLP and SVM are other classifers that can be used instead of the sigmoid layer.Te MLP is a feedforward neural network with several layers and a large number of nodes in each layer that gets stuck in local minima due to the over-ftting problem.In contrast, SVM classifers determine whether a pixel is part of the target or background class based on its posterior probability value, but it takes a lot of generalization to produce a probability image by reconstructing the score vector.However, the sigmoid layer enables the joint to optimize the entire deep framework via fne-tuning.Sigmoid regression is a binary classifcation technique for supervised learning.Te output probabilities calculate each class label's likelihood based on the input data.Te sigmoid regression model's coefcient vector gets optimization by reducing the cost function.
where σ is the output sigmoid function and x is the input.At the stage of supervised learning, the pretrained SSAE and sigmoid layer are combined into a single model for classifcation.Using the scaled gradient descent approach [40], each iteration simultaneously updates the weights of all SSAE layers and all sigmoid layer parameters to fne-tune the whole model.

Postprocessing.
Following training, the trained model is validated using unseen test patches.Our study addresses two-class classifcation issues where the patch labels are 0 and 1, with 1 denoting vertebrae patches and 0 representing nonvertebrae patches, respectively.Te same preprocessing is applied to the CT scans used for testing.Input image patches are given to the trained model, which returns a value between 0 and 1, which can be analyzed as the probability that an image patch belongs to a vertebra or not.Te segmented binary image is created by reconstructing the predicted image patches based on these results.Due to the high contrast between the vertebra, rib, and other skeletal structural tissues, some background pixels are misclassifed as vertebrae, while some vertebrae pixels were missed from the foreground.For this reason, morphological operations [41] were applied to the binary predicted image in the postprocessing step to eliminate the outliers.Training patches (P×1024) x 2 x 1024 … 6 International Journal of Intelligent Systems

Experiments
3.1.Datasets.Tree publicly available datasets of CT spine images were used to evaluate the proposed automated method for vertebral segmentation.Reference segmentation ground truths for three of these datasets are publicly available.Figure 5 shows examples of images from the datasets.

Dataset 1.
Te University of California-Irvine School of Medicine's Department of Radiological Sciences acquired CT scans using multidetector CT scanners from Philips and Siemens.At a trauma center, thoracolumbar spine CT scans [42] were taken as part of the daily routine without intravenous contrast from 15 adults aged 16 to 35 years old.Te slice thickness is 1 mm, and the in-plane resolution is 0.312 to 0.336 mm.Each scan included manual segmentation of all 12 thoracics and 5 lumbar vertebrae, for a total of 180 thoracic and 75 lumbar vertebrae across 15 subjects, and served as references for ground truth.We used 5 scans to train the model and 10 scans (120 thoracic and 50 lumbar) to test it.

Dataset 2.
Te CT dataset for the lumbar (T1-T5) spine comprises 10 scans and associated manual reference ground truth of the 50 lumbar vertebrae [10].In-plane voxel sizes ranged from 0.28 to 0.79 mm, and slice thicknesses ranged from 0.72 to 1.53 mm.Each of the lumbar vertebrae was manually segmented to create a binary mask.Tese scans were utilized as a training dataset.

Dataset 3.
Te VerSe [43,44] dataset was acquired at multiple locations utilizing CT scanners from four main vendors (Phillips, Siemens, Toshiba, and GE).In terms of feld-of-view (FoV), fndings, and scan parameters, the data were carefully arranged to match a clinical distribution.Data were acquired from patients with an average age of 59 (±17) years.It comprises a range feld of views (including cervicothoracolumbar, thoracolumbar, and cervical scans) and a combination of sagittal and isotropic reformations and fractures of the vertebrae, foreign materials, and metallic implants.Manual ground truths for the cervical, thoracic, and lumbar vertebrae are included in the data.Experiments used 25 thoracolumbar (T1-T12, L1-L5) scans for the training data.

Performance Evaluation.
Evaluation metrics [45] are used to compare the performance of vertebrae segmentation with other existing approaches.In medical image analysis, these metrics are widely used and well-known.In this paper, precision, recall, accuracy, F-score, intersection over union (IoU) [46], and Dice coefcient (DC) [47] are quantitative assessment measures for segmentation performance evaluation [48,49].We evaluated true positive (TP), false positive (FP), true negative (TN), and false negative (FN) by comparing the ground truth images with predicted segmented images.
Precision � TP TP + FP , We frst determined the number of epochs required in pretraining of SSAE to ensure the training process convergence in the proposed model.Figures 6(a)-6(c) show the training patch-based learning curve for weights between the input layer and the hidden layers (pretraining learning curves of 3-hidden layers).Te mean square error (MSE) between the original input and the reconstructed input from the autoencoder-decoder was computed and plotted.We conducted our studies with a variety of empirical numbers of hidden nodes.Tese observations reveal that learning processes converge after 300 epochs in diferent hidden node settings.We chose 500 epochs in the experiment pretraining to ensure SSAE convergence.Figure 6(d) shows the fne-tuning model learning curve for a number of epochs after pretraining.We found the best ft curve for our model training with an MSE of 0.025 for training and 0.029 for validation.Prior to 2,000 epochs, the learning curve rapidly diverges and then stabilizes after 4,000 epochs, and we chose 5,000 epochs in the model fne-tuning.
Te fve-layered SSAE is based on a visualization model [50] to show the feature presentations of the frst, second, and third hidden layers in Figure 7. Tere are 200 nodes in the frst hidden layer, representing the learned feature representation of the vertebrae and other structures, while the second and third hidden layers (200 nodes) represent more high-level feature learning from image patches.Weights between hidden nodes and pixels in the original image are represented by squares.In the weight matrix, white pixels represent positive values, while gray pixels represent negative values in the weight matrix.
3.4.Architecture Optimization.Next, we started optimizing the architecture of the proposed model.A grid search was used to optimize the number of hidden layers and nodes on each layer of SSAE.
Until now, there has been no theory to determine the optimal SSAE architecture for a particular application.Terefore, we conducted the experiments using a variety of empirical values for the number of hidden layers and nodes.SSAE's high-level feature representation is determined by the number of hidden nodes.Hence, we chose empirical values (100, 200, 300, 400, and 500) for the number of hidden nodes and empirical values [1][2][3][4] for the number of hidden layers.Te sparsity coefcients were set to sparsity L 2 regularization λ 0.10, sparsity constraint β 0.20, and target activation ρ 0.05.For every possible combination of hidden layers and nodes, the performance metrics were calculated, and the results are shown in Figure 8.
Te 3-hidden-layer architecture with 200 nodes produced the best precision results (89.9%).Te same design yielded the best recall (90.2%), accuracy (98.8%),IoU (82.6%), and DC (90.2%), respectively.A 2-hidden-layer architecture with 200 hidden nodes produced the best Fscore results (90.5%).Precision, recall, accuracy, IoU, and DC values from the best-performing architecture resulted in an acceptable F-score (90.4%).Diferent architectures might be chosen depending on the needs of the application.We chose the three-layer architecture for SSAE based on DC requirements, and each layer contains 200 hidden nodes.Figure 9 shows the results in confusion matrices of training and randomly selected test case separately.

Results and Discussion
4.1.Results.Our approach is developed based on SSAE for vertebrae segmentation.In experiments, our model achieved 89.9% in precision, 90.2% in recall, 98.8% in accuracy, 90.4% in F-score, 82.6% in IoU, and 90.2% in DC.In order to show the efectiveness of L 2 regularized SSAE, our method is compared against the state-of-the-art three-layered stacked autoencoder (TSAE) model [33].If L 2 regularization term's International Journal of Intelligent Systems International Journal of Intelligent Systems penalty coefcient λ and sparsity regularization coefcient β, the second and third portions of the cost function in equation ( 1) are limited to zero, then it turns into a threelayered stacked AE (TSAE) model.Table 1 indicates the means of precision, recall, accuracy, F-score, IoU, and DC of our approach and comparative model of TSAE.Te results show the signifcance of the L 2 regularized SSAE of our method that is in superior performance compared to TSAE in all metrics.

Discussion.
To our knowledge, this is the frst time an SSAE has been used in patch-based classifcation for automatic vertebral segmentation using three distinct CT spine datasets.Using deep learning's excellent data mining advantage on big data, our approach proved the SSAE network's strong ability to segment vertebrae automatically.Our approach has the potential to function as fully automated CAD software with minimal human intervention and training or analysis; there is no need to choose any handcrafted features.Tis is an important feature of CAD in today's fast-paced clinical settings.Te SSAE neural network was used to capture the high-level features from overlapping patches in unsupervised learning.In our method, vertebrae and nonvertebrae patches were efectively classifed by these high-level features.Te Bold values represent that our method has better results than others.
Table 2: Segmentation performance comparative analysis of the proposed approach to existing algorithms in terms of F-score, IoU, and DC.Methods F-score (%) IoU (%) DC (%) Classical U-Net [51] 81.4 71.9 83.7 SpineParseNet [48] 87.6 77.5 87.3 PaDBN model [49] 84.9 75.6 86.1 TSAE [33] 82.8 73.9 85.2 Butterfy FCN model [21] 86.4 76.9 87.0 OP-convNet [20] 90.2 82.3 89.9 Mask R-CNN [52] 70.International Journal of Intelligent Systems sigmoid regression layer was then used to incorporate these high-level features to improve classifcation accuracy.Our proposed approach is reliable, robust, and precise.In terms of clinical applications, the developed approach has a high level of overall performance.SSAE is a neural network, so the convergence of the training procedure was critical to the model's classifcation of image patches.In SSAE, a premature network might be the result of insufcient training epochs that cause a lack of optimal performance.Terefore, it is required to conduct a convergence test to determine the correct number of epochs in deep learning.In our experiment, we used 500 epochs for pretraining.Tis setting ensured the training's convergence and avoided time wastage.Te architecture of neural networks is another critical consideration.It has already been stated that there are no general criteria for designing a neural network's architecture.Indeed, the optimum neural network architecture is decided by the intricacy of the data that is being used.An early indication of SSAE architecture design could be provided by optimization experiments in our situation.We also found that sparse regularization was necessary during training to build deep feature representations that positively impacted the fnetuning phase.During training, sparsity pushes the flters to capture more detailed features from image patches.Te performance comparison indicated our approach's efectiveness and its superior capability when compared to other well-known models.
Our proposed approach performs well in classifying the test patches into vertebrae or nonvertebrae patches and then segmenting the vertebrae from the reconstruction of image patches.Each spinal level has its own set of vertebral patterns.Signifcant morphologic variations can be seen between two vertebrae separated by a wide spatial distance within the spinal column, such as the upper thoracic vertebrae and the lower lumbar vertebrae.It is therefore challenging to achieve accurate segmentation of all the vertebrae.Our proposed model has some limitations with segmentation in the upper thoracic vertebrae (T1-T3) due to the existence of rib structure, and the L5 vertebrae also obtained lower DC than other vertebrae.Figure 11 illustrates the results of poor segmentation where high false negatives represent the vertebrae regions which are not detected by the model, whereas false positives are the background regions that have been segmented wrongly as vertebrae.

Conclusion
Tis study presented a stacked sparse autoencoder framework for automated vertebrae segmentation using publicly available three distinct CT spine datasets (VerSe, CSI-seg, and the Lumbar dataset).We used 2D image slices to extract overlapped patches for model training.A high-level feature representation of pixel intensity is captured in an unsupervised fashion using the proposed model from image patches.A sigmoid layer efciently classifes vertebrae and nonvertebrae patches using these high-level features.Our approach performed optimally after setting main parameters such as the number of hidden layers, dimension of hidden nodes, and epochs.Sparsity constraints on hidden layers are also demonstrated to be efcient.It was noticed that the training using sparsity regularization is necessary to build feature representations that positively infuence the fnal supervised tuning phase.Te scheme of distinguishing vertebrae regions using image patches rather than individual pixels also decreases the rate of false positives.Te method demonstrates signifcant potential for resolving issues caused by morphological variations of vertebrae.When  International Journal of Intelligent Systems compared to other state-of-the-art vertebrae segmentation methods, our approach outperformed them in terms of segmentation accuracy.We carried out further experiments that enabled us to identify our method's limitations, specifcally in fractured vertebrae.Future work will improve our approach by developing a more discriminative deep neural network design to make our method more robust in these cases.

Figure 2 :Figure 3 :
Figure 2: Illustration of the PE-module.An example image demonstrates the patch extraction process (vertebrae and nonvertebrae patches of 32 × 32 pixels in size).Te total number of patches in each image is P.
extraction with L 2 regularized SSAE sigmoid regression

Figure 4 :
Figure 4: Stacked sparse autoencoder (SSAE) combined with a sigmoid regression illustration for classifying the vertebrae and nonvertebrae patches.

Figure 5 :
Figure 5: Examples of the various images from distinct datasets.Toracolumbar spine CT images are shown from VerSe (a), the CSI-seg (b), and lumbar CT spine (c).

Figure 6 :Figure 7 :
Figure 6: (Pretraining) Learning curves of three hidden layers' SSAE framework (a-c), all hidden layers have 200 nodes in each layer and 500 epochs were used, and no ground-truths are provided.(Fine-tuning) Model-supervised best-ft learning curve (d) has an MSE of 0.025 for training and MSE of 0.029 for validation, and 5,000 epochs were used.

Figure 8 :
Figure 8: Proposed method's performance (precision, recall, accuracy, F-score, IoU, and DC) with diverse architectures.Te columns represent the number of hidden layers (HL), and the rows present the number of hidden nodes (N) in each layer, and the top results are shown in bold text.

Figure 10 :
Figure 10: Examples of visual demonstration of segmentation results: (a) axial plane images; (b) ground truth; (c) our predicted segmented images; (d) prediction images overlaid on original images.

Figure 11 :
Figure 11: Examples of segmentation issue visualization where high false negatives represent the vertebrae regions that the model does not detect.It was observed in the starting thoracic vertebrae because of the ribs and intervertebral infuences connected to the vertebrae; the frst row shows original images, and the second row shows poor predicted segmented images.

Table 1 :
Performance comparative analysis of our method to TSAE in terms of precision, recall, accuracy, F-score, IoU, and DC.