An Overview of Deep Learning Methods for Left Ventricle Segmentation

Cardiac health diseases are one of the key causes of death around the globe. The number of heart patients has considerably increased during the pandemic. Therefore, it is crucial to assess and analyze the medical and cardiac images. Deep learning architectures, specifically convolutional neural networks have profoundly become the primary choice for the assessment of cardiac medical images. The left ventricle is a vital part of the cardiovascular system where the boundary and size perform a significant role in the evaluation of cardiac function. Due to automatic segmentation and good promising results, the left ventricle segmentation using deep learning has attracted a lot of attention. This article presents a critical review of deep learning methods used for the left ventricle segmentation from frequently used imaging modalities including magnetic resonance images, ultrasound, and computer tomography. This study also demonstrates the details of the network architecture, software, and hardware used for training along with publicly available cardiac image datasets and self-prepared dataset details incorporated. The summary of the evaluation matrices with results used by different researchers is also presented in this study. Finally, all this information is summarized and comprehended in order to assist the readers to understand the motivation and methodology of various deep learning models, as well as exploring potential solutions to future challenges in LV segmentation.


Introduction
Te capability of a machine to simulate and impersonate human intelligence processes is referred to as artifcial intelligence. Machine learning is a subbranch of artifcial intelligence which is based on the idea to enable machines or computers to perform without being specifcally programmed. Te machine can learn from data and focus on the use of the pattern and experience to improve the performance of the computer in making decisions on its own. In this way, the machine becomes capable of developing analytical models to adopt new situations autonomously.
Deep learning (DL) is a subfeld of machine learning associated with a process inspired by the formation and function of the brain called an artifcial neural network (ANN). DL is concerned with the interpretation of data based on the mechanism of the human brain by developing and simulating the algorithm worked on human brain analysis and learning. Te training data are fed into the algorithm as input, and the successive layers of the DL algorithm analyse the original data to extract the features required for the targeted task. Te training data is fed into the algorithm as input, and the successive layers of the DL algorithm analyse the original data to extract the features required for the targeted task. Te entire process is free of human manipulation. One of the earliest practiced DL techniques is ANN with a deep network structure [1]. Te multilayer perceptron models [2] have been proposed with the rapid progress in the research areas of computer vision (CV) and human brain neurons. Tis yields the development of other classical models such as back-propagation neural network models, convolutional neural network (CNN) models [3], bidirectional recurrent neural networks [4], transformers [5], long short-term memory (LSTM) [6], and deep belief network [7] models.
Tese research fndings have signifcantly helped the expansion of DL architectures, fooring the way for its substantial level applications in numerous areas, especially in image processing. Image classifcation, image registration, object detection, and image segmentation were among the main tasks performed by the DL algorithms very efciently.
Te cardiac images are one of the medical images used for the assessment of patient health. Diferent cardiac images are used for the analysis of cardiac function. Assessment of cardiac function performs an essential part in medical cardiology for patient supervision, risk estimation, disease analysis, and therapy evaluation [25,26]. For cardiac diagnosis, digital images are the basic tool used for the computation of subsequent clinical indices from the shape and structure of the heart. From the structure of the heart, the assessment of the left ventricle (LV), right ventricle (RV), and myocardium (MYO) are the main assessments. LV is one of the central issues and the attention of cardiac function study and disease diagnosis. Delineation of LV boundary is of great clinical importance for the study of heart parameters such as the ejection fraction (EF), stroke volume (SV), LV mass (LVM), end-systolic volume (ESV), and end-diastolic volume (EDV) [27].
Some studies have reviewed the segmentation of medical and cardiac images. However, to the best of the author's knowledge, those investigations did not investigate LV segmentation solely and explicitly. Keeping in mind the importance of LV, the primary focus of this research is to review only the segmentation of LV using DL models. Tis paper provides a comprehensive overview of diferent DL architectures used for the LV segmentation. Tis review has been carefully summarized to present the state-of-the-art DL algorithms focusing on the LV segmentation task. To fnd out the quality research in the area, the Web of Science database was used as a search engine. Te keywords, "left ventricle", "segmentation, and "deep learning" were used to fnd out the related papers. Te articles which primarily do not emphasize LV segmentation were excluded because the scope of this review included an analysis of LV based segmentation. Te review has been conducted using resources published between 2018 and onwards until December 2021.
In this article frst, we discussed the three diferent imaging modalities used for the LV assessment in Section 2. Section 3 presents the basics concepts related to DL and CNN. Diferent DL architectures used for the LV segmentation are reviewed in Section 4. Te section is subdivided based on the diferent approaches used with DL such as preprocessing, deformable models, and clinical indices calculation. Te discussion about the architecture, hardware, software, and datasets used for training, and evaluation matrices used to analyze the performance of models is presented in Section 5. Te complete structure of the article is depicted in Figure 1.

Medical Images for LV Assessment
Diferent medical imaging modalities were used for the assessment of LV. Tese modalities include magnetic resonance images (MRI), echocardiography, computer tomography (CT) scan, myocardial perfusion imaging, multiple gated acquisition scanning, gated blood-pool SPECT, and fusion imaging. However, the most used imaging modalities in literature for LV segmentation are MRI, US, and CT scans. Te detail of these images is presented in this section.

Magnetic Resonance Images.
MR imaging is a widely used technique in the cardiac armamentarium. Te ofcial name is recognized as "cardiovascular magnetic resonance (CMR)," when the MRI is employed on the heart or cardiovascular system. Its diagnostic precision has preceded it to become the gold-standard for heart analysis [28].
MRI is suitable for the evaluation of heart chambers [29,30], size, and blood fow through major vessels [31], heart valves [32], and pericardium [33]. In addition, for LV size and mass measurement, the MRI is considered a reference standard [34,35]. Its adaptability is incomparable to other diferent imaging methods. It provides not only precise anatomic information but also gives functional information that helps in fnding patients at risk. Tree-dimensional geometric analysis of the LV by CMR provided more appropriate information about the shape of the LV than the traditional echocardiography with high fertility and low variability [36]. CMR has also been successful in observing LV hypertrophy in patients with apparently normal echocardiographic results [37].
Besides these outstanding outcomes, a few important limitations of CMR need to be remembered. It faces problems with costs, limited availability, and lack of portability. Tese constraints prevent the use of CMR normally. Compared with other imaging modalities, CMR inspections 2 Computational Intelligence and Neuroscience are very costly and inadvisable for patients with metallic implants such as graft stents and cardiac pacemaker devices. Cardiac MRI may not be accessible immediately in all centers, and it can be a difcult instrument to work out in patients who require serial monitoring. Another condemnation of CMR is the period of examination to acquire LVM data. It also has some minor issues such as device incompatibilities and patient tolerance. Figure 2 shows LV segmentation in MRI images. Te red area is LV segmented using a CNN. [38].

Echocardiography.
Echocardiography used high-frequency ultrasound waves to produce anatomical images of the heart. Tat is why it is referred to as ultrasound (US) imaging. It is the largely used imaging modality for the examination of cardiovascular diseases [39]. Due to its easy accessibility, outstanding temporal resolution, real-time imaging, and noninvasive nature, the US is considered the basic imaging for measuring the LV function. US has become the primary preference for the analysis of LVM. A regular LVM calculation is an essential part of the US examination [40]. US imaging is also used in measuring the decrease in LVM after the treatment [41]. Te most used US imaging is two-dimensional US (2DE) and three-dimensional US (3DE). An example of 2DE is shown in Figure 3 and the LV boundary is shown by the red line. Although M-mode US imaging is also in use, due to diferent limiting factors of M-mode such as it only identifes the function of the basal segment while 2D and 3D can perform the whole LV segmentation, the use of M-mode is very limited. Using the 2DE, LVM can directly be calculated. Similarly, by attaining the pyramidal image, we can look at the 3-dimensional image of the whole heart. Te 3D imagining of heart anatomy can be obtained using 3DE and it has also overcome the limitation of 2D imagining. Terefore, 3DE has gained considerable importance over 2DE and M-mode in various patient populations [42]. Inadequately, 3DE is not commonly available and costly compared to 2D imagining. Te other limitation of the US imaging is the speckle noise [43] and low contrast ratio [44], which limits its performance.

Computer Topography.
A computed tomography (CT) scan of the heart provides a cross section of the structure of the heart. It characterizes the X-ray attenuation features of tissues being imaged [45,46]. CT is a growing imaging method for the noninvasive computation of heart anatomy and function. LV size and mass estimation can be computed using the CT modality. CT is also found as a good alternative for the LV size and mass calculation for those patients who have contraindications to CMR [42]. Te study in [47] compares CT and US and also fnds that CT can be used as an alternative to the US.
Tough CT has many advantages, a few constraints are not ignorable. CT cannot be employed as real-time intraprocedural assistance due to unavoidable ionization radiation exposure called a stochastic efect [48]. Repetitive regular use can raise the cancer risk. Te increase in image quality results in a higher dose of radiation. Te left part of Figure 4 is a CT scan of the heart and in the right part, the LV is highlighted. [49].

Deep Learning
DL can be defned as a machine learning algorithm that deals with neural networks. Neural networks with a deep structure or with more than 2 hidden layers are also referred as deep neural networks. A general architecture of DL is shown in Figure 4. DL is a representation learning (subtype of machine learning) with multiple levels of representation [50]. For the past several years, DL has been developed as a popular tool that attracts the attention of researchers from several felds. It helps to overcome the weaknesses of traditional methods and solve complex problems to achieve better results. Te popularity of DL is doable due to large datasets, computational performance, training techniques (ReLu), and advanced networks (CNN). With the increase in databases, DL has exponentially achieved success both in commercial and academia. Not only the software base advancement help DL to achieve success, the latest hardware such as graphical processing units (GPU's) improved the ability of DL [51]. Deeper layers improve the system's experience by learning the features from data and making complex structures deeper and simple [52]. Terefore, it is a  novel discovery for solving problems in those areas which has high dimensional data. Inspired by brain function, deep neural networks are built from many hidden layers sandwiched between the input and output layers. Te general architecture of a deep neural network is presented in Figure 5.

Convolutional Neural
Network. DL architectures are performing excellently in solving traditional artifcial intelligence problems. Te most established, progressive, and widely used is CNN. Te following section discusses CNN, its variants, and its applications. Among all the models of neural networks, CNN is the most dominant approach to solving problems of CV. Te idea of CNN architecture was developed in the 1980s [53] but due to the lack of computational ability of hardware, high processing machines, and large storage devices to deal with big images, the idea did not fourish. Te concept accelerated as the processing of machines increased in terms of computation and database to retrieve and store. Later in [54], CNNs were successfully applied in classifcation problems and performed brilliantly in CV applications. Te gradient-based learning algorithm highly motivated CNN to produce optimized weights. CNN performed far better than other multi-layered perceptrons. Te CNN weights are shared and are not needed to learn again for the same object at diferent locations. It recognizes visual patterns, directly from raw image pixels. Tis decreases the number of learnable parameters. CNN performance is impressive on 2D and 3D images. CNN model has minimized the preprocessing task and the back-propagation learning method improved the performance as it has provided a solution to deal with nonlinearity with the decrease in computation process due to a smaller number of weights. CNN has been producing better results in object recognition, behavior recognition, audio recognition, detection, recommendations localization, classifcation, and segmentation tasks.

Convolution.
Convolution is a mathematical operation that involves the multiplication and addition (weighted average) of two functions. Te frst function (x) represents input data and the second function (w) represents kernel and together they produce the output, called feature maps. CNN is similar to neural networks that use weights and biases. It involves a convolutional layer in the neural network that applies to input data of an acceptable type. Te CNN architecture is divided into two divisions: feature extractors and classifers. Each convolution layer fnds specifc features from the input data using a shared weight called kernels, and with n number of kernels, the convolution layer determines n features. Te input of each layer is the result of the previous layer. A simple CNN consists of a convolutional layer, pooling layer, rectifer unit, fully connected layer, and classifer. Te convolutional layer is a building block of CNN. Te input image convolves with the kernel (learnable flter). Te kernel slides over the input image and the size of the kernel are somehow learned from the input image. Some parameters drive the size of output i.e., depth, stride, and padding. Te CNN compresses the fully convolutional network by lessening the connections and sharing the weight of the edge. Figure 6 shows a general CNN structure; the input image is convolved with kernels to extract the features. Te result of convolution is then passed through the pooling layer (mostly Max pooling). Te CNN extracts the features using these layers and fnally a fully connected layer [55] gives the predicted output.
Convolution performs 3 main tasks: sparse interaction, parameter sharing, and equivariant representation.
(i) Sparse interaction: In a neural network, every output unit interacts with every input having separate parameters. Tese parameters help to determine the relationship or interaction between the input and output units. CNN uses kernels of different sizes which are smaller than input data in size. Tis reduces the number of learning parameters and the storage space and increases computation efciency. (ii) Parameter sharing: It uses the same parameter for more than one chunk. In convolution, each kernel value is used at every point of input other than boundary values. It helps CNN to use only one set instead of multiple parameters for every location. It reduces the storage requirement further. (iii) Equivariance: It refers to the shift in the feature map by the same amount as the input shifts. Convolution does the same but not naturally [50].

CNN Layers.
For the past few decades, CNN is performing intensely in CV (detection, recognition, tracking, estimation, processing, analysis, learning, restoration, and reconstruction) as the popular machine learning algorithm.
Te GPUs have also brought extra efciency in their results. Te boost CNN gain is through several factors such as a large labeled training dataset, rectifer linear unit, regularization (dropout), and augmentation. Te strength of CNN is extracting discriminative features at diferent levels. Te CNN architecture consists of a convolutional layer, a nonlinearity layer, and a polling layer followed by a fully connected layer.
(i) Input layer: Tis layer understands the input data. It gives the contents of input data and has no learnable parameters. So, this layer has nothing to do with learning.
Input Output Hidden Layers Computational Intelligence and Neuroscience (ii) Convolutional layer: Convolutional layer performs convolution operation which is the trademark of CNN architecture. Tis layer holds learnable parameters such as weights and biases. Tis layer contains flters or kernels, used to detect edges, shapes, and patterns of the given input image. Kernels are convolved with each input feature/image pixel to produce feature maps as an output. A dot product between each input and flter is performed, followed by summing each dot product output, and fnally, a bias is added. Bias can be confgured according to network requirements. Te convolutional layer reduces the computational cost by reducing the input size: Te kernel computes the product of weight and input of kernel size. It also determines the desired features based on kernel weights. Equation (1) shows the operation of the convolutional layer, where Z l i and Z l−1 i are the outputs of the current layer and previous layer, respectively, and W l i,j and B l represent flters and biases. Each neuron need not be connected to all other neurons in the preceding and following layer. Te input is convolved with flters to produce an output where bias is added for nonzero value. Te fnal output goes through a nonlinear activation function which activates the feature maps and forwarded the result to the next layer. (iii) Pooling layer: Te other name of this layer is the subsampling layer. Tis layer reduces the dimensions through downsampling operation. Average (uses the average value) and Max (uses the highest value) pooling are the two most used operations. Te following subsampling function represents a pooling operation: (iv) Nonlinearity layer: It applies the relevant nonlinear activation function. Te most common functions are sigmoid, rectifed linear unit (ReLU), hyperbolic function, and SoftMax: (v) Dropout: Tis layer regularizes the CNN model, decreases computation, and increases the generalization. It randomly drops the units by assigning the zero weight to a set of units. Tis layer helps to avoid the overftting problem. Fully connected layer: Tis is a fattened layer with each neuron of the previous layer connected to each neuron of the current layer. Each neuron has a separate weight for each connection. Tis layer has the highest number of learnable parameters. Te input data are linearly processed, passed through a nonlinearity, and then propagated to the next layer.

LV Segmentation Using DL Architectures
CNN has performed several CV tasks efectively and precisely, that is why it is a widely used DL technique for image segmentation, especially for medical images. In Section 4.1, several CNN architectures are reviewed which are used for the LV segmentation.

LV Segmentation Using Fully Convolutional Network.
Te fully convolutional network (FCN) [56] introduces the fully convolutional layers instead of fully connected layers. Terefore, FCN can handle the variable size of images and fewer parameters to be learned which also make the network faster. Te FCN and its variants used for LV segmentation are explained below.

FCN with Pre/Postprocessing.
A three-step (preprocessing, LV segmentation, and postprocessing) LV segmentation method is proposed in [57]. In the frst phase, LV is localized using the over feat algorithm [58] to determine the region of interest (ROI) which is then fed to the next  phase where the segmentation is performed using a temporal FCN (T-FCN) architecture. Te CNN model is pretrained with GoogLeNet and fne-tuned using LV images. Te T-FCN adds another hidden layer at the decoding path to restore the original size. Te segmented LV boundary is further refned in the third phase by using one of the two algorithms: fully connected conditional random felds (CRFs) with Gaussian edge potentials [59] and semantic fow [60]. To train the network, the TWINS-UK database was used which consists of more than 12,000 images. Te result showed that T-FCN with CRF performs better segmentation and achieved the dice similarity coefcient (DSC) value of 0.9815, average perpendicular distance (APD) of 6.2903, and conformity index of 0.9610. Tis work only focused on the LV segmentation.
One of the preprocessing methods is to crop the ROI frst and then apply the segmentation to the selected ROI. Tis procedure for LV segmenting is presented in [61]. Te clinical parameters such as LV volume (LVV), LVM, SV, and EF were also analyzed by estimating the size of LV from MRI images. Te class imbalance challenge was tackled by frst fnding out the ROI using an FCN model. A new FCN model was applied to these ROI images for LV segmentation. Class entropy and radial distance were used as loss functions. Te model is trained and tested using two datasets: the Automatic Cardiac Diagnosis Challenge (ACDC) 2017 publicly available dataset and a local dataset. Te ACDC-2017 dataset consists of 150 patients' data, while the local dataset consists of almost 6000 images. Te performance is evaluated using the DSC and Hausdorf distance (HD) for cross-entropy loss and radial distance loss. Te model is analyzed for both datasets and achieves almost the same results, which yield that the model is generalized and applicable to any dataset. Te proposed model attained better DSC and HD values than that of U-Net and ConvDeconv-Net. Te DSC value for LV segmentation of the proposed model on the local dataset is 0.95 and ACDC dataset is 0.94. Similarly, the HD value is 9.31 and 11.21 for the local and ACDC dataset, respectively.

Improved FCN for Clinical Index Calculation.
Chen Qin et al. [62] proposed a model that consists of two branches: the motion estimation branch and the segmentation branch. Te unsupervised Siamese style recurrent spatial transformer network is utilized for motion estimation and FCN is used for the segmentation. Motion estimation is an unsupervised method that combined the motion estimation and segmentation layer which can also be referred to as a weakly supervised model. A total of 220 short-axis view subjects were obtained from a UK biobank study. Te LV segmentation is assessed by separately segmenting the LV and also by combining the two models. Te DSC value of 0.9217 is achieved for only segmentation, while 0.9348 is achieved for the combined model which depicts that the model performs better in combine mode.
Similarly, the LVV is calculated using MRI images in [63]. Te volume of LV is a very important feature to evaluate the patient's cardiac health which requires LV segmentation. Te method segments the LV for diastolic and systolic to calculate the volume of LV. Te Sunnybrook dataset is used to train and test the model. Data augmentation is also applied by rotating the slice in diferent directions. Te method used a local binary pattern in cascade to detect the ROI. Ten, a CNN model is used to score the ROI and select the one with the maximum score. Finally, LV is segmented using hypercolumn FCN (HFCN) from the ROI. Te HFCN features from diferent levels were concatenated to form a new layer, and segmentation was based on this new layer. Te volume is calculated using both manual and HFCN. Te variance estimation method is used to estimate the fnal prediction. Tis algorithm ranked fourth in the Second Annual Data Science Bowl competition organized by Kaggle. Although this algorithm performs very well in segmentation, still sometimes the model generated the irregular shape of LV as it does not use the prior knowledge of the 3D shape of LV.
Te feasibility and accuracy of FCN to segment the scar tissues in LV were analyzed in [64]. Te modifed version of FCN, efcient neural network (ENet), is applied to cardiac images. Te proposed network consists of 13 convolutional layers with a 3 × 3 kernel size and stride of 2, while a parametric rectifed linear unit (PReLU) was used as an activation function. Cross-entropy was used as a loss function. Te two protocols, protocol-1 and protocol-2, were used for the segmentation. In protocol-1, the ground-truth and original images were directly fed to the network for training and segmentation. Whereas, in protocol-2, the desired LV area was cropped before training the network. Te images were cropped using Hough transform [65]. Te dataset consists of 250 images of 30 patients which is further increased to 2000 images by applying the data augmentation technique. Protocol-1 and protocol-2 achieved the accuracy of 95.97% and 96.83%, the sensitivity of 97.31% and 87.89, the specifcity of 68.77% and 88.07%, and the DSC value of 0.54 and 0.71, respectively. Te result demonstrated that protocol-2 performs better than protocol-1, which depicts that cropping the ROI gives better results in segmentation.

Loss Functions and Optimization Algorithm.
Until now, we have explained the FCN models and their performance based on preprocessing or by applying some changes in the model. Nevertheless, one very important parameter is the loss function. In [66], the updated model of FCN with diferent loss functions was analyzed. An iterative multipath FCN (IMFCN) segmentation model for LV, RV, and MYO from MRI images was proposed. To tackle the class imbalanced problem, searching for ROI was performed and images were cropped using the method proposed in [67]. Te proposed model consists of an encoder, feature fusion, decoder, and deep supervision. Encoder part used s Te efciency of the model is also compared using diferent loss functions. Batch-wise class reweighting mechanism and batch-wise weighted dice loss function were utilized for multiclass segmentation. Te results of the proposed models were evaluated and compared with U-Net and LVRV-Net. To quantitatively evaluate the performance, three metrics: DSC, average symmetric surface distance (ASSD), and HD were employed. Batch-wise weighted dice loss function shows the best results among all loss functions. In this research, the most inaccuracies in segmentation have occurred in apical and basal slices. Additional processing mechanisms can lessen these errors.
Te focal loss was analyzed with the four skip connections in the FCN model [68]. Te model was referred to as the focal residual network (FR-net). RestNet50 was used as a backbone network. Cross-entropy loss was calculated across the predicted probability and labeling. Focal loss was applied to improve preliminary segmentation results. Sunnybrook dataset was used. DSC and APD were used to evaluate the performance. Te model results were also compared with U-Net and FCN and other work based on the Sunnybrook dataset. Te models attained the DSC value of 0.93 and APD of 1.41.
In addition to the loss function, the optimization algorithm is also a key feature of the CNN model. In [69], the performance of optimization factors was analyzed for the CNN model. Six diferent optimization algorithms, namely, stochastic gradient descent (SGD), nesterov accelerated gradient [70], RMSProp [71], Adam [72], AdaDelta [73], and AdaGrad [74] were implemented to train CNN model. A CNN model was proposed and trained separately using all six optimization factors. Sunnybrook dataset was used for training and testing. Te best performance was obtained by the RMSProp optimization technique. Te model achieved the DSC of 0.93, APD 2.13, and percentage of good contour (GC) 95.64 using RMSProp optimization.

Other FCN Techniques.
One key limiting factor in DL is the amount of data. Training the FCN model using a large dataset for LV segmentation is studied in [75]. Te FCN model is trained on the UK Biobank dataset. Te images of 5,008 subjects (93,500 images) were used to train the model after data augmentation. Te data were manually annotated by eight diferent experts. DSC, mean counter distance, and HD values of 0.94, 1.04, and 3.16 were achieved, respectively. Te segmented LV is also used to measure the LV-EDV, LV-ESV, and LVM.
Another technique to enhance the model performance is to take advantage of the pretrained model. A pretrained VGG model (trained on ImageNet) was combined with FCN called FCN-all-at-once-VGG16 [49]. Te model used skip connections to combine the hierarchical features from convolutional layers with diferent scales. Adam was used as an optimizer with an initial rate of 10 −4 . A dataset of a total of 1100 subjects was used by splitting the dataset of 100 subjects into 50 training, 30 validations, and 20 tests images. Te next 1000 cases (diastolic) are segmented using a trained model and compared by 1000 manually drawn by an expert technician. Te manual drawing was performed using the inhouse software (A-view Cardiac, Asan Medical Centre, Seoul, Korea). For quantitative analysis, sensitivity and specifcity evaluation matrices were used. Te method is limited when the number of pixels of background (i.e., image pixels other than the LV mask) is large. Te model was evaluated using four performance indices, i.e., DSC, Jaccard similarity coefcient, mean surface distance (MSD), and HD.
FCN was also used with a graph matching algorithm. Te motion estimation of LV from MRI images was studied in [38]. Te method consists of four steps: (1) endocardial contours of the LV were predicted using a FCN, (2) features of points in short-axis cine MRI were extracted using an FCN feature descriptor, (3) the correspondence between contours of the LV myocardium was estimated by a novel graph matching algorithm, and (4) the correspondence between two LV contours and the LV motion feld was estimated using the FCN feature descriptor into the graph matching algorithm. Te Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2009 challenge database and the 33 subject's database [53] were employed to evaluate the proposed method.
Consequently, FCN and its modifcations show very decent results for LV segmentation. To enhance the performance diferent preprocessing and postprocessing techniques, loss functions, and data variability can be used. Figure 7 illustrates examples of segmented images generated by FCN models.

LV Segmentation Using U-Net.
In medical images, the required area to be segmented consists of a small area of the entire image. Te U-Net [76] has shown substantial results in the segmentation of medical images. Tis is possible due to the ability of U-Net to continuously suppress the background region in training and emphasis the required areas that need to be segmented. Tat is why the most used network for LV segmentation is U-Net and its modifcation models.  [77]. Input Cardiac MRI is fed to U-Net and then labeling probabilities were generated. For the postprocessing Kernel cut, a segmentation technique was used. Te output of U-Net is the input of continuous kernel cut which segments the desired part. LV, MYO, and RV were segmented using this approach. Te result shows that, with less training data, reasonable segmentation results can be achieved.

U-Net with
Te postprocessing on vivo difusion tensor CMR was performed in [78]. A fve-layer U-Net architecture is used to perform the LV segmentation followed by image registration. Tis helps to remove the bad images, and then, the fnal segmentation was applied. To increase the size of the dataset, data augmentation was used (translation and rotation). 8 Computational Intelligence and Neuroscience Batch normalization was used with U-Net to avoid overftting. Te model achieved a 0.93 median value of DSC. Te approach in [79] performs the preprocessing on the images by selecting the ROI by using the SinMod method. Te ROI contains the desired part of the heart that is fed to the U-Net model for training. Te curriculum learning (CL) strategy was utilized as a training strategy. Te proposed methods were compared together with U-Net without CL, FCN (with and without CL), and hybrid gradient vector fow snake. Te DSC, overlap, and mean average distance (MAD) were used as evaluation matrices.
In [80], the labeled images from the Kaggle database were used before training. Te concept of transfer learning is utilized by pretraining the 3D U-Net model using Harvard data. Te U-Net model was used to segment the LV, MYO, and RV. Te DSC value achieved for LV segmentation was 0.87 without transfer learning and 0.95 using transfer learning.
Te determination of ROI makes the segmentation task simple and accurate as the targets area is reduced to ROI instead of a complete image. Tis strategy was used in [81] and proposed for the three U-Net-based models. Te proposed CNN architectures classify myocardial tissues and detect LV-ROI before LV quantifcation. For this experiment, the Sunnybrook cardiac dataset and the Cardiac Atlas Project (CAP) were used, which consists of 45 and 95 cases, respectively. Tree new CNN architectures were proposed which are based on U-Net. Te main purpose of the proposed models is to quantify the LV. Before LV quantifcation, LV-ROI detection and myocardial tissue classifcation were performed using the same U-Net architectures.
In the frst proposed model, the encoding path comprises two 3 × 3 convolution operations, batch normalization, and residual learning. Te 2 × 2 Max-pooling operation with stride 2 was performed after the residual learning [24]. Te second and third proposed architecture is named as uInception and uXception. Te network complexity was reduced in these networks. Te SGD was used as an optimization factor and Jaccard distance as a loss function. Te data augmentation was applied to increase the data size from 4,048 to 20,000. Te segmentation accuracy was measured using the DSC and achieved 0.870, 0.869, and 0.868 for the proposed networks, respectively. Mean square errors of 0.0135, 0.0136, and 0.0138 were achieved while the mean absolute error was 0.0137, 0.0136, and 0.0138. Furthermore, EDV, ESV, SV, LVM, and EF were calculated as clinical indices.

U-Net with Deformable
Model. Te combination of DL and deformable models as postprocessing can be combined to segment the LV. Veni et al. trained the U-Net model for LV segmentation from the A4C chamber view of US [82]. Te segmented output is further refned using the deformable model. Using this technique, high accuracy is achieved by training the model with a very fewer amount of data, i.e., 69 images. Te DSC value of 0.86 ± 0.06 was achieved.

Improved U-Net for Clinical Index Calculation.
Many studies focus on the calculation of various heart parameters such as EF, global longitudinal strain, or LVM, and for measurement of these parameters, LV segmentation is one of the primary tasks to be performed. A study was performed to validate that the DL methods can be used in real-time software that streams images directly from an ultrasound scanner [83]. A U-Net model was utilized for LV segmentation. Te main goal was to calculate ventricular volume, EF, and mitral annular plane systolic excursion (MAPSE). All these parameters were based on the segmentation of LV. Te accuracy of the model was evaluated by Bland-Altman analysis. Te dataset of 75 patients was used and a value of (−13.7 ± 8.6)% for EF and (−0.9 ± 4.6) mm for MAPSE was achieved for Bland-Altman. Te results show that DL is a feasible solution for the [75] [  Similarly, LV segmentation was also performed to measure the GLS. Te work in [84] utilizes the standard U-Net architecture and performs the four tasks on US images: (i) classifcation of cardiac view, (ii) segmentation of LV from the cardiac view, (iii) estimates of the regional motion, and (iv) a fusion of measurement. Te segmentation architecture comprises fve levels of upsampling and downsampling. All levels consist of two convolution layers with flters ranging from 32 to 128. Te 3 × 3 flter size, 2 × 2 Max pooling, and 2 × 2 equal stride were utilized. Dice was used as a loss function and Adam as an optimizer.
A method to achieve LV segmentation based on temporal area correlation was proposed in [85]. U-Net was used as a base CNN model, and then, the multitask module is utilized for epicardium and endocardium segmentations. Te output of the multitask module was fed to recurrent neural network (RNN). Te RNN performs the temporal area correlation optimization. Te average DSC of 0.90 ± 0.05 and average HD of 7.6 ± 4.5 was achieved. Te LVM and EF have also been calculated to cross-validate the results.
For the quantitative analysis of the LV, segmentation is performed before quantifcation of LV parameters (area and dimension) [86]. Te segmentation provides the structural information of LV which is further used for quantifcation. Initially, U-Net architecture was used as a segmentation model. Furthermore, a Deep-CQ segmentation model was proposed for LV segmentation that comprises the proposed loss function. Te binary classifcation of each pixel as LV or background was performed using the Gibbs distribution function [87]. Te segmentation performance was evaluated using DSC matrices and achieved 0.893 ± 0.05 value for Deep-CQ models, while U-Net yields 0.897 ± 0.041. Te main object of this research work is the quantifcation of LV, and the Deep-CQ model performed better than U-Net for quantifcation while U-Net performed better than the Deep-CQ model in segmentation.
Estimation of myocardial perfusion is an essential step to measure the blood fow through the heart muscle. Te arterial input function (AIF) extraction is an important phase for calculating the myocardial perfusion. Te AIF estimation is highly dependent on detecting the LV size accurately. Te LV segmentation to measure the AIF was performed in [88]. A U-Net model based on RestNet was designed to segment the LV. RestNet consists of batch normalization, ReLU, and convolution layers. To estimate the output probability, sigmoid or SoftMax was used. Te kernel size used was 3 × 3 with 1 stride and 1 padding in all convolution layers. A weighted sum of cross-entropy and IoU was used as a loss function. To fnd out the best hyperparameters, 45 training sessions were performed and the best hyperparameters were used for fnal training. Te labeling of LV and RV was performed using an ad hoc algorithm and experts crosscheck the labeling. Te model was trained using two different sets of classes: (i) LV and background and (ii) LV, RV, and background. Te model achieved DSC values of 0.87 ± 0.08 for three classes and 0.82 ± 0.22 for two classes.
Te performance of the model trained for three classes was better than two classes because the contextual information extracted from three classes improves the LV segmentation performance.
From an entire echo cine, automatic LV segmentation was performed in [89]. Te US images and optical fow of US images were frst fed to the temporal window. Te optical fow was calculated by the Horn-Schunck algorithm. Te output of US image and optical fow US images act as input to the two separate encoder parts of U-Net. Te output of both the U-Nets was concatenated. In the third part of the model, the concatenated data were passed to the bidirectional LSTM. Te U-Net decoder fnally up-sampled and segmented the LV. Te data of 563 patients were used with a training and testing ratio of 80 and 20, respectively. Dice was used as a loss function and Adam as an optimizer. Network performance was compared with U-Net and U-Net Bi-Conv LSTM using the DSC. Te model U-Net optical Bi-Conv LSTM achieved the best DSC value of 0.936 and accuracy of 0.977.

Comparison of Diferent U-Net Models.
Te comparison of three well-known CNN architectures was performed by [90]. U-Net, wide U-Net, and U-Net++ were trained using the data of 94 patients. Data augmentation was used to increase the data size and to avoid the overftting problem. Te U-Net has 32, 64, 128, 256, and 512 feature maps, while the wide U-Net has 35, 70, 140, 280, and 560 feature maps. U-Net++ has an additional block of feature maps and skip connections. Exponential linear units (ELU) were used as an activation function in all layers except the last layer, where sigmoid was used. Te model was trained using the original dataset and augmented dataset and the performance was assessed. Te U-Net++ model performance was the best among the three models using an augmented dataset. Te highest DSC value of 92.28 is obtained. Moreover, U-Net++ was less overftted than U-Net and wide U-Net.

U-Net Performance Based on Dataset Properties.
Although the comparison among diferent variants of U-Net was performed in [90], the training dataset and data variability also afect the performance of the network. Te efect of training datasets from diferent variability on the performance of the CNN model was analyzed in [91]. U-Net architecture was used as a segmentation model and assigned the names CNN1, CNN2, and CNN3 based on the training dataset variability. Tree diferent training sets were collected for this research experiment. CNN1 was trained using the data from single center and single vendors with 25,389 images. CNN2 was trained by the set consisting of images from multiple centers by the single vendor and 27,488 images, while multiple centers and multiple vendor data were used to train the CNN3 model with 41,593 images. Te training images were preprocessed for normalizing the resolution, cropping the images to 256 × 256, and normalizing the signal intensity. APD was used as an evaluation metric. CNN3 had the largest number of training samples and the highest variability, and it has achieved the best performance on unseen heterogeneous testing data with the highest value of 1 mm for CNN3. While EDV, ESV, LVM, and EF were used as clinical indices.
Similar and detailed research to analyze the impact of the amount of data, quality of images, and infuence of expert annotation on LV segmentation was executed in [92]. A US images dataset that is openly available was also introduced in this work. Te dataset consists of an apical 4-chamber view of 500 patients and is called the Cardiac Acquisitions for Multistructure Ultrasound Segmentation (CAMUS) dataset. Te authors compared the performance of diferent CNN models based on U-Net. Te models used for LV segmentation were U-Net1, U-Net2, anatomically constrained neural network, stacked hourglasses, and U-Net++. All these architectures were based on encoder-decoder and the main diference among these architectures is the use of diferent layers and learning parameters. Te U-Net2 yields the best segmentation results, and the performance was slightly better than U-Net1, but U-Net1 needs fewer parameters to learn, so the authors choose U-Net1 for further experiments. Te model was trained to segment only LV and multistructure in which the model segments the LV endocardial (endo), LV epicardial (epi), and left atrium (LA). Te model performance was consistent for both LV segmentation and LV segmentation in the context of LA.
Te efect of image quality on training was also tested. Two diferent sets of images are given to the network for training. One set comprises only high-quality images while the other consists of high-and low-quality images. Te output of both sets does not vary signifcantly. Te author infers that the encoder-decoder-based techniques can cope with variability in image quality. Te infuence of the size of the training dataset on the performance was also tested. Te U-Net1 model was trained by increasing the dataset from 50 patients to 400 patients. At each level, 50 more patients' data are added for network training. Te results show that the performance of the model increases to 250 patients and slightly improved by increasing the training data further to 400. It is concluded that U-Net1 requires 250 patient data to attain a good promising result. Te impact of expert annotation was evaluated by annotating the data by three diferent experts. Te network was trained each time using the data of 50 patients labeled by every three diferent experts. Te validation data were kept the same, and the model was tested by the remaining 400 patients' data. Te network trained using the expert's data showed better results in testing. It is analyzed that the data contouring images are cardiologist dependent. Furthermore, the encoder-decoder network can learn a specifc way of segmenting.
Te labeling of large dataset problem was addressed in [93]. A model was proposed to generate the ground-truth images. Pseudoimages were generated using a graphical model such as the principal component analysis. Te CycleGAN model was employed to generate the labeled images by using the pseudoimages and unlabeled original images. Tese labeled images were utilized to train a U-Net model. CAMUS dataset, EchoNet dataset, and synthetic dataset were used to train and test the model. Te results show that the model trained using the synthetic data also performs very well.

Other Models Based on U-Net Architectures.
Segmentation of LV, RV, and MYO from apical 2 chamber (A2C) view or apical 4-chamber (A4C) view has been implemented using DL methods [94]. In this work, neural network was tested to segment the LV, RV, and MYO from the apical long axis view (ALAX). In ALAX the main difference is the LV outfow tract (LVOT) which restricts the view. Four diferent approaches were used in this research. First, U-Net1 was trained from scratch and used to segment the ALAX. Tis model was referred as a baseline model in this work. Secondly, the baseline network was trained on A2C/A4C views, used as a transfer learning, and then trained for ALAX segmentation. Tird, the baseline network was trained using A2C, A4C, and ALAX data. As ALAX data are less than A2C/A4C, so to compensate for this, ALAX data were repeated ten times in each epoch. In the fourth approach, the network was fed with US images and binary indicators. Te purpose of the binary indicator is to inform the network about the input image whether it is ALAX or A2C/A4C. As the U-Net model has no dense layer, so an image is created from a binary indicator and fed to the network. Te dataset of CAMUS challenge consisting of 500 patients was used for training, while for ALAX view, separate data of 106 patients were collected. Te proposed multiview segmentation network achieved the best DSC value of 0.921.
To achieve the accurate and precise LV boundary and size, diferent studies modify the U-Net to elevate its performance. Gutierrez-Castilla et al. [95] improved the U-Net model by applying the changes in skip connections. Te features' maps from each decoder layer were selected and upsampled according to the size of the fnal decoder output. After upsampling each decoder feature map, all feature maps were concatenated or added together. Using these dense skip connections, the decoder can fow directly to the fnal layer from each decoder layer. As no extra layers or flter is added, so this model does not add any extra parameters. For training, the model two datasets ACDC and Sunnybrook were used which consist of 150 and 45 patients' data, respectively. LV, RV, and MYO were segmented for diastolic and systolic. DSC and HD were used as evaluation matrices. As a clinical index, EF was also calculated by segmenting the LV for ED and ES. For ED, 0.968 and 4.855 (mm) values of DSC and HD were achieved, respectively. Likewise, DSC of 0.944 and HD of 6.254(mm) were attained for ES.
In the same way, a CNN model, named batch-normalization-U-Net (BNU-Net) was designed for LV segmentation from MRI images [96]. Te proposed model was based on U-Net architecture, where the successive layers in the encoding path were followed by an ELU as an activation function and batch normalization was applied after convolutional flters. Te BNU-Net has 4 layers in the contraction path and 7 layers in the expansion path. Te 2 × 2 Max pooling was used after a pair of convolutional layers in the contraction path. Te model was also trained using the Computational Intelligence and Neuroscience ReLU activation function and was found that the model gives better performance with the ELU activation function. Te model was trained using the publicly available Sunnybrook dataset and the training data size was increased by applying the afne method for data augmentation. DSC and sensitivity matrices were used to compare the performance of BNU-Net with U-Net (with and without data augmentation). Te BNU-Net performed better with data augmentation and gave a value of 0.93 for DSC and 0.97 for sensitivity.
Also, a novel U-Net-based method, CNN module, named the "OF feature aggregation network" (OF-Net) as integrated temporal information from cine MRI into LV segmentation [97]. Te proposed model integrates the motion information with the U-Net model. Furthermore, two more CNN models were used to localize (ROI-Net) and then segment the LV (called attention module). Te model is trained using a fying chair dataset and fne-tuned using the MRI datasets. Two diferent publicly available datasets, Statistical Atlases and Computational Modelling of the Heart (STACOM) and ACDC datasets, were used. Out of 100 subjects, 66 were used for training and 34 for testing. Total of 12,720 images for training and 6972 for testing (from the STACOM dataset). A DSC value of 94.8 ± 3.3 was achieved.
In [98], a graphical user interface is developed for LV segmentation from MRI images using PyQT libraries. Images were labeled manually, and the labeled LV images were fed to train the CNN model. A publicly available dataset and the internal dataset were used to train the model with 13,535 images and test the model with 4,148 images. Te model achieved the DSC of 0.87 ± 0.02.
Te sonographers also used the point-of-care ultrasound (POCUS), which is portable ultrasonography used for diagnosis. Te feasibility of translating the POCUS echo images to the high-quality traditional echo images was studied in [99]. To improve the quality of POCUS data according to the level of cart-based US data, the mapping from POCUS images to cart-based US images was an obligatory task. To achieve this goal, the POCUS images were analyzed, compared, and mapped with the traditional US images. Te dataset of 5000 POCUS images and 16000 US images was used for the mapping purpose. Te anatomy of LV was extracted from POCUS (using A2C view) using the DL method and then mapped with high-quality US images. Te images were classifed as low quality (fair + medium) and high quality. Tis classifcation was performed based on the visibility of the anatomy of the desired region. Fully convolutional encoder-decoder networks based on U-Net architecture were utilized for the translation of images. Te size of the input image was 128 × 128. Te model comprises ten encodings and eight decoding convolutional layers. ReLU activation, batch normalization, and dropout with ratio � 0.2 were used. In the frst layer, batch normalization was not employed. Max-pooling and transpose convolution layers with stride 2 × 2 were used in downsampling and expansive paths, respectively. Te average DSC value obtained is 82.6 ± 12.3 and 88.3 ± 5.0 for low-and high-quality images, respectively. Similarly, 2.6 ± 2.7 and 1.9 ± 0.8 mm values of HD for low-and high-quality images.
Despite the several advantages of using the U-Net in medical images, it ignores the efects of features maps on diferent scales directly. To solve this problem, a pyramid network is combined with the dilated U-Net model and named as multifeature pyramid U-Net (MFP-UNet) [100]. In dilated U-Net model, two more downsampling layers were added to extract more dense details of an image. As the US images were usually low contrast images, the images were preprocessed to enhance the contrast of US images using Niblack's method for global thresholding. Te model was trained using a self-collected dataset of 137 2D-US sequences which yields 1080 training images and 290 test images. Furthermore, the model was also trained using the publicly available CAMUS dataset. Te proposed model did not only yield good segmentation results but also took less runtime. Te model was compared with U-Net, dilated U-Net, and DeepLabv3. It takes about 1.2 sec for the classic U-Net, 1.33 sec for DeepLabv3, and 0.81 sec for MFP-UNet to segment a test image. DSC, HD, Jaccard distance, and mean absolute distance were used to compare the performance of the model.
Another concerning issue is, while computing the parameters, most of the DL models extract similar features at low levels. To avoid this problem, modifcation in the attenuation U-net model was proposed by introducing the attention gates mechanism [101]. Tis model focuses on the desired region of varying size and shape automatically. Furthermore, the class imbalance problem was addressed by introducing the Tversky loss. Te model achieves 0.75, 0.87, and 0.92 for Jaccard index, sensitivity, and specifcity, respectively.
One of the main problems which arise in DL architectures is gradient vanishing. Te research [102] focuses on the gradient vanishing problem and proposed a model residual of residual (ROR) U-Net model. Te encoding path of the proposed model is similar to ResNet-U-Net, but three shortcut levels are introduced in the ResNet-U-Net model. Te First 3 × 3 convolutions and zero padding on the input image are applied. At the second level ResNet, the identity and convolutional blocks of ResNet are divided into three branches, while, at the third level, convolutional blocks and identity blocks are used. Te proposed model was trained and tested using the Sunnybrook dataset and compared with U-Net and ResNet-Unet models. Te 0.866, 0.926, 0.923, 0.120, and 0.945 of Jaccard index, DSC, precision, false positive rate (FPR), and recall are achieved by the model.
Te study [103] used the unsupervised learning method to segment the LV, RV, and MYO. A U-Net model was trained using ACDC and tetralogy of fallot dataset (TOF) dataset on short-axis (SAX) view of MRI images. Ten, using the transformation network, the model segment the LV, RV, and MYO from the SAX view. Te model has never seen the SAX view of images before.
In [104], the localization and detection of the area containing LV and RV were performed by a network known as the Left Ventricle Localization NET (LVLNET). Tis model is a lightweight encoder-decoder such as CNN. Tis CNN model contains two 3 × 3 kernels, batch normalization, and Max pooling in each layer having four layers in total. 12 Computational Intelligence and Neuroscience Tis localization model identifes the central part containing the LV and RV. Te image is cropped and then pass to the next proposed CNN model called multigate dilated inception architecture (MGDIB). Te MGDIB is based on U-Net architecture, but the kernel weights are expanded by the dilation factor and called a dilated convolution. Te number of parameters does not increase using the dilated convolution. Two publicly available datasets, LV segmentation challenge (LVSC) + ACDC, were used. LVSC is used for the frst part for localization while ACDC is for the segmentation of LV and RV. Diferent clinical measurements such as EDV, ESV, EF, and LVM are also calculated. DSC and HD values obtained were 0.900 and 0.910 for ED and ES, while HD values were 8.330 and 11.040 for ED and ES. Figure 8 depicts instances of segmented images by the U-Net model.

LV Segmentation Using Other CNN Networks.
Several studies also used various CNN models other than FCN and U-Net. Te detail of other CNN models utilized for LV segmentation is explained in this section.

CNN Models with Preprocessing.
US imaging was also used for the heart analysis of children. For this analysis, the segmentation of LV and LA was performed on the paediatric US images [105]. Preprocessing was applied to the images before the training. Te meaningless background was removed by resizing the images to 512 × 512. Furthermore, image augmentation was also used by rotating, random cropping, using salt and paper, and speckle noise with a probability of 0.01. To extract the spatial features, a spatial path module was designed. Te spatial path is a convolutional network consisting of three layers with stride � 2, followed by batch normalization and ReLU activation function. It extracts a large amount of low-level spatial information. A submodule spatial attention was added to exploit the interspatial information and it focused on the "where" information part. Te second parallel part of the network extracts the contexture information using fve convolutional layers. RestNet50 was used as a backbone network. Te submodule contextual attention was used at the end of this part to refne the extracted information and to know the "what" information part in an image is important. Another hybrid approach based on CNN and the double snake model was proposed to segment the LV from MRI images [107]. A SegNet architecture was used for the initial segmentation result. Ten, ROI was plotted around the coarsely segmented region taking the center point of the segment object and rectangular ROI was formed using polar transform. Output from SegNet was fed to the snake models which perform the fnal segmentation of LV. For training, the model 45 subjects were used and accessed from the MICCAI challenge. Te DSC value of 0.96 and 0.97 was achieved for endo and epi. EF and LVM were calculated as clinical indices. Additionally, regression and Bland-Altman analysis was also performed.
Similarly, the work in [108] combines the ASM and neural network to segment the LV. A difusion flter was applied to the images as a preprocessing step before feeding the data to the model. Tis flter used eight neighboring edges to preserve the edge information along with noise removal. A CNN architecture, Faster-RCNN, was used to determine the position of LV, and ASM used this location to segment the LV. As the ASM needs the initial position of the object to determine the position, so the region proposal network (RPN) was used to propose the regions that might contain the LV. Ten, Faster-RCNN located the LV in the proposed ROI. Both RPN and Faster-RCNN were fne-tuned with ImageNet. Te dataset of 30 patients was used for this work. Te DSC, MAD, and HD were used to evaluate the performance. Furthermore, the models were also compared with other models proposed in the literature. Te proposed model yields a DSC value of 0.921, MAD 1.95, HD 6.29 mm, and Jaccard 0.86.
One more hybrid approach was proposed consisting of the CNN model and dynamic programming [109]. Initially, SegNet [110] with 17 stacked convolution layers was used for coarse segmentation which segments the boundaries of LV. Te batch normalization, ReLU activation function, and four MaxPool layers were used after the frst four convolution layers. Secondly, the segmented results from SegNet were refned for endocardial contour. In the last step, a dynamic programming model was used to calculate the epicardium and endocardium of the heart. Te 900 subjects from Hubei hospitals were used for this study. Jaccard and DSC were Computational Intelligence and Neuroscience used for the evaluation. Te DSC of 0.90 (0.03) and 0.93 (0.02) were obtained for endo and epi, respectively. Similarly, 0.80 ± 0.06 and 0.76 ± 0.09 average value of Jaccard was obtained. Te LV-EDV, LV-ESV, SV, EF, LVM in the diastolic phase, and LVM in the systolic phase were also measured. Te Bland-Altman analysis was performed for the comparison of these clinical indices.

CNN + LSTM.
A combination of encoder-decoder network and LSTM was used in [111] with Fire dilated and D-Fire dilated layers as a replacement for standard convolutional layers. Te Fire dilated modules add an extra dilation rate in the kernel by inserting zeros between the consecutive values of the kernel and skip connections were applied to keep the temporal information of the image. Using the Fire dilated module, the network extracted more image information by adding extra parameters. Between the encoder and decoder, an LSTM module is added which is a special RNN structure. LSTM along with propagating the characteristics also captures the temporal dependencies between consecutive frames. Images were preprocessed by cropping the image based on ROI and resized to 80 × 80. A total of 2900 images from 145 subjects were used to evaluate the performance of the model and two experts manually labeled the images. DSC, Jaccard distance, accuracy, and positive predictive value (PPV) were used as evaluation metrics. Te model achieved the DSC 0.960, Jaccard 0.903, accuracy 0.991, and PPV of 0.960 for LV. Te proposed model was compared with simple Conv-Deconv, SegNet, FCN, and U-Net architectures.
A segmentation-based deep multitask regression learning model (Indices-JSQ) was proposed in [112]. Te model is mainly divided into two parts. Te frst part is a segmentation network named Img2Contour and the second part is a multitask regression model (Contour2Indices). Te segmentation model is based on deep convolutional encoder-decoder architecture with three convolution layers. Te ReLU activation function along with Max-pooling was employed. Feature maps were generated by the use of the convolution layers with the kernel size of 5 × 5. Tis part segmented the LV and then passed the information to the next part of the model. Te second part consists of RNN with LSTM. Tree parallel CNN architectures were used that difer in kernel size and pool size. For the 1st CNN model, kernel size and pool size were 3 × 3 and 2 × 2, the 2nd model was 3 × 3 and 5 × 5, and the 3rd model have the same size of kernel and pool, i.e., 5 × 5. Te dropout layer was used to avoid the overftting problem. Information was passed to the LSTM which further quantifes the indices. A total of 2900 short-axis views of 145 subjects were used for training. DSC and mean absolute error (MAE) were used for the performance evaluation, and the performance is compared with other CNN models. Area, dimension, and wall thickness were also calculated as clinical indices. Te proposed model automatically calculates these indices which is one of the major contributions of the model.
Te tumor extraction using the convolutional LSTM network was performed in [113]. To prove the generalization of the proposed ST-ConvLSTM model, it was applied on 4D ultrasound for LV segmentation. Te model was trained on the publicly available 3D + time ultrasound dataset challenge on Endocardial Tree-dimensional Ultrasound Segmentation (CETUS) consisting of data from 15 patients. Te proposed model achieved the DSC of 0.868 and 0.859 for ED and ES phases, respectively.
Te classifcation and segmentation of LV from multiview (A2C, A3C, A4C) US images were implemented in [114]. Initially, pyramid dilated dense convolution (PDDConv) was used to extract multilevel and multiscale features. PDDConv network consists of batch normalization, ReLU, and dilated convolution. After extracting the [83] [  features, hierarchical convolutional layers with LSTM recurrent units (hConvLSTM) were used for segmentation. Te fully connected layers were used to perform the classifcation task using 3DCNN. Data of three diferent views, i.e., A2C, A3C, and A4C with 150 patients for each view yielding a total of 450 patients' data consisting of 13,500 frames, were utilized for training and testing. Furthermore, the model was trained and tested using the publicly available CAMUS dataset which has 1800 frames. To evaluate the performance of the models, MAD, DSC, and HD matrices were used. DSC of 0.92 for all A2C, A3C, and A4C views was obtained. Te mean HD of 6.06 mm, 5.96 mm, and 6.06 and mean MAD of 2.80 mm, 2.77 mm, and 2.83 were achieved for A2C, A3C, and A4C views, respectively. Te proposed model was compared with U-Net, ACNN, and U-Net++ and achieved better results. Te EDV, ESV, and EF for the CAMUS dataset were also estimated using the segmented LV.

Alternative CNN Models.
A DL model was proposed to segment the LV and calculate such as cavity area, MYO area, cavity dimension, and wall thickness [115]. Te model is named cascaded segmentation and regression network (CSRNet) and has two parts: a CNN model that segments the LV and a regression model to quantify the LV metrics. Te dense connected convolutional neural network (DenseNet) was employed to reduce the number of learning parameters. Te network mainly consists of three dense and three transition blocks. It generates three diferent probability maps for background, MYO, and cavity. Output from the last layer was fed to the regression component and passes to a CNN model with three convolution layers and two fully connected layers. To train the network, 2900 images (145 subjects) were used. Tese images were parted into 2320 training and 580 for testing. Several preprocessing methods such as landmark labeling, rotation, ROI cropping, and resizing were also applied to the images. DSC is calculated and compared with U-Net. A good initialization is a key parameter that optimizes the CNN model quickly. In [116], an initialization method was designed for the DCNN model to segment LV using MRI images. Te model was trained and tested using two initialization methods: random initialization and Gabor flter initialization. Gabor flters can provide an accurate description of most spatial characteristics of simple receptive felds. Furthermore, spectral and spatial domains were simultaneously optimized in these flters which minimized the number of features. Te authors demonstrated that using Gabor flter initialization requires less amount of training data and less complexity due to lower parameters. Te York Cardiac Segmentation database (5011 images) was used for training. Te model achieved the DSC of 0.798 with random initialization and 0.80 with Gabor initialization, while if Gabor fltered was maintained, the value further increased to 0.803.
A dense V-Net model was proposed which is based on V-Net architecture [117]. Few dense layers were added to the original V-Net model to improve the performance. For training, 30 patients' data (86 frames and each frame containing 73 images) were collected and manually labeled by 3 experts. Te improvement of the proposed model was shown by comparing it with U-Net, FCN, and V-Net. Te proposed model achieved a DSC of 0.90.
A transformers-based [5] DL model was designed to handle the sequential data. Transformers were mainly used for natural language processing and in [118] it is used to learn the image parameters. In the frst part, 3D LV volume was passed to the transformer net which consists of 3D Conv layers, batch normalization, ReLU, Max-pooling layers, and fully connected layers. Tese layers extracted the transform parameters from the LVV and were inserted into AtlasNet, a new shape generation framework. Te Atlas network has several advantages such as improved precision and generalization capabilities, and the possibility to generate a shape of arbitrary resolution without memory issues. AtlasNet consists of deformable layers and generates the 3D LV shape using the parameters achieved from the transformer. DSC, MSD, and HD were used for evaluation and achieved 0.91 ± 0.027, 1.99 ± 0.64, and 8.92 ± 7.16 respectively.
Te CNN models such as FCN and U-Net focused on single-frame image processing. While, in the study [119], a dense RNN was proposed to segment the LV from a fourchamber view of the MRI time sequence. RNN can deal with sequential information. In RNN, information from the previous cell was transmitted to the next LSTM cell, but the frst cell does not get any previous information. Te proposed model used the two RNN models. Te frst layer of the second RNN model, which performs the segmentation, receives the information from the frst RNN model. In this experiment, data from 137 patients were used. Te performance of the model was compared with state-of-the-art CNN models. Te proposed model achieved the IoU of 92.13%. Few examples of the segmented LV by CNN models are depicted in Figure 9.

Discussion
Te performance of DL methods depends on various parameters, and the time for the data processing is based on hardware. Te details of the several modifed models and proposed architectures are explained in Section 4. Here, in this section, some important data from the reviewed literature are presented. Te section is divided into fve sections and conveyed in a tabular form so that readers can have an overview of all important information related to hardware, software, imaging modality, database, architecture, and results.

Imaging Modality.
For the analysis of cardiac diseases, diferent imaging modalities have been used such as the US, CTscan, and MRI. Due to its high resolution, MRI is the gold Computational Intelligence and Neuroscience standard and is mostly used. On the contrary, US images are also highly recommended due to their ease of use and low cost. Te third type of image used for the cardiac analysis is CT.

Architectures.
Several CNN architectures are used by researchers for LV segmentation. Te U-Net architecture is specifcally designed for medical images; therefore, the use of U-Net and its variants are mostly used for the segmentation of LV. Besides U-Net architecture, FCN is the second most used network architecture for LV segmentation. Table 1 shows the CNN architectures used for LV segmentation.

Hardware.
During the training process, the neural networks learn millions of weights. It may take several days to train such a huge number of weights on CPUs. Te training time taken by the machine is one of the parameters to be focused on while implementing the models. Terefore, for the processing of DL models, hardware confguration plays an important role. A striking option for DL is a GPU. Te use of GPUs makes the training and testing process fast, and results can be attained and compared in a short time. Hardware confgurations used by authors for LV segmentation are listed in Table 2.

5.4.
Software. An appropriate software framework is necessary to execute the complex DL architectures. Various frameworks have been used to implement the LV segmentation through DL architectures. Tese frameworks are generally used in Python programming. Python is an open-source programming language; furthermore, it supports a remarkable set of easy to utilize library functions for the execution of DL models; therefore, Python is widely used in DL-based applications. Te software frameworks described in this section are primarily developed in Python language. Te most general among them are TensorFlow, Teano, Keras, CAFFE, Torch, and Deeplearning4j. Few researchers have also used MATLAB as a programming language. Software used by researchers is enlisted in Table 3.

Dataset.
Te performance of DL models is highly affected by the dataset. Te number of images or number of patient data used to train and test the model is one of the key attributes of LV segmentation. Most researchers have used self-collected data, but, at the same time, several public datasets are also available. Te details of the datasets are explained below and summarized in Table 4.
Data are from the 2009 Cardiac MR Left Ventricle Segmentation Challenge, often known as the Sunnybrook Cardiac Data. Te data collection includes 45 cine-MRI images from a variety of diferent people and pathologies.
After registering on a website dedicated to the online evaluation, the ACDC database is made accessible to participants through two datasets. One dataset, referred to as the training dataset, contains 100 patients and manual references based on the study of one clinical expert. Second is a testing dataset consisting of ffty additional cases without manual annotations. Te
Te short-axis steady-state free precession cine MRI from the Cardiac Atlas Project database is used to make up the STACOM dataset. In total, 100 individuals with postmyocardial infarction and coronary artery disease are included in the dataset. Every image contains a ground truth annotation.
Te TWINS-UK is a volunteer register consisting of more than 12,000 twins. One thousand four hundred and sixty eight consecutive female volunteers (mean age 62 9 years) were recruited for this investigation. Each dataset had 12 to 14 short-axis cine that were continuous and evenly spaced from the atrioventricular (AV) ring to the apex, covering both ventricles.
Te UKBB dataset is comprised mostly of a large number of healthy volunteers. By stacking a series of 2D cine images, 3D images of the LV and RV were created. LV, MYO, and RV were manually segmented in the ES and ED phases by 8 observers under the direction of 3 lead investigators, and hundred subjects were chosen.
Te Cardiac Atlas Project ofers CMR for 95 individuals with coronary artery disease and mild-to-moderate left ventricular dysfunction from prospective, multicenter, and randomised clinical studies. Sufcient slices along the short axis were collected to cover the whole heart in SAX. Also included in these acquisitions was the manual segmentation of the myocardium. Te CETUS dataset came from 15 patients. Each patient had 13-46 3D volumetric imaging sequences, and each sequence had two manually segmented volumes at the end-diastole (ED) and end-systole (ES) phases. Figure 10 is an illustration of original and labeled image taken from four distinct datasets.

5.6.
Results. Te segmentation performance of models is evaluated using well-known evaluation matrices such as DSC, HD, and Jaccard distance, although some authors also used other matrices for accuracy, sensitivity, etc.
Te DSC [123] is overlap based and calculated using equation (1). In the equation, S GT represents the groundtruth image that represents the original LV size and boundary. S Seg is the segmented mask by the model. To calculate the DSC, the intersection region of two masks is divided by the total region of both masks. Te range of DSC is 0 and 1, where 0 represents no similarity or overlap and 1 represents exact overlap: Te HD [124] is a spatial distance-based index to measure the "closeness" of two sets of points. Te HD between two-point sets A and B is defned by equation (2).
where h (A, B) is direct Hausdorf distance, and it can be calculated by equation (3).
where ‖a − b‖ is any norm value, e.g., Euclidean distance. Te Jaccard distance [123] can be calculated using the formula presented in equation (4).

Challenges and Future Outlook
Te article shows that DL approaches have equally performed or outperformed the previous state-of-the-art LV segmentation techniques. DL algorithms are expected to completely replace the current LV segmentation techniques. Given this, it is reasonable to consider whether DL techniques can be directly applied to real-world applications to reduce medical practitioners' workload. However, there are still challenges to make the existing DL methods viable for real-time applications.
In medical images and, particularly, cardiac images, acquiring the annotated images is the most prevalent challenge. As this article demonstrates, most of the research employed supervised learning, which necessitates the usage of a signifcant number of annotated images. To properly label, the LV needs both specialised knowledge and a signifcant investment of time. As a result, the datasets of the annotated LV are quite limited in comparison to other publicly available datasets in other felds, such as natural images.
Moreover, the performance of DL on data that difers from the training dataset is another challenge. Even though the trained DL model is tested on unseen data, the training and testing data are received from the same source, such as the same sort of scanner. Te model does not provide the anticipated outcome if new types of data, e.g., from multiple scanners or diferent disease patients, are used to test the model. A few studies have utilized training data for LV segmentation from diferent sources and scanners to train the model to get over this problem.
Also, the DL performance is highly dependent on the quality of the training images. Many imaging modalities such as CT and US are of low quality due to many factors such as speckle noise and poor contrast ratio. To produce high-quality images, many researchers use some sort of data preprocessing.
Terefore, further studies are required to investigate the methods to improve the image quality. Terefore, the efciency of the DL model and the accuracy of LV segmentation may be signifcantly boosted by improving the image quality. Tere is a signifcant demand for a DL-based system that has the ability to improve image quality in an efcient and effective manner while simultaneously reducing noise. Terefore, the LV segmentation will be considerably more accurate when the segmentation and enhancing methods are combined.
As discussed above, one of the main challenges is the availability of large datasets, and there is abundant new research aimed at levitating the limited dataset size problem, and some LV datasets are publicly available. Tere is a pressing need for architectures and algorithms that have been purposed and built for the segmentation of medical images and, therefore, LVs, and that can also perform admirably when applied to limited datasets.

Conclusion
In this article, a comprehensive review of the literature focused on the analysis of cardiac images using DL for LV segmentation is presented. In the feld of image processing CNN, a subbranch of DL has shown very promising results for diferent types of identifcation including classifcation, object detection, and segmentation. CNN is also seen as a futuristic approach specifcally in image processing. Te application of CNN in medical images is extensive. Terefore, this work details and summarizes the uses of CNN for LV segmentation. Te most common imaging modalities (MRI, US, and CT scan) were briefy introduced in the article. Te basics of CNN architectures were also discussed to have a better understanding of these models. Among the diferent CNN models, FCN, U-Net, and modifed model two are mostly used for LV segmentation. Tis work also gives a detailed discussion of hardware, software, and dataset used for LV segmentation. Te diferent evaluation matrices used for the performance analysis of the models were also discussed. A comparative summary was tabulated to ease the comparison for the readers. Tis work lays a foundation for the readers for an instinctive understanding of DL methods used for LV segmentation specifcally for medical and cardiac images.

Data Availability
All the data used to support the fndings of the study are included within the article as references.

Conflicts of Interest
Te authors declare that they have no conficts of interest.