Multilevel Strip Pooling-Based Convolutional Neural Network for the Classification of Carotid Plaque Echogenicity

Carotid plaque echogenicity in ultrasound images has been found to be closely correlated with the risk of stroke in atherosclerotic patients. The automatic and accurate classification of carotid plaque echogenicity is of great significance for clinically estimating the stability of carotid plaques and predicting cardiovascular events. Existing convolutional neural networks (CNNs) can provide an automatic carotid plaque echogenicity classification; however, they require a fixed-size input image, while the carotid plaques are of varying sizes. Although cropping and scaling the input carotid plaque images is promising, it will cause content loss or distortion and hence reduce the classification accuracy. In this study, we redesign the spatial pyramid pooling (SPP) and propose multilevel strip pooling (MSP) for the automatic and accurate classification of carotid plaque echogenicity in the longitudinal section. The proposed MSP module can accept arbitrarily sized carotid plaques as input and capture a long-range informative context to improve the accuracy of classification. In our experiments, we implement an MSP-based CNN by using the visual geometry group (VGG) network as the backbone. A total of 1463 carotid plaques (335 echo-rich plaques, 405 intermediate plaques, and 723 echolucent plaques) were collected from Zhongnan Hospital of Wuhan University. The 5-fold cross-validation results show that the proposed MSP-based VGGNet achieves a sensitivity of 92.1%, specificity of 95.6%, accuracy of 92.1%, and F1-score of 92.1%. These results demonstrate that our approach provides a way to enhance the applicability of CNN by enabling the acceptance of arbitrary input sizes and improving the classification accuracy of carotid plaque echogenicity, which has a great potential for an efficient and objective risk assessment of carotid plaques in the clinic.


Introduction
Ischaemic heart disease and stroke are the leading causes of mortality and morbidity in upper-middle-income and highincome countries [1]. Most strokes and acute coronary syn-dromes are caused by the rupture of vulnerable atherosclerotic plaques [2], commonly due to the accumulation of fatty deposits at arterial bends and bifurcations. When carotid plaques rupture, atherothrombotic emboli consisting of clumps of platelet aggregates or plaque fragments may travel into the brain, occluding smaller arteries and resulting in a transient ischaemic attack (TIA) or stroke [3]. Ultrasound (US) imaging is a preferred modality for detecting carotid atherosclerotic plaques due to its advantages of being nonionizing, low cost, and convenient for monitoring plaque regression and progression in response to medical therapy [4,5]. Recent studies have shown that the echogenicity of carotid plaques is associated with their vulnerability [6,7]. Echolucent plaques are more vulnerable due to their large lipid cores and thin fibrous caps, while echo-rich plaques are stable because they mainly consist of calcifications and fibrotic tissue [8,9]. The classification of US carotid plaque echogenicity can provide valuable information regarding vulnerable plaques and their risks of causing cerebrovascular events [10][11][12][13]. Thus, it is of great significance to identify the echogenicity of carotid plaques, which may contribute to the risk assessment of carotid plaques and be helpful for the risk prediction of cerebrovascular events. However, due to carotid plaques coupled with US image speckles, the complexities of tissue appearances, and the visual similarities of different carotid plaque echogenicities, it is tedious and operatordependent for expert observers to identify the echogenicity of carotid plaques, and accurate classification is challenging.
Researchers have made some attempts to classify carotid plaque echogenicity using traditional methods based on one or more handcrafted features to tackle this challenge. Irie et al. showed that the greyscale median (GSM) was a useful and objective metric for the assessment of carotid plaque echogenicity for the prediction of cardiovascular events in diabetic patients [14]. GSM was also used as an important feature for the identification of patients with histologically unstable carotid plaques [15]. Prahl et al. used a semiautomated method to evaluate echogenicity (SAMEE) based on a percentage white (PW) feature metric [16]. In [17], a method that combines texture features and morphological characteristics for the assessment of carotid plaque echogenicity was proposed. In [18], a bimodal gamma distribution was proposed to model the pixel statistics in the greyscale images of carotid plaques; the most discriminative features (MDFs) were extracted from the discrete Frechet distance features (DFDFs) of each carotid plaque based on the statistical model to classify the carotid plaques into three types, and a classification accuracy of 77.5% was achieved. In [19], the integral value obtained by calculating the area under the cumulative probability distribution curve (AUCPDC) was adopted to evaluate carotid plaque echogenicity. The classification accuracy for 125 plaques (43 echo-rich, 35 intermediate, and 47 echolucent plaques) was 78.4%, whereas the GSM was 64.8%. All these methods have shown great potential for carotid plaque classification. However, the classification accuracies of the above methods were not high because they used handcrafted features that cannot fully and accurately reflect the complicated intrinsic features of carotid plaques.
Compared to handcrafted features, conventional convolutional neural networks (CNNs), such as VGGNet [20], GoogLeNet [21], and ResNet [22], have been shown to be powerful tools for automatically extracting intrinsic features from medical images [23][24][25]. A deep convolutional neural network was trained using 129,450 clinical images of skin disease to classify skin lesions in [26]. A multiorgan CAD system based on CNNs was developed for classifying both thyroid and breast nodules and investigating the impact of this system on the diagnostic efficiency of different preprocessing approaches [27]. A deep residual network was applied to automatically extract features of carotid ultrasound images and identify the carotid plaques in the images [28]. A convolutional neural network was built to automatically extract features from carotid ultrasound images for the identification of different plaque components in [29]. The conventional CNN classification tasks require input images of fixed size (e.g., 224 × 224), which contradicts the varying sizes of carotid plaques. Although it is promising to transform carotid plaques of arbitrary sizes to a uniform size by cropping and scaling, as shown in Figure 1, this will result in geometric distortion or changes in the spatial texture features, which may negatively impact the classification accuracy of the utilized model. Although He et al. [30] proved that spatial pyramid pooling-(SPP-) based CNNs could remove the imposed fixed-size constraint and achieve outstanding accuracy in classification and object detection tasks, the limitation of SPP is that it pools the input feature maps with square windows, which is suitable for symmetrical structure, such as the lumen of the artery. While this limits the flexibility in capturing the anisotropy context that widely exists in carotid plaques, because carotid plaques are mainly formed by an accumulation of lipid and inflammatory deposits in the subintimal space of the arterial wall, during this process, affected by the hemodynamics in the vascular lumen, most of the carotid plaques are long strips in the longitudinal section of carotid ultrasound (e.g., the carotid plaques in Figures 2(e) and 2(f)). The pooling operation using square windows in SPP cannot overcome the aforementioned limitation efficiently because it will inevitably contain information about contamination from irrelevant areas.
To this end, in this study, we considered that most carotid plaques ultrasound images in the longitudinal view are stripe-like structure, and MSP was proposed for the classification of carotid plaques, which pools the feature maps using multilevel stripped shape windows and obtains fixed length outputs, which are then fed into the fully connected layers. MSP not only inherit the merits of SPP, which can accept input images of any size, but also enlarge the receptive field and maximum effectively capture the long-range context in the longitudinal view to improve the accuracy for the classification.
The main contributions of our work can be generalized as follows.
We investigate the design of the SPP module and propose an MSP module, which inherits the advantage of SPP that can accept input images of arbitrary size and overcomes the limitation of SPP to more effectively capture long-range context to improve the classification performance.
Furthermore, we present an MSP-based CNN for the automatic and reliable classification of carotid plaque echogenicity, which achieves significant improvements over the baseline VGG16, SPP-based CNN, and other popular CNNs and enables efficient stability estimations of carotid plaques so that clinicians can make suitable diagnostic schemes.

Computational and Mathematical Methods in Medicine
A large-scale clinical carotid US dataset was established for carotid plaque classification. The dataset includes 1463 carotid plaque US images, which consist of three different carotid plaque types according to their echogenicity. Each carotid plaque in this dataset has a classification label and its corresponding region of interest (ROI).
The remainder of this paper is organized as follows. Section 2 introduces the preparation of the dataset and the structures of the SPP module, the proposed MSP module, and the MSP-based CNN. Section 3 describes the experimental setup and utilized classification metrics and presents the experimental results and discussion, and conclusions are given in Section 4.

Data Acquisition and Preparation
2.1.1. Data Acquisition. In this study, a total of 1463 US images of carotid plaques were acquired from 925 patients in Zhongnan Hospital of Wuhan University by expert sonographers who have decades of experience in vascular imaging. An Acuson SC2000 (Siemens, Erlangen, Germany) US system equipped with a 5-12 MHz linear array probe (9L4) was used to acquire carotid US images. This study was approved by the Institutional Review Board (IRB) of the Medical School, Wuhan University, and written informed consent was obtained from all patients. During the acquisition process, the subjects were supine, and their heads were tilted back. The probe was positioned perpendicular to each patient's neck, moving slowly along the carotid arteries. After a carotid plaque was identified, longitudinal images of the carotid plaque in the common and internal carotid arteries were acquired.

Data Preparation
(1) Image Normalization. The appearances of carotid US images vary due to different image acquisitions depending on the equipment, operator, patient, and US machine settings. Consequently, it is important to develop methods that can address the variability in the appearances of tissues in US images. Traditionally, image normalization is used to overcome this limitation, i.e., by transforming the image data such that the same tissues have approximately similar intensity values. In this paper, the proposed deep learning network can extract high-level features from carotid US images; therefore, these features are less sensitive to image normalization. To improve the comparability of the images and the reliability of our results, we applied a linear scaling operation between the minimum and maximum values of the images as a standard processing method for normalization (without the need for any user interaction). The normalization formula is defined by ðx − x min Þ/ðx max − x min Þ, where x is the pixel value of a carotid plaque US image and x min and x max are the minimum and maximum pixel values of this carotid plaque US image, respectively.
(2) Groundtruth Data. Our groundtruth data were generated according to the criteria of the European carotid plaque study group, which classified carotid plaque echogenicity into three different types: echo-rich, intermediate, and echolucent [31]. This classification was performed by an expert clinician (coauthor F.W.) with at least a decade of experience in the assessment of atherosclerosis using carotid US images, who first classified 1463 plaques into three categories based on their echogenicity (echo-rich, intermediate, and echolucent) and reclassified them three months later. The kappa value (κ = 0:747) was calculated to demonstrate that the two classifications had high intraobserver agreement. For the 232 controversial plaques, Dr. Wang classified them for a third time and then took two of the three results that were consistent with the final results. Ultimately, the groundtruth data included 335 echo-rich plaques, 405 intermediate plaques, and 723 echolucent plaques among all 1463 plaque images.
(3) Manual Segmentation of Plaques. Due to the large sizes of the acquired images and the fact that the area outside each plaque did not contain critical related information, the boundaries of the plaque were manually delineated for each image by the same expert clinician, and then, the ROI containing the segmented plaque was saved, as shown in Figure 1. An automatic segmentation method for carotid plaques is being studied by another member of our laboratory [32]. Because segmentation is not the focus of this study, the manual segmentation results were used as the ROIs. These ROIs containing plaques vary in size, with the largest size being 134 × 564 (h × w) and the smallest one being only 19 × 29 (h × w). Table 1 shows the statistical distribution and sizes of the samples per class for training and testing obtained from one of the conducted 5-fold cross-validation.

Spatial Pyramid Pooling (SPP) Module.
Before describing the design of MSP, as depicted in Figure 3, we first briefly review the structure of the SPP module. In this module, the pooling operations are performed with a pyramid level of a n × a n bins for the k feature maps obtained after the last convolutional layer and a size of m h × m w . The size of the sliding pooling window is dm h /a n , m w /a n e, and the stride is bm h /a n , m w /a n c, where d·e and b·c denote the ceiling and floor operations. The outputs of the j-level pyramid pooling layer can be calculated by k∑ j n=1 a 2 n , where k is the number of filters in the last convolutional layer,

3
Computational and Mathematical Methods in Medicine and a 2 n denotes the bins of level n of the pyramid pooling layer. As an example, a 3-level pyramid pooling layer f3 × 3, 2 × 2, 1 × 1g results in 14 bins. Finally, the outputs of the 3-level pyramid pooling layer are concatenated to obtain k∑ 3 n=1 a 2 n ða 1 = 1, a 2 = 2, a 3 = 3Þ14k fixed-dimensional vectors and input them into the fully connected layer to obtain the classification results.

Multilevel Strip Pooling (MSP)
Module. SPP can generate a fixed-length representation that does not depend on the size of the input image. However, it pools the feature maps using square windows to collect context, which would inevitably contain contaminating information from irrelevant regions. This is especially true for long-strip targets such as carotid plaques in the longitudinal section of the carotid ultrasound images. Thus, inspired by [33], we designed a novel MSP module to alleviate the above problem. It uses multilevel strip-shaped window to enlarge the receptive field and perform strip pooling to allow the collection of longrange contexts, as shown in Figure 4.
Let the size of the k feature maps obtained from the previous convolution layer be m ih × m iw ði = 1, 2,⋯NÞ (N is the sample size of the dataset). In the nth level strip pooling of a n × b n strips, we adopt adaptive average pooling with a kernel ðk h , k w Þ and stride ðs h , s w Þ to obtain the output. The kernel ðk h , k w Þ and stride ðs h , s w Þ can be calculated as follows: Then, j-level strip pooling operations are performed on each feature map (the response of each filter) using a stripshaped window in the horizontal or vertical dimension. Similar to those obtained with spatial pyramid pooling, the output vectors v o after j -level strip pooing can be written as Here, j is the number of levels, and k denotes the number of filters of the last convolutional layer in the backbone network. Thus, the number of output vectors obtained after  The MSP layer pools the features and generates fixed-length vectors, which are then fed into the fully connected layers for the classification of carotid plaques. As an example, a 3-level strip pooling layer with strips of f1 × 1, 2 × 1, 3 × 1g in horizontal dimension results in 6 strips. Then, we concatenate the outputs of the 3-level strip pooling layer to obtain k∑ 3 n=1 a n × b n ða 1 = 1, a 2 = 2, a 3 = 3 ; b 1 , b 2 , b 3 = 1Þ6k fixed-dimensional vectors and input them into the fully connected layer for the classification.
It should be noted that Figure 4 only shows multilevel strip pooling in the horizontal dimension. In fact, each feature map can be pooled by the MSP module using a multilevel strip window to average all the feature values in the horizontal, vertical, or both dimensions. In image classification, we can flexibly choose horizontal, vertical, or both dimensions multilevel strip pooling according to the structural characteristics of the target object in the image. The outputs of the multilevel strip pooling operations are fixeddimensional vectors, which contain global and local informative contexts. In this study, since we collected carotid plaques in the longitudinal section of carotid US images, the longrange context in the horizontal dimension is more informative. According to the results of the preliminary experiment, to take efficiency into account and to make the MSP module lightweight, in this work, we adopt MSP operations only in the horizontal dimension to capture multilevel long-range context of carotid plaques. For example, in Figure 4, the vector bounded by the red box in the outputs is obtained by pooling the horizontally long-range area (enclosed by the red box) of the feature maps using one of 3 × 1 strips in the 3rd-level strip pooling. Compared to SPP, MSP considers using long but narrow kernel instead of square window for pooling, which focuses on acquiring long-range context in horizontal dimension and avoiding some unnecessary connections to be built in vertical dimension. Furthermore, the module is an add-on building block that can be plugged into the backbone of any network. In the following, we describe the structure of the proposed MSP-based CNN for the classification of carotid plaque echogenicity.

MSP-Based CNN for the Classification of Carotid Plaque
Echogenicity. The VGG model is one of the most popular deep learning networks because it reinforces the notion that CNNs must have a deep network of layers for a hierarchical representation of visual data to be possible. Although many follow-up works have improved upon the VGG architecture, in this work, we used the VGG network with a simple structure as the backbone to build the MSP-based CNN.
The structure of MSP-based VGGNet, as shown in Figure 5, consists of two main components. One component employs the same 5 convolution and pooling blocks as VGG16, except for the pooling layer after the last convolution layer, which is mainly used for image feature extraction. Each block has multiple convolution layers (with rectified linear unit (ReLU) activation), which use 3 × 3 filters with strides and paddings of 1, along with 2 × 2 max-pooling layers with strides of 2. The convolution layers operate in a sliding window manner to perform feature extraction on the input carotid plaque images of arbitrary sizes and generate feature maps of any size. The other component is the MSP module, followed by the fully connected layer and Softmax layer. The MSP module can perform multilevel strip pooling on the acquired feature maps of arbitrary sizes to obtain a fixed-size feature representation and then input it into the fully connected layer for carotid plaque echogenicity classification.
To prevent the model from overfitting, we use publicly available weights for the VGG16, trained against the ILSVRC12 challenge dataset and fine-tune them through transfer learning [34] for our purpose. Meanwhile, a dropout layer [35] is added to the network before the last fully connected layer. The feed-forward operation in the network with dropout is shown in Equations (3)- (6). Here, the Bernoulli function will randomly generate a vector of 0 or 1. z l denotes the vector of inputs into layer l, and y ðlÞ denotes the vector of outputs from layer l. w ðlÞ and b ðlÞ are the weights and biases at layer l [35].

Results and Discussion
In this section, we implement MSP-based VGGNet for the classification of carotid plaque echogenicity on the collected dataset, which was labelled three types (echo-rich, intermediate, and echolucent). We used an open-source deep learning framework, PyTorch, for training and testing the proposed network and popular CNNs for comparison purposes. All training and testing procedures were performed on an Ubuntu 64-bit desktop personal computer with an Intel Core i9-10900K central processing unit (CPU) and 32 GB of random access memory. An NVIDIA RTX 2080 Ti graphical processing unit (GPU) with CUDA 10.1 was used for acceleration.
The cross-entropy function was used as the cost function, and the stochastic gradient descent (SGD) optimizer was adopted to minimize the cost function [36]. The number of iterations was 30, the momentum was 0.9, and the learning rate was set to 0.001, which was reduced by a factor of 10 after every 6 iterations.
During the training and testing phases, we used batch data to train the network. The batch data needed to be consistent in all dimensions because the batch array was required to be converted into a tensor during the training and testing phases. Consequently, the batch size should be set to 1 when using SPP-based VGGNet and MSP-based VGGNet accepted images with arbitrary sizes as inputs.

Evaluation
Metrics. The performances of networks in terms of carotid plaque classification were evaluated using the accuracy, sensitivity (recall), specificity, precision, and F1-score metrics, which are defined as follows:   where TP, FP, TN, and FN represent the numbers of true positive, false positive, true negative, and false negative cases, respectively. Sensitivity measures the ability to correctly recognize positive cases, while specificity indicates the ability to correctly classify negative cases. Precision denotes the proportion of positive cases that were classified as positive cases, and the F1-score represents the harmonic average of precision and recall and is typically used for the optimization of a model towards either precision or recall.

Experimental Results.
We designed three experiments to investigate the effects of various levels and pools in the SPP and MSP modules and chose the best module to demonstrate the effectiveness of MSP-based VGGNet for the classification of carotid plaque echogenicity by comparing it with the baseline network VGG16 and SPP-based VGGNet, and to compare it with other popular CNNs.

Selection of the Levels and Pools in MSP and SPP Modules.
To verify whether the number of levels and pools affects the experimental results, we explored the effect of a 4-level strip pooling layer with strips of f1 × 1, 2 × 1, 3 × 1, 4 × 1g, namely, MSP-1234, and three 3-level strip pooling layers with strips of f1 × 1, 2 × 1, 3 × 1g, f1 × 1, 2 × 1, 4 × 1g, and f2 × 1, 3 × 1, 4 × 1g, namely, MSP-123, MSP-124, and MSP-234, respectively. The settings and outputs are described in Table 2, and the results are presented in Figure 6(a). The accuracy of MSP-123 reached 0.921, which was also slightly higher than that in the other cases. More levels, such as in MSP-1234, or more strips, such as in MSP-124 and MSP-234, provided very little in terms of performance gains. This may be due to sufficient long-range information being collected with MSP-123. A similar pooling configuration was applied in SPP-based VGGNet. A 4-level SPP layer with a pool of f1 × 1, 2 × 2, 3 × 3, 4 × 4g, namely, SPP-1234, and three 3-level SPP layers with varying pools of f1 × 1, 2 × 2, 3 × 3g, f1 × 1, 2 × 2, 4 × 4 g, and f2 × 2, 3 × 3, 4 × 4g, namely, SPP-123, SPP-124, and SPP-234, respectively, were verified in SPP-based VGGNet. The settings used are also shown in Table 2. The results are depicted in Figure 6(b), which shows that the accuracy of SPP-123 was also slightly higher than that in the other cases, especially at the beginning of epochs below 7. There were no significant differences between the accuracies of the other two pools in the 3-level spatial pyramid pooling layer and the 4-level spatial pyramid pooling layer.
As a result, regarding the runtime cost, we adopted a 3level strip pooling layer with a strip of f1 × 1, 2 × 1, 3 × 1g in MSP-based CNNs, that is, MSP-123 and a 3-level SPP layer with a pool of f1 × 1, 2 × 2, 3 × 3g in SPP-based CNNs, that is, SPP-123 as the default settings in the following experiments.      The 5-fold cross-validation procedure was adapted to obtain more impartial and unbiased results. To evaluate the performance of the proposed MSP-based VGGNet, we compared it with the baseline VGG16 network and SPP-based VGGNet, and the results are shown in Figure 7 and Tables 3 and 4. From Figure 7, it can be seen that the proposed MSP-based VGGNet obtained the highest accuracy, and the testing process was stable and converged quickly. The performance metrics obtained on our dataset other than accuracy are shown in Tables 3 and 4, where SEN ER , SEN IM , and SEN EL represent the sensitivities of the networks to the echo-rich plaques, intermediate plaques, and echolucent plaques, respectively, and SPE ER , SPE IM , SPE EL , PRE ER , PRE IM , PRE EL , F1-score ER , F1-score IM , and F1-score EL denote the respective specificities, precisions, and F1-scores. SEN represents the overall mean classification sensitivity, which combines SEN ER , SEN IM , and SEN EL for the three different types of plaques, and SPE, PRE, and F1-score represent the corresponding overall mean specificity, precision, and F1-score, respectively. Tables 3 and 4 show that our proposed MSP-based VGGNet performed better than VGG16 and SPP-based VGGNet in terms of classifying the three types of plaques. On the echo-rich plaques, the mean sensitivities of MSPbased VGGNet according to 5-fold cross-validation were 95:9 ± 2:3%, which surpassed those of VGG16 and SPPbased VGGNet by 17.9% and 1.1%, respectively. The mean specificities, precisions, and F1-scores were also higher than those of VGG16 and SPP-based VGGNet. This finding was also evident for the intermediate plaques. On the echolucent plaques, although VGG16 provided the best mean sensitivity (92:9 ± 2:3%), it had relatively low specificity (79:4 ± 2:8%), precision (82:9 ± 1:8%), and F1-score (87:6 ± 1:7%). By comparison, our MSP-based VGGNet not only provided the second-ranked sensitivity (92:8 ± 2:0%), which was very close to the best sensitivity (92:9 ± 2:3%) with no statistically significant difference between them (p = 0:4), but also obtained the best specificity (93:3 ± 2:0%), precision (93:2 ± 1:8%), and F1-score (93:0 ± 1:2%). Moreover, the overall average sensitivity of 92:1 ± 1:3%, specificity of 95:6 ± 0:5%, precision of 92:0 ± 0:5%, and F1-score of 92:1 ± 0:8% obtained by our MSP-based VGGNet are higher than those of VGG16 and SPP-based VGGNet, which also demonstrates the superiority of the proposed method.
Finally, comparisons of the training and testing times of the three tested networks are provided in Table 5. It can be seen that the least time was spent by our MSP-based VGGNet during the training and testing phases due to it having fewer parameters and reduced computational costs. Figure 8 shows a comparison of our proposed network with several popular CNNs. Obviously, our MSP-based VGGNet achieved accuracy higher than 0.9, which was much better than those of all the popular networks. Meanwhile, our proposed network converged after almost 5 epochs on the test set, which was faster than other networks, and the training process is more stable. Among the compared popular CNNs, they had similar classification performance. ResNet50 had a slightly higher accuracy, while the latest EfficientNet-b7 had a slightly lower accuracy. This indicates that it is not that the more complex the network architecture is, the better is the classification performance. Our specially designed network had a simpler architecture but should be more suitable for the classification of specific medical images than the complex heavy-weighted networks in the case of a small dataset. Figure 9 shows the confusion matrices of ResNext50 [37], DRN-d22 [38], MobileNet-v2 [39], DenseNet121 [40], EfficientNet-b7 [41], and MSP-based VGGNet for the classification of the three types of carotid plaques using 5-fold cross-validation. From Figure 9, it is apparent that our proposed network provided the best classification rates for the

Discussion
The accurate and objective classification of carotid plaque echogenicity is crucial for stroke risk assessment and for the planning of optimal treatment strategies. In this study, we proposed MSP-based CNN for the classification of carotid plaque echogenicity, which differs from the previous work. In particular, previous classification methods [16][17][18][19] identified different types of carotid plaques using handcrafted features, which lacked the ability to achieve a potential higher performance due to their inability to comprehensively represent the complicated features of carotid plaque. Meanwhile, obtaining these handcrafted features required professional domain knowledge and manual intervention, which limits the applicability of the method for other classification tasks. In contrast, the proposed approach can automatically extract low-and high-level features from the massive carotid plaques and serve the classification purpose without requiring manual intervention, suggesting the utility in research and clinical studies. Compared to the popular CNNs, the proposed MSPbased CNN can accept carotid plaques of arbitrary sizes as inputs, while popular CNNs need to transform the input images to a uniform size by cropping and scaling, which will cause content loss or distortion and hence has a negative impact on the classification accuracy. Although the widely used SPP network can also accept input images of any size, its ability to exploit anisotropy contextual information is limited since only square kernel shapes are applied. In contrast, our MSP-based CNN has a couple of advantages. First, it enlarges the receptive field by strip pooling and has less network parameters that resulted in less computational cost. Secondly, considering that the ultrasound images of carotid plaques in the longitudinal section are generally stripe-like structure and the greyscale distribution is anisotropic, the proposed MSP-based CNN adopts multilevel strip pooling in horizontal dimension to capture more accurate context, which is beneficial to improve the classification accuracy of carotid plaques. Experimental results show that the proposed MSP-based CNN is superior in terms of discriminating the three different types of carotid plaques compared to popular CNNs and SPP-based CNN.
Although we achieved high classification accuracy as well as computational efficiency, we must acknowledge a number of limitations. We note that the recognition rate of intermediate plaques is lower than echo-rich and echolucent plaques. This may be because of the complex morphology and variability of intermediate plaques. We may consider the novel attention mechanism to more accurately capture the information closely related to the classification task, while eliminating some irrelevant information, so as to improve the recognition rate. In addition, the groundtruth for this study was provided by only one expert clinician with decades of experience working with carotid ultrasound data. In the follow-up study, the generation of groundtruth datasets by multiple experts from different institutions would be needed to evaluate the sensitivities of the proposed networks on the training dataset and ensure that the results are generalizable. Furthermore, patients should be followed for at least 5 years and their carotid plaques should be reclassified to determine if the plaques are changing and becoming unstable, and patient outcome data (i.e., TIAs or strokes) should be compared to the classification results to determine if these data can be used clinically as risk indicators.

Conclusions
In this work, we investigated the design of the SPP module, proposed the MSP module, and presented MSP-based VGGNet to improve the classification performance with respect to carotid plaque echogenicity. A 5-fold crossvalidation was used to evaluate the effectiveness of our network on a collected clinical dataset. In a comparison with popular CNNs, the experimental results demonstrated that our network is more effective for correctly classifying the echogenicity of carotid plaques into three types. Therefore, our network may potentially assist clinicians in using a more objective risk assessment metric for carotid plaques to monitor plaque changes and predict possible cerebrovascular events.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.