Wireless capsule endoscopy (WCE) has developed rapidly over the last several years and now enables physicians to examine the gastrointestinal tract without surgical operation. However, a large number of images must be analyzed to obtain a diagnosis. Deep convolutional neural networks (CNNs) have demonstrated impressive performance in different computer vision tasks. Thus, in this work, we aim to explore the feasibility of deep learning for ulcer recognition and optimize a CNN-based ulcer recognition architecture for WCE images. By analyzing the ulcer recognition task and characteristics of classic deep learning networks, we propose a HAnet architecture that uses ResNet-34 as the base network and fuses hyper features from the shallow layer with deep features in deeper layers to provide final diagnostic decisions. 1,416 independent WCE videos are collected for this study. The overall test accuracy of our HAnet is 92.05%, and its sensitivity and specificity are 91.64% and 92.42%, respectively. According to our comparisons of F1, F2, and ROC-AUC, the proposed method performs better than several off-the-shelf CNN models, including VGG, DenseNet, and Inception-ResNet-v2, and classical machine learning methods with handcrafted features for WCE image classification. Overall, this study demonstrates that recognizing ulcers in WCE images via the deep CNN method is feasible and could help reduce the tedious image reading work of physicians. Moreover, our HAnet architecture tailored for this problem gives a fine choice for the design of network structure.
Gastrointestinal (GI) diseases pose great threats to human health. Gastric cancer, for example, ranks fourth among the most common type of cancers globally and is the second most common cause of death from cancer worldwide [
The emergence of wireless capsule endoscopy (WCE) has revolutionized the task of imaging GI issues; this technology offers a noninvasive alternative to the conventional method and allows exploration of the GI tract with direct visualization. WCE has been proven to have great value in evaluating focal lesions, such as those related to GI bleeding and ulcers, in the digestive tract [
WCE was first induced in 2000 by Given Imaging and approved for use by the U.S. Food and Drug Administration in 2001 [
Illustration of a wireless capsule. This capsule is a product of Ankon Technologies Co., Ltd. (Wuhan, Shanghai, China).
Examination of WCE images is a time-consuming and tedious endeavor for doctors because a single scan for a patient may include up to tens of thousands of images of the GI tract. Experienced physicians may spend hours reviewing each case. Furthermore, abnormal frames may occupy only a tiny portion of all of the images obtained [
Several features have motivated researchers to turn to computer-aided systems, including improved ulcer detection, polyp recognition, and bleeding area segmentation [
Ulcers are one of the most common lesions in the GI tract; an estimated 1 out of every 10 persons is believed to suffer from ulcers [
Typical WCE images of an ulcer. Ulcerated areas in each image are marked by a white box.
Deep learning methods based on the convolutional neural network (CNN) have seen several breakthroughs in classification tasks in recent years. Considering the difficulty in mathematically describing the great variation in the shapes and features of abnormal regions in WCE images and the fact that deep learning is powerful in extracting information from data, we propose the application of deep learning methods to ulcer recognition using a large WCE dataset of big volume to provide adequate diversity. In this paper, we carefully analyze the problem of ulcer frame classification and propose a deep learning framework based on a multiscale feature concatenated CNN, hereinafter referred to as HAnet, to assist in the WCE video examination task of physicians. Our network is verified to be effective on a large dataset containing WCE videos of 1,416 patients.
Our main contributions can be summarized in terms of the following three aspects: (1) The proposed architecture adopts state-of-the-art CNN models to efficiently extract features for ulcer recognition. It incorporates a special design that fuses hyper features from shallow layers and deep features from deep layers to improve the recognition of ulcers at vastly distributed scales. (2) To the best of our knowledge, this work is the first experimental study to include a large dataset consisting of over 1,400 WCE videos from ulcer patients to explore the feasibility of deep CNN for ulcer diagnosis. Some representative datasets presented in published works are listed in Table
Representative studies and datasets of WCE videos in the literature.
Experiment | Cases | Detail |
---|---|---|
Li and Meng [ |
10 patients’ videos | 10 patients’ videos, 200 images |
Li and Meng [ |
10 patients’ videos | 10 patients’ videos (five for bleeding and the other five for ulcer) |
Li et al. [ |
— | 80 representative small intestine ulcer WCE images and 80 normal images |
Karargyris and Bourbakis [ |
— | A WCE video containing 10 frames with polyps and 40 normal frames and extra 20 frames with ulcer |
Li and Meng [ |
10 patients’ videos | 10 patients’ videos, 600 representative polyp images and 600 normal images from data; 60 normal images and 60 polyp images from each patient’s video segments |
Yu et al. [ |
60 patients’ videos | 60 patients’ videos, 344 endoscopic images for training; another 120 ulcer images and 120 normal images for testing |
Fu et al. [ |
20 patients’ videos | 20 patients’ videos, 5000 WCE images consisting of 1000 bleeding frames and 4000 nonbleeding frames |
Yeh et al. [ |
— | 607 images containing 220, 159, and 228 images of bleeding, ulcers, and nonbleeding/ulcers, respectively |
Yuan et al. [ |
10 patients’ videos | 10 patients’ videos, 2400 WCE images that consist of 400 bleeding frames and 2000 normal frames |
Yuan and Meng [ |
35 patients’ videos | 35 patients’ videos, 3000 normal WCE images (1000 bubbles, 1000 TIs, and 1000 CIs) and 1000 polyp images |
He et al. [ |
11 patients’ videos | 11 patients’ videos, 440K WCE images |
Aoki et al. [ |
180 patients’ videos | 115 patients’ videos, 5360 images of small-bowel erosions and ulcerations for training; 65 patients’ videos, 10,440 independent images for validation |
Ours | 1,416 patients’ videos | 1,416 patients’ videos with 24,839 representative ulcer frames |
Prior related methods for abnormality recognition in WCE videos can be roughly divided into two classes: conventional machine learning techniques with handcrafted features and deep learning methods.
Conventional machine learning techniques are usually based on manually selected handcrafted features followed by application of some classifier. Features commonly employed in conventional techniques include color and textural features.
Lesion areas are usually of a different color from the surrounding normal areas; for example, bleeding areas may present as red and ulcerated areas may present as yellow or white. Fu et al. [
Texture is another type of feature commonly used for pattern recognition. Texture features include local binary patterns (LBP) and filter-based features [
CNN-based deep learning methods are known to show impressive performance. The error rate in computer vision challenges (e.g., ImageNet, COCO) has decreased rapidly with the emergence of various deep CNN architectures, such as AlexNet, VGGNet, GoogLeNet, and ResNet [
Many researchers have realized that handcrafted features merely encode partial information in WCE images [
A framework for hookworm detection was proposed in [
From ulcer size analysis of our dataset, we find that most of the ulcers occupy only a tiny area in the whole image. Deep CNNs can inherently compute feature hierarchies layer by layer. Hyper features from shallow layers have high resolution but lack representation capacity; by contrast, deep features from deep layers are semantically strong but have poor resolution [
Our dataset is collected using a WCE system provided by Ankon Technologies Co., Ltd. (Wuhan, Shanghai, China). The WCE system consists of an endoscopic capsule, a guidance magnet robot, a data recorder, and a computer workstation with software for real-time viewing and controlling. The capsule is 28 mm × 12 mm in size and contains a permanent magnet in its dome. Images are recorded and transferred at a speed of 2 frames/s. The resolution of the WCE image is 480 × 480 pixels.
The dataset used in this work to evaluate the performance of the proposed framework contains 1,416 WCE videos from 1,416 patients (males 73%, female 27%), i.e., one video per patient. The WCE videos are collected from more than 30 hospitals and 100 medical examination centers through the Ankon WCE system. Each video is independently annotated by at least two gastroenterologists. If the difference between annotation bounding boxes of the same ulcer is larger than 10%, an expert gastroenterologist will review the annotation and provide a final decision. The age distribution of patients is illustrated in Figure
Age distribution of patients providing videos for this study. The horizontal axis denotes the age range and the vertical axis denotes the case portion.
We plot the distribution of ulcer size in our dataset in Figure
Distribution of ulcer size among patients. The horizontal axis denotes the ratio of ulcer area to the whole image, while the vertical axis presents the number of frames.
In this section, we introduce our design and the proposed architecture of our ulcer recognition network.
Inspired by the design concept of previous works that deal with object recognition in vastly distributed scales [
Ulcer recognition network framework. ResNet-34, which has 34 layers, is selected as the feature extractor. Here, we only display the structural framework for clarity. Detailed layer information can be found in Appendix.
Our WCE system outputs color images with a resolution of 480 × 480 pixels. Experiments by the computer vision community [
Cross-entropy (CE) loss is a common choice for classification tasks. For binary classification [
To deal with possible imbalance between classes, a weighting factor can be applied to different classes, which can be called weighted cross-entropy (wCE) loss [
To evaluate the performance of classification, accuracy (AC), sensitivity (SE), and specificity (SP) are exploited as metrics [
Here,
AC gives an overall assessment of the performance of the model, SE denotes the model’s ability to detect ulcer images, and SP denotes its ability to distinguish normal images. Ideally, we expect both high SE and SP, although some trade-offs between these metrics exist. Considering that further manual inspection by the doctor of ulcer images detected by computer-aided systems is compulsory, SE should be as high as possible with no negative impact on overall AC.
We use a 5-fold cross-validation strategy at the case level to evaluate the performances of different architectures; this strategy splits the total number of cases evenly into five subsets. Here, one subset is used for testing, and the four other subsets are used for training and validation. Figure
Illustration of cross-validation (green: test dataset; blue: train dataset; yellow: validation dataset).
In this section, the implementation process of the proposed method is introduced, and its performance is evaluated by comparison with several other related methods, including state-of-the-art CNN methods and some representative WCE recognition methods based on conventional machine learning techniques.
The proposed HAnet connects hyper features to the final feature vector with the aim of enhancing the recognition of ulcers of different sizes. The HAnet models are distinguished by their architecture and training settings, which include three architectures and three training configurations in total. We illustrate these different architectures and configurations in Table
Illustration of different architectures and configurations.
Model | Features | Network initialization weights | Training | ||
---|---|---|---|---|---|
Layer 2 | Layer 3 | Layer 4 | |||
ResNet(480) | ✓ | ImageNet | Train on WCE dataset | ||
|
|||||
hyper(l2) FC-only | ✓ | ✓ | ResNet(480) | Update FC-layer only | |
hyper(l3) FC-only | ✓ | ✓ | |||
hyper(l23) FC-only | ✓ | ✓ | ✓ | ||
|
|||||
hyper(l2) all-update | ✓ | ✓ | ResNet(480) | Update all layers | |
hyper(l3) all-update | ✓ | ✓ | |||
hyper(l23) all-update | ✓ | ✓ | ✓ | ||
|
|||||
hyper(l2) ImageNet | ✓ | ✓ | ImageNet | Update all layers | |
hyper(l3) ImageNet | ✓ | ✓ | |||
hyper(l23) ImageNet | ✓ | ✓ | ✓ |
Three different architectures can be obtained when hyper features from layers 2 and 3 are used for decision in combination with features from layer 4 of our ResNet backbone: in Figure
Illustrations of HAnet architectures. (a) hyper(l2), (b) hyper(l3), (c) hyper(l23).
Each HAnet can be trained with three configurations. Figure
Illustration of three different training configurations, including ImageNet, all-update, and FC-only. Different colors are used to represent different weights. Green denotes pretrained weights from ImageNet, and purple denotes the weights of ResNet(480) that have been fine-tuned on the WCE dataset. Several other colors denote other trained weights.
The whole HAnet is trained using pretrained ResNet weights from ImageNet from initialization (denoted as ImageNet in Table
ResNet(480) is first fine-tuned on our dataset using pretrained ResNet weights from ImageNet for initialization. The training settings are identical to those in (1). Convergence is achieved during training, and the best model is selected based on validation results. We then train the whole HAnet using the fine-tuned ResNet (480) models for initialization and update all weights in HAnet (denoted as all-update in Table
The weights of the fine-tuned ResNet(480) model are used, and only the last fully connected (FC) layer is updated in HAnet (denoted as FC-only in Table
For example, the first HAnet in Table
To achieve better generalizability, data augmentation was applied online in the training procedure as suggested in [
The experiments are conducted on an Intel Xeon machine (Gold 6130 CPU@2.10 GHz) with Nvidia Quadro GP100 graphics cards. Detailed results are presented in Sections
To demonstrate the impact of different weighting factors, i.e.,
AC, SE, and SP evolution against the wCE weighting factor. Red, purple, and blue curves denote the results of AC, SE, and SP, respectively. The horizontal axis is the weighting factor, and the vertical axis is the value of AC, SE, and SP.
AC varies with changes in weighting factor. In general, SE improves while SP is degraded as the weighting factor increases. Detailed AC, SE, and SP values are listed in Table
Cross-validation accuracy of ResNet-18(480) with different weighting factors.
Weighting factor ( |
1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
---|---|---|---|---|---|---|
AC (%) | 90.95 ± 0.64 | 91.00 ± 0.49 | 90.96 ± 0.68 | 91.00 ± 0.70 | 90.95 ± 0.83 | 90.72 ± 0.75 |
SE (%) | 88.65 ± 0.64 | 89.85 ± 0.47 | 89.86 ± 1.01 | 90.67 ± 0.93 | 91.50 ± 0.76 | 91.12 ± 1.94 |
SP (%) | 93.27 ± 1.05 | 92.15 ± 1.22 | 92.05 ± 1.09 | 91.45 ± 1.84 | 90.38 ± 1.26 | 90.32 ± 1.88 |
In the following experiments, 4.0 is used as the weighting factor as it outperforms other choices and simultaneously achieves good balance between SE and SP.
We tested 10 models in total, as listed in Table
Performances of different architectures.
Model | Cross validation | ||
---|---|---|---|
AC (%) | SE (%) | SP (%) | |
ResNet-18(480) | 91.00 ± 0.70 | 90.55 ± 0.86 | 91.45 ± 1.84 |
hyper(l2) FC-only | 91.64 ± 0.79 | 91.22 ± 1.08 | 92.05 ± 1.65 |
hyper(l3) FC-only | 91.62 ± 0.65 | 91.15 ± 0.56 | 92.07 ± 1.45 |
|
|
|
|
hyper(l2) all-update | 91.39 ± 1.02 | 91.28 ± 1.04 | 91.47 ± 1.86 |
hyper(l3) all-update | 91.50 ± 0.81 | 90.63 ± 1.06 | 92.33 ± 1.74 |
hyper(l23) all-update | 91.37 ± 0.72 | 91.33 ± 0.63 | 91.4 ± 1.42 |
hyper(l2) ImageNet | 90.96 ± 0.90 | 90.52 ± 1.10 | 91.38 ± 2.01 |
hyper(l3) ImageNet | 91.04 ± 0.80 | 90.52 ± 1.34 | 91.54 ± 1.31 |
hyper(l23) ImageNet | 90.82 ± 0.85 | 90.26 ± 1.33 | 91.37 ± 1.48 |
According to the results in Table
The hyper ImageNet models, including hyper(l2) ImageNet, hyper(l3) ImageNet, and hyper(l23) ImageNet, seem to give weak performance. Hyper ImageNet models and the other hyper models share the same architectures. The difference between these types of models is that the hyper ImageNet models are trained with the pretrained ImageNet ResNet-18 weights while the other models use ResNet-18(480) weights that have been fine-tuned on the WCE dataset. This finding reveals that a straightforward base net such as ResNet-18(480) shows great power in extracting features. The complicated connections of HAnet may prohibit the network from reaching good convergence points.
To fully utilize the advantages of hyper architectures, we recommend a two-stage training process: (1) Train a ResNet-18(480) model based on the ImageNet-pretrained weights and then (2) use the fine-tuned ResNet-18(480) model as a backbone feature extractor to train the hyper models. We denote the best model in all hyper architectures as HAnet-18(480), i.e., a hyper(l23) FC-only model.
Additionally, former exploration is based on ResNet-18, and results indicate that a hyper(l23) FC-only architecture based on the ResNet backbone feature extractor fine-tuned by WCE images may be expected to improve the recognition capability of lesions in WCE videos. To optimize our network, we examine the performance of various ResNet series members to determine an appropriate backbone. The corresponding results are listed in Table
Model recognition accuracy with different settings.
ResNet-18(480) | ResNet-34(480) | ResNet-50(480) | |
---|---|---|---|
AC (%) | 91.00 ± 0.70 |
|
91.29 ± 0.91 |
SE (%) | 90.55 ± 0.86 | 90.74 ± 0.74 | 89.63 ± 1.83 |
SP (%) | 91.45 ± 1.84 | 92.25 ± 1.72 | 92.94 ± 1.82 |
Progression of HAnet-34(480).
To evaluate the performance of HAnet, we compared the proposed method with several other methods, including several off-the-shelf CNN models [
We performed repeated 2 × 5-fold cross-validation to provide sufficient measurements for statistical tests. Table
Comparison of HAnet with other methods.
AC (%) | SE (%) | SP (%) | |
---|---|---|---|
SPM-BoW-SVM [ |
61.38 ± 1.22 | 51.47 ± 2.65 | 70.67 ± 2.24 |
Words-based-color-histogram [ |
80.34 ± 0.29 | 82.21 ± 0.39 | 78.44 ± 0.29 |
vgg-16(480) [ |
90.85 ± 0.98 | 90.12 ± 1.17 | 92.02 ± 2.52 |
dense-121(480) [ |
91.26 ± 0.43 | 90.47 ± 1.67 | 92.07 ± 2.04 |
Inception-ResNet-v2(480) [ |
91.45 ± 0.80 | 90.81 ± 1.95 | 92.12 ± 2.71 |
ResNet-34(480) [ |
91.47 ± 0.52 | 90.53 ± 1.14 | 92.41 ± 1.66 |
HAnet-34(480) |
|
|
|
Comparison of different models. (a) Accuracy and inference time comparison. The horizontal and vertical axes denote the inference time and test accuracy, respectively. (b) Statistical test results of paired
Among the models tested, HAnet-34(480) yields the best performance with good efficiency and accuracy. Additionally, the statistical test results demonstrate the improvement of our HAnet-34 is statistically significant. Number in each grid cell denotes the
Table
Evaluation of different criteria.
PRE | RECALL | F1 | F2 | ROC-AUC | |
---|---|---|---|---|---|
vgg-16(480) | 0.9170 | 0.9012 | 0.9087 | 0.9040 | 0.9656 |
dense-121(480) | 0.9198 | 0.9047 | 0.9118 | 0.9074 | 0.9658 |
Inception-ResNet-v2(480) | 0.9208 | 0.9081 | 0.9138 | 0.9102 | 0.9706 |
ResNet-34(480) | 0.9218 | 0.9053 | 0.9133 | 0.9084 | 0.9698 |
HAnet-34(480) |
|
|
|
|
|
In this section, the recognition capability of the proposed method for small lesions is demonstrated and discussed. Recognition results are also visualized via the class activation map (CAM) method [
To analyze the recognition capacity of the proposed model, the sensitivities of ulcers of different sizes are studied, and the results of ResNet-34(480) and the best hyper model, HAnet-34(480), are listed in Table
Recognition of ulcer with different sizes.
Model | Ulcer size | |||
---|---|---|---|---|
<1% | 1–2.5% | 2.5–5% | >5% | |
ResNet-34(480) | 81.44 ± 3.07 | 91.86 ± 1.40 | 94.16 ± 1.26 | 96.51 ± 1.43 |
HAnet-34(480) | 82.37 ± 3.60 | 92.78 ± 1.33 | 95.40 ± 0.74 | 97.11 ± 1.11 |
Based on the results of each row in Table
To better understand our network, we use a CAM [
Visualization network results of some representative ulcers with CAM. A total of six groups of representative frames are obtained. For each group, the left image reflects the original frame, while the right image shows the result of CAM. (a) Typical ulcer, (b) ulcer on the edge, (c) ulcer in a turbid background with bubbles, (d) multiple ulcers in one frame, (e) ulcer in a shadow, and (f) ulcer recognition in the frame with a flashlight.
These results displayed in Figure
In this work, we proposed a CNN architecture for ulcer detection that uses a state-of-the-art CNN architecture (ResNet-34) as the feature extractor and fuses hyper and deep features to enhance the recognition of ulcers of various sizes. A large ulcer dataset containing WCE videos from 1,416 patients was used for this study. The proposed network was extensively evaluated and compared with other methods using overall AC, SE, SP, F1, F2, and ROC-AUC as metrics.
Experimental results demonstrate that the proposed architecture outperforms off-the-shelf CNN architectures, especially for the recognition of small ulcers. Visualization with CAM further demonstrates the potential of the proposed architecture to locate a suspicious area accurately in a WCE image. Taken together, the results suggest a potential method for the automatic diagnosis of ulcers from WCE videos.
Additionally, we conducted experiments to investigate the effect of number of cases. We used split 0 datasets in the cross-validation experiment, 990 cases for training, 142 cases for validation, and 283 cases for testing. We constructed different training datasets from the 990 cases while fixed the validation and test dataset. Firstly, we did experiments on using different number of cases for training. We randomly selected 659 cases, 423 cases, and 283 cases from 990 cases. Then, we did another experiment using similar number of frames as last experiment that distributed in all 990 cases. Results demonstrate that when similar number of frames are used for training, test accuracies using training datasets with more cases are better. This should be attributed to richer diversity introduced by more cases. We may recommend to use as many cases as possible to train the model.
While the performance of HAnet is very encouraging, improving its SE and SP further is necessary. For example, the fusion strategy in the proposed architecture involves concatenation of features from shallow layers after GAP. Semantic information in hyper features may not be as strong as that in deep features, i.e., false-activated neural units due to the relative limited receptive field in the shallow layers may add unnecessary noise to the concatenated feature vector when GAP is utilized. An attention mechanism [
The architectures of members of the ResNet series, including ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, are illustrated in Table
Architectures of ResNet series members.
Layer name | Output size | 18-Layer | 34-Layer | 50-Layer | 101-Layer | 152-Layer |
---|---|---|---|---|---|---|
Conv1 |
|
|
||||
|
||||||
Maxpool |
|
|
||||
|
||||||
Layer 1 |
|
|
|
|
|
|
Layer 2 |
|
|
|
|
|
|
Layer 3 |
|
|
|
|
|
|
Layer 4 |
|
|
|
|
|
|
|
||||||
Avgpool and fc |
|
Global average pool, 2-d fc |
The WCE datasets used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
Hao Zhang is currently an employee of Ankon Technologies Co., Ltd. (Wuhan, Shanghai, China). He helped in providing WCE data and organizing annotation. The authors would like to thank Ankon Technologies Co., Ltd. (Wuhan, Shanghai, China) for providing the WCE data (ankoninc.com.cn). They would also like to thank the participating engineers of Ankon Technologies Co., Ltd., for their thoughtful support and cooperation. This work was supported by a Research Project from Ankon Technologies Co. Ltd. (Wuhan, Shanghai, China), the National Key Scientific Instrument and Equipment Development Project under Grant no. 2013YQ160439, and the Zhangjiang National Innovation Demonstration Zone Special Development Fund under Grant no. ZJ2017-ZD-001.