Rapid Retrieval of Lung Nodule CT Images Based on Hashing and Pruning Methods

The similarity-based retrieval of lung nodule computed tomography (CT) images is an important task in the computer-aided diagnosis of lung lesions. It can provide similar clinical cases for physicians and help them make reliable clinical diagnostic decisions. However, when handling large-scale lung images with a general-purpose computer, traditional image retrieval methods may not be efficient. In this paper, a new retrieval framework based on a hashing method for lung nodule CT images is proposed. This method can translate high-dimensional image features into a compact hash code, so the retrieval time and required memory space can be reduced greatly. Moreover, a pruning algorithm is presented to further improve the retrieval speed, and a pruning-based decision rule is presented to improve the retrieval precision. Finally, the proposed retrieval method is validated on 2,450 lung nodule CT images selected from the public Lung Image Database Consortium (LIDC) database. The experimental results show that the proposed pruning algorithm effectively reduces the retrieval time of lung nodule CT images and improves the retrieval precision. In addition, the retrieval framework is evaluated by differentiating benign and malignant nodules, and the classification accuracy can reach 86.62%, outperforming other commonly used classification methods.


Introduction
The early diagnosis and treatment of lung cancer patients can help improve their survival rate [1]. However, with the development and improvement of various medical image scanning technologies, especially computed tomography (CT), the number of medical images is growing explosively every year. Hence, in the early medical screening process, reviewing lung lesions is an extremely labor-intensive job for radiologists. In addition, when reviewing and analyzing lesions, radiologists mainly rely on their diagnostic experience, and the diagnosis tends to be highly subjective. Moreover, clinical statistical studies show that the same radiologist, at different times, under different states of physical fatigue, may come up with a different diagnosis for the same CT image. Therefore, it is necessary to retrieve similar lung nodule CT images to improve diagnostic efficiency. By obtaining similar images from a CT image repository of pulmonary nodules, the anamnesis and successful treatments of these images can be viewed as clinical references for the case under consideration, which can lessen the reliance on a physician's clinical diagnostic experience to a certain degree.
Given the explosive growth of the number of current lung images and advantage of medical image retrieval for physicians' diagnosis of lung lesions, in this paper, a novel retrieval framework based on a hashing and pruning algorithm for lung nodule CT images is proposed. When retrieving similar lung nodule CT images, it not only can reduce the memory space required but also further shorten the retrieval time and improve precision with a pruning-based similarity measure method.
The remainder of this paper is organized as follows. Section 2 introduces previous work related to lung nodule image retrieval and current popular retrieval methods. Section 3 describes the proposed retrieval framework in detail. Experimental results are presented with some discussion in Section 4. Section 5 concludes the paper and discusses future work.

Related Work
Recent years have witnessed the growing popularity of medical image retrieval, and there are many significant results in the field of lung imaging. Oliveira and Ferreira [2] proposed a bag-of-tasks method combining texture features and registration techniques to retrieve lung cancer images. Ng et al. [3] presented an improved hierarchical spatial descriptor and binary descriptor to retrieve similar lung nodule CT images from the perspective of spatial similarity. Aggarwal et al. [4] studied the detection and classification of lung nodules with content-based medical image retrieval. However, given the large number of lung CT images generated every year, effective medical image retrieval is still a difficult challenge.
At present, hashing-based methods for image retrieval can solve the storage and efficiency problems that traditional image retrieval methods may encounter [5]. Further, these hashing methods can transform high-dimensional image data into a low-dimensional binary space by utilizing the constructed hash functions [6]. It is precisely because of these advantages that many scholars focus on researching hashingbased image retrieval technology. Locality-sensitive hashing [7], a pioneering work, can generate compact binary codes with a random threshold. Further, in many hashing methods, such as those in [8][9][10], principal component analysis (PCA) is a common method for preprocessing image data. The simplest of these approaches is PCAH: after using PCA to reduce the dimensionality of the image data, "0" is viewed as the boundary, and both sides, respectively, correspond to "0" and "1." Moreover, according to whether label information is used to construct hash functions, hashing methods can be categorized as unsupervised hashing [8,9,11], semisupervised hashing [10], or supervised hashing [12,13]. Additionally, the core of these hashing methods is the minimization of the error when translating the image data into binary space.
Although many hashing and improved hashing methods have been presented, only a few researchers have applied them to medical image retrieval. Liu et al. [14] presented an image retrieval framework for digital mammograms with anchor graph hashing and improved its search accuracy by fusing different features. Jiang et al. [15] used a joint kernelbased supervised hashing algorithm with a small amount of supervised information to compress breast histopathological images into 10-bit hash codes and identified actionable and benign tumors based on the retrieval results. Zhang et al. [16] built a histopathological image retrieval framework using a supervised hashing with kernel (KSH) method and validated the retrieval performance on breast microscopic tissues images.
In our proposed retrieval method for lung nodule CT images, partitioning the dataset with a clustering algorithm is the foundation of the pruning algorithm. The KSH method is then used to translate the images in each cluster into short hash codes and form the hash code database. During retrieval, a pruning algorithm is employed to shorten the retrieval area and further improve the retrieval speed and precision. We use other state-of-the-art hashing methods to validate the proposed pruning algorithm and compare it with other commonly used classification methods to demonstrate the performance of our retrieval framework.

Description of the Retrieval Framework and Pruning Algorithm
The retrieval framework for lung nodule CT images proposed in this paper consists of two main parts, the learning phase and query phase, as shown in Figure 1. The aim of the learning phase is to build a hash code database. First, we use the extracted visual and medical features to represent each lung nodule CT image. Binary codes are then obtained and stored in a hash code database using the KSH method. In the query phase, given a query lung nodule CT image, we first extract the same features that were extracted in the learning phase and translate them into binary code with the constructed hash functions. Next, similar images are retrieved from the hash code database while using the pruning algorithm. The retrieval results can be used to recognize benign or malignant nodule.

Lung Nodule Feature Extraction.
Feature extraction plays an important role in image retrieval and can transform highdimensional nodule images into a lower-dimensional space while retaining the essential content of the image. Good features can help physicians to distinguish lung nodules efficiently [17][18][19]. To facilitate analysis and research on lung lesions, we extract lung nodule features based on grayscale, morphology, and texture. Grayscale features are the most basic characteristics of an image of lung nodules, and the grayscale difference can highlight the corresponding organization and structure. The proposed method extracts three gray level characteristics, grayscale mean, variance, and entropy, where grayscale entropy reflects the grayscale information contained in the nodule image, and is defined as where ( ) represents the probability density of a different greyscale value and is the number of available gray level values.
Morphological features are the most intuitive visual features and are helpful for identifying tumors. We describe the morphological features of lung nodule mainly using invariant moments, medical signs, and geometric features. We employ the seven invariant moments proposed by Hu to describe the shape characteristics of pulmonary nodules. The calcification area, calcification degree, cavitary area, and cavitary ratio are calculated and represent the medical sign information of the lung nodules. The geometric features consist of lung nodule perimeter, area, maximum diameter, rectangle, and roundness, where roundness describes the degree of deviation of the nodule region from a circular shape, defined as  where is the area of the nodule region and is the perimeter.
Texture features can provide important information on the health of the lung. For example, the structure of diseased tissue is more chaotic and rough than healthy tissue [18]. Here, the gray level co-occurrence matrix is used to extract texture features. This is the most widely used texture analysis method in medical imaging. The computed features include 14 characteristic values, such as contrast, angular second moment, entropy, and inverse difference moment, which are defined as We also calculate the mean and variance of these 14 feature values.
A detailed description of the extracted multiple features is given in Table 1. By extracting the lung nodule features from grayscale, morphology, and texture, we utilized a 104dimensional feature vector to uniquely represent each lung nodule CT image, constructed as follows:

Building the Hash Code Database.
In order to achieve the rapid retrieval of lung nodule CT images using the proposed pruning algorithm, it is necessary to partition the dataset before constructing the hash code database using the hashing method. As shown in Figure 2, the construction of a hash code database includes two parts: clustering and hashing. In the first part, a spectral clustering algorithm is used to partition our training dataset into several clusters so that the distribution of lung nodule CT images in each cluster is near uniform. In addition, when retrieving a query image, the retrieval scope can be narrowed according to the distance between the query image and cluster centers. In the second part, the KSH method is employed to construct hash functions for the whole training dataset and obtain the hash code database. Furthermore, the uniform distribution of lung nodule CT images in clusters could reduce the instability of retrieval performance caused by the uneven distribution of images.

Partitioning the Lung Nodule Dataset.
Spectral clustering is a clustering algorithm based on spectral graph theory that can identify a sample space with arbitrary shapes and converge to the global optimal solution. Furthermore, the obtained clustering results outperform traditional clustering approaches, such as -means or single linkage clustering [20,21].
In this paper, given a training dataset of lung nodule CT image = { 1 , . . . , } ∈ , where is the number of images and is the dimension of extracted features, a graph is first built to represent these data. The vertices in the graph represent lung nodule CT images, and the weights of edges represent the similarity of any two lung nodule CT images. An undirected weighted graph = ( , ) based on the similarity of the images then can be obtained. Thus, the clustering problem is converted into a graph partitioning problem on .

Construction of Hash Functions.
One of the factors affecting the performance of a hashing method is the ability to preserve the similarity of any two images in the original feature space. Hence, the key to a hashing method is to construct appropriate hash functions and maintain the similarity within the hash code. KSH is a supervised hashing method that uses a limited amount of supervised information for learning hash functions, and the retrieval results are better than other unsupervised hashing methods as well as some supervised hashing methods.
Given all the lung nodule CT images in training dataset , we need to construct a group of hash functions = {ℎ 1 ( ), . . . , ℎ ( )}, each of which will generate a single hash bit. In addition, if the length of the hash code is , then hash functions will be constructed. A hash function is defined as where ( , ) = exp(−‖ − ‖/2 2 )is a Gaussian kernel function for solving the problem of the linear inseparability of lung nodule images, { (1) , . . . , ( ) } are samples randomly selected from the training dataset to support kernel computation, { 1 , . . . , } are a group of coefficients, sgn( ) is a sign function outputting 1 for positive input and −1 for negative input, and ∈ is a bias defined as As the differences in lung nodule images are not apparent, as they are in natural images, the coefficient vector a = [ 1 , . . . , ] is vital for generating distinguishable hash functions. Here, supervised information, that is, a label matrix, is utilized to solve this problem. During the image preprocessing, we mark each lung nodule CT image with a label 1 or 0 based on whether it shows benign or malignant lesions. The ( < ≤ ) images from the training dataset are then randomly selected to construct a label matrix S ∈ × . The construction process is shown in Figure 3. When retrieving a query image, the Hamming distance is a commonly used method to measure the similarity between the query image and the database images. However, it is difficult to directly compute this distance because of its complex formula. The research in [13] explains the corresponding relation between the Hamming distance and code inner product. Hence, the objective function can be defined using a code matrix formed by the selected samples and label matrix to solve A = [a 1 , . . . , a ] for hash functions as follows: where ‖⋅‖ is the Frobenius norm and represents the kernel computation for the images involved in label matrix S and is expressed as Thus, by minimizing the error between the code matrix and label matrix, the hash functions with supervised information can be acquired and used to encode each lung nodule CT image.

Retrieval Process for Lung
Nodule CT Images. The aim of the proposed method is to achieve rapid retrieval for lung nodule CT images with higher precision. Hence, a pruning algorithm is presented. The retrieval procedure with pruning algorithm is illustrated in Figure 4.
Given a query image, the retrieval process includes the following three steps: (1) determining candidate clusters, that is, selecting some clusters as the candidate clusters to which the query image may belong, (2) encoding the query image, that is, compressing the extracted relevant features into binary codes with the constructed hash functions, and (3) calculating similarity, that is, computing and sorting the code inner products between the query image and images in the candidate clusters and returning the similar images according to the similarity.
However, when sorting the code inner products, if the length of hash code is , the value range of code inner product is [− , ], and it is impossible to directly sort the images that have the same code inner products. In order to solve this problem, we designed a decision rule: compare the distances between the query image and the clusters that these similar images belong to and return the image with a smaller distance. The whole pruning process is shown in Algorithm 1. Output. Similar images { 1 , . . . , }.
Step 3. Compress the query image into code with .
Step 4. Calculate the code inner products between and the images in the candidate clusters as follows: Step 5. Rank the code inner products sim as follows: 1 , . . . , ← sort (sim, "descend") .
Step 6. If equal (sim i , sim j ), then compare the corresponding distances and in Step 1.
Step 7. If > , then return image first.
Step 8. Repeat Steps 6-7 until similar images are returned in order.

Image Dataset.
The image data used in our experiments are from the LIDC dataset [22]. The LIDC dataset contains 1,018 cases, each of which includes a set of chest CT images and an associated XML file that records some relevant information about the lung nodules (such as whether they are benign or malignant). There are a total of 7,371 nodules labeled at least by one radiologist and 2,669 of these nodules are marked "nodule ≥ 3 mm." Here, in order to ensure that the training dataset does not influence the testing dataset (i.e., that no images belonging to the same case appear in both the training dataset and the testing dataset), the dataset in this research is constructed from 600 cases. Further, we randomly selected 70% of them as training data and the remaining 30% as testing data. Here, slices with unclear nodules in each case were discarded, resulting in a total of 2,450 slices in our dataset. The detailed contents of this dataset are listed in Table 2. This study is aimed at lung nodules, so the first step in our experiment is to obtain the region of interest. As the XML files in the LIDC record information about the lung nodules, our team designed a visual interface and parsed these XML files to obtain the relevant information, as shown in Figure 5. The rectangular regions containing lung nodules were extracted and viewed as regions of interest. Thus, the lung image database was constructed based on these regions.
In order to facilitate the research and analysis of lung lesions, we extracted the multiple features of lung nodules and stored them into our database in advance. Table 3 describes some feature values extracted from the lung nodule images.

Parameter Settings.
There are three main parameters affecting the performance of the proposed retrieval framework. They are the length of the compact hash code bit, number of clusters , and number of candidate clusters used for retrieval .
The effect of a hashing method is to transform the highdimensional image features of lung nodules into a lowdimensional hash code to represent each image. Additionally, based on the experience of a large number of studies, the length of hash codes in this paper is set to between 8 and 64 bits. Here, in order to determine the influence of the hash code length on the retrieval results, we first did our experiment using the KSH method only (without using the pruning algorithm). The retrieval precision for different hash code lengths is reported in Table 4, and the best retrieval results are obtained when bit = 48. Retrieval precision is one of where results indicate the number of returned images and relevant results denote the number of correct results in the returned images as judged by the label information.
Next, setting bit to 48, we discuss how to set parameters and appropriately to achieve a retrieval precision of 85.42% within the shortest retrieval time. Retrieval time refers to the period of time beginning with the encoding of the test images and ending when the similar lung nodule CT images have been obtained.
In order to reach the same retrieval precision without using a decision rule, the values of parameters and were acquired through experiment and are shown in Table 5. Figure 6 demonstrates how the retrieval time changes according to the number of clusters when the length of hash code is 48 bits. We can see that, as the number of clusters increases, the retrieval time decreases when < 35, but when > 35, the retrieval time increases. Considering Table 5 and Figure 6, the reason for this situation is that as the number of clusters increases, the reduced number of images in the candidate clusters is greater than the increased number of images in the newly produced candidate clusters. That is, the total number of images in the candidate clusters needed to reach the same retrieval precision is less than before. However, when > 35, the situation is reversed. This may be similar to the phenomenon of overfitting in statistical learning. When the number of clusters increases, the retrieval result may be worse. Moreover, as the number of clusters increases, the time required to calculate the distance between the query image and each cluster center cannot be ignored. Hence, in order to obtain a better retrieval result, we set the length of the hash code to 48 bits and the number of clusters to 35. Additionally, as shown in Table 5, the number of candidate clusters is set to eight.
The retrieval results of lung nodule CT images based on the above parameter settings are shown in Figure 7, where the first two query images are malignant tumors, and the last two are benign tumors.

Performance Comparison of the Pruning Algorithm.
In these experiments, we applied the proposed pruning algorithm to different hash methods such as kernelized localitysensitive hashing (KLSH) [11], spectral hashing (SH) [8], binary reconstructive embedding (BRE) [12], PCAH, and iterative quantization (ITQ) [9]. We compared the retrieval time and precision to evaluate the performance of the pruning algorithm. The experiment flow is shown in Figure 8.
First, the different hashing methods were utilized to retrieve the similar lung nodule CT images without using the pruning algorithm. Figure 9 shows the retrieval precision of these hashing methods, in which the retrieval result of the KSH method outperforms the other hashing methods.
We then applied the proposed pruning algorithm (without using the decision rule) to these hashing methods and   validated its performance when the dataset is partitioned into 35 clusters. The parameter settings for the hash code length and number of candidate clusters are shown in Table 6, which differ depending on the highest precision that these hashing methods can reach. Figure 10 shows the retrieval time for all query images for the different hashing methods. We can see that the retrieval speed of these hashing methods using the pruning algorithm is about 2-4 times faster than when these methods do not use the pruning algorithm. Hence, the proposed pruning algorithm clearly reduces the retrieval time of lung nodule CT images. Moreover, the decision rule designed in the pruning algorithm is also helpful for improving the retrieval precision to some extent. By comparing the distance between the query image and the clusters, the most similar lung image can be returned first. Figure 11 shows the influence of the decision rule on the highest retrieval precision of different hashing methods. Figure 12 compares different classification methods, the support vector machine (SVM), back propagation (BP), and nearest neighbors (KNN), with our hashing-based method, using the classification accuracy of benign and malignant lesions. In this evaluation, we compared them with respect to overall (all nodules in the test dataset), benign nodule, and malignant nodule accuracy in the test dataset. We judged whether the

Decision rule
The retrieval precision Step 3 Improve Pruning algorithm (without decision rule) The retrieval time Step 2 To reach the best retrieval results Step 1 Different hashing methods The best retrieval results The length of hash code Figure 9: Retrieval precision of different hashing methods for different hash code lengths (without using the pruning algorithm). query lesion was benign or malignant using the returned similar lung nodule CT images. The judging method is similar to the idea of the KNN algorithm, that is, if the number of benign lung nodules is greater than the number of malignant lung nodules in the returned similar images, the query image is diagnosed as a benign tumor. Otherwise, the query tumor is diagnosed as malignant.

Generate
The KNN algorithm is always used as a baseline classical classification method in machine learning. Here, we employ the Euclidean distance to obtain similar samples and set to 5. However, the calculation is not efficient enough for the high-dimensional image features. Our hashing method leverages the compact hash code and code inner products  to measure the similarity, which is helpful for improving its efficiency. The BP neural network is one of the most widely used neural network models. The learning rate is set to 0.01, and the number of iterations is 500 in our experiments. A SVM is a supervised learning model that uses supervised information to bridge the semantic gap. Hence, the classification results of SVM with a radial basis function are better than KNN and the BP method.
Furthermore, we can see that our method significantly outperforms the other three classification methods. The overall classification accuracy can reach 86.62%, with accuracies of 84.61% for the benign lesions and 87.67% for the malignant lesions. This improvement illustrates that hash functions with supervised information actually preserve the similarity of the images in the original feature space and validate its retrieval performance.

Conclusion
In this paper, in order to improve the retrieval efficiency of lung nodule CT images, we presented a retrieval framework based on the KSH method and a pruning algorithm. Specifically, a clustering method is first used to partition the dataset into several clusters. The KSH method is then utilized to compress the high-dimensional feature vectors into compact hash codes. Finally, a pruning algorithm is employed to narrow the retrieval range and further shorten the retrieval time while improving the retrieval precision. Here, the hash functions are used to map a 104-dimensional image feature of lung nodules into a 48-bit binary code, which, to some extent, reduces the memory space. Low memory cost, fast query speed, and a higher precision demonstrate the suitability of the proposed retrieval framework for lung nodule image retrieval. However, in this paper, we only handle benign and malignant lung nodules, which is a relatively easy task. In future work, the method should be further refined to retrieve lung images at the level of medical signs (such as calcification, lobulation, and speculation) with a higher retrieval precision, helping physicians make reliable diagnostic decisions for clinical cases.