Statistical Analysis of Haralick Texture Features to Discriminate Lung Abnormalities

The Haralick texture features are a well-known mathematical method to detect the lung abnormalities and give the opportunity to the physician to localize the abnormality tissue type, either lung tumor or pulmonary edema. In this paper, statistical evaluation of the different features will represent the reported performance of the proposed method. Thirty-seven patients CT datasets with either lung tumor or pulmonary edema were included in this study. The CT images are first preprocessed for noise reduction and image enhancement, followed by segmentation techniques to segment the lungs, and finally Haralick texture features to detect the type of the abnormality within the lungs. In spite of the presence of low contrast and high noise in images, the proposed algorithms introduce promising results in detecting the abnormality of lungs in most of the patients in comparison with the normal and suggest that some of the features are significantly recommended than others.


Introduction
The lung is an organ that performs a multitude of vital functions every second of our lives. This fact leads to considering lung abnormalities, life-sustained diseases that have high priority in detection, diagnosis, and treatment if possible. Our focus in this paper will be on two popular abnormalities within the lung, which are pulmonary edema and lung tumor. Pulmonary edema (water in the lungs) is caused by fluid building up in the air sacs of the lungs [1,2]. On the other hand, lung cancer/tumor is a disease where uncontrolled cell growth in tissues of the lung occurred [3].
Computer-aided diagnosis (CAD) schemes for thoracic computed tomography (CT) are widely used to characterize, quantify, and detect numerous lung abnormalities, such as pulmonary edema and lung cancer [4,5]. An accurate lung segmentation method is always a critical first step in these CAD schemes and can significantly improve the performance level of these schemes. Although manual or semiautomatic lung segmentation methods for CT images were used in some early CAD schemes [6][7][8][9][10], they are impractical for current CAD schemes because multidetector CT (MDCT) scanners can generate hundreds of CT slices for a patient. An automated method for lung segmentation is needed for MDCT. In addition, the eye identification/detection of the abnormality type (pulmonary edema or tumors) in computed tomography (CT) images is very difficult even for the experienced clinicians because of its variable shape along with low contrast and high noise associated with it. As the final stage of treating the lung cancer is surgical removal of the diseased lung, hence it is necessary to identify the cancer location, which can be useful before they plan for the surgery.
The aim of our work is to develop an automated novel texture analysis based method for the segmentation of the lungs and the detection of the abnormalities, whether pulmonary edema or lung tumor. Haralick's features based on the gray level cooccurrence matrix (GLCM) are applied to capture textural patterns in lung images. The objective of this work is the selection of the most discriminating and finding out the significant texture features that can differentiate between these two types of abnormalities, in comparison to normal.
Haralick features are statistical features that are computed over the entire image. These measurements are utilized to describe the overall texture of the image using measures such as entropy and sum of variance. Chaddad et al. propose an approach, based on Haralick's features, to detect and classify colon cancer cells. This work aimed to select the most discriminating parameters for cancer cells [11]. A study to investigate the feasibility of using Haralick features to discriminate between "cancer" and "normal" subimages within a patient is illustrated in [12]. In this paper, CT images are first preprocessed for noise reduction and image enhancement, followed by segmentation techniques, as the tools to segment the lungs, and finally Haralick texture features [13][14][15] are calculated. Statistical analysis is done to detect the most significant Haralick features that will characterize the type of the abnormality within the lungs. Despite the low contrast and high noise existence in the images, the proposed algorithms introduce promising results in detecting the abnormality of lungs in most of the patients in comparison with the normal.

Materials and Methods
This paper presents a new automatic lung cancer detection system based on Haralick texture features extracted from the slice of DICOM Lung CT images. The proposed system is accomplished in four stages: image preprocessing, lung image segmentation, feature extraction, and classification. Statistical analysis is used to obtain the best features for classification to differentiate between lung cancer patients, ordered edema patients, and control subjects. The following sections will describe in detail these stages. All image analyses were achieved without any knowledge of patient clinical characteristics or status.

Dataset.
Patients with either a lung cancer tumor or pulmonary edema were encompassed in the study. This study included two datasets, the first dataset referred to the Radiology Department at New Elkasr ElAiny teaching hospital, University of Cairo. The other dataset was obtained from The Cancer Imaging Archive (TCIA) sponsored by the SPIE, NCI/NIH, AAPM, and the University of Chicago [16]. The two datasets of 532 CT images from 37 different patients were included. The images are 512 × 512 stored in DICOM format. For each lung CT image, we separate the left lung from the right lung automatically, and each separated lung is labeled as normal or edema/cancer based on the dataset information.

Preprocessing.
The main goal of preprocessing is to improve the quality of an image as well as make it in a form suited for further processing by human or machine [17]. This is accomplished by enhancing the visual appearance of an image besides removing the irrelevant noise and unwanted parts in the background. The proposed enhancement process, which is based on combining filters and noise reduction techniques for pre-and postprocessing as well, is carried out applying histogram equalization (HE) [18][19][20] followed by Wiener filtering [21,22]. Figure 1 presents the enhancement in the lung image contrast attained by applying the histogram equalization. However, the obtained gray scale image contains noises such as white noise and salt and pepper noise. Thus, Wiener filter is utilized to remove these noises from the enhanced lung image. Figure 1(c) shows the effect of applying Weiner filter on the contrast enhanced lung image.

Lung Segmentation.
Lung segmentation step aims to basically extract the voxels corresponding to the lung cavity in the axial CT scan slices from the surrounding lung anatomy. The segmentation technique proposed in [23] is utilized. This technique is based on the fact that there is a large density difference between air-filled lung tissues and surrounding tissues. Furthermore, both lungs are almost looking like mirror images of themselves in a human body. The segmentation of lung regions is achieved through the following steps. In the first step, the preprocessed CT image is converted into a binary image; a threshold of 128 was selected. Values greater than the threshold are mapped to white, while others less than that are marked as black. Consequently, the two lungs are marked and the area around them is cropped out. Second, an erosion morphological operation is employed in order to eliminate any white pixels within the two lungs. Afterward, International Journal of Biomedical Imaging the eroded and the original images are both divided into two equal regions. Black pixels for each region in the eroded image are counted; the region with the largest black area will be deemed as a lung mask. The attained lung mask is reflected in the opposite direction. As a result, right and left lung masks are obtained. These masks are multiplied with the corresponding original image regions; this will project the lung masks on the original two lungs images. Finally, update each black pixel in the obtained images by its original value; other pixels are set to 255. Figure 2 illustrates the lung extraction process.

Feature Extraction.
Feature extraction is the process of obtaining higher-level information of an image such as color, shape, and texture.  [13]. This technique has been widely used in image analysis applications, especially in the biomedical field. It consists of two steps for feature extraction. The GLCM is computed in the first step, while the texture features based on the GLCM are calculated in the second step. GLCM shows how often each gray level occurs at a pixel located at a fixed geometric position relative to each other pixel, as a function of the gray level [13]. The horizontal direction 00 with a range of 1 (nearest neighbor) was used in this work. The 9 texture descriptions used are presented in (4) to (13), where is the number of gray levels, is the normalized symmetric GLCM of dimension × , and ( , ) is the ( , )th element of the normalized GLCM [13].
Contrast (Moment 2 or standard deviation) is a measure of intensity or gray level variations between the reference pixel and its neighbor. Large contrast reflects large intensity differences in GLCM: Homogeneity measures how close the distribution of elements in the GLCM is to the diagonal of GLCM. As homogeneity increases, the contrast, typically, decreases: Entropy is the randomness or the degree of disorder present in the image. The value of entropy is the largest when all elements of the cooccurrence matrix are the same and small when elements are unequal: Energy is derived from the Angular Second Moment (ASM). The ASM measures the local uniformity of the gray levels.

4
International Journal of Biomedical Imaging When pixels are very similar, the ASM value will be large. Consider Correlation feature shows the linear dependency of gray level values in the cooccurrence matrix: where ; and ; are the means and standard deviations and are expressed as The moments are the statistical expectation of certain power functions of a random variable and are characterized as follows. Moment 1 ( 1 ) is the mean which is the average of pixel values in an image and it is represented as Moment 2 ( 2 ) is the standard deviation that can be denoted as

Moment 3 ( 3 ) measures the degree of asymmetry in the distribution and it is defined as
And finally Moment 4 ( 4 ) measures the relative peak or flatness of a distribution and is also known as kurtosis: Furthermore, difference statistics that are a subset of the cooccurrence matrix are also used. These features are based on the distribution of probability − ( ) which is defined as follows: where ( , ) is the ( , )th element of the GLCM. The most basic difference statistic texture descriptions are the ASM, mean, and entropy: When the − ( ) values are very similar or close, ASM is small. ASM is large when certain values are high and others are low: When − ( ) values are concentrated near the origin, mean is small and mean is large when they are far from the origin: Entropy is smallest when − ( ) values are unequal and largest when − ( ) values are equal. The calculation of the Haralick texture features using the previous equations for the CT images volume sequences for every segmented lung (right and left) separately was performed. For each participant the gray level cooccurrence texture features: contrast, homogeneity, entropy, energy, correlation, and 1 , 2 , 3 , and 4 accompanied by the difference statistical features: ASM, contrast, mean, and entropy were obtained for each segmented lung (right and left).

Statistical Analysis.
For the purpose of random lung assignment in healthy volunteers, the left lung represented the diseased lung in the same percentage of cases as the patient population. For the acute data, two single factor analyses of variance (ANOVA) tests were conducted for each Haralick texture feature measurement between affected (either left or right) and fellow lung (either left or right) for both categories cancer and edema patients. A single factor analysis of variance (ANOVA) was conducted as well between patients and controls. Other between-subject single factor analyses were conducted to find out the significant Haralick features that could differentiate cancer from edema.

Experimental Results
Two datasets of 532 CT images were included. For each lung CT preprocessed image, we separate the left lung from the right lung automatically as discussed before in Section 2.3, and each separated lung is labeled as normal or edema/cancer based on the dataset information. The Haralick texture features measurements for each lung separately are calculated (the gray level cooccurrence texture features: contrast, homogeneity, entropy, energy, correlation, and moments along with the difference statistical features: ASM, mean, and entropy). The mean and the standard deviation of the Haralick texture features measurements calculated as well as the ANOVA results are given for tumor patients affected lung versus fellow lung in Table 1 and for pulmonary edema patients in Table 2. The ANOVA summary of statistics for either pulmonary  edema or tumor patients versus normal is given in Table 3. The significant Haralick texture features that can differentiate between pulmonary edema and tumor are found in Table 4. From Table 1, we can conclude that Haralick texture features measurements (homogeneity, energy, correlation, and entropy) of the affected cancer lung were significantly different than that of the fellow lung. The homogeneity, energy, and correlation were significantly less than those of the normal fellow lung. While entropy of the cancerous lung is approaching being significantly more than that of the fellow lung, Moment 3 and the difference statistical feature ASM (diff ASM) texture feature measurement of the cancerous lung is approaching being significantly less than that of the normal lung. Table 2 showed that Haralick texture features measurements (homogeneity, entropy, and moments calculated from the cooccurrence matrix as well as mean and ASM computed from the difference statistics) of the pulmonary edema affected lung were also significantly different than those of the control subject lung; moreover contrast and entropy computed from the difference statistics were significantly more than those of the fellow lung.
Considering Tables 1 and 2, we can conclude from Table 3 that the homogeneity, energy, entropy, 3 , 4 , diff ASM, diff mean, and diff entropy are good biomarkers to significantly differentiate between diseased and normal lungs without any disease specification. On the other hand, the results illustrated in Tables 1, 2, and 4 show that entropy and the entropy calculated from the difference statistics would be a good candidate to significantly differentiate between pulmonary edema and cancer.

Conclusion and Discussion
The texture features analyses are well known approaches to quantify and express the heterogeneity that may not be appreciated by clinical naked eyes, and it was presented before as good imaging biomarkers to differentiate between diseases. In this paper an evaluation of the Haralick texture features is done in order to identify the most significant features that can be used in order to detect and differentiate abnormalities within the lungs for cancer and edema versus normal.
Our results indicate that entropy determined by gray level cooccurrence matrix and ASM is significantly different in  edema patients versus normal while it is not in cancer patients versus normal. Since the entropy is the degree of randomness or the degree of disorder in the image, and the angular second moment represents the uniformity in the image, this may be interpreted as the cancer disease causing a localized heterogeneity in the diseased specified area in the lung while the edema causes heterogeneous disorder in the whole lung image. High entropy values calculated implies that the elevated level of disorder and disorganization occurred due to the edema diseased lung versus the cancer diseased lung. The energy feature that is derived from the angular second moment measures and representing the local uniformity of the gray levels is a good biomarker to differentiate between cancer and edema diseases. From Table 2, contrast is a good biomarker for the pulmonary edema disease and this agrees with the texture feature meaning which means high contrast values for heavy texture changes. Gray level cooccurrence matrix textural properties such as homogeneity, correlation, mean, and moments are good significant biomarkers for diseased lung versus normal ones in general without any specification for the disease type. Our results agree with other articles indicating that textural analysis has the potential to develop into a valuable clinical tool that improves the diagnosis, tumor staging, and therapy assessment. While our results are promising, there is still further work that can be done in the detecting of the abnormality within the lungs to detect the type of that abnormality whether it will be a lung cancer or edema. A preliminary investigation has been done using statistical analysis to identify the most useful texture features that can be fed to any classification technique later. This statistical analysis is done using ANOVA. After selecting these features we can feed them for better localization and classification as further work.