Unsupervised Detection of Suspicious Tissue Using Data Modeling and PCA

Breast cancer is a major cause of death and morbidity among women all over the world, and it is a fact that early detection is a key in improving outcomes. Therefore development of algorithms that aids radiologists in identifying changes in breast tissue early on is essential. In this work an algorithm that investigates the use of principal components analysis (PCA) is developed to identify suspicious regions on mammograms. The algorithm employs linear structure and curvelinear modeling prior to PCA implementations. Evaluation of the algorithm is based on the percentage of correct classification, false positive (FP) and false negative (FN) in all experimental work using real data. Over 90% accuracy in block classification is achieved using mammograms from MIAS database.


INTRODUCTION
Breast cancer is the second leading cause of cancer death in women in the United States. Early detection is crucial for treatment success as tumor size is a major prognostic indicator. Studies have shown that early detection and treatment improve the chances of survival for breast cancer patients [1,2]. The American Cancer Society recommends all women 40 years and older undergo yearly screening mammograms. The goal of screening mammography is the detection of cancer before it becomes palpable. Unfortunately, mammograms are not 100% accurate. False positive rates of 15-30% and false negative rates of 10-30% have been reported [3]. False-positives (labeling a finding as suspicious which later is found to be benign) lead to unnecessary biopsies and anxiety, while false-negatives (failure to detect presence of cancer) result in later detection and often poorer prognosis. Nonetheless, mammography has an overall accuracy rate of 90% [2].
Although radiologists are capable of detecting a number of findings suggesting cancerous tissues in radiographic images, a significant percentage of abnormalities are missed [3]. Screening programs typically require radiologists to read large numbers of mammograms with great attention to fine details. As less than 10% of exams will have abnormalities that need further attention and only around 1% will actually have cancer, this is a rather tedious process. Fatigue, satisfaction of search (failing to detect additional abnormalities once one finding is detected), and failure to perceive subtle changes are common causes of false negative exams. Development of computer algorithms to assist radiologists in detection of abnormalities would be extremely beneficial. Masses can be hidden by normal dense glandular tissue and fine microcalcifications can blend in with background tissue. Computer-aided image analysis enables detection of masslike structures only a few millimeters in size and even smaller microcalcifications. Cancerous tissues usually arise in duct channels and lobules. It is critical to define the degree of abnormality compared to normal cells and growth rate of abnormal cells, which is named tumor grade.
Computerized feature extraction techniques are used to extract features in mammographic images that may not be readily perceived by radiologists. Many methods have been proposed in the literature for mammography detection and classification utilizing a wide variety of algorithms to achieve their goals. Chan et al. [4] used artificial neural networks to extract features from mammograms to predict whether the presence of microcalcifications is associated with malignant or benign pathology. A back-propagation artificial neural network classifier was trained and tested with a 2 International Journal of Biomedical Imaging leave-one-case-out method to recognize the malignant or benign microcalcification clusters. 11 out of the 28 benign cases were correctly identified (39%) without missing any malignant cases. Lemaur et al. [5] used wavelets having a high Sobolev regularity to detect clustered microcalcification in digitized mammograms. Morrow et al. [6] used each pixel in the mammographic images as a seed to grow a region. Then, the contrast of each region is calculated and enhanced by applying an empirical transformation based on each region's seed pixel value, its contrast, and its background. The validity of microcalcification clusters and anatomic details is considerably improved in the processed images. Shen et al. [7] developed a set of shape factors to measure the roughness of contours of calcifications in mammograms and for use in microcalcification classification as malignant or benign. Wang and Karayiannis [8] used wavelet transform to decompose the mammograms into different frequency subbands. They suppressed the low-frequency subbands, making microcalcifications correspond to high-frequency components, and reconstructed the mammogram from the subbands containing only high frequencies. Zwiggelaar [3] used fractals and statistical modeling to separate the structure and texture background that are present in mammographic images.
Utilizing PCA and feature modeling in this work is to target identifying abnormal mammograms in a local processing setting. It identifies the region of suspicious tissues in an area size of 120 × 120. This paper is organized as follows. Section 2 presents a procedure of PCA. Section 3 presents the linear and curvelinear modeling of the data. Simulation results are presented in Section 4 followed by the conclusions in Section 5.

BACKGROUND ON PCA
Principal component analysis has proven to be one of the best methods to find similar patterns or features in a data set. It is an essential statistical tool in pattern recognition applications which includes medicine (such as sample identification), industry (quality control and document image analysis), and government (fingerprint identification and face recognition) [9][10][11]. Ye and Auner [11] used PCA to reduce the dimensionality of signatures for different types of samples to create a real-time approach of sample analysis using a Raman spectrometer directly mounted at the end-effector of medical robot to enhance the remote robot surgery. Chen et al. [9] developed algorithms based on PCA to generate a set of new identifying keys for a given set of patterns to reduce the number of comparisons during the near-matching process. Pinkowski [10] used PCA for feature reduction on a speaker-dependent data set to achieve high recognition rate analyzing spectrograms, which contain human speech utterances. 97.5% correct recognition rate is achieved using PCA. Also, Swiniarski and Swiniarsk [12] used PCA with rough set methods for feature selection in mammograms and reported good results.
In general, PCA algorithm can be useful whenever an automated feature extraction or identification from a digital image is required. It is based on finding a match for the specific feature in the test image from image database using some similarity measures. These measures are defined based on statistical characteristics of the data: variance, covariance, eigenvectors, and eigenvalues. A brief description of PCA is given in the following subsection and followed by an introduction of two distance measures used in this work.

A brief description of PCA
Suppose that we have m training data vectors x 1 , x 2 , . . . , x m of n dimensions each, that is, x j = [x 1 j , . . . , x n j ] T . There are two phases in the algorithm of PCA. The first phase is to find p orthogonal and uncorrelated vectors and the second phase is to project the given data set into a subspace spanned by these p orthogonal vectors.
The first phase of PCA is as follows.
(i) Construct an n × 1 vector m whose ith element m i is the mean of the ith dimension of all data x, that is, (iv) Let λ i and v i , i = 1, 2, . . . , n, be eigenvalues and normalized eigenvectors of the matrix C with Cv i = λ i v i and λ 1 ≥ λ 2 ≥ · · · ≥ λ n ≥ 0. Eigenvectors are called the principal components.
Note that the covariance matrix C is symmetric and semidefinite. We have v T i · v j = 0 for all λ i = λ j . If λ i is the repeated eigenvalue of C, the associated principle component v i is not unique.
The second phase of PCA is to project a given n×1 testing data x into a space spanned by v 1 , v 2 , . . . , v p , the eigenvectors associated with the first p largest eigenvalues of the covariance matrix C. This space is called eigenspace. The projection of the testing data x on the eigenspace is where Discarding small eigenvalues, and consequently corresponding eigenvectors, results in dimension reduction and increases the speed of computations. The highest eigenvalues are associated with the eigenvectors that contain the major modes of variation in the data. Once test mammogram and training set all projected in the eigenspace, a distance measure is used to find the nearest match of the test mammogram to the mammogram in the training set. Approximately 90% of the total variance is contained in the first 5 to 10 of the dimensions. In most implementations of PCA for feature extraction applications, the important decision is how much of the original data variation is needed to be captured in the eigen features. Most researches choose to work with 88% to 98% of original data variations (see [11]). Including all of the principal components would be equivalent to working with the original data since each one of these vectors is a linear combination of the original vectors. It is worth mentioning that keeping the first few components and discarding the others will result in a loss of the original data, however, in this utilization of the PCA, this loss is insignificant due to the fact that the principal components are used as feature vectors to match them with the feature vectors from mammograms in the training set. The original mammogram is preserved and the output for the radiologist is the original mammogram with identified blocks as suspicious block.
Thresholding is implemented by throwing away all eigenvalues that are below a threshold T, which is basically a measure of how much variation in the original data is accounted for in the eigenvalues and their corresponding eigenvectors that are preserved. The threshold T is calculated as follows: where L is usually kept much smaller than the original total number of the eigenvalues n. In our simulation, we choose L = 36.

Distance measurements
A common way of finding similarities between two patterns is to find the difference. Since minimum distance means maximum similarity, different types of distance measurements are being used. In this study, two types are employed: Euclidean and Chebyshev with the intention to compare them and find the optimum one for mammographic data.

(i) Euclidean distance
This type of distance is the standard metric, which is the shortest distance between two vectors (x and y) and is defined as follows: This type of distance is also known as maximum value distance. It examines the absolute magnitude of the differences between coordinates of a pair of objects. When computation time is extremely imperative, Chebyshev distance is used. It is defined as follows:

MODELING OF IMAGE FEATURES
Initially the intention was to identify the image features using a linear structure model in an effort to improve the results obtained by utilizing PCA alone [13]. Indeed, a proper choice of a modeling technique that suites the features in the data can be combined with PCA to improve results as reported in [10,11,13]. In this work, the search is for suspicious tissue which can be identified by regions surrounded by an edge. Edges in images can be detected using several algorithms such as directional morphology, curve-linear structure detection, directional second-order Gaussian derivative, and convolution-based edge detection algorithms [14][15][16][17][18][19][20][21]. Also, edges may have several shapes such as straight lines, circles, ellipses, and parabola. Thus it is important that a suitable data modeling is utilized to better fit the nature of the features captured in the data.

Linear modeling
Convolution-based edge detector is used in this work. Four masks with the size of 5 × 5 pixels are used for vertical, horizontal, and oblique (±45 • ) structure detection such as where R is the resultant image, a is the mask, and b is the input image. When a vertical mask is convolved with an image, the longitudinal linear structures are extracted in the resultant image. Using other masks results in similar outputs. Once all edges in all directions are identified, an edge mammogram is generated. It is more common to use mask size of 3 × 3 and they are seldom to be greater than 7 × 7, we decide to use 5 × 5. Masks used in this implementation for detecting vertical, horizontal, oblique with 45 • and −45 • edges are respectively.

Curvilinear modeling
Hough Transform (HT) was introduced by Hough in 1962 for detecting and recognizing complex patterns in data [22].
HT does not require connected or even nearby edge points. Literature is rich in articles on HT and its abilities in tracking edges, lines, curves, and parabolic features in images. Fundamentally it is a mapping process of edge pixels into a parameter space [22,23]. Edges that represent straight lines, for example, are identified with their slope-intercept (m, b) parameters and the line for edge pixel (x, y) is However, due to the difficulty dealing with vertical lines in the image space, a different description for straight line is commonly used [23]. Lines are parameterized by the orientation of the line θ and the distance of the line from the origin ρ as follows: where all points (x i , y i ) of an edge in the image space are represented by (θ, ρ) point in the transform space, and (x 0 , y 0 ) is the origin of the image space. While this algorithm has proved its efficiency in applications where features are straight lines, however, suspicious regions tend to have circular shapes. Thus HT for curves is a mapping of a pixel point (x i , y i ) in an image space to a sinusoidal curvilinear in the Hough space (ρ, θ). A complete review of detecting Hough transforms curves can be found in [24,25]. In general, curves of various sizes can be identified in the image scene by introducing a new parameter such as a radius of a circle r. The transform makes use of the circle formula x − x 0 2 + y − y 0 2 = r 2 (11) to find the pixels that fall on this circle and simultaneously increases the particular accumulator positions. HT can be high in computational cost and complexity and has some disadvantages such as the fact that some lines are replicated during detection due to spatial sampling. Nevertheless, HT is a powerful tool for detecting features with various sizes and orientations with high accuracy and research is ongoing in developing computationally efficient algorithms [26]. Recently, Olson in [27] has presented superior algorithm to provide accurate and fast Hough curves with a worst-case complexity of O(n).

EXPERIMENTAL RESULTS
The algorithm outlined above was simulated on mammograms from the Mammographic Image Analysis Society (MIAS) database using Matlab . The simulations can be categorized as follows: the first is utilizing PCA with linear modeling in local processing, and the second is utilizing PCA with HT as a method for curvilinear modeling. In a global approach one would consider each mammogram is an image as where in local processing each mammogram is segmented into blocks prior to any processing and each block Training set Segments Image is 1 2 3 4 5 6 7 8 9 10 11 12 1 mdb001 n n n n s n s s s n n n S 2 mdb002 n n n n n n s s s n s s S 3 mdb003 n n n n n n n n n n n n N 4 mdb004 n n n n n n n n n n n n N 5 mdb058 n n n n n n s n n n n n S 6 mdb072 n n n n n n n s n n s n S 7 mdb008 n n n n n n n n n n n n N 8 mdb009 n n n n n n n n n n n n N 9 mdb012 n n n n s n n n n n n n S 10 mdb013 n n n n n n n s s n n n S 11 mdb075 n n n n s n n n n n n n S 12 mdb090 n n n s s n s s n n n n S is processed as it was an image. Instead of a training set of mammograms, we are training a set of blocks from several mammograms. Earlier work showed that PCA using a local processing is performing better [13]. This is an attempt to examine PCA performance in a small neighborhoods that may contain some mamographic features of interest. Twelve images are segmented into 12 blocks each resulting in 144 elements trained. These mammograms are combinations of normal, benign, and malignant mammograms. All these images/blocks from MIAS database are labeled and their information are tabulated in Table 1. Table 1 is the decision table, where the status of each block is visually inspected and labeled by a radiologist. As for the blocks, each block is defined to be either normal (n) or suspicious (s). Figure 1 shows one of the images of each group. Figure 1(a) is a benign mammogram in training set database, the area around nipple is defined to be suspicious. Tissues of interest are referred to as suspicious in this paper, whether they are malignant or benign. Making final decision is left to a specialist. Figure 1(b) is a normal mammogram in the training set database. Figure 1(c) is a malignant mammogram, and areas around milk ducts are suspicious. The local processing can be computationally expensive depending on the training set size and number of blocks in each image. Moreover, adding more mammograms to the training set increases memory requirements.
Fifteen mammograms were used as testing set, which mounts to 180 block testing and matchings results. Figure 2 shows three samples of test images. Each of them belongs to one of the three classified groups: benign, normal, and malignant, respectively.

Results from linear structure modeling
This algorithm is based on extracting the linear models in each image or block using a convolution process. The detectors are composed of four 5 × 5 masks responding to vertical, horizontal, and oblique (±45 • ) lines. This method Ikhlas Abdel-Qader et al.  Tables 2 and 3, respectively. These results indicate that Euclidian distance is capable of achieving similar results to Chebyshev distance in terms of a mammogram classification as suspicious or not, both at 60% correct, 26.6% FP, and 13.3% FN. Out of the fifteen tests performed, 9 were correct, 4 were FP, and 2 FN using both distance measures. However, the Euclidian distance has shown much higher accuracy with PCA using block statistics. That is, in the Euclidean distance simulations, 159 (88.33%) blocks were correctly classified, while 9 (5%) FP, and 12 (6.66%) FN classifications. As for the Chebyshev results they are at 157 (87.22%) blocks correctly classified, 11 (6.11%) FP, and 12 (6.66%) FN classifications. The Euclidean distance has a comparable accuracy to Chebyshev's with the first has two less FP block classification which requires less viewing time by the radiologist.

Results from curvilinear modeling
The results of using HT transform as an algorithm to identify the curvilinear features in the mammogram as opposed to straight edges are displayed in Tables 4 and 5.
From these tables, it is observed that the correct mammogram classifications is improved to 73.3% with 20% FP and 6.67% FN classifications for the Eculidean disteance while the results of the Chebyshev distance are improved to 66.67% correct classifications, and 26.67% FP, and 6.67% FN. However, since the objective here is to alert a radiologist to suspicious regions in the test mammogram, it is more proper again to look at the block classification results. These are at 91.11% correct classifications, 3.33% FP, and 5.55% FN for the Euclidean distance measure while the Chebyshev results are at 90% correct block classifications, 4.44% FP, and 5.55% FN. It is worth noting that mdb016 was consistently classified as suspicious mammogram when it is a normal mammogram. This is due to the fact that this image is a mammogram of fatty and glandular tissue. Tissue composition ion mammography is important as the detection of cancer is easier in fatty tissue and mammography becomes less sensitive the more dense tissue (the greater the proportion of fibroglandular tissue). While this mammogram is not suspicious, it is not uncommon for a CAD system to mark an area of normal overlapping tissue as suspicious since its tissue is highly glandular tissue.

CONCLUSION
Our goal is to detect abnormalities in screening mammograms as an additional support system to assist radiologists. 6 International Journal of Biomedical Imaging