^{1}

The extraction of the breast boundary is crucial to perform further analysis of mammogram. Methods to extract the breast boundary can be classified into two categories: methods based on image processing techniques and those based on models. The former use image transformation techniques such as thresholding, morphological operations, and region growing. In the second category, the boundary is extracted using more advanced techniques, such as the active contour model. The problem with thresholding methods is that it is a hard to automatically find the optimal threshold value by using histogram information. On the other hand, active contour models require defining a starting point close to the actual boundary to be able to successfully extract the boundary. In this paper, we propose a probabilistic approach to address the aforementioned problems. In our approach we use local binary patterns to describe the texture around each pixel. In addition, the smoothness of the boundary is handled by using a new probability model. Experimental results show that the proposed method reaches 38% and 50% improvement with respect to the results obtained by the active contour model and threshold-based methods respectively, and it increases the stability of the boundary extraction process up to 86%.

Breasts are soft parts of the body which are normally composed of fatty tissues as well as specialized tissues that produce milk. Breast cancer is a very serious disease. The early detection of the disease increases the success of treatment. However, its early detection is difficult since there are no symptoms during the first stages of breast cancer development. Fortunately, X-ray mammography can reveal small changes in breast tissue [

We distinguish two kinds of mammographies, namely,

To take mammograms, radiographers help patients to position their breast between two small plates where X-rays pass through the tissues of the breast. The plates then compress the breast for a moment to take an X-ray image. Each breast is compressed to a thickness of approximately 6 cm, and an X-ray image is taken perpendicular to the plane of compression [

The breast is connected to the pectoral muscle, fatty tissues are located below the skin, and lobules and ducts are located at the center of the breast. There is a phenomenon, known as

As we stated before, during a mammography process, X-rays enter from one side of the breast and exit from the other side. Inside the breast, each tissue attenuates the X-rays to some degree. As a result, the final X-rays attenuation is affected by each tissue inside the breast. The boundary of the compressed breast has a lower thickness in comparison to its inner parts, but its texture remains as the tissue attenuates the X-rays to a lower degree and produces dark areas in the image. In the same way, whenever X-rays pass through tissues in the breast, they are attenuated according to the density of the tissues and produce bright areas in the image.

Due to the attenuation produced by the different densities of the breast tissues, the final image of the breast is characterized by a specific pattern of gray levels in its different regions. In other words, different regions of the mammogram have different textures (note that, since no imaging technique is perfect, we can expect noise to appear on mammograms).

Analyzing mammograms is one of the hardest tasks even for human experts. Hence, there is a considerable need for

In this paper we propose a probabilistic learning method for tracing the boundaries of the breast and the pectoral muscle. Our method overcomes the problems of thresholding and deformable techniques and provides accurate, applicable, and stable results. In addition, our method does not require preprocessing techniques to remove artifacts or to align the breast image. In contrast with the previously proposed techniques, our method learns the shape information of the mammogram from training mammograms, and, hence, there is no need for a manual determination of parameters. Also, instead of using pixel intensities or edge information, our proposed method utilizes the texture information of each pixel. Experimental results have been obtained by using the mini-MIAS database, and they show that our method is able to extract accurate boundaries even for noisy mammograms.

The rest of the paper is organized as follows. First, we review the current state of the art related to the problem of determining the breast boundary in Section

Generally, there are two different approaches to cope with the extraction of the breast boundary. The first approach is based on the combination of image processing techniques such as thresholding, watershed transformation, morphological operations, and flood fill. The second approach is founded on well-known deformable techniques such as the active contour model and level-set methods.

In early studies by Karssemeijer [

In addition, Dehghani and Dezfooli [

Since a mammogram usually contains noise and other artefacts such as labels, it is reasonable to add preliminary steps to remove them from the image. To this end, Tzikopoulos et al. [

In another approach, Maitra et al. [

Mello and Tenorio [

Maysam Shahedi et al. [

There are other similar approaches [

Another popular approach to determining the breast boundary is to use deformable models such as the active contour model or level-set methods. Since the boundary of the breast is a well-defined curve and the background region is more likely to be composed by low-intensity and low-gradient pixels, it is reasonable to use active contour models (snakes) to look for local minima. The basic idea behind active contour models is to find a contour such that the internal and external energy of the contour is minimized. Basically, external energy is calculated as the negative of the gradient of the Gaussian smoothed image.

Similarly, level-set methods start with an initial contour and change it using a simple equation. The amount of change at each point on the contour is determined by using a potential function. Generally, the potential function is zero wherever the gradient of the image is maximum. Ferrari et al. [

Differently from traditional models, Thiruvenkadam et al. [

One of the most recent approaches for finding the boundary of the breast is that proposed by Marti et al. [

To the best of our knowledge, there are only a few studies on the pectoral muscle boundary extraction problem. Mustra et al. [

In a mammogram, the number of dark pixels is generally greater than the number of bright pixels, and, usually, bright pixels are distributed uniformly. This results in a histogram with significant peaks at dark pixels and a near-uniform shape at bright pixels. Due to this fact, finding a proper threshold is a difficult task, and it gets even more difficult in the presence of noise. Clearly, the determination of an inappropriate threshold value would lead to the incorrect selection of larger or smaller areas for the breast.

Threshold-based methods, which try to find the threshold value by optimizing some evaluation function directly on the histogram of the image, can easily fail. This problem is illustrated in Figure

Example of breast extraction using different thresholding methods.

As a result, all thresholding methods that only consider histogram information can fail to properly segment the mammogram with accuracy. Applying preprocessing techniques to reduce the effect of noise or morphological operations on the resulting images does not guarantee the achievement of an accurate segmentation. In addition, due to the high degree of freedom in the breast shape, simple shape information cannot help thresholding methods to find the proper threshold value. Hence, we cannot expect to obtain accurate segmentation results by using segmentation techniques, and there is a high probability of finding smaller or larger breast regions using thresholding techniques.

Technically, active contour models and level-set methods are applicable techniques for medical image segmentation, but they suffer from poor initialization. The main issue of those methods is that their accuracy depends on their initialization. In the case of mammograms, this kind of methods is usually initialized using thresholding techniques. As a result, they are vulnerable to remaining stuck in local optima rather than in the actual boundary. Figure

Importance of initialization in deformable models.

Due to the fact that body tissues are not uniform inside the breast and its boundary, they will be characterized as textured regions on the final image. In addition, in a texture region, the intensity of each pixel might be different from that of its surroundings, and, hence, each pixel will have a positive gradient. As a result, the final external force function will be highly nonlinear and will contain lots of local minima.

In addition to the aforementioned problem, artifacts can also interfere and make deformable models fail. This is shown in Figure

Example of interference of an artifact with the breast region.

As we stated before, body tissues are not uniform, and, as a result, they are characterized as textured regions. Hence, simple edge detection methods are not able to extract the boundary of the breast. Figure

(a) Breast image, (b) edge map, and (c) a small portion of edge map.

Figure

To start introducing our ideas, let us suppose that we are given the location of the initial point

Candidates to be the next boundary points.

With the aim to simplify this model and to make it tractable, we assume that the smoothness is a second-order Markov process. Applying this assumption to (

Using the product rule of probability, we can factorize (

Assuming that

In (

In a nutshell, to trace the boundary of the breast, we just need to find the probability distribution of the texture features,

There are lots of efficient texture feature extraction algorithms such as first-order statistical measures, cooccurrence matrix, autocorrelation, Voronoi tessellations, Markov random fields, fractals, Fourier transforms, discrete cosine transforms, and Gabor filters banks [

To extract the feature vector of a given texture patch, we compute the LBP value of each pixel in the texture patch. Suppose that

Computation of the LBP value for a pixel using its 8 neighbors.

After computing the LBP values of all pixels in a given patch, a histogram of LBP values is calculated. This histogram is used as the feature vector for that patch. In this paper, to extract the vector of features from a patch, we propose to divide it into four equal subregions as shown in Figure

Division of the image patch into four regions for the extraction of texture features with LBP.

After collecting the feature vectors of the boundary pixels, we should find a way to model their distribution. Usually, a Gaussian mixture model (GMM) is used in these cases. However, as we show in the experimental results, GMM is not able to model their distribution accurately. One of the difficulties in GMM is the determination of the number of components of the model. In addition, expectation maximization algorithms can get stuck in local minima, and, consequently, GMM cannot model the distribution accurately.

In this paper, we use support vector machines (SVMs) to calculate the probability of the feature vectors. The intuition behind this approach is as follows: we suppose that a feature vector can be classified into two classes, namely, “boundary” and “nonboundary.” Obviously, there is a decision boundary which separates these two classes. The intuition behind calculating the probability using support vector machines is that the probability of the feature vectors near the decision boundary will be close, and, actually, on the decision boundary, the probability is equal to 0.5. Also, inside each region, the probability is relative to the distance of the feature vector to the decision boundary. Technically, the decision function of a support vector machine is calculated as follows:

In this paper, we use the same approach to calculate the probability of the feature vectors. In a nutshell, we collect the feature vectors of boundary points as well as the feature vectors of nonboundary points from the training mammograms so as to train a two-class SVM. Then, for a new feature vector

To calculate

Distribution of the vector

Prior probability is important as we illustrate in Figure

The need for prior knowledge.

Considering the line segments between (

Mathematically, in a support-vector-machine-based probability estimation, there is a direct relation between the distance of the feature vector from the support vectors and its probability. However, since the core of the probability estimation in the support vector machine is a

To avert this problem, we need a prior probability to penalize points such as

The Laplace distribution with

In (

In our algorithm, initializing means finding the starting point of the boundary from which we start the analysis. Since this is the first point, there is no information about previous points, and the smoothness cannot be computed (i.e.,

Marginalization over

Let us suppose that we have trained the texture probability model

Then, the class of this feature vector is determined using the trained SVM, and if it is classified as a “boundary” point, its probability is computed using the same SVM. This process is repeated for every pixel of the same row of the image. If any boundary point is found, the one with maximum probability is selected as the starting point of the boundary. As it is shown in the first step of Figure

Tracing the boundary.

To find the second point of the boundary, all pixels in a radius

The important role of the smoothness factor is emphasised in steps 3 and 5 of Figure

However, the texture probability of the magenta pixel at step 5 is computed inaccurately (which may be caused by the feature vector or the SVM itself). Again, selecting this pixel lowers the smoothness of the boundary, and, as a result, the red pixel is selected as the next boundary point.

We have implemented the proposed method in the MATLAB 2010 environment. In addition, we used LibSVM [

In the following experiments, the texture feature vectors of both breast and pectoral muscle pixels are extracted in windows of size

With the aim to reduce the dimensionality of the feature vectors, we applied

The Mammographic Image Analysis Society (MIAS) has generated a digital database of mammograms. Films taken from the UK National Breast Screening Programme have been digitized to 50-micron pixel edge with a Joyce-Loebl scanning microdensitometer, a device linear in the optical density range 0–3.2 representing each pixel with an 8-bit word [

For our experiments, we have used the mini-MIAS [

In order to manually extract the breast boundary and prepare training data, the gradients of sample images were processed using Adobe Photoshop CS, and the boundary of the breast was manually determined using a combination of threshold and magnetic and polygonal lasso tools. Also, in some cases, we used an interactive threshold method to find the ground-truth boundaries. By this way, we selected 57 images and extracted their boundaries manually.

The selection of the training images was done manually such that they cover different shapes and textures as much as possible. For example, based on the size of the breast, the shape of the mammogram can be near-flat, semicurved, or highly curved. From the texture perspective, some mammograms are noisy in their boundaries. In addition, some of them have dense tissues near the nipple area. In summary, the selected 57 training images provide a convenient variety of shape and texture information.

It is clear that different radiologists can draw the boundaries of the same mammogram differently. In addition, a radiologist may extract different boundaries for the same mammogram at different times. Hence, there is not a perfect or exact ground-truth boundary for a mammogram. From a probabilistic point of view, there are uncertainties in the boundary. Notwithstanding, it should be noted that our algorithm is based on probability by learning the texture and the shape of the training boundaries and computing their corresponding probability density functions. Hence, even if there are different boundaries for a typical mammogram, however, there is just one more-likely boundary which is extracted using our algorithm. As a result, our algorithm does not require perfect or identical training ground-truth boundaries. Using the extracted training boundaries, positive and negative samples are collected to train the SVM as follows: for every boundary pixel, the pixel and its left and right pixels were selected as positive samples. Also, for each mammogram, 900 random pixels on the image are selected as negative samples. The negative samples which were close to the boundary were discarded. Positive and negative samples were collected from all training images, and, in total, 50176 negative samples and 37704 positive samples were obtained. Finally, we selected another 37 hard-to-process test images from the database and extracted their ground-truth boundaries in the same way. The test boundaries are not used for the training of the SVM and the smoothness probability model. We have selected only 37 test images for two main reasons: first, it is time consuming to extract the ground-truth boundary of the complete set of images in the database, and, second, in most cases, the texture and the shape information of the mammograms are similar, so they cannot be good test cases.

Each boundary is represented by a set of points. Given the extracted boundary

Let us suppose that the ground-truth boundary points are

Estimating the accuracy of a boundary on the extracted boundary.

To compute the value of

Now, we can use (

In addition to the accuracy, the extracted boundary has to be smooth. The smoothness of the curve can be formulated by its first and second derivatives. For a point

With the aim to assess the performance of our algorithm, we have run it three times with different configurations. In the first configuration, all terms in (

Result of tracing the boundary of the breast.

In Figure

Although we are interested in low values of

As the two left images in Figure

According to the table under Figure

There are only a few parameters in our algorithm that are determined manually. The first parameter is the size of the window in which texture features are extracted. Although we have used a predefined value for this parameter, it can be defined as a function of the size of the mammogram. Also, it is possible to use multiscale window sizes with different SVMs and select the candidate point using voting algorithms. The second parameter is the distance between candidate points which can be a function of the size of the mammogram or a fixed value. The only parameter that can affect the result of the algorithm is the

Generally, the tissues of the breast are categorized as

Among different approaches, image-processing-based methods can be used in real-time applications, but it should be noted that they do not guarantee to find an accurate boundary. On the other hand, deformable models are slower than image-based methods, and their time complexity directly depends on the number of points that are used to form the contour. Although, they are more accurate than image based methods they can fail to find the boundaries especially when artifacts such as labels are connected/overlapped with the breast.

The time complexity of our method is higher than those of two previous methods since, instead of using simple rules for selecting a pixel as boundary or nonboundary, our method uses machine-learning techniques to classify the pixels using their texture information. This increases the localization accuracy and stability of the method significantly.

Notwithstanding, a high degree of accuracy and reliability comes with a high time complexity. The most time-consuming part of our algorithm is classifying and calculating the probability of candidate points using SVM. The complexity of SVM depends on the number of support vectors, and the number of support vectors depends on the degree of nonlinearity of the feature space. The key to increase the speed of the algorithm resides in replacing the SVM by other probabilistic methods. If we can achieve this goal, the time complexity of our algorithm could be decreased as much as those of deformable models.

In order to extract the boundary of pectoral muscles, we used the same probabilistic approach that we used for the boundary of the breast except for the probability density functions that are modeled by using training data from the pectoral muscle boundary. Again, to evaluate the performance of the proposed algorithm, we applied the method in three different configurations as in the previous experiment on 38 test images. The result is shown in Figure

Result of tracing the boundary of the pectoral muscle region.

As Figure

To visually analyze the results, two representative images were selected. It is apparent from the two left images in Figure

On the contrary, the shape information adds a constraint to the probability model, and, in addition, prior probability makes fast nonlinear transitions on the final density function. This prevents the algorithm from selecting wrong candidate points. As in the previous experiment, prior information guides the algorithm to put more value on the pixels with higher texture probabilities, and, for the same reasons that we discussed previously, using this information, the extracted pectoral muscle boundary is more accurate.

Although our method is able to extract the boundary of the breast and pectoral muscle accurately, there are still cases in which the algorithm makes wrong decisions in selecting the next candidate points. This problem appears in the presence of significant amounts of noise that distorts the boundary of the breast. Also, the overlap of artifacts with the breast boundary might lead to wrong decisions. These issues are shown in Figure

Failure analysis of the algorithm.

This figure shows three different images from mini-MIAS database. The blue curves are the ground-truth boundaries of the breast, and the red dots are the ones extracted by using our algorithm. In the two images on the left, the extracted boundary of the image “mdb065” and its corresponding edge map are shown. There are some yellow arrows on the image pointing to those parts of the boundary which are selected inappropriately. To find out the reason of these wrong selections, we might refer to edge map of the mammogram (on the right). According to the edge map, there is a significant amount of noise on the image which has highly distorted the actual boundary. Also, the algorithm relies on extracted LBP features, but, due to the noise, uncertain features are extracted for candidate points, and, consequently, the texture probability is estimated inaccurately. As result, the candidate points are selected wrongly in some parts of the boundary.

The two images on the right of Figure

Since we are using a probabilistic framework, we should be able to deal with uncertainties. From a probabilistic perspective, the pixels of the image with inaccurate feature vectors must have lower probabilities to be candidate points. However, if the probability density function of the feature vectors is not accurately modeled or if the feature vectors of both a noisy image patch and an ideal image patch are placed in the same region of feature vector space, their probabilities are computed wrongly, and inappropriate candidates are selected as boundary points.

As we stated before, the LBP is a powerful and efficient feature extraction method with high discriminative power. In this experiment, we utilize different feature extraction methods and compare them with LBP features. As with LBP, the dimensionality of the feature vector is reduced by applying PCA. In this experiment, we have used histogram of oriented gradients (HOG) and first-order statistical features. The cell size and block size of the HOG method are selected as 20 and 2, respectively. Also, a 9-bin histogram was used to extract the final feature vector. Regarding the statistical values, we used

Comparison between LBP, HOG, and first-order statistical features.

As we mentioned in the previous sections, threshold-based algorithms and active contour models are the most widely used techniques for extracting the boundary of the breast. Both can fail to extract an accurate boundary even in simple mammograms. In this section, we elaborate on this claim. We start our discussion by referring to Figure

Optimal threshold values and histograms of four mammograms.

Intuitively, we would expect the automatic threshold selection method to return the same gray-level (three) as threshold value, but, if we refer to the decision functions of automatic threshold selection methods, we realize that there is no guarantee that the gray-level remains the same for the second histogram as well. For this reason, most authors propose to select the threshold value using shape information of the histogram instead of statistical criteria.

However, these methods are not reliable, and they can fail to find the proper threshold. This is shown in Figure

In the first histogram, the optimal threshold value is not a local minimum nor is it a local maximum. In the second and third histograms, the optimal threshold value is located at a saddle point of the histogram. However, in the fourth histogram, the optimal threshold is a local minimum of the histogram.

Suppose that we specify that the optimal threshold value is located at a local minimum on the histogram. Since there can be many local minima in the histogram, the question is which local minimum is the optimal threshold value? In summary, it is not possible to find a reliable way for finding the proper threshold value using the histogram. We can conclude that any algorithm that tries to find the threshold value has to find it not only by using the histogram but also by considering the shape of the segmented region after thresholding.

On the other hand, the proposed active contour models and level-set methods work directly with the gradient of the smoothed image. Since their results are very close to each other, we just consider the active contour model. In this model, the contour is specified by a finite set of points

Usually, these terms are defined as follows:

Assuming that

An example of active contour-based segmentation.

Now, consider that we give nonzero values for parameters

Determining the values of these parameters is time consuming, and it can vary for different mammograms. For example, as it is shown in Figure

In addition, the initialization of the contour is a hard problem. In the proposed methods, the authors have initialized the active contour model using a threshold-based method. As we mentioned before, threshold-based methods have their own weaknesses, and they do not have high success rates.

However, as we showed in our experimental results, our algorithm is able to find an accurate boundary using just texture information. By adding the smoothness factor to this algorithm, smoother boundaries are found. Here, instead of using just raw gradient data or pixel intensities, we utilized texture properties of the surrounding region of the pixel.

Also, the only parameter in our algorithm is the size of the texture window. It is not a sensitive parameter like the threshold value or the parameters of active contour models, and it can be approximated easily. Even, if it is considered an important parameter, we could remove its effect by analyzing the mammogram in a scale space. Other parameters of our probability model are obtained through machine-learning techniques.

With the aim to compare our algorithm with previous methods, we implemented the automatic threshold selection method in [

Comparing our algorithm with threshold-based and active contor model methods. The yellow boundary is that found by our algorithm, the blue boundary is that found by the snake method, and the red boundary is the one obtained by the threshold-based method.

According to this figure, the results of applying the snake model on images mdb093, mdb125, mdb071, mdb063, and mdb206 are close to our algorithm, but the threshold-based method could not find accurate boundaries. Numerical results show 50% and 38% improvement in the results obtained by our method with regard to the snake and threshold-based methods, respectively. However, it should be noted that there are situations in which threshold-based method produces accurate results, but, even in these cases, the results are very close to ours. Another important aspect about the boundary extraction algorithm is its

Finally, Figure

Result of applying the algorithm on several images from mini-MIAS database.

Extracting the breast boundary in mammograms is a difficult task that has captured the attention of the scientific community. In this paper, we have reviewed the available methods to extract breast boundaries in mammograms, namely, those based on image processing techniques and those based on deformable models, and we have described their pros and cons.

We have proposed a probabilistic approach for solving some of the problems of the current methods. Our approach comprises features of texture, smoothness, and prior probability models. Instead of using raw gray-level values or the gradient of the pixels, we used local binary pattern features to describe the properties of each pixel in the image. The contour is extracted by a region growing strategy, and an initial point in the breast contour is found by classifying the pixels as boundary or nonboundary and selecting the ones with the highest probability of being boundary.

To estimate the probability of a pixel from being part of the boundary, a support vector machine is used. Our basic idea is to use the SVM score as the input for a logistic regression model and, then, find the probability by minimizing a function. Since the core of the algorithm relies on a logistic function, we can expect that there are not sharp transitions for high probabilities.

Experimental results on test data show that our method is able to extract the breast boundary accurately. We have evaluated the importance of the smoothness and prior knowledge components, and we have showed that they are necessary for finding precise boundaries. Also, we have compared our methods against several off-the-shelf approaches, and we have demonstrated the advantages of our method, in terms of both accuracy and stability.