Evaluation of the Diagnostic Power of Thermography in Breast Cancer Using Bayesian Network Classifiers

Breast cancer is one of the leading causes of death among women worldwide. There are a number of techniques used for diagnosing this disease: mammography, ultrasound, and biopsy, among others. Each of these has well-known advantages and disadvantages. A relatively new method, based on the temperature a tumor may produce, has recently been explored: thermography. In this paper, we will evaluate the diagnostic power of thermography in breast cancer using Bayesian network classifiers. We will show how the information provided by the thermal image can be used in order to characterize patients suspected of having cancer. Our main contribution is the proposal of a score, based on the aforementioned information, that could help distinguish sick patients from healthy ones. Our main results suggest the potential of this technique in such a goal but also show its main limitations that have to be overcome to consider it as an effective diagnosis complementary tool.


Introduction
Breast cancer is one of the main causes of death among women worldwide [1]. Moreover, a specificity is required in the diagnosis of such a disease given that an incorrect classification of a sample as a false positive may lead to the surgical removal of the breast [2]. Nowadays, there are different techniques for carrying out the diagnosis: mammography, ultrasound, MRI, biopsies, and, more recently, thermography [3][4][5][6]. In fact, thermography started in 1956 [7] but was discarded some years later because of the poor quality of the thermal images [8] and the low specificity values it achieved. However, with the development of new thermal imaging technology, thermography has reappeared and is being seriously considered as a complementary tool for the diagnosis of breast cancer [9]. Because of specificity required, it is compulsory to have as many available tools as possible to reduce, on the one hand, the number of false positives and, on the other hand, to achieve high sensitivity. Although open biopsy is regarded as the gold standard technique for diagnosing breast cancer, it is practically the last diagnostic resource used since it is an invasive procedure that represents not only significant health implications but also psychological and economic ones also [10]. Other techniques, which are not necessarily invasive, have implicit risks or limitations such as X-ray exposure, interobserver interpretability and difficult access to high-tech expensive equipment [11,12]. Thermography is also noninvasive, but it has the advantage of using a cheaper device (an infrared camera), which is far more portable than those used in mammography, MRI, and ultrasound. Furthermore, it can be argued that some of the variables considered by thermography may be more easily interpreted than those of some of the aforementioned techniques. As a matter of fact, in this paper we will explore and assess this argument in order to measure the potential of such a technique as a diagnostic tool for breast cancer. Moreover, our main contribution is the proposal of a score, based not only on thermographic variables but also on variables that portray more information than temperature alone, that might help differentiate sick patients from healthy ones. We will also explore the potential of thermography in diagnosing women below the age of 50, which would allow the detection of the disease in its early stages, thus reducing the percentage of mortality.
The rest of the paper is divided as follows. In Section 2, we will present some related research that places our research in context and thus appreciates our contribution. In Section 3, we explain the materials and methods used in our experiments. In Section 4, we will present the methodology and the experimental results. In Section 5, we will discuss these results and, finally, in Section 6, we will conclude our paper and give directions regarding future research.

Related Research
In our review of the related literature, we divided these into three categories: introductory, image-based, and data-based works [13][14][15][16][17]. The introductory research mainly points out the potential of thermography as an alternative diagnostic tool for breast cancer comparing its performance to other diagnostic methods such as mammography and biopsy [18,19]. Unfortunately, because this research is intended as an introduction to the topic, it lacks some important details about the data used in these studies as well as the analyses carried out.
The image-based works mainly range from cluster analyses applied to thermal images (to differentiate healthy from sick breasts) [20] to fractal analyses (to characterize the geometry of the malignant lesions) [21] to the camera calibration for capturing thermal images [3,22].
The data-based investigations present statistical analyses of patient databases (healthy and sick) such as nonparametric tests, correlation, and analysis of variance; artificial intelligence analyses such as artificial neural networks and Bayesian analysis; and numerical models such as physical and simulation models (bioheat equations) [8,9,[23][24][25][26]. Only a small number of papers propose a score formed from thermographic data [27,28] but they only propose a maximum of 5 variables to form such a score. In our research, we propose 14 variables to calculate this score: this is the main contribution of the paper alongside the analysis of the diagnostic power of the proposed variables. In Section 3, we will present those variables in more detail and, in Section 4, we will evaluate how informative these variables are in the diagnosis of breast cancer. To end this section, it is important to mention that although the research in this category is very interesting, in some of them the methodology is not clear. This prevents one from easily reproducing the experiments carried out there. We have done our best to present a clear methodology so that our results can be reproduced.

The Database.
For our experiments, we used a realworld database which was provided by an oncologist who has specialized in the study of thermography since 2008, consisting of 98 cases: 77 cases are patients with breast cancer (78.57%) and 21 cases are healthy patients (21.43%). All the results (either sick or healthy) were confirmed by an open biopsy, which is considered the gold standard diagnostic method for breast cancer [29]. We include in this study 14 explanatory variables (attributes): 8 of them form our score (proposed by the expert), 6 are obtained from the thermal image, one variable is the score itself, and the final variable is age which was discretized in three categories as this is recommended for the selected algorithms [30][31][32]. In Table 1, we give details of the name, definitions, and values of each of these variables. The dependent variable (class) is the outcome (cancer or no cancer).

Bayesian Networks.
A Bayesian network (BN) [33,34] is a graphical model that represents relationships of a probabilistic nature among variables of interest. Such networks consist of a qualitative part (structural model), which provides a visual representation of the interactions amid variables, and a quantitative part (set of local probability distributions), which permits probabilistic inference and numerically measures the impact of a variable or sets of variables on others. Both the qualitative and quantitative parts determine a unique joint probability distribution over the variables in a specific problem [33][34][35]. In other words, a Bayesian network is a directed acyclic graph consisting of [36]: (a) nodes (circles), which represent random variables; arcs (arrows), which represent probabilistic relationships among these variables and (b) for each node, there is a local probability distribution attached to it, which depends on the state of its parents.  (1): where ( ) represents the set of parent nodes of , that is, nodes with arcs pointing to . Equation (1) also shows how to recover a joint probability from a product of local conditional probability distributions.

Bayesian Network Classifiers
. Classification refers to the task of assigning class labels to unlabeled instances. In such a task, given a set of unlabeled cases on the one hand and a set of labels on the other, the problem to solve lies in finding a function that suitably matches each unlabeled instance to its corresponding label (class). As can be inferred, the central research interest in this specific area is the design of automatic classifiers that can estimate this function from data (in our case, we are using Bayesian networks). This kind of learning is known as supervised learning [37][38][39]. For the sake of brevity and the lack of space, we have not written here the code of the 2 procedures used in the tests carried out in this research. We have only briefly described them and refer the reader to their original sources. The procedures used in these tests are (a) The Naïve Bayes classifier (NB) is one of the most effective classifiers [38] and the benchmark against which state-of-the-art classifiers have to be compared.
Its main appeals lie in its simplicity and accuracy: although its structure is always fixed (the class variable has an arc pointing to every attribute), it has been shown that this classifier has a high classification accuracy and optimal Bayes's error (see Figure 3, Section 4). In simple terms, the NB learns, from a training data sample, the conditional probability of each attribute given the class. Then, once a new case arrives, the NB uses Bayes's rule to compute the conditional probability of the class given the set of attributes selecting the value of the class with the highest posterior probability.
(b) Hill-Climber is a Weka's [41] implementation of a search and scoring algorithm, which uses greedyhill-climbing [42] for the search part and different metrics for the scoring part, such as Bayesian information criterion (BIC), Bayesian Dirichlet (BD), Akaike information criterion (AIC), and minimum description length (MDL) [43]. For the experiments reported here, we selected the MDL metric. This procedure takes an empty graph and a database as input and applies different operators for building a Bayesian network: addition, deletion, or reversal of an arc. In every search step, it looks for a structure that minimizes the MDL score. In every step, the MDL is calculated and procedure Hill-Climber keeps the structure with the best (minimum) score. It finishes searching when no new structure improves the MDL score of the previous network.
(c) Repeated Hill-Climber is a Weka's [41] implementation of a search and scoring algorithm, which uses repeated runs of greedy hill-climbing [42] for the search part and different metrics for the scoring part, such as BIC, BD, AIC, and MDL. For the experiments reported here, we selected the MDL metric. In contrast to the simple Hill-Climber algorithm, Repeated Hill-Climber takes as input a randomly generated graph. It also takes a database and applies different operators (addition, deletion, or reversal of an arc) and returns the best structure of the repeated runs of the Hill-Climber procedure. With this repetition of runs, it is possible to reduce the problem of getting stuck in a local minimum [35].

Evaluation Method: Stratified k-Fold Crossvalidation.
We followed the definition of the crossvalidation method given by Kohavi [37]. In k-fold crossvalidation, we split the database in mutually exclusive random samples called the folds: 1 , 2 , . . . , , where said folds have approximately the same size. We trained this classifier each time ∈ 1, 2, . . . , using \ and testing it on (again, the symbol denotes set difference). The crossvalidation accuracy estimation is the total number of correct classifications divided by the sample size (total number of instances in ). Thus, the k-fold crossvalidation estimate is as follows: where ( ( \ ( ) , V ), ) denotes the label assigned by inducer to an unlabeled instance V on dataset \ ( ) , is the class of instance V , is the size of the complete dataset, and ( , ) is a function where ( , ) = 1 if = and 0 if ̸ = . In other words, if the label assigned by the inducer to the unlabeled instance V coincides with class , then the result is 1; otherwise, the result is 0; that is, we consider a 0/1 loss function in our calculations of (2). It is important to mention that in stratified k-fold crossvalidation, the folds contain approximately the same proportion of classes as in the complete dataset . A special case of crossvalidation occurs when = (where represents the sample size). This case is known as leave-one-out crossvalidation [37,39].
For both evaluation methods, we assessed the performance of the classifiers presented in Section 3.2 using the following measures [44][45][46][47].
(a) Accuracy: the overall number of correct classifications divided by the size of the corresponding test set: where cc represents the number of cases correctly classified and is the total number of cases in the test set. (b) Sensitivity: the ability to correctly identify those patients who actually have the disease: where TP represents true positive cases and FN is false negative cases. (c) Specificity: the ability to correctly identify those patients who do not have the disease: where TN represents true negative cases and FP is false positive cases.

Methodology and Experimental Results
We used stratified 10-fold crossvalidation on the 98-case database described in Section 3.1. All the algorithms described in Section 3.2.1 used this data in order to learn a classification model. Once we have this model, we then evaluate its performance in terms of accuracy, sensitivity, and specificity. We used Weka [41] for the tests carried out here (see their parameter set in Table 2). For comparison purposes other classifiers were included: a multilayer perceptron (MLP) neural network and decision trees (ID3 and C4.5) with default parameters. The fundamental goal of this experiment was to assess the diagnostic power of the thermographic variables that form the score and the interactions among these variables. To illustrate how the variable values are obtained, we cite one example.
(a) In Figure 1 we show the type of images obtained by the thermal imager; in this case, the front of the breast thermography. Using ThermaCAM Researcher Professional 2.9 [48] software, we detect the hottest areas of the breast that pass from red to gray. The breast whose furrow displays the largest gray area is assigned a positive value and the other a negative one.
Computational and Mathematical Methods in Medicine 5     In Figure 2 we show a general overview of the procedure of breast thermography, from thermal image acquisition to the formation of the score. Tables 3,4,5,6,7,8,9,and 10 show the numerical results of this experiment. Figures 3 and 4 show the structures resulting from running Hill-Climber and Repeated Hill-Climber classifiers and Figure 5 shows the decision tree (C4.5). We do not present the structure of the Naïve Bayes classifier since it is always fixed: there is an arc pointing to every attribute from the class. For the accuracy test, the standard

Discussion
The main objective of this paper is to assess the diagnostic power of thermography in breast cancer using Bayesian network classifiers. As can be seen from Table 3, the overall accuracy is still far from a desirable value. We chose Bayesian networks for the analyses because this model does not only carry out a classification task but it is also able to show interactions between the attributes and the class as well as interactions among the attributes themselves. This ability of Bayesian networks allows us to visually identify which attributes have a direct influence over the outcome and how they are related to one another. The MLP shows a comparable performance but lacks the power of explanation: it is not possible to query this network to know how it reached a specific decision. On the other hand, decision trees do have this explanation capability but lack the power to represent interactions among attributes (explanatory variables). Figures 3 and 4 depict that only 5 variables (out of 16) are directly related to the score: 1C, f unique, thermovascular, curve pattern, and asymmetry t. Hence we can see that the score influence on the class outcome is null and the variable furrow (this variable is part of the score) is the only one that affects the class. Figure 5 shows that procedure C4.5 also identifies 2 of those 5 variables as being    Computational and Mathematical Methods in Medicine the most informative ones for making a decision: f unique and asymmetry t. In fact, if we only consider these attributes, we get the same classification performance as that when taking into account all thermographic variables. Other models, such as artificial neural networks, cannot easily identify this situation. As seen in Section 3.2, the extensive use of conditional independence allows Bayesian networks to potentially disregard spurious causes and to easily identify direct influences from indirect ones. In other words, once these variables are known, they render the rest of the variables independent from the outcome. Another surprising result is that of variable age: some other tests consider this to be an important observation for the diagnosis of breast cancer [30][31][32]. However, our analyses suggest that, at least with the database used in our experiments, age is not important in a diagnosis when using thermography. As can be seen from Figures 3  and 4, age is disconnected from the rest of the variables. This may imply that thermography shows potential for diagnosing breast cancer in women younger than 50 years of age.
Regarding the sensitivity performance of our models (see Table 3), Hill-Climber and Repeated Hill-Climber achieve a perfect value of 100%. This means that, at least with our database, thermography is excellent for identifying sick patients. Naïve Bayes classifier shows a significantly worse performance; it can be argued that this performance is due to the noise that the rest of the variables may add. Once again, if we only considered the 5 variables mentioned above, we would get the same results as those using Hill-Climber and Repeated Hill-Climber. Other models would not be capable of revealing this situation. Of course, it is mandatory to get more data in order to confirm such results.
It is important to point out that the Hill-Climber and Repeated Hill-Climber procedures identify the same 5 variables as directly influencing the outcome.
Regarding the specificity performance of our models (see Table 3), Hill-Climber and Repeated Hill-Climber achieve the worst possible value of 0%. This means these 5 variables, while being informative when detecting the presence of the disease, are not useful for detecting the absence of such disease (see Tables 5-10). On the other hand, the noise that the rest of the attributes produce when detecting the disease seems to work the other way around: it is not noise but information that makes Naïve Bayes achieve a specificity of 33%. Of course, such a value is far from desirable, but this result makes us think of proposing two different scores (one for sensitivity and one for specificity) with two different sets of variables. But our proposal of a score is a first approximation to combine thermographic variables in such a way as to allow us to tell sick patients from healthy ones. Our results show that such a score needs to be refined in order to more easily identify these types of patients.
Although the results may be discouraging, we strongly believe that they are a step forward in order to more deeply comprehend the phenomenon under investigation: breast cancer. In fact, we have proposed a score that takes into account more information than just that of temperature. Until now, few areas of research have considered other variables apart from that of temperature [27,28]. Those papers include in their analyses a total of 5 variables that can be extracted from the information a thermogram provides. Our score includes 16 variables and our work, to the best of our knowledge, presents the first analysis of this kind of data using Bayesian networks. What this analysis suggests is a refinement of the score, probably in the sense of proposing a more complex function to represent it beyond the simple addition of the values of each attribute. Intuitively, we thought that other variables, such as hyperthermia or thermovascular network, would be more significant in differentiating sick patients from healthy ones.
In the case of the database, we are aware of the limitations regarding the number of cases and the imbalance of the number of classes. Thus, we would need to collect more data so that more exhaustive tests can be carried out.

Conclusions and Future Work
Thermography has been used as an alternative method for the diagnosis of breast cancer since 2005. The basic principle is that lesions in the breasts are hotter than healthy regions. In our experience, only taking into account temperature is not enough to diagnose breast cancer. That is why we proposed a score that considers more information than only temperature alone. We have found that only 5 attributes that are part of this score are the unique direct influence needed to determine if a patient has cancer.
Although some other research projects show better performance than ours, their methodology to carry out the experiments is not clear; thus these experiments cannot be reproduced. Therefore, we need to more closely explore the details of these models and the nature of their data. In this paper we have done our best to present the methodology used in our experiments as clear as possible so that they indeed can be reproduced. It is true that we do not give details about how the database was formed (since this is not the primary goal of the paper). However, we believe that if we make this database available, researchers who want to reproduce our experiments should be able to do so without much trouble.
We have found that the framework of Bayesian networks provides a good model for analyzing this kind of data: it can visually show the interactions between attributes and outcome as well as the interactions among attributes and numerically measure the impact of each attribute on the class.
Although we obtained excellent sensitivity results, we also obtained very poor specificity results. The sensitivity values are consistent with the expectations of the expert, and a discussion about the helpfulness of the Bayesian network is already underway in order to better understand the disease. Given that breast cancer has a special requirement of specificity values, we have to more deeply investigate the causes of those poor results. One possible direction for future research is to collect more balanced data using techniques such as SMOTE [49], ADASYN [50], AdaC1 [51], and GSVM-RU [52]. Another possible direction is to design a more complex score that includes a more complex function compared to that of a simple sum. A third direction we can detect is reviewing how the variables are collected and try to reduce subjectivity in them. Finally, we have also detected that medical doctors usually take into account more information than that supplied to the models for diagnosing breast cancer. Thus, we can also work more in the area of knowledge elicitation.