Augmenting Multi-Instance Multilabel Learning with Sparse Bayesian Models for Skin Biopsy Image Analysis

Skin biopsy images can reveal causes and severity of many skin diseases, which is a significant complement for skin surface inspection. Automatic annotation of skin biopsy image is an important problem for increasing efficiency and reducing the subjectiveness in diagnosis. However it is challenging particularly when there exists indirect relationship between annotation terms and local regions of a biopsy image, as well as local structures with different textures. In this paper, a novel method based on a recent proposed machine learning model, named multi-instance multilabel (MIML), is proposed to model the potential knowledge and experience of doctors on skin biopsy image annotation. We first show that the problem of skin biopsy image annotation can naturally be expressed as a MIML problem and then propose an image representation method that can capture both region structure and texture features, and a sparse Bayesian MIML algorithm which can produce probabilities indicating the confidence of annotation. The proposed algorithm framework is evaluated on a real clinical dataset containing 12,700 skin biopsy images. The results show that it is effective and prominent.


Introduction
Skin diseases are common in our daily life. Most of the skin diseases are not harmful to our health, while some kinds of them would lead to serious problems for our health. For example, malignant melanoma is a highly aggressive skin cancer which looks just like some harmless nevi in some cases. Pemphigus mostly characterized by the development of blisters on the skin is a rare skin disorder that leads to severe infection without effective treatment. Consequently, rapid recognition and correct diagnosis are important to the grave skin diseases as well as neoplasms, bullous dermatoses, sexually transmitted diseases (STD), and so forth. However, it is a great challenge for doctors specializing in dermatology since there are more than 3,000 kinds of diseases in this field, and what is worse is that the number of patients in dermatology is increasing rapidly [1], leading to great burden for doctors to precisely inspect large amount of cases every day.
Generally there are two categories of skin imaging inspection methods. The first is skin surface imaging. A doctor could be confident of making a diagnosis through observation and routine examination on the skin surface in some cases. However, in many other cases, especially in cases of skin cancer, a doctor is not easy to make a diagnosis decision when only skin surface information is available. The second is skin biopsy imaging, which is the imaging of slice of skin tissue under microscope. Skin biopsy images reflect the pathological changes behind skin lesions at a microscopic level. It is widely accepted that histopathology is the gold standard of diagnosing a skin disease [2]. Skin biopsy imaging can provide valuable information of what happens under skin surface. To reach correct annotation or diagnosis, a doctor needs not only professional knowledge and rich experience 2 BioMed Research International Widespread parakeratosis, absence of a granular layer, psoriasis-like hyperplasia in epidermis, upward elongation of the dermal papillae, dilation of the capillaries in the papillae, and dermal inflammatory cell infiltration consisting of sparse lymphocytes and histocytes around the blood vessels.
Hyperkeratosis, acanthosis, horn-like pseudocysts that can be seen in prickle cell layer, and inflammatory cell infiltration consisting of sparse lymphocytes and histocytes in the upper dermis.
Incrustation of the follicular orifices, spongiosis present between follicle wall cells, massive lymphocytes and eosinophils infiltration within and around the hair follicles, and increase and dilation of the capillaries in the dermal papillae.
Hyperkeratosis with follicular plug formed, acanthosis with irregular elongation of the epidermal rete ridges, degeneration of the basal cells, and massive lymphocytes and histocytes infiltration around the cutaneous appendages and the blood vessels in the upper dermis. in inspecting skin lesions, but also deep understanding of skin histopathological imaging. While analyzing skin biopsy images consumes more time and requires more skills, differentiating normal/lesion regions or similar skin diseases becomes great challenges for doctors. Meanwhile, current skin biopsy image inspection is heavily relied on experience and professional knowledge of histopathological laboratory experts, which are subjective and unstable. To obtain a stable and reproducible diagnosis result, a computer-aid diagnosis (CAD) system is necessary.
Hence it is meaningful to develop computational methods for automatic feature recognition and annotation of skin biopsy images. However, there are some significant challenges due to the complex structures and textures of biopsy images and indirect relationship between historic diagnosis records and images. First of all, in dermatological practice, when annotating biopsy skin images, doctors only give plain text description for a patient attached to several skin biopsy images. The plain text description involves a set of standard dermatological annotation terms and some linked words to show key features reflected by the biopsy images, as shown in Figure 1. However, in fact, the dermatological terms only reflect certain local regions instead of the whole image. See Figure 2 for details. Only one or more small local regions is responsible for a certain dermatological term. However, the correspondence between dermatological terms and local regions is unknown in current datasets. Thus we cannot model this correspondence directly.
Another challenge is that, even for the same term, its corresponding local regions may be significantly varied in size, shape, texture, lightening, inner structure, or the relation between local regions with different terms. In addition, we should be aware of the fact that sublayers of a skin tissue are strictly ordered, leading to some correlations between local visual regions as well as the corresponding features [3]. All these challenges make the task more difficult to tackle compared with traditional machine learning ones.
Several attempts have been reported publicly to build models or classifiers for skin image automatic annotation or recognition. A portion of them have attempted to design different color space-based feature extraction methods and to apply different machine learning models to achieve good performance for different kinds of skin diseases [4][5][6]. However, a large amount of these methods have to face the problem of manually labeling lesion regions. In order to build a training dataset comprising both normal and lesion skin images, we are required to pick out normal and lesion regions for each skin image. Meanwhile, a large number  of histopathological image analysis methods have also been reported for classification or grading of biopsy images [7][8][9][10]. But few of them attempted to model the indirect relationship between histopathological features and parts of a biopsy image.
Moreover, many previous methods required specialized knowledge to choose a proper color space representation and a model, which is not feasible in most cases. Recently, Bunte et al. [11] proposed a machine learning framework to combine several color space representation methods through a weighting procedure. Zhang et al. [12] proposed to convert the skin biopsy image feature recognition problem into a multi-instance (MI) learning problem and then solve it by current well-studied MI algorithms, which is the first attempt to tackle the skin biopsy image annotation problem within machine learning framework. In their paper, they applied a famous graph cutting algorithm, named nnormalized cut [13], to generate visual disjoint regions and then apply image feature extraction algorithm for each local region, so as to turn each image into a MI sample. However, they simply trained an individual MI learner for each target feature to be recognized, discarding the correlation between target features, which is not sufficient from a medical point of view.
In this paper, we attempt to tackle the skin biopsy image feature extraction problem under a recently proposed machine learning framework, multi-instance multilabel (MIML) learning. We first show that the problem is naturally a MIML learning problem. Then we propose a sparse Bayesian MIML learning algorithm with a Gaussian prior as the main model, which is able to model a posterior distribution of the target features giving images as input. We evaluate the proposed algorithm on a real dataset from the department of dermatology and venereology of a large local hospital. The evaluation results show that the proposed algorithm framework can effectively annotate the concerning terms of skin biopsy images superior to existing methods.   with a plain text description given by an expert. We only select an image for each patient and assume that each selected image contains all terms in the text description. Then we can convert the text description into a binary vector through simple word-matching procedure. Thus the original problem becomes a multilabel binary classification problem.
We further formally define the problem as follows. Let = {( 1 , 1 ), . . . , ( , ), ∈ , ∈ } be a set of images associated with the annotated terms, where is an image, = { 1 , . . . , } is a set of terms associated with the image, and , stand for the whole set of images and terms, respectively. The problem is to learn a function : → with a training image set such that when given a test image it can give the posterior probability of each term in to be annotated to .
To represent the key features of a given image, different feature extraction methods have been proposed and developed and in various fields of image understanding research [7]. However, a large body of feature extraction methods previously applied in histopathological image analysis, which extract global features, is not suitable for our biopsy image annotation task. Because in our problem there are to relationships between notation terms and local regions within images, methods extracting global features are not able to express local features corresponding to each region of interest.
If a given image can be cut properly to generate meaningful regions, the above correspondence can be directly modeled. The proper cutting of a given image should generate regions attached with terms as few as possible. Such regions are relatively simple and easy to be described. In histopathological image analysis, several image cutting methods have been applied in different tasks. Caicedo et al. [4] proposed a bag-of-words approach for histopathological image annotation. They divided an image into blocks of equal size to generate a codebook for feature representation. Ji et al. [14] and Li et al. [15] applied the almost same block-cutting method to generate MI samples from given images. Another region generating method that should be mentioned is based on block clustering proposed by Chen and Wang [16]. They generated regions by clustering 2D waveform transformation coefficiencies of each block. Thus similar blocks can be gathered into a single cluster. In their work clusters were regarded as regions and it generated discontiguous regions, not regions in common sense.
However, such cutting approaches cannot generate regions of medical meaning as we need. As shown in our previous work [12], the model that is built upon such region generating methods cannot properly capture the direct medical knowledge and experience for annotating biopsy images. An experienced doctor would annotate an image by directly inspecting some local visual disjoint regions within the image. Following this observation, we apply the same idea to cut a given image into visual disjoint regions through the normalized cut algorithm proposed by Shi and Malik [13]. The number of regions should be set before running the algorithm. Figure 3 shows the result of normalized cut for an skin biopsy image with = 11.
It should be noted that there is not any optimal for the annotation problem, since the concept of local region is not an actual cutting of an image. A smaller leads to larger regions, which may contain more than one term, while fragment regions may be generated if is large. Hence we add a region size constraint when running the cutting algorithm. A generated region should contain at least 1500 pixels to avoid too much fragments, along with a relatively large . Thus we can get as much as possible regions but avoiding too much fragments.
To further express each generated region as a vectorial representation, we propose a feature representation method that can capture both texture and structure features of regions. The method combined the features extracted through the method introduced in our previous work [8,12] and features from a graph view of the image. Briefly saying, for the first part of the features, the method performs a waveform transformation for each equal-sized block within each region and combines the waveform transformation coefficiencies to form a 9-ary real vector for each region. To make the paper self-contained, we present some details of the extracted features. The first three features 1 , 2 , 3 are means of , , values of all pixels within a region. The next three features 4 , 5 , 6 are mean DWT coefficients HH, HL and LH of all blocks. The last three features are the 1st, 2nd, and 3rd order normalized criteria [17] of the whole region.
For the second part of the features, we represent a region as a graph in which nodes are centroids of clusters of pixels and edges are the relationship between nodes with real weights. We apply a heuristic algorithm [5] to seek the centroids of local similar pixels. Then a Delaunay triangulation method [18] is applied to the set of centroids to add edges. Graph representation methods are widely used in histopathological image analysis for it is able to capture the structure of a tissue [7,9,10,19]. Figure 4 illustrates the main steps of our region feature extraction procedure.
There are three types of graph features considered in our feature representation. The first is average degree of nodes belonging to each cluster in the graph. It can be simply obtained by averaging the degrees of all nodes belonging to the same cluster. The degree of a node is the number of edges. The second is average clustering coefficient (ACC) [20], which measures the average connectivity of a node and its neighbors. The ACC for node is defined as In (1), is the number of edges between node and its neighbors and is the degree of node . The neighborhood between each pair of nodes is measured by the Euclidean distance. We calculate the values of ACC for nodes belonging to different clusters. We compute the average ACC of all nodes in the graph and nodes in the same cluster. Hence there are + 1 average ACC where is the number of clusters. The third is the diameter of the graph, which is defined as the shortest path of the longest path between pair of nodes on the graph. In our work = 4, there are 4 average degrees, 4 × 3 different types of node connection, which results in 12 ACCs, and finally a diameter value of the whole graph. Totally we get a 17-ary feature vector.  Since the generated regions are irregular in shape, padding pixels (in black) must be excluded from our feature extraction procedure. To do this, for the texture features, blocks that have at least one black pixel are discarded. Since the block in our method is of 4 × 4 pixels, it leads to a rough border of the original region which would not significantly affect the texture features. For graph features, it is not a problem since the black pixels would of course be clustered into a single cluster. Thus we can simply discard such black cluster to get rid of padding pixels. Details of the above idea were presented in our recent work [8]. Figure 5 illustrates the processing of padding pixels in our feature extraction procedure.
Thus, a skin biopsy image is decomposed into a MI example (bag), in which visual disjoint regions are instances. Moreover, we can define a binary vector to indicate whether an annotation term is associated with a given image. An annotation term can be regarded as a label associated with an image. Hence the biopsy image annotation problem can be naturally considered as a multi-instance multilabel (MIML) problem. Based on the relationship between regions and terms from clinical experience, we tackle the problem under the standard MI assumption which was firstly introduced by Dietterich et al. [21], assuming that a sample was labeled positively if at least one instance in it is positive and negative otherwise. The standard MI assumption has been widely used in bioinformatics study [22] and it is also suitable for this work.

Sparse Bayesian MIML Learning Framework.
In the previous subsection, we have shown that the problem is naturally a MIML problem. Now we propose a novel algorithm to solve this problem effectively. The general idea is that we first randomly construct a set of basic MIML learners and then learn a sparse weights vector under the relevant vector machine (RVM) [23] framework to combine the basic learners together. The learning framework prunes off many learners by automatically driving the corresponding weights to zero so as to get a sparse solution. The motivation of this work is the consideration of time complexity of building a good MIML learner. A weighted ensemble method is adopted, and the weights are determined by RVM method. The method does not require basic learners of good quality. It can find an optimal combination of learners of low quality at relatively low cost.

Generating Basic Learners.
We make use of a recently proposed Bayesian MIML learning model [24] for the generation of MIML basic learners. The method directly models a predictive distribution of terms conditioning on training data with a Gaussian process (GP) prior. We introduce a set of unobserved real-value functions = { 1 , . . . , } ranging from [0, 1], where is the number of target labels. The value of for a given instance (region) indicates to which extent it should be annotated with the concerning terms. Under the standard MI assumption, the bag label can be determined by a max or soft max function over on all instances in the bag [25].
We formally describe the procedure of basic learner construction as follows. The goal is to model the predictive probability of the concerning annotation terms , giving the training set , a prior GP , and a test sample , which can be expressed as ( | , , GP ). The prior GP can be given by a kernel function through a Gaussian process. The likelihood function associated with latent functions on can be expressed as where is the value of applying to all instances in bag and is a matrix containing all values of applying all on .
Since is unknown, we impose a prior for to avoid overfitting when evaluating it. Following Bonilla et al. 's work [26], a Gaussian prior for with zero mean and covariance is defined as follows: In (3), stands for the gram matrix for some kernel functions (e.g., RBF or poly kernel) in instance space and GP in fact indicates the relationship between terms to be annotated. In [24], they adopted a marginal likelihood maximization method to find the optimal GP , which is expensive. In this work, we do not directly work out the optimal solution for GP . On the contrary, we randomly generate GP times and then learn a vector of weights to obtain an optimal combination.
With GP , we can further derive the posterior distribution given a training dataset as Notice that the second ( ) = ∫ ( | ) ( ) is a constant value since is constant and is integrated out. Thus it can BioMed Research International 7 be ignored. Because ( | ) ( ) is not a Gaussian [26], we use some approximation methods to evaluate it. Following Nickisch and Rasmussen's work [27], we apply the Laplace approximation to convert ( | ) into a Gaussian near its true mode. According to [26,27], we can directly write down the mean and variance of the approximation distribution for ( | ). Meanwhile we notice that ( ) is also a Gaussian, which leads to a Gaussian distribution for ( | , ).
The predictive probability can then be derived from the likelihood, prior, and posterior distribution aforementioned. We have where is a test bag (image) and is a vector of applying all to all instances in . The first term on the righthand side reflects the standard MI assumption, meaning that the largest value among determines the probability to be annotated with the corresponding term. For computational convenience, we often use soft max function instead of max in (5), given by ln ∑ . The predictive distribution is also a Gaussian and can be solved directly as follows: The right-hand side of (6) is a Gaussian, which can be determined through a EM-like procedure [27]. An important thing should be noticed is that (6) has a parameter matrix GP that controls the relationship between terms. The time complexity of the above procedure can be analysed as follows. Suppose we generate a set of basic learners and | | annotation terms. For each learner, there is a random sampling procedure for GP which requires (| | 2 ) operations; training a MIML learner requires (| | × | | 2 ), where | | denotes the number of instances in training dataset.

Sparse Bayesian Ensemble.
Since the cost of calculating the optimal GP is very high, we randomly set them times to obtain a set of different learners and then apply a weighted ensemble procedure as follows: A RVM-like algorithm [23] is adopted to find the optimal weights to combine them. The main reason for using RVM is twofold. On one hand it is purely based on Bayesian theory which is consistent with our basis learner. On the other hand, RVM can give a sparse solution which is preferred in large data analysis and fast annotation. Figure 6 shows the main steps of the proposed algorithm framework.
The target model is a weighted ensemble of a set of basic learners. To get a sparse representation, we impose an ARD prior [28] on the weights which is a Gaussian with zero mean and different variances for each weight . In RVM's  optimization procedure [23], a large body of variances would be driven to infinity leading the corresponding weights to zero. Hence a large body of weights would be pruned off from the model and final a sparse model is obtained. Formally, let = { 1 , . . . , } be a set of weights associated with learners. A Gaussian prior with zero mean and different variances is imposed on . Tipping's work [23] indicated that when applying a maximum a posterior (MAP) learner to learn an optimal , a large body of would be driven to zero. Following this idea, we apply RVM algorithm on given the training dataset .
Please note that the weighted ensemble may not follow a Gaussian distribution. This is because ∑ is not guaranteed to be 1. A normalization procedure should be applied to obtain a normalized combination By applying RVM, a smooth learner can be obtained which captures the general features of the whole training dataset. RVM adopts an iterative procedure to find optimal weights.

Results.
We present the evaluation result of the proposed algorithm on a real dataset gathered from a large local hospital. The setting of basic learner generation is the same as [24] and the setting of RVM follows Tipping's original implementation [23]. The proposed method is compared with some existing approaches in histopathological image analysis. Since some of them are not consistent with the MIML setting in our work, we would implement them on a more general foundation for image analysis.

Dataset and Data Preprocessing.
The evaluation was carried out on a real skin disease clinical dataset from a large local hospital. The dataset has been reconstructed to get rid of irregular patient information and low quality biopsy images.
The biopsy images in the evaluation dataset are taken by a Leica DFC290 digital camera with 20x, 40x, and 100x microscope objective lenses. The images are taken in RGB color space and stored in JPEG format. For convenience, we only keep images at 40x magnification ratio. It contains 4,123 patients with 12,700 images. The images are 2048 × 1536 pixels with 24 colors. For computational efficacy, they are rescaled to 800×600 pixels. There are three 40x biopsy images for each patient on average. We consider 15 features to be annotated, corresponding to 15 standard terms, as shown in Table 1, and then convert the plain-text description into a 15ary binary vector in which each element indicates whether the corresponding term exists in the diagnosis record in plain text, as shown in Figure 1. Since most doctors use standard terms and link words in their description, training dataset of good quality can be obtained in this way.
Each image associated with a patient is converted into a bag through normalized cut and then a feature extraction method combined with waveform transformation and graph representation. For normalized cut, the number of regions must be set manually. In our evaluation we set = 11 which means an image would be converted into a bag consisting of 11 instances. A further discussion on the setting strategy of is presented in the next section. Different images of the same patient are associated with the 15-ary binary of the patient. We denote the dataset generated through the above procedure as 1. For waveform transformation, each region should be divided into blocks of size 4 × 4 pixels. Blocks containing at least one black pixel would be discarded. For graph representation, the number of clusters is set to 5, assuming that there are 5 different tissues in each image on average. In node identification algorithm, circles containing less than 20 pixels would not be taken into account.
Since there are other compared methods that are not consistent with the MI setting, we generate another three data representations, namely, 2, 3, and 4, for these methods. Data representation 2 is based on the equalsized block cutting method proposed in [4]. We first cut each image into 4 × 4 blocks and apply a scale-invariant feature transform (SIFT) descriptor [29] to extract features and use histogram to express it as a feature vector. 2 is a bag-of-words [14] image representation which is widely used in image understanding. Dataset 3 is an equal-sized block MI sample representation proposed in [15]. The main procedure is similar to 2, but it directly regards each block with SIFT representation as an instance. Hence in 3 there are totally 30, 000 instances in each bag. Finally dataset 4 is a clustering based representation. It clusters equal-sized blocks represented in real value vector and regards each cluster as an instance. Details of this method can be found in [16]. Table 2 lists the above data representation and their consistent methods for comparison.
Note that these datasets are only different in their preprocessing steps. In Table 2 we can see that method 4 can be fed with 1 and 3, for 4 is a MIML learning algorithm naturally consistent with MI data representation. However 3 cannot be fed to 1 because the idea of 1 is to regard each visual disjoint region instead of equal-sized block, as an instance. Different definitions of instance are originated from the difference of underlying idea of the problem. A single  Evaluate the number of misclassified label pairs one-error Evaluate the portion that a label of highest probability is not a correct label V Evaluate the average distance to go down to find the proper label for a given image Evaluate the average fraction of label pair that are misordered in the ranking list block may not contain medically acceptable features, which is not consistent with our MI framework.

Evaluation Criteria.
We adopt five different criteria to evaluate the performance of the proposed method and the compared methods. The first is accuracy, a zero-one loss function evaluating whether a single term is correctly annotated. It can be applied to evaluate the performance of methods that annotate only one term each time. Since the proposed method is a MIML one, it can be regarded as a multilabel learner. Several evaluation criteria have been proposed in multilabel learning and MIML learning study [30]. Introducing such criteria is necessary for our evaluation. Formal definition of the four multilabel evaluation criteria can be found in [30]. Table 3 lists five criteria used in our evaluation. Table 2, we use the same setting for evaluation. The evaluation is launched through a supervised learning manner. The whole dataset (with 12,700 images) is divided into training set and test set at a ratio 3 : 7. To avoid learning bias, the occurrence ratios of the concerning terms in Table 1 were kept the same as the training set. For method 1, we use a modified GPMIL and RVM implementation which were originally proposed by Kim et al. [31] and Tipping [23]. The first evaluation focuses on the annotation accuracy. Recall that we have 15 concerning annotation terms. Table 4 gives the results of annotating each term by different methods in Table 2.
The second evaluation focuses on the performance of annotating several terms simultaneously. Note that, in previous part of evaluation, accuracy of annotation was evaluated term by term; hence the overall accuracy of annotating all concerning terms may not be as high as the individual ones. We adopt four criteria listed in Table 3 to show the performance of annotation of all terms at the same time. Some criteria rely on the ranking of terms. We can get a natural ranking for the proposed method since it gives the probabilities for all terms. For other methods to be compared in our evaluation, we use the ranking strategy similar to [30]. Note that methods 2, 3, and 5 are not multilabel classifiers. Hence we only compare 1, 4 with 1 and 4 with 3. Figure 8 shows the performance evaluated by the above four criteria.
According to Table 3, the smaller results of the four criteria indicate the better performance. From Figure 8, it can be seen that method 1 achieved best performance compared to other methods in a multilabel classification setting at different training data ratios. For method 4, different data representations 1 and 3 lead to different performances. It can be seen that 1 is better than 3 in most cases. Since the intuition of 1 and 3 is totally different, it may be concluded that the representation 1 is more consistent with the term set and the models.
Finally we evaluate the sparsity of the proposed model. We vary the ratios between training data and test data and plot them with the nonzero-weighted basic learners after RVM procedure. In this case the set of basic learners contains 200 learners; that is, = 200. Figure 9 shows the result.
From Figure 9 we can see that RVM procedure can prune off about 2/3 learners, which yields a sparse ensemble learner. Figure 10 shows the corresponding annotation accuracy of different training set sizes. It can be seen that large training set would lead to high accuracy. Figure 9 indicates that the number of nonzero-weighted learners is stable at different training set sizes. The performance of the proposed method obeys the basic principle of machine learning; that is, more training data means model of high accuracy. For illustration, Figure 10 shows the relationship between accuracy and the size of training set for terms 1, 6, and 9.

3.2.
Discussions. Some important issues are worth addressing here. First, we must answer why MIML rather than MI framework is consistent with our task. MIML learning problem can be decomposed into several MI learning problems if we assume labels are independent of each other. When coming to our annotation problem, it is observed that there are correlations between annotation terms, including the cooccurence of some terms or the absence of other terms. Furthermore, some annotation terms may appear at the same time for some diseases. To capture the correlations mentioned above, MI learning framework which regards each annotation term independently is not sufficient. However, MIML learning framework is able to capture the relationship between annotation terms, as well as regions, which is superior to MI framework.
Second, our proposed regions generating method is based on normalized cut, which generates visual disjoint regions for a given image. The number of regions generated by normalized cut must be manually set. A small would lead to large regions that may contain different terms. A large would lead to fragment regions associated with the same term, as shown in Figure 11. However, in either case, MIML learning framework works according to the standard MI assumption [21,32]. The former case is equivalent to an instance corresponding to more than one term. The latter case is equivalent to several instances corresponding to the  same term. Though the quantity of would not affect the effectiveness of MIML, too small would affect the effect of feature extraction. A region contains different terms cannot be expressed as a real feature vector distinguishing between each term at the same time. Hence, in our work, we use a relative large according to medical experience to avoid a region containing more than one term and too much fragments. Third, a Bayesian model can generate probability for each concerning annotation term, which makes it available to build a more powerful model for automated skin disease diagnosis. Annotation terms can be regarded as latent variables between skin biopsy images and diseases, meaning that ( | ) = ∑ ∈ ( | ) ( | ) for independent and identically distributed (i.i.d.) terms, where , , , respectively, stand for diseases, images, a certain term, and the set of terms. And for non-i.i.d. terms, we can separate the terms into dependent term groups and apply almost the same equation as in the i.i.d. case. The method proposed in this paper can effectively evaluate ( | ), and ( | ) can be obtained directly from clinical experience. Hence, it is meaningful in CAD system design and implementation.
Finally, we discuss the multi-instance assumption implied in this work. We use the standard MI assumption [21] when considering the relationship between regions and terms. The standard MI assumption does not directly consider the impact of the number of regions and the relationship between regions to the terms. From clinical observation, most annotation terms can mainly be determined by a single region if the generated regions are not too small. Large region may contain more than one term, but it is also consistent with the standard MI assumption and this can be solved due to the power of MIML models. Though our proposed MIML model in fact considers such relationship, a simple assumption of the problem may lead to simple model.

Conclusions
In this paper we propose a MIML framework for skin biopsy image annotation. We adopt a famous graph cutting algorithm named normalized cut to transfer a biopsy image into a MI sample, in which each region is regarded as an instance. To effectively express features of biopsy images, each region is expressed as a 9-ary real vector. To reduce the model complexity and training time, we propose a novel sparse Bayesian MIML learning model, which applies a RVM-like algorithm to obtain a sparse weighted combination for a set of basic learners. We also make use of the well-studied Bayesian MIML learner as basic learners. Evaluation of a real clinical dataset shows that the proposed model can achieve good performance and reach a medical acceptable result. We have achieved an annotation accuracy up to 85% in our evaluation dataset.
The proposed annotation framework directly models doctor's experience of annotation biopsy images. Different from previous work, it is explicable since it can give the correspondence between local visual disjoint regions and the terms associated with them. Future work will focus on studying the relationship between biopsy images and the final diagnosis given the annotation term set as latent variables. And the feature fusion algorithm towards an effective feature representation is another research direction.