In multiple instance learning (MIL) framework, an object is represented by a set of instances referred to as bag. A positive class label is assigned to a bag if it contains at least one positive instance; otherwise a bag is labeled with negative class label. Therefore, the task of MIL is to learn a classifier at bag level rather than at instance level. Traditional supervised learning approaches cannot be applied directly in such kind of situation. In this study, we represent each bag by a vector of its dissimilarities to the other existing bags in the training dataset and propose a multiple instance learning based Twin Support Vector Machine (MIL-TWSVM) classifier. We have used different ways to represent the dissimilarity between two bags and performed a comparative analysis of them. The experimental results on ten benchmark MIL datasets demonstrate that the proposed MIL-TWSVM classifier is computationally inexpensive and competitive with state-of-the-art approaches. The significance of the experimental results has been tested by using Friedman statistic and Nemenyi post hoc tests.
1. Introduction
Standard pattern recognition problems consider that the objects are represented as a single feature vector which contains sufficient information for the recognition of these objects. However, some complex objects exist in the real world which are difficult to represent by using a single feature vector; that is, single feature vector representation of an object is not sufficient for its separability, for example, a document with several paragraphs, an image containing many regions, each with different characteristics, and a drug with various conformations of a molecule. Traditional supervised learning techniques handle such kind of problems by representing complex objects using single feature vector. This reduction may lose significant information which further degrades the performance of supervised learning techniques. A set of feature vectors or multiple instances representation can be used for the better understanding of complex object [1, 2]. Multiple instances representation of a complex object can preserve more information about it. MIL is a variation of supervised learning in which a classifier is trained on a set of instances known as bag instead of individual instance. The objective of MIL approaches is to predict the class label for a bag. A bag may have different number of instances and may belong to positive or negative class label. Positive class label is assigned to a bag if it contains at least one positive instance while negative class label is assigned to a bag when all of its instances are negative [1, 3–5]. Figure 1 illustrates the framework of single instance learning and multiple instance learning.
Learning frameworks.
Single instance learning
Multiple instance learning
From Figure 1, it is observed that, in single instance learning, each object is represented by a single feature vector or instance and the classifier learns at instance level by assigning the class label to each instance individually. However, in MIL framework, an object is represented by a set of feature vectors or a bag and the classifier trains at bag level instead of instance level and predicts the class label of a bag instead of an instance.
The term MIL was first used for the drug activity prediction problem (Musk odor prediction) [1]. Later on, it is widely used by the researchers to solve various real world problems like image annotation [6–9], document categorization [6, 10, 11], object detection [12, 13], human action recognition [14], visual tracking [15–18], spam filtering [19], and many other problems. Several MIL approaches have been proposed by the researchers which can be broadly categorized into two groups. The first category, also known as bag-based method, works only on the bag label without having any knowledge of each instance label. Thus the bag labels can be predicted by converting a bag into a single instance representation and using supervised algorithms or by defining kernels or distances between bags. MIL approaches belonging to first category include those of Chen et al. [7], Sørensen et al. [20], and Cheplygina et al. [21] that use some dissimilarity measures to represent a bag with a derived feature vector. Wang and Zucker [22] defined nearest neighbors among bags, Gärtner et al. [23] and Wang et al. [24] determined kernels between bags, Zhou et al. [10] generated a graph with instance from a bag, and Zhang et al. [25] incorporated structure information between bags and many more. On the other hand, the second category also known as instance-based methods focuses on the instance label and a bag label is determined by combining the classification of instances. Axis-parallel rectangle method [1] and Diverse Density [2] and its variation [26] are some examples of instance-based approaches. The bag-based methods are widely used by the researchers as they have shown better performance on a wide range of MIL datasets. Therefore, this study has focused on the first category and extended the recently proposed Twin Support Vector Machine (TWSVM) classifier to multiple instance learning scenarios by obtaining summarized information of each bag using different dissimilarity measures [27].
In recent years, many nonparallel hyperplane Support Vector Machine (SVM) classifiers are proposed by the researchers for binary classification [27–29]. For example, Mangasarian and Wild proposed a Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM), the first nonparallel hyperplane classifier, which aims to find a pair of nonparallel hyperplane in such a way that each hyperplane is nearest to one of the two classes and as far as possible from the other classes [30]. GEPSVM shows excellent performance with several benchmark datasets especially with the “Cross-Planes” dataset. Later, by utilizing the concept of traditional SVM and GEPSVM, Jayadeva et al. proposed a nonparallel hyperplane based novel binary classifier, named as TWSVM [27]. The aim of TWSVM classifier is to generate two nonparallel hyperplanes in such a way that each hyperplane lies in close affinity to one of the two classes while maintaining distant from the data instance of other classes. For this purpose, it solves two SVM-type Quadratic Programming Problems (QPPs), while GEPSVM solves two generalized eigenvalue problems. Since TWSVM solves two smaller QPPs as opposed to a single complex QPP, the learning of TWSVM classifier is four times faster than that of standard SVM. TWSVM has shown its superiority over the other existing machine learning approaches on several benchmark datasets. Therefore, in this study, we have extended TWSVM to multiple instance learning scenarios. This paper proposes a bag dissimilarity based multiple instance learning TWSVM (MIL-TWSVM) classifier. We have defined the dissimilarity between bags using different approaches. The proposed classifier is trained with the summarized information of instances in each bag where a bag is represented by a feature vector. Feature vector contains the dissimilarity scores of a bag derived from the other bags in the training set. The experiment has been performed on ten MIL benchmark datasets. The results of the proposed approach have been compared with several existing MI learning approaches, such as Diverse Density (DD) [2], Expectation-Maximization Diverse Density (EMDD) [26], Multi-instance Logistic Regression (MILR) [31], Citation k-NN [22], and Multi-Instance Support Vector Machine (MISVM) [6]. The effectiveness of the proposed approach has also been analyzed by using Friedman average rank hypothesis tests [32, 33]. The statistical inferences are made from the observed difference in predictive accuracy. Modified version of Demšar significance diagram has been used to display the output of Friedman test.
The rest of the paper is organized as follows. Section 2 provides a brief overview of multiple instance learning approaches and their applications in the real world. Section 3 includes the formulation of Twin Support Vector Machine classifier. Section 4 describes different approaches used to measure the dissimilarity between bags. The experimental results are discussed in Section 5 and finally the conclusion is drawn in Section 6.
2. Overview of Multiple Instance Learning Approaches and Their Applications
In Multiple Instance Learning (MIL), a bag is used to represent an object as follows:(1)Bi=xil∣l=1,2,…,ni∈Rd,where ni is the number of instances or feature vectors in bag Bi and Rd is the d-dimensional feature space. Consider the training dataset contains N bags. The training dataset for MIL is represented as(2)T=Bi,yi∣i=1,…,N,where yi={+1,-1} is the class label corresponding to each bag in the dataset. The bags are labeled with class +1 if and only if it contains at least one positive instance; otherwise it is considered as negative labeled bag. The objective of MIL problem is to learn a model which can determine the class label of the unseen bag. MIL approaches have been widely used by the researchers to solve many real world problems. The drug activity recognition is one of the most popular applications of it. In this problem, for a given chemical molecule, the system must decide if it is useful for drug design or not. A good drug has the characteristic that it is strongly bound to a target “binding site.” A molecule can adopt multiple conformations or shapes and only one or a few of them bind well with the target protein or binding site. Dietterich et al. modeled the MIL framework to deal with the drug activity recognition [1]. They have predicted whether a new molecule was suitable for drug design or not by analyzing a set of known molecules. They have developed a three-axis-parallel rectangle algorithm in which the combination of extracted molecule features was used to determine the axis-parallel rectangles (APR). Zhao et al. proposed an MIL approach based on joint instance and feature selection for drug activity prediction [34]. They have focused on irrelevant and redundant features reduction in order to improve the interpretability of the drug activity recognition model. Diverse Density (DD) has been proposed to measure a region in the feature space which consists of at least one instance from each positive bag while there are no instances from negative bags [2]. DD of a given point p is defined as the ratio of the number of positive bags which have instances near to this point and the sum of distances of negative instances from p. The point p∗ at which DD is maximized corresponds to the target concept. DD has suffered from local optimization problem and the best solution could be achieved by many restarts. An algorithm Expectation-Maximization Diverse Density (EMDD) has been proposed by Zhang and Goldman to solve this problem [26]. They have combined expectation-maximization approach with DD to solve the local optimum problem by iteratively updating the previous target point. Expectation step finds the most positive instance from each bag after an initial guess for the target point p. Then, maximization step searches for a new point by maximizing DD on the selected most positive instances. These steps are repeated until the algorithm converges. EMDD performs well on a variety of MIL problems, but it is also very computationally intensive. Several regular supervised classifiers have also been extended to MIL scenarios, for example, Citation k-Nearest Neighbor (Citation k-NN), Bayesian k-NN, ID3-MI, and multiple instance learning SVM. Citation k-NN is an extension of k-NN in which a bag has been labeled by analyzing its neighboring bags and the bags that consider the concerned bag as a neighbor [22]. Citation k-NN uses different distance metric (minimal Hausdorff distance) in which the focus has been shifted from instances to the bags; that is, the distance is measured between different bags instead of different instances (see Figure 2). Citation k-NN has shown better performance on Musk datasets. Recently, a variance of Citation k-NN has been proposed by Zhou et al. for web mining task in which the minimum Hausdorff distance has been modified for text features [35]. ID3-MI is a decision tree algorithm which uses multi-instance entropy criterion to split the tree nodes [36].
Distance between different bags.
SVM has been extended to MIL scenario by Andrews et al. [6]. They have used two approaches for the extension of SVM. In the first approach, traditional SVM has been extended to MIL scenarios in which hidden labels of instances are decided under constraints posed by the class labels of bags. In the second approach, the objective was to maximize the bag margin directly. MIL Boost is another example of MIL approach where the weights of instances are updated in each of the boosting rounds [12]. In this approach, Noisy-OR rule is used to determine the bag labels from given instance labels. Logistic Regression and Neural Network are also extended to the MIL framework. Logistic Regression is a popular probabilistic supervised learning approach which has been extended to MIL framework. Fu and Robles-Kelly extended Logistic Regression (LR) to MIL problem domains and combine L1 and L2 regularization methods [37]. Xu and Frank also upgraded single instance LR to multi-instance data and showed its effectiveness on artificial and Musk drug activity prediction dataset [31]. They have followed several assumptions to form the bag-level probability from instance-level class probabilities. Ramon and De Raedt have also explored the utility of NN for MIL due to its ability of automatic learning from examples [38]. In another research work, Zhou and Zhang proposed BP-MLP which is the extension of NN to MIL [39]. In this approach, traditional BP algorithm has been extended using a global error function which is defined at bag level rather than at instance level. Image classification and retrieval is another significant application of MIL in which a given image is to be classified into a target category on the basis of its visual content [40–42]. In MIL, an image can be viewed as a bag of local image patches and can be labeled as positive or negative. A positive label image contains a set of image patches or instances in which at least one patch is conceptual to the user while if all the patches are not conceptual to the user then image is considered as negative label image. For example, in beach scene classification, the target class is beach and by using different regions or contents of the scene image (see Figure 3), the objective is to recognize whether it belongs to beach scene or not. In this case, any visual content that displays a beach may be considered as positive image while negative images show another different visual content.
Classification of a scene image into beach and not beach scene.
Xu proposed an MIL extension of Neural Network for the retrieval and classification of images [43]. Maron and Ratan applied DD MIL approach for the classification of natural image scene by using different kinds of bag generators [44]. Bag generators consider each image as a bag and various subregions in the image as instances. They have performed experiment on COREL photo library. Cheng et al. proposed BP-MLP and BP-SVM approaches for automatic image categorization in multiple instance learning scenario and performed experiment on 2000 images obtained from COREL repository [45]. They have extracted frequent patterns from each image category and embedded an image bag into a multidimensional data point which is useful to characterize the similarity between the image and every common pattern of an image category. Gondra and Xu proposed a Relevance Feedback (RF) learning based Content Based Image Retrieval (CBIR) framework in multiple instance learning scenario [8]. In another research work, Pao et al. proposed an EMDD based MIL method for image classification [46]. Sener and Ikizler-Cinbis proposed an ensemble of multiple instance learning approaches for the problem of image reranking [47]. They have constructed bags by using three different approaches: sliding window and dynamic and dynamic-sliding methods. Then the constructed bags have been used to develop multi-instance classifiers. They have used multiple instance learning with instance selection (MILES) algorithm as MIL-classifiers. Rank was assigned to an image by combining the decision score of MIL-classifiers. Li and Liu used graph based MIL with instance weighting for the retrieval of images [48]. Different weights were assigned to each region in positive images on the basis of learning results and then rank was calculated for each image. Feng et al. proposed a multi-instance semisupervised learning approach on the basis of hierarchical sparse representation for the categorization of images [49]. They have solved the instance confidence value identification problem under the framework of instance-level sparse representation. Several other research works have also focused on the image categorization problem in MIL framework. Xu et al. utilized the concept of deep learning of feature representation with multiple instance learning for colon cancer classification based on histopathology images [50]. The deep learning network is a process of obtaining high level features from low level features. They have proposed a system based on deep learning having a set of linear filters in encoder and decoder and used the last hidden layer of deep learning as fully supervised feature learning, as it represents intrinsical features compared to lower level features. Wu et al. extended deep learning to multiple instance learning framework for image annotation [51]. They have used deep convolutional neural network which contains five convolutional layers, followed by a pooling layer and three fully connected layers for learning visual representation with multiple instance learning. The last hidden layer was redesigned for multiple instance learning. Kotzias et al. have also combined the concept of deep learning and multi-instance learning for knowledge transfer [52]. MIL has also been utilized in disease diagnosis by analyzing medical images. Ding et al. considered the breast ultrasound image classification task as multiple instance learning tasks and proposed an MIL method based on SVM which classifies the tumors into benign and malignant [53]. They have used self-organizing map (SOM) to map the instance space into concept-space and constructed the bag feature vector by using the distribution of the instances of each bag. Li et al. proposed a novel computer aided diagnosis scheme for the recognition of tumor invasion depth of gastric cancer [54]. They have extracted both bag-level and instance-level features and applied an improved citation k-NN algorithm for the identification of gastric tumor invasion depth. Tong et al. used MIL method for the detection of Alzheimer’s disease (AD) and its prodromal stage mild cognitive impairment (MCI) [55]. They have built a graph for each image to identify the relationships among the patches and performed experiment on 834 MRI images taken from ADNI study. In another research work, Quellec et al. proposed an MIL framework for diabetic retinopathy screening [56]. Text categorization is another popular application of MIL. Wang et al. [11] proposed a novel instance specific distance method for the application of MIL text categorization. They have derived this data from Reuters-21578 collection having 2000 bags with 243 features. He and Wang investigated the problem of text categorization from multiple instance view in which each text is considered as a bag and each of its sentences as instance [57]. They have developed an MIL approach for Chinese text classification using k-NN. MIL is also used for web mining or web index recommendation problem in which each web page is considered as a bag and each of its linked pages is considered as bag instances. Viola et al. [12] proposed an algorithm, known as Fretcit k-NN based on minimum Hausdorff dissimilarity measure, and determined the class label of unseen bag by utilizing both references and citers. MIL approaches also have significant contribution to visual tracking [15, 17, 58] and real time video event detection areas [59].
3. Twin Support Vector Machine
Twin Support Vector Machine is a binary classification technique that does classification of data instances by constructing two nonparallel hyperplanes instead of a single hyperplane as in the case of traditional Support Vector Machine. It obtains two nonparallel hyperplanes by solving two QPPs of smaller size as compared to a single complex QPP solved by traditional SVM. TWSVM generates hyperplane for each class in such a way that the data instances of each class lie in close affinity to its corresponding hyperplane and as far as possible from the other hyperplane. The effectiveness of TWSVM over other existing classification approaches has been validated on various benchmark datasets. TWSVM has better generalization ability and faster computational speed due to which it has been applied to several real life applications such as intrusion detection [60, 61], activity recognition [62], image denoising [63], emotion recognition [64], text classification [65], defect prediction [66, 67], disease diagnosis [68, 69], and speaker identification [70]. Consider a binary classification problem of “m” size. The training dataset for such kind of problem can be represented as(3)T=x1,y1,x2,y2,…,xm,ym,where xi∈Rn, i=1,2,…,m, represents input data instances in n-dimensional feature space R and yi∈+1,-1 indicates corresponding class label. Consider two matrices A+1∈Rm1×n and A-1∈Rm2×n comprising the data instance of class +1 and class -1, respectively. TWSVM solves the following two QPPs:(4)minw+1b+1,ξ12A+1w+1+e+1b+12+c1e-1Tξ,s.t.-A-1w+1+e-1b+1+ξ≥e-1,ξ≥0,(5)minw-1b-1,η12A-1w-1+e-1b-12+c2e+1Tη,s.t.A+1w-1+e+1b-1+η≥e+1,η≥0,and seeks the following two nonparallel hyperplanes in Rn:(6)xTw+1+b+1=0,xTw-1+b-1=0.Here, w+1 and w-1 are normal vectors to the hyperplanes; b+1 and b-1 represent the bias terms. e+1∈Rm1 and e-1∈Rm2 are two vectors of 1’s of appropriate dimensions. c1 and c2 are two positive trade-off constants. ξ∈Rm1 and η∈Rm2 are slack variables due to the class -1 and class +1, respectively. The first term of (4) or (5) is the sum of squared distances of data instances from their corresponding hyperplane. Minimization of this term keeps the hyperplane closest to the data instances of class +1 or class -1. The second term of (4) or (5) assigns penalty to the data instances of other classes which are misclassified. The constraints require the hyperplane to be maintained at least 1 distance from the data instances of other classes. Slack variable measures the error wherever the hyperplane is closer than the 1 distance. In this way, the hyperplane is kept closer to the data instances of its respective class and as far as possible from the data instances of other classes. The Lagrangian corresponding to (4) is given as follows:(7)Lw+1,b+1,ξ,α=12A+1w+1+e+1b+12+c1e-1Tξ-αT-A-1w+1+e-1b+1+ξ-e-1-βTξ,where α=(α1,α2,…,αm2)T and β=(β1,β2,…,βm2)T are two vectors of Lagrange multipliers. The Karush-Kuhn-Tucker (KKT) conditions are given by(8)A+1TA+1w+1+e+1b+1+A-1Tα=0,(9)e+1TA+1w+1+e+1b+1+e-1Tα=0,(10)c1e-1-α-β=0,(11)-A-1w+1+e-1b+1+ξ≥e-1,ξ≥0,(12)αT-A-1w+1+e-1b+1+ξ-e-1-βT=0,βTξ=0,(13)α≥0,β≥0.Since β≥0, from (10), we can determine(14)0≤α≤c1.Equations (8) and (9) lead to(15)A+1Te+1TA+1e+1w+1b+1+A-1Te-1Tα=0.Let B+1=A+1e+1, B-1=A-1e-1, and u+1=w+1b+1. The above equation becomes(16)B+1TB+1u+1+B-1Tα=0oru+1=-B+1TB+1-1B-1Tα.In similar manner,(17)u-1=B-1TB-1-1B+1Tγ.From (16) and (17), it is clear that the solution of hyperplane parameters requires the inverse of matrix B+1TB+1 and B-1TB-1. Sometimes, matrix may be ill-conditioned due to which it is difficult to calculate its inverse. To avoid this situation, regularization terms c3I and c4I are added to the above-mentioned equations as follows:(18)u+1=-B+1TB+1+c3I-1B-1Tα,u-1=B-1TB-1+c4I-1B+1Tγ,where c3,c4>0 are user defined parameters having small values and I is an identity matrix of appropriate dimension. Wolfe dual of (4) and (5) can be defined as(19)maxαe-1Tα-12αTB-1B+1TB+1+c3I-1B-1Tα,s.t.0≤α≤c1,maxγe+1Tγ-12γTB+1B-1TB-1+c4I-1B+1Tγ,s.t.0≤γ≤c2.Using these equations, we can determine Lagrangian multipliers which are further useful to obtain hyperplane parameters. In this way, hyperplane is constructed for each class using (6). A class +1 or -1 is assigned to new data instance depending upon its closeness to the two nonparallel hyperplanes. TWSVM assigns class label to an instance by using the following decision function: (20)fx=argmini=+1,-1wi·x+biwi,where · is the absolute value. TWSVM has also been extended to the nonlinear cases where data instances are not separable by linear class boundaries. For this purpose, it uses kernel trick to transform the data instances into higher-dimensional feature space. Nonlinear TWSVM seeks the following two kernel surfaces instead of planes: (21)KxT,DTμ+1+b+1=0,KxT,DTμ-1+b-1=0,where K is any arbitrary kernel function and DT=A+1A-1T. The primal QPPs of nonlinear TWSVM corresponding to kernel-generated surfaces (21) are given below:(22)minμ+1,b+1,ξ12KA+1,DTμ+1+e+1b+12+c1e-1Tξ,s.t.-KA-1,DTμ+1+e-1b+1+ξ≥e-1,ξ≥0,minμ-1,b-1,η12KA-1,DTμ-1+e-1b-12+c2e+1Tη,s.t.KA+1,DTμ-1+e+1b-1+η≥e+1,η≥0.Similar to (16) and (17), kernel-generated surface parameters can be determined as(23)μ+1b+1=-G+1TG+1-1G-1Tα,μ-1b-1=G-1TG-1-1G+1Tγ.Here, G+1=[K(A+1,DT)e+1] and G-1=[K(A-1,DT)e-1]. Similar to the linear case, regularization terms c3I and c4I are added to (23) to avoid the ill-conditioned matrices. A new data instance is labeled with class +1 or -1 in a similar manner to the linear case.
4. Bag Dissimilarity Representation
In multiple instance learning case, a classifier works at the bag level rather than instance level and takes a bag as an input. Therefore, the objective of MIL is to develop a classifier which generates a decision function for the bag. In the proposed approach, each bag is represented by a vector of its dissimilarities to the other bags in the training set. The dissimilarity of a bag from all other bags represents a feature vector. If there are N bags in the training set and ith bag contains ni number of instances, then ith bag can be represented as(24)vi=dBi,B1,dBi,B2,…,dBi,BNT.Thus, each bag has a single feature vector or instance representation and the MIL problem can be considered as a regular supervised learning problem. The dissimilarity between two bags Bi and Bj is measured by using different ways which are classified into two main categories on the basis of bag representation. Consider the representation of a bag as a point set of the high-dimensional feature space, and then the dissimilarity between two bags can be measured using a set distance. The following distance metrics have been used to calculate the dissimilarity between bags.
(a) Hausdorff Distance. The Hausdorff distance is one of the most popular distance metrics used in object recognition in the field of computer vision. Two bags Bi and Bj are said to be close to each other if every instance of bag Bi is close to an instance in bag Bj. The dissimilarity between two bags Bi and Bj is defined as(25)dBi,Bj=maxddirBi,Bj,ddirBj,Bi.Here, ddirBi,Bj represents the directed distance between two bags Bi and Bj. In detail, given two bags Bi={xi1,xi2,…,xini} and Bj={xj1,xj2,…,xjnj}, the directed distance between Bi and Bj is calculated as(26)ddirBi,Bj=maxkminlxik-xjl2.Simply, xik-xjl2 measures the Euclidean distance between the instances xik and xjl. The N-dimensional representation of ith bag, vi, is formed as a vector of such dissimilarities between ith bag and all the other bags in the training set. The final Hausdorff distance between two bags is symmetrized by taking the maximum of the directed distance between them as ddir is not symmetric. The dissimilarity between two bags can also be defined by taking the minimum and average of squared Euclidean distance as follows:(27)dBi,Bj=minkminlxik-xjl2,(28)dBi,Bj=1ni∑k=1niminlxik-xjl2.Figure 4 shows the minimum Euclidean distances between instances of two bags. According to Figure 4(a), all the instances in bag Bi have the same closest instance in bag Bj while instances in bag Bj have two different closest instances in bag Bi. Due to which the minimum distance between instances of two bags is asymmetric dminBi,Bj≠dminBj,Bi. This distance can be symmetrized by again taking the minimum of these distances; that is, dminminBi,Bj=dminminBj,Bi. For the case of average distance as given by (28), daverageminBi,Bj≠daverageminBj,Bi.
Minimum Euclidean distances between instances of two bags.
(b) City Block Distance. The dissimilarity between two bags can be defined by using City Block distance metric as follows:(29)dBi,Bj=1ni∑k=1niminlxik-xjl.
(c) Chi-Squared Distance. Chi-squared distance is the weighted Euclidean distance which measures the dissimilarity between two bags as follows: (30)dBi,Bj=1ni∑k=1niminlxik-xjl22∗xik+xjl.If a bag can be viewed as a probability distribution in the instance space, then the dissimilarity between two bags can be defined through distribution distance. It is not only difficult to determine the probability density function in a higher-dimensional feature space but also very computationally expensive to estimate the true distributions of instances. Therefore, the instance distributions are approximated and the distance is measured between the approximated distributions by using the following two distance metrics.
(a) Earth Mover’s Distance (EMD). Earth Mover’s distance [71] measures the minimum amount of work to transform one probability distribution Bi into another probability distribution Bj. Consider that each instance has 1/ni of the total probability mass in bag Bi of size ni. The Earth Mover’s distance between two bags is computed as(31)dEMDBi,Bj=minfkl∑k,lfkldxik,xjl,where d(xik,xjl) is the Euclidean distance and fkl is the flow between instances “k” and “l” associated with additional constraints: fkl>0, ∑kfkl≤1/nj, ∑lfkl≤1/ni, and ∑klfkl=1.
(b) Mahalanobis Distance. Each bag Bi is approximated by a single Gaussian distribution with mean μi and covariance matrix ∑i parameters. Bag dissimilarity between two bags Bi and Bj through Mahalanobis distance is defined as(32)dMAHABi,Bj=μi-μjT12∑i+12∑j-1μi-μj.In this way, we can calculate the dissimilarity score of a bag from the rest of the other bags. The new vector representation of each bag vi acts as an input to the TWSVM classifier which now works at the bag-level f(vi). Now the problem has been converted into the single instance binary classification problem in which a bag has either +1 or -1 class label. Figure 5 depicts the example of multiple instance learning data having four bags. Each bag contains different number of instances such that bag 1 and bag 3 have three instances while bag 2 contains two instances and bag 4 consists of four instances. Class label is associated with each bag instead of individual instance. Traditional supervised learning approaches are not designed for such type of problems. Thus a bag-level MIL-TWSVM classifier is trained with this summarized data. During testing phase, the similar representation is obtained for the bag query and the proposed classifier takes a decision for a bag on the basis of minimum distance criteria.
Conversion of multiple instance learning data into single instance learning using bag dissimilarity score.
5. Numerical Experiments
This section presents the experimental results of our proposed MIL-TWSVM classifier on ten benchmark MIL datasets. We have analyzed the performance of proposed MIL-TWSVM classifier with different dissimilarity metrics. The results of MIL-TWSVM have been compared with several existing MIL approaches such as Diverse Density (DD), Expectation-Maximization Diverse Density (EMDD), Multi-Instance Logistic Regression (MILR), Citation k-NN, and Multi-Instance Support Vector Machine (MISVM). All these classifiers have been implemented in MATLAB 2012a on Windows 7 operating system with Intel core i-7 processor with 12 GB RAM. This section has been divided into four subsections. The first subsection includes the description of benchmark MIL datasets used in this study. The second subsection analyzes the impact of parameters on the performance of proposed classifier. Experimental results are discussed and analyzed in subsections three and four, respectively.
5.1. Dataset Description
In this study, the experiment has been performed on ten MIL benchmark datasets: Musk 1, Musk 2, Mutagenesis-atoms, Winter Wren, Brown Creeper, Elephant, Fox, Tiger, eastWest, and westEast datasets. These datasets are available online at http://www.miproblems.org/. The detailed description of these datasets is shown in Table 1. These datasets are widely adopted for the performance evaluation of new MIL approaches. These datasets represent four different categories of MIL example.
Multiple instance learning datasets.
Datasets
Number of bags
Number of attributes
Number of instances
Average bag size
+bags
−bags
Total
Musk 1
47
45
92
167
476
5.17
Musk 2
39
63
102
167
6598
64.69
Mutagenesis-atoms
125
63
188
11
1618
8.61
Winter wren
109
439
548
38
10232
18.67
Brown creeper
197
351
548
38
10232
18.67
Elephant
100
100
200
230
1391
6.96
Fox
100
100
200
230
1320
6.6
Tiger
100
100
200
230
1220
6.2
eastWest
10
10
20
25
213
10.65
westEast
10
10
20
25
213
10.65
Musk 1 and Musk 2 are two standard drug activity prediction benchmark datasets in which a bag is represented by one molecule and different conformations or shapes of these molecules are the instances of a bag. In these drug activity prediction datasets, a bag is assigned with the class label “musk” or “non-musk” by human expert. Musk 1 dataset contains 92 bags while Musk 2 dataset contains 102 bags. Musk 2 dataset contains more number of instances or molecule conformations as compared to the Musk 1 dataset. The objective of MIL is to predict whether a new molecule is “musk” or “non-musk.” Another dataset that belongs to the category of drug activity prediction is Mutagenesis-atoms. This dataset contains 125 positive and 63 negative bags. Brown Creeper and Winter Wren are two audio MIL datasets which contain the audio of bird songs of different species. A bag is represented by an audio fragment. A bag is labeled as positive if particular species is heard in the audio fragment for that category. Since the birds of the same species have similar songs or audio fragment, therefore different bird species have different concepts. It is also possible that some species are heard together more often. In this case, the audio fragments, which are not heard or are negative for one bird species, could be useful to determine whether an audio fragment contains that species or not. Content Base Image Retrieval (CBIR) is another one of the most recognized applications of MIL in which the objective is to determine whether the given image is of interest to user or not. A set of regions or image patches represent an image. An image corresponds to a bag and image regions represent the instances in each bag. The class label of individual instances is unknown. In this study, we have used three image datasets: Elephant, Fox, and Tiger. eastWest and westEast datasets belong to an ILP problem and have been collected from eastWest challenge. The objective of this challenge is to predict whether a train is eastbound or westbound. In eastWest or westEast datasets, a bag represents a train which contains various cars (instances) of different shapes and sizes. Each car having different loads represents its instance-level attributes. eastWest data challenge has two MI datasets: eastWest and westEast as it is not clear whether an eastbound or westbound train can be considered as positive label example. In eastWest dataset, eastbound trains are regarded as an example of positive class label. Similarly, westbound train is considered as positive example in the westEast dataset.
5.2. Parameters Selection
This study has used Gaussian Kernel function Kxi,xj=exp(-xi-xj2/2σ2) for nonlinear case. MIL-TWSVM classifier has four penalty parameters: c1, c2, c3, and c4 and an additional kernel parameter sigma (σ). The predictive performance of the classifier gets affected by the choices of these parameters. This study has used Grid Search approach which is one of the widely used approaches for the optimal parameters selection [27, 65, 72–74]. The penalty and kernel parameters are selected from the following range: c1,c2,c3,c4∈10-6,…,103 and σ∈{2-3,…,26}. The experiment has been conducted using 10-fold cross-validation approach. It trains the proposed classifier with each pair (penalty and σ) in the Cartesian product of these two sets and evaluates their performance by internal cross-validation on the training set, in which case multiple MIL-TWSVMs are trained per pair. Finally, it outputs the settings that achieved the highest score in the validation procedure. We have analyzed the influence of these parameters on the performance of MIL-TWSVM on three datasets: Tiger, Fox, and Mutagenesis-atoms as shown in Figures 6, 7, and 8. For linear case, we set c1=c2 and c3=c4 to reduce the computational complexity and analyze their influence on the predictive performance of linear classifier. However, for nonlinear case, consider c1=c2=c3=c4 to reduce the computational complexity and analyze the influence of these parameters and sigma on the predictive performance of nonlinear MIL-TWSVM classifier. For tiger dataset, the impact of parameters has been analyzed using Max-Hausdorff dissimilarity measure as shown in Figure 6. From the figure, it is observed that the proposed linear MIL-TWSVM classifier has obtained better performance with low value of c1 and high value of c3 parameters (c1=10-4,c3=102). The performance of MIL-TWSVM suddenly degrades for low value of c3 parameter. For nonlinear cases, MIL-TWSVM obtains better performance with high value of sigma (σ) and low value of penalty parameter c1(σ=23,c1=10-5). Max-Hausdorff based MIL-TWSVM classifier shows better performance for different combinations of penalty and sigma parameters on other datasets.
Influence of parameters on the predictive accuracy of MIL-TWSVM (Max-Hausdorff) on Tiger dataset.
Linear
Nonlinear
Influence of parameters on the predictive accuracy of MIL-TWSVM (Minimum Hausdorff) on Fox dataset.
Linear
Nonlinear
Influence of parameters on the predictive accuracy of MIL-TWSVM (EMD) on Mutagenesis-atoms dataset.
Linear
Nonlinear
The impact of these parameters on Fox and Mutagenesis-atoms datasets has been analyzed using Min-Hausdorff and EMD dissimilarity measures, respectively. On fox dataset, Min-Hausdorff based linear MIL-TWSVM has shown better performance for low value of c1 and high value of c3(c1=10-3,c3=102). Nonlinear MIL-TWSVM classifier has achieved better predictive accuracy with high value of sigma and low value of penalty parameter (σ=23,c1=10-2) on fox dataset as shown in Figure 7. For Mutagenesis-atoms, EMD based linear MIL-TWSVM has obtained highest accuracy for low value of c1 and c3 parameters (c1=10-3,c3=10-3). Nonlinear MIL-TWSVM has gained highest accuracy on Mutagenesis-atoms dataset for low value of sigma and penalty parameter (σ=20,c1=10-4) as shown in Figure 8. For every combination of these parameters (penalty and sigma) and dissimilarity measures, the proposed MIL-TWSVM classifier behaves differently on different datasets. Therefore, the appropriate selection of these parameters is essential to obtain better performance of MIL-TWSVM classifier.
5.3. Results and Discussion
The predictive accuracy of the proposed MIL-TWSVM classifier with different dissimilarity measures on ten benchmark datasets is shown in Tables 2 and 3 for linear and nonlinear cases, respectively.
Performance comparison of linear MIL-TWSVM with different dissimilarity score.
Datasets
Max-Hausdorff c1, c3Acc ± std (%)
Min-Hausdorff c1, c3Acc ± std (%)
Average-Hausdorff c1, c3Acc ± std (%)
City block c1, c3Acc ± std (%)
Chi-squared c1, c3Acc ± std (%)
Mahalanobis c1, c3Acc ± std (%)
EMD c1, c3Acc ± std (%)
Musk 1
10^{0}, 10^{0}95.56 ± 2.34
10^{−2}, 10^{−3}96.67 ± 5.09
10^{1}, 10^{1}77.78 ± 0.0
10^{0}, 10^{3}61.11 ± 7.85
10^{−2}, 10^{−2}64.48 ± 4.84
10^{−4}, 10^{−4}70.03 ± 4.58
10^{−2}, 10^{−2}75.58 ± 8.31
Musk 2
10^{−3}, 10^{−4}93.02 ± 4.56
10^{−2}, 10^{−1}92.85 ± 4.77
10^{−3}, 10^{−1}79.84 ± 2.26
10^{−4}, 10^{−1}62.46 ± 4.67
10^{−2}, 10^{−3}66.67 ± 4.18
10^{−2}, 10^{−1}72.34 ± 4.28
10^{−1}, 10^{0}74.47 ± 7.23
Mutagenesis-atoms
10^{0}, 10^{−1}86.84 ± 1.52
10^{−1}, 10^{−3}87.04 ± 2.63
10^{−2}, 10^{−1}68.42 ± 0.0
10^{−1}, 10^{−1}68.42 ± 0.0
10^{−2}, 10^{−1}68.42 ± 0.0
10^{−2}, 10^{0}68.42 ± 0.0
10^{−3}, 10^{−3}87.36 ± 2.57
Winter wren
10^{−4}, 10^{−2}96.84 ± 2.86
10^{−1}, 10^{−3}97.47 ± 2.6
10^{−4}, 10^{−2}89.22 ± 5.88
10^{−3}, 10^{0}79.67 ± 6.56
10^{1}, 10^{−3}84.26 ± 4.02
10^{−4}, 10^{−3}76.89 ± 5.35
10^{−3}, 10^{−1}94.71 ± 4.53
Brown creeper
10^{−5}, 10^{−3}96.01 ± 3.66
10^{−3}, 10^{−4}95.88 ± 3.24
10^{−4}, 10^{−1}90.89 ± 4.22
10^{−1}, 10^{−3}76.38 ± 3.63
10^{−2}, 10^{−4}85.43 ± 5.83
10^{−2}, 10^{−3}82.67 ± 5.56
10^{−4}, 10^{0}93.45 ± 5.15
Elephant
10^{1}, 10^{−3}83.88 ± 3.01
10^{−5}, 10^{−1}85.56 ± 5.49
10^{0}, 10^{0}55 ± 4.08
10^{0}, 10^{0}50 ± 0.0
10^{−1}, 10^{−1}64.16 ± 13.07
10^{0}, 10^{0}50 ± 0.0
10^{−1}, 10^{−1}74 ± 5.83
Fox
10^{−4}, 10^{−5}58.33 ± 1.74
10^{−3}, 10^{2}67.22 ± 6.7
10^{−4}, 10^{1}54.44 ± 3.5
10^{−1}, 10^{−1}50 ± 0.0
10^{−2}, 10^{−3}55 ± 0.0
10^{−1}, 10^{0}51.5 ± 1.97
10^{−7}, 10^{2}60 ± 6.32
Tiger
10^{−4}, 10^{2}83.5 ± 2.45
10^{−6}, 10^{−2}86.25 ± 5.89
10^{−5}, 10^{−2}55 ± 7.28
10^{0}, 10^{−2}50 ± 0.0
10^{−3}, 10^{−2}51.5 ± 4.27
10^{−1}, 10^{−2}50 ± 0.0
10^{−6}, 10^{−5}74.5 ± 6.5
eastWest
10^{−5}, 10^{−1}60 ± 20
10^{−3}, 10^{0}80 ± 24.49
10^{−1}, 10^{0}80 ± 24.49
10^{−2}, 10^{−2}50 ± 0.0
10^{−1}, 10^{0}50 ± 0.0
10^{−4}, 10^{0}65 ± 22.9
10^{−3}, 10^{−1}80 ± 24.49
westEast
10^{0}, 10^{−2}75 ± 25
10^{0}, 10^{0}90 ± 20
10^{0}, 10^{−3}85 ± 22.9
10^{0}, 10^{0}50 ± 0.0
10^{−2}, 10^{0}50 ± 0.0
10^{−1}, 10^{−1}75 ± 25
10^{−2}, 10^{−1}75 ± 25
Performance comparison of nonlinear MIL-TWSVM with different dissimilarity score.
Datasets
Max-Hausdorffc1, σAcc ± std (%)
Min-Hausdorffc1, σAcc ± std (%)
Euclidean (average)c1, σAcc ± std (%)
City Blockc1, σAcc ± std (%)
Chi-squaredc1, σAcc ± std (%)
Mahalanobisc1, σAcc ± std (%)
EMDc1, σAcc ± std (%)
Musk 1
10^{−2}, 2^{3}55.57 ± 0.0
10^{0}, 2^{1}55.57 ± 0.0
10^{−2}, 2^{2}55.57 ± 0.0
10^{3}, 2^{2}55.57 ± 0.0
10^{−2}, 2^{1}78.89 ± 3.68
10^{−4}, 2^{3}71.14 ± 5.22
10^{−1}, 2^{0}55.57 ± 0.0
Musk 2
10^{−2}, 2^{3}58.25 ± 3.45
10^{−3}, 2^{5}55.57 ± 0.0
10^{−3}, 2^{3}57.03 ± 2.4
10^{−3}, 2^{2}55.57 ± 0.0
10^{0}, 2^{1}77.78 ± 4.04
10^{−2}, 2^{4}77.70 ± 3.26
10^{−2}, 2^{0}55.57 ± 0.0
Mutagenesis-atoms
10^{−2}, 2^{1}85.52 ± 1.28
10^{−1}, 2^{4}81.58 ± 5.88
10^{−3}, 2^{3}68.42 ± 0.0
10^{−2}, 2^{3}68.42 ± 0.0
10^{−1}, 2^{5}68.42 ± 0.0
10^{−2}, 2^{3}68.42 ± 0.0
10^{−3}, 2^{0}89.47 ± 4.07
Winter wren
10^{−3}, 2^{3}97.56 ± 4.67
10^{−4}, 2^{4}98.02 ± 2.2
10^{−4}, 2^{3}91.10 ± 3.95
10^{−1}, 2^{5}82.86 ± 5.69
10^{−5}, 2^{6}79.92 ± 3.05
10^{−3}, 2^{3}87.18 ± 4.84
10^{−3}, 2^{4}95.49 ± 4.66
Brown creeper
10^{−3}, 2^{1}97.27 ± 3.84
10^{−5}, 2^{3}97.89 ± 3.71
10^{−3}, 2^{0}92.30 ± 4.08
10^{−5}, 2^{2}85.8 ± 4.53
10^{−3}, 2^{0}81.10 ± 3.26
10^{−3}, 2^{1}90.03 ± 3.35
10^{−4}, 2^{5}96.72 ± 3.24
Elephant
10^{−5}, 2^{3}83.62 ± 3.37
10^{−7}, 2^{5}83.33 ± 5.08
10^{−1}, 2^{0}50 ± 0.0
10^{0}, 2^{0}50 ± 0.0
10^{0}, 2^{3}55 ± 0.0
10^{0}, 2^{0}50 ± 0.0
10^{−1}, 2^{0}50 ± 0.0
Fox
10^{−7}, 2^{4}63.89 ± 1.01
10^{−2}, 2^{3}75 ± 4.49
10^{−5}, 2^{4}61.11 ± 6.98
10^{−5}, 2^{1}57.78 ± 7.85
10^{−5}, 2^{0}55 ± 0.0
10^{−4}, 2^{1}54 ± 4.76
10^{−3}, 2^{1}50 ± 0.0
Tiger
10^{−5}, 2^{3}85.63 ± 1.22
10^{−5}, 2^{3}83.13 ± 6.02
10^{1}, 2^{1}50 ± 0.0
10^{1}, 2^{1}50 ± 0.0
10^{−2}, 2^{2}50.5 ± 2.06
10^{0}, 2^{1}77 ± 8.12
10^{0}, 2^{1}50 ± 0.0
eastWest
10^{−3}, 2^{4}80 ± 24.49
10^{0}, 2^{1}80 ± 24.49
10^{−1}, 2^{2}80 ± 24.49
10^{−2}, 2^{5}50 ± 0.0
10^{−2}, 2^{4}50 ± 0.0
10^{−3}, 2^{2}50 ± 0.0
10^{−3}, 2^{5}80 ± 24.49
westEast
10^{0}, 2^{5}85 ± 22.9
10^{0}, 2^{3}85 ± 22.9
10^{0}, 2^{3}80 ± 24.49
10^{−1}, 2^{4}50 ± 0.0
10^{1}, 2^{4}50 ± 0.0
10^{−1}, 2^{3}70 ± 24.49
10^{0}, 2^{2}70 ± 24.49
The result includes the average and standard deviation of classification accuracies of the 10-fold cross-validation. Bold values indicate better predictive accuracy of the classifier. Min-Hausdorff dissimilarity score based linear MIL-TWSVM gains highest accuracy on Musk 1, Winter Wren, Elephant, Fox, Tiger, and westEast datasets. Max-Hausdorff based linear MIL-TWSVM classifier obtains highest accuracy on Musk 2 and Brown Creeper datasets. EMD based linear MIL-TWSVM achieves highest accuracy on Mutagenesis-atoms dataset.
Other bag dissimilarity measurements based MIL-TWSVM classifier shows poor performance on all type of datasets. Similarly, for nonlinear case, Min-Hausdorff and Max-Hausdorff dissimilarity score based MIL-TWSVM classifier has shown better performance on Winter Wren, Brown Creeper, Elephant, Tiger, Fox, eastWest, and westEast datasets. EMD based nonlinear MIL-TWSVM classifier achieves highest predictive accuracy on Mutagenesis-atoms dataset. Therefore, we can conclude that the MIL-TWSVM has shown better performance with Min-Hausdorff and Max-Hausdorff dissimilarity scores. Further, we have compared the performance of Min-Hausdorff based MIL-TWSVM classifier with the existing MIL approaches- Expectation-Maximization Diverse Density (EMDD), Diverse Density (DD), Multi-Instance Logistic Regression (MILR), Citation k-NN, and Multi-Instance Support Vector Machine (MISVM) as shown in Table 4.
Performance comparison of MIL approaches.
Datasets
EMDDAcc ± std (%)Rank
DDAcc ± std (%)Rank
MILRAcc ± std (%)Rank
Citation k-NN Acc ± std (%)Rank
MIL BoostAcc ± std (%)Rank
BP-MLPAcc ± std (%)Rank
MILESAcc ± std (%)Rank
MISVMAcc ± std (%)Rank
MIL-TWSVM (min-Hausdorff)Acc ± std (%)Rank
Musk 1
83.61 ± 13.16
84.65 ± 11.83
73.51 ± 13.43
90.37 ± 10.40
84.84 ± 11.55
83.7 ± 9.2
92.28 ± 10.45
89.21 ± 9.84
96.67 ± 5.09
8
6
9
3
5
7
2
4
1
Musk 2
85.57 ± 10.28
80.39 ± 13.44
78.51 ± 13.06
84.61 ± 11.54
79.50 ± 12.39
80.4 ± 10.34
89.66 ± 12.36
83.92 ± 10.36
92.85 ± 4.77
3
7
9
4
8
6
2
5
1
Mutagenesis-atoms
68.89 ± 12.29
72.89 ± 9.34
74.00 ± 10.37
73.35 ± 10.13
79.26 ± 9.24
78.88 ± 8.85
75.56 ± 8.88
66.54 ± 1.85
87.04 ± 2.63
8
7
5
6
2
3
4
9
1
Winter wren
96.85 ± 3.02
85.56 ± 4.16
89.45 ± 2.3
94.8 ± 4.22
92.34 ± 5.16
96.67 ± 7.53
96.35 ± 6.02
97.16 ± 2.85
98.02 ± 2.2
3
9
8
6
7
4
5
2
1
Brown creeper
94.5 ± 2.26
93.78 ± 3.88
93.6 ± 4.24
85.77 ± 4.35
94.02 ± 5.75
95.36 ± 6.14
95.8 ± 5.84
92.80 ± 3.16
97.89 ± 3.71
4
6
7
9
5
3
2
8
1
Elephant
75.25 ± 10.57
81.45 ± 10.23
78.90 ± 9.28
50 ± 0.0
84.70 ± 8.46
83.72 ± 8.82
80.36 ± 8.26
79.45 ± 9.32
85.56 ± 5.49
8
4
7
9
2
3
5
6
1
Fox
59.65 ± 9.16
59.40 ± 9.88
57.35 ± 10.81
50 ± 0.0
66.90 ± 9.58
62.05 ± 7.43
68.87 ± 9.04
49.35 ± 5.71
75 ± 4.49
5
6
7
8
3
4
2
9
1
Tiger
71.35 ± 9.45
72.20 ± 8.80
75.30 ± 8.98
50 ± 0.0
87.62 ± 8.07
79.45 ± 9.12
84.22 ± 7.65
80.70 ± 7.42
86.25 ± 5.89
8
7
6
9
1
5
3
4
2
eastWest
64.00 ± 28.50
64.50 ± 32.79
67.00 ± 36.39
45 ± 24.10
55.00 ± 32.18
60.50 ± 21.67
55.00 ± 32.18
60.50 ± 21.67
80 ± 24.49
4
3
2
9
7.5
5.5
7.5
5.5
1
westEast
38.00 ± 27.63
36.00 ± 29.37
35.50 ± 34.30
50.5 ± 27.06
56.50 ± 33.83
55.00 ± 32.18
50.5 ± 27.06
38.00 ± 25.74
90 ± 20.00
6.5
8
9
4.5
2
3
4.5
6.5
1
Average rank
5.75
6.3
6.9
6.75
4.25
4.35
3.7
5.9
1.1
Friedman test statistic = 36.83.
From Table 4, it is observed that the proposed MIL-TWSVM classifier has achieved highest predictive accuracy on all ten benchmark datasets and thus performs better than the other existing MIL approaches.
5.4. Statistical Comparison
Friedman test statistic [32, 33] assigns rank to each classifier according to their performance on each dataset independently. For example, the first rank is given to the best performing classifier; second best performing classifier gets second rank. Average rank is given to the classifiers in case they have shown the same performance. Let rij be the rank of jth classifier on ith dataset. Friedman test statistic is calculated as(33)χF2=12DKK+1∑j=1KARj2-KK+124,whereARj=1D∑i=1Drij.Here, D represents the number of datasets used in this study for comparison purpose; K denotes the number of classifiers and ARj is the average rank of jth classifier. Friedman test statistic follows chi-squared distribution with K-1 degrees of freedom. The null hypothesis which states that there is no difference between classifiers can be rejected or accepted according to the value of Friedman test statistic. If the value of Friedman test statistic is large as compared to the critical value corresponds to K-1 degrees of freedom then we can accept or reject the null hypothesis. The Nemenyi post hoc test [75] reports the significant differences between individual classifiers. According to this test, two classifiers are significantly different if their average rank differs by at least the critical difference (CD) which is obtained as(34)CD=qαKK+16D,where qα is calculated on the basis of studentized range statistic. The results of Friedman test statistic are plotted by using modified Demšar significance diagram [76]. We have calculated the average rank of each MIL approach on the basis of its performance on each dataset (see Table 4). Then, the Friedman test statistic is calculated according to (31). From Table 4, it is observed that the Min-Hausdorff dissimilarity based MIL-TWSVM classifier achieves highest average rank among all MIL approaches. Maximum Hausdorff based MIL-TWSVM classifier gets second highest average rank. Consider α=0.05; then the critical value for 8-degree of freedom from chi-squared table is 15.507. The obtained Friedman test statistic value is 36.83 which is very higher than the critical value of 8-degree of freedom. Hence, we reject the null hypothesis which states that there is no difference between the classifiers. Critical value q0.05 for nine classifiers is 3.102. Critical difference for α=0.05 is determined using (34) as follows:(35)CD=3.1029×106×10=3.799.Figure 9 depicts the Demšar significance diagram in which MIL approaches are arranged in ascending order on the y-axis as per their average rank and their corresponding ranks are mentioned on the x-axis.
Average rank comparison of MIL approaches.
Critical difference value has been added to the average rank of each MIL approach in order to analyze whether the proposed approach is significantly better than the other MIL approaches. Two vertical lines in red color depict the difference of the end of the best performing MIL approach’s tail and the start of the next significantly different MIL approach. From the figure, it is clear that the other existing MIL approaches such as EMDD, DD, MILR, Citation k-NN, and MISVM perform significantly worse than the best performing approach which is Min-Hausdorff based MIL-TWSVM. Thus, we can conclude that the proposed MIL-TWSVM is a suitable choice in the multiple instance learning problem domains.
6. Conclusion
This study has focused on multiple instance learning in which a classifier learns from a set of feature vectors (bag) instead of single feature vector and has proposed an MIL approach based on TWSVM, termed as MIL-TWSVM. Each bag is denoted by a vector of its dissimilarities to the other bags in the training set and the proposed classifier has been trained with this summarized information. Initially, the performance of proposed MIL-TWSVM classifier has been compared with different dissimilarity scores on ten benchmark MIL datasets. We have also compared the performance of MIL-TWSVM with six existing MIL approaches. Experimental results demonstrate that the proposed approach has achieved highest predictive accuracy as compared to the other existing MIL approaches on all ten datasets. This further supports the suitability of MIL-TWSVM in multiple instance learning scenarios. The findings of experimental results are also supported by the statistical analysis performed by using Friedman test. The test shows that the MIL-TWSVM is significantly better than the EMDD, DD, MILR, Citation k-NN, and MISVM. In the future, we are interested in extending MIL-TWSVM to multi-instance multilabel scenario.
Competing Interests
The authors declare that they have no conflict of interests regarding the publication of this paper.
DietterichT. G.LathropR. H.Lozano-PérezT.Solving the multiple instance problem with axis-parallel rectanglesMaronO.Lozano-PérezT.ZhouZ. H.Multi-instance learning: a survey2004Department of Computer Science and Technology, Nanjing UniversityFouldsJ.FrankE.A review of multi-instance learning assumptionsAmoresJ.Multiple instance classification: review, taxonomy and comparative studyAndrewsS.TsochantaridisI.HofmannT.Support vector machines for maultiple-instance learningChenY.BiJ.WangJ. Z.MILES: multiple-instance learning via embedded instance selectionGondraI.XuT.A multiple instance learning based framework for semantic image segmentationZhangQ.GoldmanS. A.YuW.FrittsJ. E.Content-based image retrieval using multiple-instance learning2Proceedings of the International Conference on Machine Learning (ICML '02)2002682689ZhouZ.-H.SunY.-Y.LiY.-F.Multi-instance learning by treating instances as non-I.I.D. samplesProceedings of the 26th International Conference on Machine Learning (ICML '09)June 2009Montreal, Canada124912562-s2.0-71149085943WangH.NieF.HuangH.Learning instance specific distance for multi-instance classificationProceedings of the 25th AAAI Conference on Artificial IntelligenceAugust 2011San Francisco, Calif, USA507512ViolaP.PlattJ. C.ZhangC.Multiple Instance boosting for object detection18Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS '05)December 2005141714242-s2.0-84864049528BabenkoB.VermaN.DollárP.BelongieS. J.Multiple instance learning with manifold bagsProceedings of the 28th International Conference on Machine Learning (ICML '11)20118188AliS.ShahM.Human action recognition in videos using kinematic features and multiple instance learningBabenkoB.YangM.-H.BelongieS.Visual tracking with online multiple instance learningProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '09)June 2009Miami Beach, Fla, USAIEEE98399010.1109/cvprw.2009.52067372-s2.0-70450188146LeistnerC.SaffariA.BischofH.Miforests: multiple-instance learning with randomized treesProceedings of the European Conference on Computer Vision (ECCV '10)2010Berlin, GermanySpringer2942XieY.QuY.LiC.ZhangW.Online multiple instance gradient feature selection for robust visual trackingZeislB.LeistnerC.SaffariA.BischofH.On-line semi-supervised multiple-instance boostingProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10)June 2010San Francisco, Calif, USAIEEE187910.1109/cvpr.2010.55398602-s2.0-77955991184JorgensenZ.ZhouY.IngeM.A multiple instance learning strategy for combating good word attacks on spam filtersSørensenL.LoogM.TaxD. M. J.LeeW. J.De BruijneM.DuinR. P. W.Dissimilarity-based multiple instance learningCheplyginaV.TaxD. M. J.LoogM.Multiple instance learning with bag dissimilaritiesWangJ.ZuckerJ. D.Solving multiple-instance problem: a lazy learning approachProceedings of the 17th International Conference on Machine Learning2000San Francisco, Calif, USA11191125GärtnerT.FlachP. A.KowalczykA.SmolaA. J.Multi-instance kernels2Proceedings of the International Conference on Machine Learning (ICML '02)2002179186WangH.-Y.YangQ.ZhaH.Adaptive p-posterior mixture-model kernels for multiple instance learningProceedings of the 25th International Conference on Machine LearningJuly 2008Helsinki, Finland113611432-s2.0-56449086489ZhangD.LiuY.SiL.ZhangJ.LawrenceR. D.Multiple instance learning on structured dataZhangQ.GoldmanS. A.EM-DD: an improved multiple-instance learning techniqueJayadevaKhemchandaniR.ChandraS.Twin support vector machines for pattern classificationTianY.QiZ.JuX.ShiY.LiuX.Nonparallel support vector machines for pattern classificationTomarD.AgarwalS.Twin support vector machine: a review from 2007 to 2014MangasarianO. L.WildE. W.Multisurface proximal support vector machine classification via generalized eigenvaluesXuX.FrankE.DaiH.SrikantR.ZhangC.Logistic regression and boosting for labeled bags of instancesFriedmanM.A comparison of alternative tests of significance for the problem of m rankingsDemšarJ.Statistical comparisons of classifiers over multiple data setsZhaoZ.FuG.LiuS.ElokelyK. M.DoerksenR. J.ChenY.WilkinsD. E.Drug activity prediction using multiple-instance learning via joint instance and feature selectionZhouZ.-H.JiangK.LiM.Multi-instance learning based web miningChevaleyreY.ZuckerJ. D.Solving multiple-instance and multiple-part learning problems with decision trees and rule sets. Application to the mutagenesis problemFuZ.Robles-KellyA.Fast multiple instance learning via L1,2 logistic regressionProceedings of the 19th International Conference on Pattern Recognition (ICPR '08)December 2008Tampa, Fla, USAIEEE142-s2.0-77957961184RamonJ.De RaedtL.Multi instance neural networksProceedings of the ICML-2000 Workshop on Attribute-Value and Relational Learning20005360ZhouZ. H.ZhangM. L.Neural networks for multi-instance learningProceedings of the International Conference on Intelligent Information Technology2002Beijing, China455459ShenW.BaiX.HuZ.ZhangZ.Multiple instance subspace learning via partial random projection tree for local reflection symmetry in natural imagesBirisanM.BelingP. A.A multi-instance learning approach to filtering images for presentation to analystsGondraI.XuT.Image region re-weighting via multiple instance learningXuY.-Y.Multiple-instance learning based decision neural networks for image retrieval and classificationMaronO.RatanA. L.Multiple-instance learning for natural scene classificationProceedings of the International Conference on Machine Learning (ICML '98)1998Madison, Wis, USA341349ChengH.HuaK. A.YuN.An automatic feature generation approach to multiple instance learning and its applications to image databasesPaoH. T.ChuangS. C.XuY. Y.FuH.-C.An EM based multiple instance learning method for image classificationSenerF.Ikizler-CinbisN.Ensemble of multiple instance classifiers for image re-rankingLiF.LiuR.Graph-based multiple-instance learning with instance weighting for image retrievalProceedings of the 18th IEEE International Conference on Image Processing (ICIP '11)September 2011Brussels, BelgiumIEEE2453245610.1109/icip.2011.61161562-s2.0-84856290705FengS.XiongW.LiB.LangC.HuangX.Hierarchical sparse representation based multi-instance semi-supervised learning with application to image categorizationXuY.MoT.FengQ.ZhongP.LaiM.ChangE. I.-C.Deep learning of feature representation with multiple instance learning for medical image analysisProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '14)May 2014Florence, France1626163010.1109/icassp.2014.68538732-s2.0-84905230329WuJ.YinanY.HuangC.KaiY.Deep multiple instance learning for image classification and auto-annotationProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15)June 2015Boston, Mass, USAIEEE3460346910.1109/cvpr.2015.7298968KotziasD.DenilM.BlunsomP.de FreitasN.Deep multi-instance transfer learninghttps://arxiv.org/abs/1411.3128DingJ.ChengH. D.HuangJ.LiuJ.ZhangY.Breast ultrasound image classification based on multiple-instance learningLiC.ZhangS.ZhangH.ZhangS.PangL.LamK.HuiC.Using the K-nearest neighbor algorithm for the classification of lymph node metastasis in gastric cancerTongT.WolzR.GaoQ.GuerreroR.HajnalJ. V.RueckertD.Multiple instance learning for classification of dementia in brain MRIQuellecG.LamardM.AbràmoffM. D.DecencièreE.LayB.ErginayA.CochenerB.CazuguelG.A multiple-instance learning framework for diabetic retinopathy screeningHeW.WangY.Text representation and classification based on multi-instance learningProceedings of the 16th International Conference on Management Science and Engineering (ICMSE '09)September 2009Moscow, Russia343910.1109/icmse.2009.53175372-s2.0-72249089004XuC.TaoW.MengZ.FengZ.Robust visual tracking via online multiple instance learning with Fisher informationXuJ.DenmanS.ReddyV.FookesC.SridharanS.Real-time video event detection in crowded scenes using MPEG derived features: a multiple instance learning approachDingX.ZhangG.KeY.MaB.LiZ.High efficient intrusion detection methodology with twin support vector machines1Proceedings of the International Symposium on Information Science and Engineering (ISISE '08)December 2008Shanghai, China56056410.1109/isise.2008.2782-s2.0-62449320695HeJ.ZhengS.-H.Intrusion detection model with twin support vector machinesNasiriJ. A.CharkariN. M.MozafariK.Energy-based model of least squares twin Support Vector Machines for human action recognitionYangH.-Y.WangX.-Y.NiuP.-P.LiuY.-C.Image denoising using nonsubsampled shearlet transform and twin support vector machinesTomarD.OjhaD.AgarwalS.An emotion detection system based on multi least squares twin support vector machineKumarM. A.GopalM.Least squares twin support vector machines for pattern classificationAgarwalS.TomarD.A feature selection based model for software defect predictionTomarD.AgarwalS.Prediction of defective software modules using class imbalance learningTomarD.AgarwalS.Hybrid feature selection based weighted least squares twin support vector machine approach for diagnosing breast cancer, hepatitis, and diabetesTomarD.PrasadB. R.AgarwalS.An efficient Parkinson disease diagnosis system based on least squares twin support vector machine and particle swarm optimizationProceedings of the 9th IEEE International Conference on Industrial and Information Systems (ICIIS '14)December 2014Gwalior, IndiaIEEE1610.1109/iciinfs.2014.70366032-s2.0-84924242454WuZ.YangC.Study to multi-twin support vector machines and its applications in speaker recognitionProceedings of the International Conference on Computational Intelligence and Software Engineering (CiSE '09)December 2009Wuhan, China1410.1109/cise.2009.53668472-s2.0-77949706623RubnerY.TomasiC.GuibasL. J.Earth mover's distance as a metric for image retrievalHsuC. W.ChangC. C.LinC. J.A practical guide to support vector classification2013Taipei, TaiwanDepartment of Computer Science, National Taiwan UniversityKhemchandaniR.SharmaS.Robust Least Squares Twin Support vector machine for human activity recognitionHsuC.-W.LinC.-J.A comparison of methods for multiclass support vector machinesNemenyiP.BrownI.MuesC.An experimental comparison of classification algorithms for imbalanced credit scoring data sets