A Novel Multiple Instance Learning Method Based on Extreme Learning Machine

Since real-world data sets usually contain large instances, it is meaningful to develop efficient and effective multiple instance learning (MIL) algorithm. As a learning paradigm, MIL is different from traditional supervised learning that handles the classification of bags comprising unlabeled instances. In this paper, a novel efficient method based on extreme learning machine (ELM) is proposed to address MIL problem. First, the most qualified instance is selected in each bag through a single hidden layer feedforward network (SLFN) whose input and output weights are both initialed randomly, and the single selected instance is used to represent every bag. Second, the modified ELM model is trained by using the selected instances to update the output weights. Experiments on several benchmark data sets and multiple instance regression data sets show that the ELM-MIL achieves good performance; moreover, it runs several times or even hundreds of times faster than other similar MIL algorithms.


Introduction
Multiple instance learning (MIL) was first developed to solve the problem of drug prediction [1]. From then on, a variety of problems are formulated as multiple instance ones, such as object detection [2], image retrieval [3], computer aided diagnosis [4], visual tracking [5][6][7], text categorization [8][9][10], and image categorization [11,12]. In MIL, the single example object that is called a bag contains many feature vectors (instances), some of which may be responsible for the observed classification of the example or object, and the label is only attached to bags (training examples) instead of its instances. Furthermore, example is classified as positive if at least one of its instances is a positive example; otherwise, the bag is labeled as a negative one.
Numerous learning methods for MIL problem have been proposed in the past decade. As the first learning algorithm for MIL, Axis-Parallel Rectangle (APR) [1] was created by changing a hyper rectangle in the instances feature space. Then, the famous Diverse Density (DD) [13] algorithm was proposed to measure a cooccurrence of similar instances from different positive bags. Andrews et al. [8] used support vector machine (SVM) to solve the MIL problem that was called MI-SVM, where a maximal margin hyperplane is chosen for the bags by regarding a margin of the most positive instance in a bag. Wang and Zucker [14] proposed two variants of the -nearest neighbor algorithm by taking advantage of the -neighbors at both the instance and the bag, namely, Bayesian-NN and Citation-NN. Chevaleyre and Zucker derived ID3-MI [15] for multiple instances learning from the decision tree algorithm ID3. The key techniques of the algorithm are the so-called a multiple instance coverage and a multiple instance entropy. Zhou and Zhang presented a multiple instance neural network named BP-MIL [16] with a global error function defined at the level of bags. Nevertheless, it is not uncommon to see that it takes a long time to train most of the multiple instance learning algorithms.
Extreme learning machine (ELM) provides a powerful way for learning pattern which has several advantages such as faster learning speed, higher generalization performance [17][18][19]. This paper is mainly concerned with extending extreme learning machine to multiple instance learning. In this paper, a novel classification method based on neural network is presented to address MIL problem. Two-step training procedure is employed to train the ELM-MIL. During the first step, the most qualified instance is selected in each bag through SLFNs with a global error function defined at the level of 2 Computational Intelligence and Neuroscience bags, and the single selected instance is used to represent each bag. During the second step, by making use of the selected instances, the modified SLFNs output parameters are optimized the way ELM does. Experiments on several benchmark data sets and text categorization data sets show that the ELM-MIL achieves good performance; moreover, it runs several times or even hundreds of times faster than other similar MIL algorithms.
The remainder of this paper is organized as follows. In Section 2, ELM is briefly introduced and an algorithmic view of the ELM-MIL is provided. In Section 3, the experiments on various MIL problems are conducted and the results are reported. In Section 4, the main idea of the method is concluded and possible future work is discussed.

Proposed Methods
In this section, we first introduce ELM theory; then, a modified ELM is proposed to address the MIL problem, where the most positive instance in positive bag or the least negative instance in negative bag is selected.

Extreme Learning Machine.
ELM is a single hidden layer feedforward neural network where the hidden node parameters (e.g., the input weights and hidden node biases in additive nodes and Fourier series nodes, centers, and impact factors in RBF nodes) are chosen randomly and the output weights are usually determined analytically by using the least square method. Because updating of the input weights is unnecessary, the ELM can learn much faster than back propagation (BP) algorithm [18]. Also, ELM can achieve a better generalization performance.
Concretely, suppose that we are given a training set comprising samples {x , } =1 and the hidden layer output (with nodes) denoted as a row vector o(x) = [ 1 (x), . . . , (x)], where x is the input sample. The model of the single hidden layer neural network can be written as where is the weight of th hidden node connecting to output node, is the output of the network with hidden nodes, and a and are the input weights and hidden layer bias, respectively. (a , , x ) is the hidden layer function or kernels. According to the ELM theory [18][19][20], the parameters a and can be randomly assigned, and the hidden layer function can be a nonlinear continuous function that satisfies universal approximation capability theorems. In general, the popular mapping functions are as follows: (1) Sigmoid function: (2) Gaussian function: For notational simplicity, (1) can be written as where H is the × hidden layer output matrix, whose elements are as follows: and The least square solution with minimal norm is analytically determined by using generalized Moore-Penrose inverse: where H † is the Moore-penrose generalized inverse of the hidden layer output matrix H.

ELM-MIL.
Assume that the training set contains bags, the th bag is composed of instances, and all instances belong to the -dimension space; for example, the th instance in the th bag is ]. Each bag is attached by a label . If the bag is positive, then = 1; otherwise, = 0. Our goal is to predict whether the label of new bags is positive or negative. Hence, the global error function is defined at the level of bags instead of at the level of instances: where is the error on bag . Based on the assumption if a bag is positive at least one of its instances is positive, we can simply define as follows: where is the output of instance for bag . And our goal is to minimize the cost function for the bags.
Up to now, the last problem is how we can find the most likely instance that has the maximum output. As we know, ELM chooses the input weights randomly and determines the output weights of SLFNs analytically. At first, the output weights are not known; thus, the max 1≤ ≤ ( ) can not be calculated directly [16]. Furthermore, both the input weights/hidden node biases and output weights are initialized randomly. When the bags are put into the original SLFNs one by one, the instance having the maximum output will be marked down. The most positive or least negative instance (having maximum output) will be thus picked out from each bag. For each bag, we pick the most positive or negative instance with highest likelihood according to the label of the bags. The selected instances, whose number is equal to the number of training bags, will be used as training data set to train the original network through minimizing the least square.
Computational Intelligence and Neuroscience 3 Given a training set { , | = 1, . . . , }, the bag containing instances { 1 , 2 , . . . , }, each instance is denoted as -dimension feature vector, so the th instance of the th bag is [ 1 , 2 , . . . , ] . The hidden node uses sigmoid function, and hidden node number is defined as . The algorithm can now be summarized step-by-step as follows.
End for Select the win-instance win :
Step 3. Calculate the hidden layer output matrix H: Step 4. Calculate the new output weights: where respectively. For comparison with several typical MIL methods, we conduct 10-fold cross validation, which is further repeated 10 times with random different partitions, and the average test accuracy is reported. In Table 1 The relation between the number of hidden layer nodes and the prediction accuracy with different regulator parameter on MUSK1 and MUSK2 data sets is presented in Figures 1 and 2, respectively. It can be found that when the number of hidden layer is over 300, the accuracy stays at a high level for both MUSK1 and MUSK2.
As  Table 2 for MUSK1  and Table 3 for MUSK2. Table 1 suggests that MI-ELM is comparable with stateof-the-art algorithm that is proposed in [13]. Particularly, it can be found from Tables 2 and 3 that the test accuracy of ELM-MIL not only is higher than that of BP-MIP, which is also a multiple instance learning method based on neural network, but also learns significantly faster than that of BP-MIP on MUSK data set. Moreover, the iterated-discrim APR algorithm was specially devised for MUSK data, while ELM-MIL is a general algorithm. It is clear that, for applicability, ELM-MIL is superior to the APR method. When compared with Citation-NN, from the point of prediction accuracy, ELM-MIL is worse than Citation-NN, but from the point 4 Computational Intelligence and Neuroscience      All of them run on a 2.6 GHz, i5-3230 PC, matlab2013b. Table 4 shows that the square loss of our proposed ELM-MIL is worse than MI-kernel, but ELM-MIL takes only tiny mounts of seconds to find appropriate parameters, about twenty times faster than MI-kernel. When compared with BP-MIP and Diverse Density, from the point of performance as well as from the point of training time, ELM-MIL is better than both of them. These results indicate that ELM-MIL is an efficient and effective approach on multiple instance regression task.

Conclusions
In this paper, a novel multiple instance learning algorithm is proposed based on extreme learning machine. Through modifying the specific error function for the characteristics of multiple instance problems, the most representative instance is chosen in each bag, and the chosen instances are employed to train the extreme learning machine. We have tested ELM-MIL over the benchmark data sets which are taken from applications of drug activity prediction, artificial data sets, and multiple instance regression. Compared with other methods, ELM-MIL algorithm learns much faster and its classification accuracy is slightly worse than stateof-the-art multiple instance algorithms. The experimental results recorded in this paper are rather preliminary. For continuous work, there may be two directions. First, it is possible to improve our method performance by exploiting feature selection techniques [3,13], that is, feature scaling with Diverse Density and feature reduction with principal component analysis. Next, one can build ensembles of several multiple instance learners to enhance the basic multiple instance learners.