Font Size: a A A

Research On Embedding-Space-Based Multiple-Instance Learning Algorithms With Feature Selection

Posted on:2015-05-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:L M YuanFull Text:PDF
GTID:1228330422492544Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Similar to the conventional supervised learning, multiple-instance learning (MIL) concerns classification of training examples each of which possesses a label and a MIL learner tries hard to correctly predict the labels of unknown examples. In the supervised learning every example contains only one instance. However, in MIL examples are called bags each of which contains one or more instances, and labels are associated with train-ing examples rather than instances in them. The standard multiple-instance assumption indicates that a training bag is labeled positive if at least one of its instances is positive, and otherwise labeled negative.The standard multiple-instance assumption implies that there exists at least one pos-itive instance in a positive bag. However, many MIL problems do not satisfy this as-sumption, such as the problem of region-based image categorization, where an image is considered to belong to a certain category only when several different regional target ob-jects appear in the image simultaneously. To solve such a problem, many researchers have made a more generalized multiple-instance assumption under which they have proposed several embedding-space-based MIL algorithms. The basic idea can be summarized as follows. First, they embed the training bags into the embedding space that is formed from the instances in the training set. Thus, every training bag is represented as the corre-sponding bag-level feature vector. Then, they apply all bag-level feature vectors into a specific standard supervised learning algorithm, e.g., a support vector machine (SVM). The embedding-space-based MIL algorithms have converted MIL into the conventional supervised learning by using the bag-level feature mapping.Typically, an embedding-space-based MIL algorithm uses all the training instances to construct the embedding space. Since most of training bags are composed of multiple instances for the general MIL problems, the dimensionality of the new embedding space is much higher than the number of training bags, even for moderately sized data sets. The imbalance between the number of training bags and that of features can easily lead to overfitting. Therefore, feature selection becomes a necessity for an embedding-space-based MIL algorithm. Since every feature is defined by an instance prototype, feature selection is essentially also instance selection here. In this work, we have investigated the embedding-space-based MIL algorithms from the perspective of feature selection, and discussed two key aspects of this kind of algorithm, including feature mapping and feature selection. The details have been summarized as follows.(1) We have improved the MILES algorithm by using a nonlinear SVM to map the distance features. How to combine the bag-level features with SVMs is one of the ba-sic issues for an embedding-space-based MIL algorithm. Inspired by this issue, we have explored the essential aims of two existing combinations, and then improved the MILES algorithm by transforming its combination. There exist two kinds of combinations, in-cluding the combination of the distance features with a nonlinear SVM and that of the similarity features with a linear SVM. The former uses a nonlinear SVM to achieve the nonlinear mapping of the distance features, while the latter uses an exponential func-tion. By comparing these two kinds of mapping modes, it can be concluded that the combination of the distance features with a nonlinear SVM is more appropriate for an embedding-space-based MIL algorithm than the other one. We have then improved the MILES algorithm by using the better combination. The experimental results show that the improved algorithm is better than the original one in terms of classification performance, computational efficiency and robustness to labeling noise.(2) We have presented a taxonomy for the existing embedding-space-based MIL algorithms. Motivated by lack of studies on this direction, we have attempted to classify these algorithms according to the characteristics of them in feature selection. Specifically, we group them into filter and embedded feature selection based algorithms. Moreover, previous algorithms tune the related parameters using the whole data set, leading to the untrue experimental results. To solve this problem, we have also tested them again using the cross-validation method. We use the training set to tune the related parameters instead of the whole data set and make sure that none of testing examples is involved in the whole training procedure. This experiment can also help to analyze the effect of different types of feature selection methods on the embedding-space-based MIL algorithms.(3) We have proposed three embedding-space-based MIL algorithms based on a greedy approach to forming the feature subset. How to form the feature subset is anoth-er issue for the embedding-space-based MIL algorithms. However, the previous statistic approach neglects the specialty of MIL itself, which will introduce a lot of redundant fea-tures into the feature subset and finally lead to low efficiency of the entire learning process. To solve this problem, we have proposed a greedy approach according to the specialty of MIL, which is selecting from the bag-level features the ones each of which corresponds to an instance with the highest score in a training bag under a certain feature selection standard. Based on this greedy approach, we have also proposed three embedding-space-based MIL algorithms. The experimental results demonstrate that the greedy approach based algorithm can achieve highly comparative classification performance with that of the original one and higher efficiency. In other words, this greedy approach can help an algorithm to keep a balance between effectiveness and efficiency.(4) We have improved the MILD algorithm by considering the generalization ability of its instance selection method. MILD regards the ability of an instance in separating the training bags as its instance selection criterion. Nevertheless, MILD ignores the ability of an instance in separating the unknown bags. In other words, MILD does not take into account the generalization ability of its instance selection method. Moreover, MILD does not select the instance prototypes from the negative bags. To overcome these drawbacks of MILD, we have applied the classical cross-validation technique into its instance selec-tion process, and then proposed the improved algorithm. The major improvements are regarding the classification capability of an instance on the validation set as the corre-sponding instance selection criterion and selecting the negative instance prototypes. The experiments validate that the improved algorithm can significantly improve the classifica-tion accuracy of the original one.(5) We have proposed an embedding-space-based MIL algorithm with pairwise in-stance similarity in a bag. Since a pair of the most similar instances in a training bag may capture the target concept existent in it, we attempt to use them as the prototypes. The proposed algorithm has performed very well on several classification tasks in terms of classification ability and computational efficiency. Furthermore, the noise sensitivity test shows that the proposed algorithm is very robust to labeling noise, which is mainly due to the fact that it focuses on only the inner structure of a training bag not the label of the bag with respect to instance selection.
Keywords/Search Tags:Multiple-instance learning, Embedding space, Feature selection, Instance se-lection, Generalization ability, Pairwise instance similarity
PDF Full Text Request
Related items