Font Size: a A A

Research Of Sampling Strategy In Active Learning Algorithms

Posted on:2014-04-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:W N WuFull Text:PDF
GTID:1268330392972657Subject:Artificial Intelligence and information processing
Abstract/Summary:PDF Full Text Request
Currently, in fields of text mining, speech recognition, bioinformation data mining andvisual object classification, it has been a real problem that there are always lots ofunlabeled examples which are easy to be obtained, but there are a few of labeledexamples which are hard to be obtained. As one of important aspects in machinelearning, active learning techniques can utilize labeled and unlabeled examples at thesame time in order to obtain a classification model with high performance. In this paper,we make a thorough study of sampling strategy in active learning, and then we applythe proposed algorithms to real tasks of visual object classification.It has been an important problem all along how to understand or utilize semanticinformation contained in visual objects. Due to rapid development of web techniques, itis possible to collect a lot of images in a short time, and then it becomes a challenge toclassifiy visual objects by using their semantic information which is extracted fromthese unsupervised or weak-supervised images. More and more researchers focus onmining effective algorithms of machine learning, and then judge which category avisual object belongs to according to the knowledge obtained by building a model onlabeled images. In this process, it always needs lots of precisely labeled images fortraining a model whose costs are expensive and time-consuming. In order to obtain sucha model within as few costs as possible, it needs to fully utilize the annotator resource,and then reduce the total labeling costs.In order to collect and utilize the annotations of images, active learning algorithmsprovide effective solutions. Firstly, a small number of images are randomly chosen andtheir annotations are obtained. Then, by creating the interaction between the annotatorsand model, the learning system can freely choose some unlabeled images, which areconsidered as the most helpful images of all, for querying their annotations. The goal ofreducing annotators’ workload is achieved by making the learning system ask forannotations. Not only this method makes full use of rare annotations, but also ittransfers the knowledge of annotators into the learning system. Therefore, it is importantto mine active learning algorithms for classification and retrieval of visual objects.Now, some active learning algorithms have been used in reducing total labeling costsof classification and retrieval of visual objects, and these works have achieved favoriteperformance on practical tasks. But there are always some idealized assumptions whichmake active learning algorithms unsuitable for noisy or big data environment. In thispaper, we focus on research of active learning algorithms. Based on existed works inactive learning, we explore sampling strategies which can be used in the condition of noise or big data, and then use them to obtain accurate classification models with as lowlabeling cost as possible. At last, we apply the proposed algorithms in the task of objectclassification and retrieval. Our main contributions are listed as follows:(1) A sampling strategy is proposed by weighting examples based on structure riskAiming at the idealized assumption that training data and test data must have thesame distribution, we propose a sampling strategy by weighting some examples basedon the structure risk. Our goal is to solve the problem that the performance ofclassification model will fall, when there is difference between training and testdistribution. In our method, we use the expected error of structure risk between labeledand unlabeled data to estimate the weight value of every unlabeled example, and then,according to their corresponding weights, choose the most helpful example of all.Compared with other methods, experimental results show that the proposed method caneffectively enhance the performance of classification model.(2) A method of constructing the training set is proposed by selecting some examples ina batch modeAiming at the unbalanced classification problem which is caused by lots of objects inthe whole database but few of them belong to the same category, we propose a methodof constructing training set by selecting examples in a batch mode. Our goal is to avoidthe adverse effect coming from lots of negative examples, and then enhanceclassification performance of the classifier. In our method, we estimate the trainingdistribution by minimizing the variance of structure risk, and then select a group ofexamples according to the estimated distribution. Compared with other methods, theexperimental results show that the labeling costs are fewer than other methods, whenthe classifier obtains similar performance.(3) A multi-annotator probabilistic model of active learning is proposedAiming at the idealized assumption that there is only one annotator to provideaccurate annotation for the selected example, we propose a multi-annotator probabilisticmodel of active learning for noisy annotations. Our goal is to reduce the effect ofannotation quality from multiple annotators. In our probabilistic model, the totallabeling costs are reduced and the classification performance is enhanced by choosingthe most reliable annotator of all for labeling the selected example and estimating theactual annotation. Compared with other methods, the experimental results show that theproposed probabilistic model can effectively reduce the impact from noisy annotations,and then enhance the performance of classification model.(4) A hash-based sampling strategy of active learning is proposedAiming at the problem that it needs a lot of time costs for selecting examples in alarge number of data, we propose the hash-based sampling strategy. Our goal is to return selected examples in a short time, and then reduce the time expense required byobtaining a classification model. In our method, the important weight elements in theparameter vector of classification model are selected, and then the approximate distancebetween the unlabeled examples and classification boundary are estimated. Comparedwith other methods, the experimental results also show that the proposed algorithm caneffectively reduce the time costs.
Keywords/Search Tags:Active learning, Importance sampling, Cost-sensitive sampling, Multipleannotators, Hash technique, Classification and retrieval of visual objects
PDF Full Text Request
Related items