Font Size: a A A

Research On Machine Learning Methods That Exploit Unlabeled Data

Posted on:2018-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:X Y GuoFull Text:PDF
GTID:2348330512998080Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Traditional machine learning needs labeled data to train a good model.But,it is expensive to assign labels to the training data since it requires human efforts.Unlabeled data is generally easier to acquire in many real-world applications.Thus,researchers have long been interested in utilizing these abundant unlabeled data to help learning process.Currently there're mainly two approaches:first is the semi-supervised learning approach,which automatically exploits unlabeled data with a few labeled data to improve learning performance;However,some-times semi-supervised methods could degrade performance by using unlabeled data.To avoid this risk,the second approach is crowdsourcing,which provides large amounts of labels with lower cost.In detail,the contribution of this thesis mainly includes:1.Existing works have pointed out that insufficient views could cause perfor-mance degradation for co-training,a classical semi-supervised learning algorithm.Insufficient views can lead to examples that are inconsistent with the optimal classifier during the co-training process.We propose the Weighted Co-training algorithm to address this problem.It can identify potential inconsistent data and decrease their weight to avoid the risk.The experiment results demonstrate that,compared with standard co-training,Weighted Co-training has better accuracy and robustness.2.In crowdsourcing processes,the label quality usually depends on how difficult a task is.Based on this observation we give a new crowdsourcing task assignment algorithm.It first estimates the difficulty of a small part of all the tasks,and use these data to train a model to predict remained ones' difficulty.With the predicted difficulty,we can divide the tasks into easy and hard parts.For easy tasks,the quality of labels provided by the crowd is high enough;while for the hard ones,it'd be better to employ specialized workers to label them.Experimental results show that the proposed approach can significantly improve the label quality and reduce labeling cost.Besides the works above,we've also studied how to reuse pre-trained model by exploiting unlabeled data.In this model reuse problem,users need to combine the outputs of these unmodifiable models to get a final prediction.We propose a new multi-view model reuse algorithm.It estimates each pre-trained model's reliability via a belief-propagation method,in which the view consistency of un-labeled data is used as regularization.With the estimated reliability,we can combine the models by weighted majority voting.The experiment results show that our algorithm can significantly improve prediction accuracy.
Keywords/Search Tags:machine learning, semi-supervised learning, co-training, crowdsourcing, task difficulty, model reuse
PDF Full Text Request
Related items