Research On Machine Learning Methods That Exploit Unlabeled Data

Posted on:2018-06-27

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Guo

Full Text:PDF

GTID:2348330512998080

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Traditional machine learning needs labeled data to train a good model.But,it is expensive to assign labels to the training data since it requires human efforts.Unlabeled data is generally easier to acquire in many real-world applications.Thus,researchers have long been interested in utilizing these abundant unlabeled data to help learning process.Currently there're mainly two approaches:first is the semi-supervised learning approach,which automatically exploits unlabeled data with a few labeled data to improve learning performance;However,some-times semi-supervised methods could degrade performance by using unlabeled data.To avoid this risk,the second approach is crowdsourcing,which provides large amounts of labels with lower cost.In detail,the contribution of this thesis mainly includes:1.Existing works have pointed out that insufficient views could cause perfor-mance degradation for co-training,a classical semi-supervised learning algorithm.Insufficient views can lead to examples that are inconsistent with the optimal classifier during the co-training process.We propose the Weighted Co-training algorithm to address this problem.It can identify potential inconsistent data and decrease their weight to avoid the risk.The experiment results demonstrate that,compared with standard co-training,Weighted Co-training has better accuracy and robustness.2.In crowdsourcing processes,the label quality usually depends on how difficult a task is.Based on this observation we give a new crowdsourcing task assignment algorithm.It first estimates the difficulty of a small part of all the tasks,and use these data to train a model to predict remained ones' difficulty.With the predicted difficulty,we can divide the tasks into easy and hard parts.For easy tasks,the quality of labels provided by the crowd is high enough;while for the hard ones,it'd be better to employ specialized workers to label them.Experimental results show that the proposed approach can significantly improve the label quality and reduce labeling cost.Besides the works above,we've also studied how to reuse pre-trained model by exploiting unlabeled data.In this model reuse problem,users need to combine the outputs of these unmodifiable models to get a final prediction.We propose a new multi-view model reuse algorithm.It estimates each pre-trained model's reliability via a belief-propagation method,in which the view consistency of un-labeled data is used as regularization.With the estimated reliability,we can combine the models by weighted majority voting.The experiment results show that our algorithm can significantly improve prediction accuracy.

Keywords/Search Tags:

machine learning, semi-supervised learning, co-training, crowdsourcing, task difficulty, model reuse

PDF Full Text Request

Related items

1	Research On Domain And Plagiarism Perception Crowdsourcing Model Based On Task Difficulty Differentiation
2	Research On Semi-Supervised Support Vector Machine Learning Methods
3	Semi-Supervised Multi-Task Learning Based On DFS
4	Research And Implementation Of Semi-supervised Machine Learning Algorithms For Classifying The Imbalanced Protocol Flows
5	Research On Network Anomaly Detection Method Based On Semi-supervised Learning Strategy
6	Research On Weakly Supervised Learning Based On Controlled Random Walk Model
7	Semi-Supervised Learning With Multiple Views
8	Research And Application Of Semi-Supervised Learning Algorithms Based On "Collaborative-Participatory" Computational Cognition Model
9	Research On Transfer Learning Algorithm Based On Semi-supervised Tri-training
10	Research On The Application Of Geometric Information In The Semi-supervised Learning