Font Size: a A A

Research On The Algorithm Improving The Quality Of Crowdsourcing Data Labeling

Posted on:2020-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:P J YangFull Text:PDF
GTID:2428330596968164Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Labeling data on a large scale is indispensable to the development and application of many research fields supported by data,in particular in the artificial intelligence field,such as machine learning.In recent years,collecting labeled data by the crowdsourcing system is becoming more and more popular,which helps to make large quantities of labeling data available to researchers rapidly and at a low cost.However,the quality of the labels by the labeler cannot be guaranteed due to unstable factors of the crowdsourcing labeler.Currently,researchers have proposed some effective ground truth inference algorithms of the label of the labeled data to improve the quality of crowdsourcing data labeling.To address the above-mentioned issue,two algorithms better than the widely recognized benchmark algorithms are proposed in this paper.The main research is summarized below(1)Reasons leading to the poor quality of crowdsourcing data labeling are analyzed and the ground truth inference of the label with an aim to improve the quality of crowdsourcing labeling is systematically defined.Fundamentals and realization of some typical benchmark algorithms are discussed to provide a basis for our algorithms and contrast experiment in the paper(2)The ground truth inference algorithm of the label is proposed based on the gold standard data and incentive strategies.While adopting the gold standard data,this method takes into full consideration the types of labelers in the actual crowdsourcing environment and filters labelers of poor quality.Secondly,labeling capacity of the labeler is estimated rationally to address the shortcomings in ELICE algorithm based on the same gold standard data.Lastly,the effect of the algorithm is further enhanced by improving incentive strategies to motivate the labeler(3)A ground truth inference algorithm of the label based on labeler capability and labeling difficulty is proposed.Not relying on the gold standard data and applicable to multi-label labeling task,this method mainly considers labeler capability and difficulty of labeling instance to establish a probability model of effective multi-label labeling and rationally estimates the difficulty of labeling instance.Finally,EM iterative algorithm is adopted to obtain the maximum likelihood estimate of the labeling model and infer the label truth of the labeled dataOpen source experimental tools are used to do contrast experiment between this algorithm and algorithms of other benchmarks in many open datasets.The effectiveness of the two algorithms proposed in the paper and their advantages compared with some other algorithms are tested by the experiment results and analysis.
Keywords/Search Tags:crowdsourcing, label quality, ground truth inference, gold standard data, incentive mechanism, machine learning
PDF Full Text Request
Related items