Font Size: a A A

Research On Crowdsourcing Learning Methods Based On Data Mining

Posted on:2023-05-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:M WuFull Text:PDF
GTID:1528307331472004Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Instead of traditional expensive and time-consuming experts labeling,the development of the crowdsourcing makes researchers in the field of data mining and machine learning obtain a large number of labeled data,which can be used to construct supervised learning model,through the crowdsourcing platforms low cost and quickly.So crowdsourcing is beneficial to update the learning model quickly and overcome the difficulty of labeled data collection.However,unlike high-quality annotations collected from domain experts,the quality of annotations provided by non-professional crowdsourcing workers cannot be guaranteed,which leads to great challenges to the use of crowdsourcing learning in the field of Artificial Intelligence.The topic that how to obtain high-quality integrated labels by ground-truth inference from crowdsourcing noisy annotations and use noisy data to train high-quality classification learning models efficiently and accurately have become hot research field and application directions in the machine learning paradigm.Crowdsourcing learning is combined with deep learning,active learning and other technical methods in this paper,which aims to conduct in-depth research and exploration on the effective use of crowdsourcing noisy annotations.The main works and innovations are as follows:(1)The subjective-aware ground truth inference algorithm.A novel ground truth inference algorithm based on expectation maximization(EM)is proposed by in order to solve subjective bias problems caused by the limitation of the professional knowledge of annotators and problems that malicious annotators affect the result of ground truth inference.The reliability and subjectivity of annotators and the difficulty of samples are used as parameters to generate the probability model,and the EM algorithm is used to iteratively estimate the hidden variables,i.e.the ground truth of the samples,and to calculate and update the values of parameters at the same time.Meanwhile,in the algorithm iteration process,this paper dynamically uses the calculated parameters of annotators to identify whether the annotator is malicious,and timely adjusts(i.e.annotation correction or discard)the annotations provided by the malicious annotators to reduce the impact of the malicious annotators on the algorithm performance.Finally,the effectiveness of the proposed algorithm is verified through doing comparative experiments on different data sets with other classical algorithms,especially on datasets with high professional requirements.(2)Annotation prediction algorithm based on deep learning.To address the problem of information loss in the process of ground truth inference of traditional two-step crowdsourcing models,an end-to-end deep neural network model is proposed.The proposed model uses a special neural network structure"crowdsourcing layer" to model the labeling behavior of the annotators,which is used to learn the reliability and bias information of annotators.The model can directly use the crowdsourcing noisy annotations as the expected output to train the neural network end-to-end without ground truth inference in advance.After the model is trained,it could be used as a classifier to accomplish the classification task.In addition,this paper applies the algorithm to relevance evaluation tasks of heterogeneous data information retrieval system,and constructs a kind of heterogeneous data fusion deep neural network.The model uses two different neural network to extract the features of two heterogeneous data,and fuses the features of the two heterogeneous data by a similarity factor effectively,then train a nerual network by the fused features.Experimental results show that the proposed deep neural network has higher accuracy of ground truth inference and better classification performance than other classification algorithms.(3)Multi-label learning algorithm in crowdsourcing based on active learning.To address the problems of high labeling cost and low classification accuracy of multi-label learning in machine learning,this paper proposes a multi-label learning algorithm in crowdsourcing based on active learning.In this paper,a set of classifiers are trained to classify and predict multi-label data,and a probabilistic model is constructed by modeling the parameters of classifier and the reliability of annotations,and the EM algorithm is used to estimate the ground truth of samples,and the parameters of model are also calculated.Considering the potential correlation between multiple labels to help learning model,the algorithm introduces sample similarity information,encodes the multi-label correlation of samples,and generates a enhanced feature for each sample to improve the performance of the classifier.Meanwhile,this paper proposes a group of active learning strategies,which dynamically selects the most valuable samples and labels in the algorithm iteration process,and selects the most reliable annotator for labeling,reasonably allocates resources and reduces cost of labeling.Experimental results show that the proposed active learning algorithm can rapidly converge with fewer annotations and has better classification performance than other multi-label learning algorithms.
Keywords/Search Tags:crowdsourcing learning, weak-supervised learning, active learning, ground truth inference, deep learning, multi-label classification
PDF Full Text Request
Related items