Font Size: a A A

Research On Key Technologies Of Data Cleaning Based On Crowdsourcing

Posted on:2016-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:C YeFull Text:PDF
GTID:2308330479490030Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the rise of the internet and digital technology industries leading to a sharp increase of various data, the importance of data quality has been realized. Data cleaning is a natural way to improve the data quality. However, existing methods for data cleaning are often too hard to compute, even some of them are NP-hard or incalculable. Also due to the lack of human knowledge, existing methods cannot find the proper values for dirty data so the accuracy of the data cleaning results is not enough. In this paper, we present two active learning frameworks combine with the crowdsourcing for data cleaning. We use crowdsourcing to confirm the true value and the active learning mechanism to reduce the cost and improve the machine learning accuracy. We present three active learning algorithms based on crowdsourcing to deal with truth discovery, missing values and entity recognition, and then we put forward a data cleaning system based on crowdsourcing.The main contribution of this paper is divided into three aspects: First, active learning model is designed to meet the problem of data cleaning characteristics; Second, we use crowdsourcing platform to speed up the repair speed of the original model, increase the accuracy of data cleaning; Third, we present a data cleaning system framework based on crowdsourcing. We describe the three aspects briefly below.Firstly, we present two active learning models to attempt different occasions of data cleaning, an upfront active learning model and an iterative active learning model. We select few records which have high information to the crowdsourcing platform to label to improve the accuracy of data cleaning.Secondly, we use the crowdsourcing platform for data cleaning the first time to provide adequate knowledge to find the true value.Finally, we propose a data cleaning system based on crowdsourcing. The system provides the user a friendly interface dealing with different data quality problems.
Keywords/Search Tags:data cleaning, truth discovery, missing values, entity recognition, crowdsourcing
PDF Full Text Request
Related items