Font Size: a A A

Research On Key Technologies Of On-demand Cleaning For Dirty Data

Posted on:2019-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z X QiFull Text:PDF
GTID:2428330566496877Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the development of the information age,the amount of data has grown dramatically.At the same time,dirty data have already existed in various types of databases.Due to the negative impacts of dirty data on data mining and machine learning results,data quality issues have attracted widespread attention.The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean.However,rare research has focused on exploring such relationship.Motivated by this,this paper conducts an experimental comparison for the effects of missing,inconsistent and conflicting data on classification,clustering,and regression algorithms.Based on the experimental findings,we provide guidelines for algorithm selection and data cleaning.After obtaining the specific impacts of different types of dirty data on different algorithms,this paper focuses on dirty data cleaning.At present,there are many data cleaning approaches.Among these,crowdsourced cleaning is a novel method to clean dirty values that could hardly be filled with automatic approaches.However,the time cost and overhead in crowdsourcing are high.Therefore,it is necessary to reduce cost and guarantee the accuracy of crowdsourced cleaning.To achieve the optimization goal,COSSET+,a crowdsourced framework optimized by knowledge base,is presented.It combines the advantages of both knowledge-based filter and crowdsourcing platform.Since the amount of crowd values will affect the cost of COSSET+,the goal is to select partial dirty values to be crowdsourced.This paper proves that the crowd value selection problem is an NP-hard problem and develops an approximation algorithm for this problem.Experimental results demonstrate the efficiency and effectiveness of the proposed approaches.However,since the costs of data cleaning are expensive,many users demand that data cleaning costs should be controlled within a limited cost.Therefore,how to clean data selectively according to the needs of users has become an urgent problem.In order to solve it,this paper takes the cost-sensitive decision tree as an example to propose three kinds of on-demand data cleaning algorithms,that is,a step-by-step ondemand cleaning algorithm based on splitting attribute benefits,a one-time ondemand cleaning algorithm based on splitting attribute benefits and cleaning costs,and a step-by-step on-demand cleaning algorithm based on splitting attribute benefits and cleaning costs.Experiment results demonstrate the effectiveness of the presented algorithms.
Keywords/Search Tags:dirty data, on-demand cleaning, crowdsourcing, knowledge base, decision tree
PDF Full Text Request
Related items