Research On Key Technologies Of On-demand Cleaning For Dirty Data

Posted on:2019-07-01

Degree:Master

Type:Thesis

Country:China

Candidate:Z X Qi

Full Text:PDF

GTID:2428330566496877

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,with the development of the information age,the amount of data has grown dramatically.At the same time,dirty data have already existed in various types of databases.Due to the negative impacts of dirty data on data mining and machine learning results,data quality issues have attracted widespread attention.The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean.However,rare research has focused on exploring such relationship.Motivated by this,this paper conducts an experimental comparison for the effects of missing,inconsistent and conflicting data on classification,clustering,and regression algorithms.Based on the experimental findings,we provide guidelines for algorithm selection and data cleaning.After obtaining the specific impacts of different types of dirty data on different algorithms,this paper focuses on dirty data cleaning.At present,there are many data cleaning approaches.Among these,crowdsourced cleaning is a novel method to clean dirty values that could hardly be filled with automatic approaches.However,the time cost and overhead in crowdsourcing are high.Therefore,it is necessary to reduce cost and guarantee the accuracy of crowdsourced cleaning.To achieve the optimization goal,COSSET+,a crowdsourced framework optimized by knowledge base,is presented.It combines the advantages of both knowledge-based filter and crowdsourcing platform.Since the amount of crowd values will affect the cost of COSSET+,the goal is to select partial dirty values to be crowdsourced.This paper proves that the crowd value selection problem is an NP-hard problem and develops an approximation algorithm for this problem.Experimental results demonstrate the efficiency and effectiveness of the proposed approaches.However,since the costs of data cleaning are expensive,many users demand that data cleaning costs should be controlled within a limited cost.Therefore,how to clean data selectively according to the needs of users has become an urgent problem.In order to solve it,this paper takes the cost-sensitive decision tree as an example to propose three kinds of on-demand data cleaning algorithms,that is,a step-by-step ondemand cleaning algorithm based on splitting attribute benefits,a one-time ondemand cleaning algorithm based on splitting attribute benefits and cleaning costs,and a step-by-step on-demand cleaning algorithm based on splitting attribute benefits and cleaning costs.Experiment results demonstrate the effectiveness of the presented algorithms.

Keywords/Search Tags:

dirty data, on-demand cleaning, crowdsourcing, knowledge base, decision tree

PDF Full Text Request

Related items

1	Human-in-the-loop Knowledge Graph Cleaning
2	Research On Key Technologies Of Data Cleaning Based On Crowdsourcing
3	The Research On Method Optimization Of Data Cleaning In The Construction Of Agricultural Domain Knowledge Base
4	Research On Vertical Crowdsourcing System For The Domain Knowledge Base Construction
5	Uncertain Graphs Cleaning For Reachability Queries Via Crowdsourcing
6	Research On Berth Mud-filling Knowledge And Its Application On The Management Of Mud-cleaning Engineering
7	Heterogeneous Data Sources Integration In Research And Application Of The Cleaning Strategy,
8	The Research Of Decision Tree Algorithm In Data Mining
9	Classification Algorithm Study And Application Based On Decision Tree
10	Assisted Decision Support System For Insurance Company Based On The Analysis Of Business Data