Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning

Posted on:2018-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:S L Yang

Full Text:PDF

GTID:2348330542964624

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The correct use of high-quality data can lead to better predictions,analyses,and decisions.In the Multi-source heterogeneous data environment,since the data structures are different among different data sources,the form of data representation is not uniform and the data sets often contains incomplete,incorrect or irrelevant dirty data,so that Multi-source heterogeneous data cleaning faces the enormous challenge.Data cleaning is a powerful means to guarantee data quality,so it can improve consistency,accuracy,authenticity and availability of large data by data cleaning.In order to improve the efficiency and reduce the complexity of multi-source heterogeneous large data cleaning,the main work of this dissertation is as follows:(1)In view of the problem that there is a large number of imprecise data in multi-source heterogeneous data environment,we propose a data cleaning strategy for hierarchical reduction and classified cleaning.TAN(Tree Augmented Bayes)is constructed by measurement of importance degree of data source,data attribute and tuple weight tag,and machine learning classification algorithm.Finally,using the probability value of the data complete the classification cleaning of the imprecise data.The experiments show that,compared with the existing methods for solving multi-source heterogeneous imprecise data cleaning,HRSC strategy can effectively improve the accuracy and cleaning efficiency of imprecise data cleaning.(2)In view of the problem that there are a lot of redundant or similar duplicate data in multi-source heterogeneous data environment,we propose a strategy of attribute reduction associated cleaning(ARAC:Attribute Reduction Associated Cleaning).The construction of standard library for data attribute,attribute reduction and the SNM algorithm that is improved by multiple sort to complete the cleaning of similar duplicate records.In this paper,we evaluate the experiment for model and algorithm of the data cleaning by using the real data sets and the validation data sets,and then we verify the correctness of similar duplicate data of the data cleaning and the integrity of the final data.The experiments show that the proposed model and algorithm in this paper can effectively solve the similar duplicate data cleaning problem in multi-source heterogeneous big data.

Keywords/Search Tags:

Machine learning, Data cleaning, Approximately duplicated data, Attribute reduction, Bayesian Network, Rough set theory

PDF Full Text Request

Related items

1	Research On Data Cleaning Of Approximately Duplicated Records
2	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
3	Analysis Of Financial Data Based On Rough Set Theory
4	Some Main Technology's Research Of Data Cleaning
5	Research Of Data Cleaning Method Based On Data Warehouse
6	Research On Weighted Naive Bayesian Classification Algorithm Based On Rough Set Theory
7	Research And Implementation Of Data Cleaning System Based On Pre-Processing Techniques
8	The Research Of Data Cleaning Algorithm Base On MapReduce
9	Rough Sets Theory And SVMs Based Multi-class Classification Algorithm
10	Based On Rough Set Theory Data Mining Technology And Its Application Of Potential Consumers Of Private Cars