Font Size: a A A

Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning

Posted on:2018-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:S L YangFull Text:PDF
GTID:2348330542964624Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The correct use of high-quality data can lead to better predictions,analyses,and decisions.In the Multi-source heterogeneous data environment,since the data structures are different among different data sources,the form of data representation is not uniform and the data sets often contains incomplete,incorrect or irrelevant dirty data,so that Multi-source heterogeneous data cleaning faces the enormous challenge.Data cleaning is a powerful means to guarantee data quality,so it can improve consistency,accuracy,authenticity and availability of large data by data cleaning.In order to improve the efficiency and reduce the complexity of multi-source heterogeneous large data cleaning,the main work of this dissertation is as follows:(1)In view of the problem that there is a large number of imprecise data in multi-source heterogeneous data environment,we propose a data cleaning strategy for hierarchical reduction and classified cleaning.TAN(Tree Augmented Bayes)is constructed by measurement of importance degree of data source,data attribute and tuple weight tag,and machine learning classification algorithm.Finally,using the probability value of the data complete the classification cleaning of the imprecise data.The experiments show that,compared with the existing methods for solving multi-source heterogeneous imprecise data cleaning,HRSC strategy can effectively improve the accuracy and cleaning efficiency of imprecise data cleaning.(2)In view of the problem that there are a lot of redundant or similar duplicate data in multi-source heterogeneous data environment,we propose a strategy of attribute reduction associated cleaning(ARAC:Attribute Reduction Associated Cleaning).The construction of standard library for data attribute,attribute reduction and the SNM algorithm that is improved by multiple sort to complete the cleaning of similar duplicate records.In this paper,we evaluate the experiment for model and algorithm of the data cleaning by using the real data sets and the validation data sets,and then we verify the correctness of similar duplicate data of the data cleaning and the integrity of the final data.The experiments show that the proposed model and algorithm in this paper can effectively solve the similar duplicate data cleaning problem in multi-source heterogeneous big data.
Keywords/Search Tags:Machine learning, Data cleaning, Approximately duplicated data, Attribute reduction, Bayesian Network, Rough set theory
PDF Full Text Request
Related items