Research On Detection Of Approximate Duplicate Records For Massive Data

Posted on:2012-11-27

Degree:Master

Type:Thesis

Country:China

Candidate:P Zhang

Full Text:PDF

GTID:2248330338493141

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development and widespread use of database technology, the data accumulated by various professional domains is on the rise. It is need to import data from different data sources when build a data warehouse, which results in the increase of approximately duplicated records, adversely affecting the data utilization and the quality of making decisions. Therefore, detecting and cleaning approximately duplicated records have become an intensive research subjects in the fields of data warehouse and data mining.A theory on data cleaning is elaborated in detail, and the necessity of data cleaning is analyzed as well as domestic research situation. It has an emphasis of analyzing and summarizing the theory, method, evaluation criterion and basic process of approximately duplicated records detection in massive data. The major results achieved are as follows:(1) For solving the problem of clustering the approximately duplicated records into the few clusters in Density-Based Spatial of Applications with Noise (DBSCAN) algorithm, an improved method for finding optimal radius by using dynamic random function to constantly adjust radius is proposed.(2) For detecting the approximately duplicated records in massive data, an algorithm of entropy-based Feature Selection Grouping Clustering (FSGC) is proposed. The basic idea constructs an entropy metric based on similarity between objects,the importance of each property can be evaluated and a key property subset can be obtained. According to the key property to split the data sets into small data sets,the approximately duplicated records are identified based on the improved algorithm of DBSCAN.The theory analysis and experiments show that this algorithm has higher detection efficiency, while the detection accuracy is not so efficient.(3) In order to improve the accuracy of approximately duplicated records, a method of N-Gram-based second clustering is given. Compared with records in each cluster in couples, the method can detect faulty duplicated records.Experimental analysis shows that this method can effectively improve the detection accuracy.(4) For some inefficiencies and non-intelligence in the traditional model of data cleaning, a three-layer data cleaning system architecture based on Multi-Agent is built, which is discussed of communication and running process in this paper.The model compensates the limitation and deficiencies; moreover, to a great extent, it improves the capabilities of intelligence and efficiency, and reduces the participation by people.

Keywords/Search Tags:

Data cleaning, Approximately duplicated records, Entropy, Properties optimization, DBSCAN algorithm, FSGC

PDF Full Text Request

Related items

1	Research On Data Cleaning Of Approximately Duplicated Records
2	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
3	Some Main Technology's Research Of Data Cleaning
4	Research Of Data Cleaning Method Based On Data Warehouse
5	Research And Implementation Of Data Cleaning System Based On Pre-Processing Techniques
6	Research On The Method Of Approximately Duplicated Records Detection For Text Data In Big Data Envitonment
7	Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning
8	Research On Data Cleaning Method Based On Optimal Feature Selection
9	Study And Application Of The Data Cleansing Techenology In ETL
10	Research On Key Technologies For Data Extracting In Data Warehousing