Font Size: a A A

Research On Detection Of Approximate Duplicate Records For Massive Data

Posted on:2012-11-27Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhangFull Text:PDF
GTID:2248330338493141Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development and widespread use of database technology, the data accumulated by various professional domains is on the rise. It is need to import data from different data sources when build a data warehouse, which results in the increase of approximately duplicated records, adversely affecting the data utilization and the quality of making decisions. Therefore, detecting and cleaning approximately duplicated records have become an intensive research subjects in the fields of data warehouse and data mining.A theory on data cleaning is elaborated in detail, and the necessity of data cleaning is analyzed as well as domestic research situation. It has an emphasis of analyzing and summarizing the theory, method, evaluation criterion and basic process of approximately duplicated records detection in massive data. The major results achieved are as follows:(1) For solving the problem of clustering the approximately duplicated records into the few clusters in Density-Based Spatial of Applications with Noise (DBSCAN) algorithm, an improved method for finding optimal radius by using dynamic random function to constantly adjust radius is proposed.(2) For detecting the approximately duplicated records in massive data, an algorithm of entropy-based Feature Selection Grouping Clustering (FSGC) is proposed. The basic idea constructs an entropy metric based on similarity between objects,the importance of each property can be evaluated and a key property subset can be obtained. According to the key property to split the data sets into small data sets,the approximately duplicated records are identified based on the improved algorithm of DBSCAN.The theory analysis and experiments show that this algorithm has higher detection efficiency, while the detection accuracy is not so efficient.(3) In order to improve the accuracy of approximately duplicated records, a method of N-Gram-based second clustering is given. Compared with records in each cluster in couples, the method can detect faulty duplicated records.Experimental analysis shows that this method can effectively improve the detection accuracy.(4) For some inefficiencies and non-intelligence in the traditional model of data cleaning, a three-layer data cleaning system architecture based on Multi-Agent is built, which is discussed of communication and running process in this paper.The model compensates the limitation and deficiencies; moreover, to a great extent, it improves the capabilities of intelligence and efficiency, and reduces the participation by people.
Keywords/Search Tags:Data cleaning, Approximately duplicated records, Entropy, Properties optimization, DBSCAN algorithm, FSGC
PDF Full Text Request
Related items