Font Size: a A A

An Improved Method For Detecting Incremental Approximately Duplicate Records Based On Clustering Tree

Posted on:2011-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y DaiFull Text:PDF
GTID:2178360308473216Subject:Engineering and project management
Abstract/Summary:PDF Full Text Request
Data sources'coming from multi-channel leads to an increase of approximately duplicated records in data warehouse, which has seriously affected the efficiency of data utilization and the quality of decision-making. The detection and elimination of approximately duplicate records becomes the hot research question in data warehouse, knowledge discovery and other fields. Since most applications of decision are based on dynamic database, so the study of incremental approximately duplicate records has got organizations and scholars'attention. Detection algorithm based on clustering tree is a good incremental approximately duplicate records detection algorithm. However, neither of the efficiency and the accuracy is high to this algorithm. In view of this, this paper presented an improved detection method based on clustering tree by taking rank-based weights method as property reduction method and adding a threshold to the constructing process of clustering tree.Firstly, this paper summarizes theories and methods related to the data quality, data cleansing and approximately duplicate records. Then, this paper analyzes the problem in the detection algorithm based on clustering tree and proposing an improved method to solve them. Rank-based weights method was used to reduce and rank properties and a record threshold was added to improve the accuracy and efficiency of algorithm. And then a detailed improved algorithm process was introduced. In the end, using SQL Server Management Studio as the application development tool and MyEclipse7.0 as the DBMS, we developed application software called ICT-Syst (Improved Clustering Tree System) for experiments. Based on the experimental database obtained by database generator in accordance with given rules, we finished the experiments to verify the effectiveness of the improved algorithm. The results show that the improved algorithm has higher recall and precision, data-processing efficiency has also improved significantly.
Keywords/Search Tags:data cleansing, approximately duplicate records, clustering tree, rank-based weights method
PDF Full Text Request
Related items