An Improved Method For Detecting Incremental Approximately Duplicate Records Based On Clustering Tree

Posted on:2011-02-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y Dai

Full Text:PDF

GTID:2178360308473216

Subject:Engineering and project management

Abstract/Summary:

PDF Full Text Request

Data sources'coming from multi-channel leads to an increase of approximately duplicated records in data warehouse, which has seriously affected the efficiency of data utilization and the quality of decision-making. The detection and elimination of approximately duplicate records becomes the hot research question in data warehouse, knowledge discovery and other fields. Since most applications of decision are based on dynamic database, so the study of incremental approximately duplicate records has got organizations and scholars'attention. Detection algorithm based on clustering tree is a good incremental approximately duplicate records detection algorithm. However, neither of the efficiency and the accuracy is high to this algorithm. In view of this, this paper presented an improved detection method based on clustering tree by taking rank-based weights method as property reduction method and adding a threshold to the constructing process of clustering tree.Firstly, this paper summarizes theories and methods related to the data quality, data cleansing and approximately duplicate records. Then, this paper analyzes the problem in the detection algorithm based on clustering tree and proposing an improved method to solve them. Rank-based weights method was used to reduce and rank properties and a record threshold was added to improve the accuracy and efficiency of algorithm. And then a detailed improved algorithm process was introduced. In the end, using SQL Server Management Studio as the application development tool and MyEclipse7.0 as the DBMS, we developed application software called ICT-Syst (Improved Clustering Tree System) for experiments. Based on the experimental database obtained by database generator in accordance with given rules, we finished the experiments to verify the effectiveness of the improved algorithm. The results show that the improved algorithm has higher recall and precision, data-processing efficiency has also improved significantly.

Keywords/Search Tags:

data cleansing, approximately duplicate records, clustering tree, rank-based weights method

PDF Full Text Request

Related items

1	Study And Application Of The Data Cleansing Techenology In ETL
2	Data Cleaning Algorithm And Applications
3	Research And Realization Of Vegetable Traceability Display System Based On An Algorithm Of Approximately Duplicate Database Records Merging
4	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
5	Research And Application Of Data Cleansing In Multi-radar Data Fusion Algorithm
6	Research On The Method Of Approximately Duplicated Records Detection For Text Data In Big Data Envitonment
7	Research On Detection Of Approximate Duplicate Records For Massive Data
8	Similar Repetitive Record Detection Method In Uncertainty Database
9	Researches On Data Elimination In Forestry WEB Yellow Page Information Integration
10	Research And Implementation On Mass Data Cleaning In E-Government System