Research On DBSCAN-Based Detection Method Of Approximate Duplicate Records

Posted on:2008-05-02

Degree:Master

Type:Thesis

Country:China

Candidate:L Cui

Full Text:PDF

GTID:2178360215959794

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology, organizational managers more and more depend on data to make their decisions. On the foundation of database there appears data warehouse which can support decision analysis. But during the construction of data warehouse ,data from different data sources are inputted into the data warehouse ,there may exist many data qualitative problem, result in false decisive analysis and influent quality of information service. Therefore enterprise's management that is based on quality is obtaining more and more attention. So when extracting all kinds of heterogeneous data to data warehouse, all data sources need be cleaned. According to the problems that "garbage in, garbage out" in the former decision system, and with the purpose to support the mining, data to be processed are supposed to be reliable ,so that errors is as little as possible. Data cleaning is becoming an important question of data warehouse and data mining and net data processing, detection of approximate duplicate records is very important question.In this paper, author depicted the knowledge of data cleansing in detail. It introduced the concept, meaning and current research and application situation home and abroad of data cleansing. It summarized and described the theories, methods, evaluating standards and basic workflow of data cleansing. It discussed the correlation knowledge and algorithms on the basis of analysis and summary deeply, and it researched field matching and records comparability deeply. At the same time we gave our advanced algorithms to improve the limitation of original ones in each step, in approximate duplicate records detection process, DBSCAN clustering algorithm is used to cluster records in dataset, this algorithm have an advantage of fast cluster, stronger the anti-noise ability. when exchanging char variance to space vector coordinate by using ASCII code, some records which are not duplicate are clustered in the same class. And for the clustering characteristic, clustering with region connectivity, pair-wise algorithm was used to compare records that have been clustered, to find out the approximate duplicate records more exactly.Dataset are tested with this advanced algorithm, result shows this algorithm is improved in precision.At the end of the paper, author summarized the research work and presented the emphases of next research.

Keywords/Search Tags:

data cleaning, DBSCAN cluster, approximate duplicate record, pair-wise

PDF Full Text Request

Related items

1	Research On Detection Of Approximate Duplicate Records For Massive Data
2	Research And Application Of Data Cleaning In The Construction Of POI Data Warehouse
3	Research On Related Algorithms For Chinese Repeated Record Cleaning
4	Study Of Data Cleaning Algorithms Based On Data Warehouse
5	Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment
6	Research On Web Similar Duplicate Data Cleaning Based On Hadoop
7	Research On Technologies Of Duplicate Record Data Cleaning Under Industrial Big Data
8	Research And Implementation Of Similar Duplicate Record Detection Optimization Algorithm Based On DBSCAN
9	Similar Repetitive Record Detection Method In Uncertainty Database
10	Pre-distribute And Establish The Pair-wise Keys Of Scheme In Wsn