Font Size: a A A

Research On DBSCAN-Based Detection Method Of Approximate Duplicate Records

Posted on:2008-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:L CuiFull Text:PDF
GTID:2178360215959794Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, organizational managers more and more depend on data to make their decisions. On the foundation of database there appears data warehouse which can support decision analysis. But during the construction of data warehouse ,data from different data sources are inputted into the data warehouse ,there may exist many data qualitative problem, result in false decisive analysis and influent quality of information service. Therefore enterprise's management that is based on quality is obtaining more and more attention. So when extracting all kinds of heterogeneous data to data warehouse, all data sources need be cleaned. According to the problems that "garbage in, garbage out" in the former decision system, and with the purpose to support the mining, data to be processed are supposed to be reliable ,so that errors is as little as possible. Data cleaning is becoming an important question of data warehouse and data mining and net data processing, detection of approximate duplicate records is very important question.In this paper, author depicted the knowledge of data cleansing in detail. It introduced the concept, meaning and current research and application situation home and abroad of data cleansing. It summarized and described the theories, methods, evaluating standards and basic workflow of data cleansing. It discussed the correlation knowledge and algorithms on the basis of analysis and summary deeply, and it researched field matching and records comparability deeply. At the same time we gave our advanced algorithms to improve the limitation of original ones in each step, in approximate duplicate records detection process, DBSCAN clustering algorithm is used to cluster records in dataset, this algorithm have an advantage of fast cluster, stronger the anti-noise ability. when exchanging char variance to space vector coordinate by using ASCII code, some records which are not duplicate are clustered in the same class. And for the clustering characteristic, clustering with region connectivity, pair-wise algorithm was used to compare records that have been clustered, to find out the approximate duplicate records more exactly.Dataset are tested with this advanced algorithm, result shows this algorithm is improved in precision.At the end of the paper, author summarized the research work and presented the emphases of next research.
Keywords/Search Tags:data cleaning, DBSCAN cluster, approximate duplicate record, pair-wise
PDF Full Text Request
Related items