Font Size: a A A

Research On Duplicate Record Detection Algorithms In Heterogeneous Data Sources

Posted on:2012-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:M H LiFull Text:PDF
GTID:2218330362450407Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data Quality Management run through various stages of the life cycle of data. In many areas, such as business, sports, music, travel, etc., there are many data sources provide duplicate records. Duplicate records not only cause data redundancy, resulting in the waste of network bandwidth and storage space, but also provide users too many useless duplicate results. So it is significant detect the duplicate record efficiently.Recent researches focus on how to detect duplicate records in the same schema. But when records are from heterogeneous data sources, schema mapping should be done first. That means a unified schema should be build before duplicate record detection. when there are thousands of heterogeneous data sources, it will be difficult to deal with the schema mapping problem. Because that there are not only existing records with a large number of different schemas, but also existing records with unkown schemas.In order to process the case efficiently and effictively, a method based on optimal matching of bipartite graph is proposed in this paper. In this paper, we focus on duplicate records which have different schemas with multiple data types. A method to compute the similarity of the records is proposed in chapter 2. Furthermore, an algorithm based on this kind of similarity is proposed to detect the duplicate records in chapter 3. Method which based on optimal bipartite graph matching considered the case of heterogeneous, and therefore more suitable for multi-source environment data of duplicate records detection. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in chapter 4. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.
Keywords/Search Tags:Data quality, Heterogeneous Data Sources, Duplicate Record Detection, Record Similarity, Similarity Estimation
PDF Full Text Request
Related items