Research On Duplicate Record Detection Algorithms In Heterogeneous Data Sources

Posted on:2012-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:M H Li

Full Text:PDF

GTID:2218330362450407

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Data Quality Management run through various stages of the life cycle of data. In many areas, such as business, sports, music, travel, etc., there are many data sources provide duplicate records. Duplicate records not only cause data redundancy, resulting in the waste of network bandwidth and storage space, but also provide users too many useless duplicate results. So it is significant detect the duplicate record efficiently.Recent researches focus on how to detect duplicate records in the same schema. But when records are from heterogeneous data sources, schema mapping should be done first. That means a unified schema should be build before duplicate record detection. when there are thousands of heterogeneous data sources, it will be difficult to deal with the schema mapping problem. Because that there are not only existing records with a large number of different schemas, but also existing records with unkown schemas.In order to process the case efficiently and effictively, a method based on optimal matching of bipartite graph is proposed in this paper. In this paper, we focus on duplicate records which have different schemas with multiple data types. A method to compute the similarity of the records is proposed in chapter 2. Furthermore, an algorithm based on this kind of similarity is proposed to detect the duplicate records in chapter 3. Method which based on optimal bipartite graph matching considered the case of heterogeneous, and therefore more suitable for multi-source environment data of duplicate records detection. However, the intuitive method has two aspects of shortcomings. The one in efficiency is that it needs to compare all records pairwise. The one in effectiveness is that a strict duplicate records judgment condition results in a low rate of recall. To make the method work in practice, an efficient method is presented in chapter 4. Based on similarity estimation, the basic idea is to estimate the range of the records similarity in O(1) time, and to determine whether they are duplicate records according to the estimation. Theoretical analysis and experimental results show that the method is effective and efficient.

Keywords/Search Tags:

Data quality, Heterogeneous Data Sources, Duplicate Record Detection, Record Similarity, Similarity Estimation

PDF Full Text Request

Related items

1	Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment
2	Research On Cleaning Method For XML Similarity Duplicate Data
3	Research And Implementation Of The Data Quality Control Methods In Integrating Heterogeneous Data Sources
4	Learnable similarity functions and their application to record linkage and clusterin
5	Research And Implementation Of Data Quality Rules Mining And Detection System
6	Research And Application Of Data Cleaning In The Construction Of POI Data Warehouse
7	Research On Deep Web Data Extraction And Refining Methods
8	Research On Key Technologies Of Record Match With Token
9	The Research And Implementation Of Electronic Medical Record Data Quality Analysis System Based On Data Mining Techniques
10	Similar Repetitive Record Detection Method In Uncertainty Database