Font Size: a A A

Research On Duplicate Records Identification Model In Deep Web

Posted on:2010-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:L N LiuFull Text:PDF
GTID:2178360308477801Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
World Wide Web (referred to WWW, or Web Network) is growing at a prodigious rate since 1990s. Up to now, it contains a mass of rich resources, which is a valuable intellectual property. According to the depth of the information, the Web can be divided into two categories:Surface Web and Deep Web. Deep Web refers to the data sources that are stored in databases and can not be accessed by hyper-links but only by dynamic Web page accessing. Some statistics have shown that information on Deep Web and its accessing amount as well as the increasing speed is far higher than Surface Web.In order to use the information of Deep Web as effectively as possible, it is an urgent need to build a data integration system of Deep Web. Because of the heterogeneity and the autonomy of the Web database, it is a challenge to merge the query results extracted from various Web databases. And duplicate records identification is an essential part of data integration during the cleaning of the extracted results.In this thesis, a brief definition of duplicate identification problem (i.e. data cleaning and deduplicate) is given firstly, then a detailed description of the existed methods and models are presented. For the most of current duplicates identification is based on the structured relational model, the duplicate records identification model is presented in this thesis based on the semi-structured data. The duplicate records identification model mainly comprises of the data preprocessing module, homogeneous records processing module and heterogeneous records processing module.The model analyzes the matching of the entity records extracted from different data resources based on the global schema of specific domain, and it greatly improves the accuracy of similarity between two entity records. In the calculation of the similarity of the entity records extracted from different databases, the model provides a scalable similarity algorithm library, and it could combine different algorithms during the calculation. In the model, the new similarity algorithm could be added to the similarity algorithm library, and the strategies and the algorithms of similarity calculating could be changed based on the specific domain.The experiment results show the duplicate records identification model is feasible and efficient.
Keywords/Search Tags:duplicate records, deep Web, data cleaning, semi-structured data, global schema
PDF Full Text Request
Related items