Research On Duplicate Records Identification Model In Deep Web

Posted on:2010-09-16

Degree:Master

Type:Thesis

Country:China

Candidate:L N Liu

Full Text:PDF

GTID:2178360308477801

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

World Wide Web (referred to WWW, or Web Network) is growing at a prodigious rate since 1990s. Up to now, it contains a mass of rich resources, which is a valuable intellectual property. According to the depth of the information, the Web can be divided into two categories:Surface Web and Deep Web. Deep Web refers to the data sources that are stored in databases and can not be accessed by hyper-links but only by dynamic Web page accessing. Some statistics have shown that information on Deep Web and its accessing amount as well as the increasing speed is far higher than Surface Web.In order to use the information of Deep Web as effectively as possible, it is an urgent need to build a data integration system of Deep Web. Because of the heterogeneity and the autonomy of the Web database, it is a challenge to merge the query results extracted from various Web databases. And duplicate records identification is an essential part of data integration during the cleaning of the extracted results.In this thesis, a brief definition of duplicate identification problem (i.e. data cleaning and deduplicate) is given firstly, then a detailed description of the existed methods and models are presented. For the most of current duplicates identification is based on the structured relational model, the duplicate records identification model is presented in this thesis based on the semi-structured data. The duplicate records identification model mainly comprises of the data preprocessing module, homogeneous records processing module and heterogeneous records processing module.The model analyzes the matching of the entity records extracted from different data resources based on the global schema of specific domain, and it greatly improves the accuracy of similarity between two entity records. In the calculation of the similarity of the entity records extracted from different databases, the model provides a scalable similarity algorithm library, and it could combine different algorithms during the calculation. In the model, the new similarity algorithm could be added to the similarity algorithm library, and the strategies and the algorithms of similarity calculating could be changed based on the specific domain.The experiment results show the duplicate records identification model is feasible and efficient.

Keywords/Search Tags:

duplicate records, deep Web, data cleaning, semi-structured data, global schema

PDF Full Text Request

Related items

1	Research Of Data Cleansing Algorithms For Duplicate Records Detection Problem
2	Data Bank Data Warehouse Build Process Of Cleaning And VIP Clients Of The Excavation
3	Design And Implementation Of Customer Information Cleaning In CRM System
4	Research On Detection Of Approximate Duplicate Records For Massive Data
5	Research Of Large Amount Of Data In Chinese Commodity Cleaning Method Of The Algorithm Based On The SNM
6	Study Of Data Cleaning Algorithms Based On Data Warehouse
7	Research On The Storing And Querying Of Semi-Structured Data On The Web
8	Research On Data Cleaning Method Based On Optimal Feature Selection
9	Towards Data-Mining: Data Cleaning Based On Clustering Techniques
10	Research On Data Cleaning Algorithm Based On Clustering