Research And Application Of Deep Web Data Cleansing

Posted on:2011-04-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Peng

Full Text:PDF

GTID:2208330332972998

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet in recent years, Deep Web has become an important part of network information resources, the user can use the query interface to dynamically get the vast amounts of information from the back-end of the Deep Web database.As the Deep Web resources are located in various Deep Web sites, with heterogeneous, dynamic, large volumes of data and other characteristics, it's not convenient to use these resources, therefore, the data integration systems for the Deep Web have emerged.In this paper, the process of data cleaning in Deep Web data integration system has been studied, the data cleaning process refers to extracting the query results returned from Deep Web databases, and merging them into a unified and structural model. This paper divide the data cleaning process into two parts--data extraction and data integration, and studied the technology of these two parts and put forward the related algorithms and solutions. At last, on this basis, we design a Deep Web oriented prototype system for Data Extraction. The main work of this thesis is summarized as follows:(1)This thesis proposes an automatic method of Deep Web Extraction based on XML. This method divided the process of data extraction into five steps,they are page conversion,page pre-processing, page partition, semantic annotations added and extraction rules generated.At first we use the Java open source tool WebHarvest to transform the HTML pages into XML documents,and followed parse the XML documents into class DOM,and then deep traverse the DOM tree to find and remove the noise data in the page; then after that we use the weight added partition algorithm based on DOM tree to divide one page into many blocks, and identify the data area which relevant to the subject of user queryed; and then, we add the semantic annotations by two algorithms,which are the data item property distinguish algorithm, and the property value and semantic annotations segmentation algorithm. Finally, we generate the extraction rules of the Web page by the extraction rules generation algorithm.(2)Proposes the method to integrate the result data from multiple Deep Web data sources.This method divide the data integration process into two steps,which are.the results pattern matching and data consolidation. First, by constructing attribute vector space model to calculate the similarity between attributes, thus match the results pattern of multiple data sources, then through the attribute weight calculation method and the record similarity calculation method to find similar duplicate records. At last, we use the similar data processing method to deal with duplicate data.(3)Designs a Deep Web oriented data extraction prototype system based on above works. The system is divided into two major parts, which are data extraction module and data integration module. Data extraction module extracts the useful data from the result pages,and generates the result pattern and extraction rules.Data integration module matches the result patterns of each Deep Web database,and generates a global module, after that it extracts the data and stores the data into the database,then merges the data.

Keywords/Search Tags:

Deep Web, Data Cleaning, Data Extraction, Data Integration, XML

PDF Full Text Request

Related items

1	Heterogeneous Data Sources Integration In Research And Application Of The Cleaning Strategy,
2	The Integration Design And Implementation On Data Rocessing Of Group Customers Of Commercial Bank Based ETCL
3	Data Cleaning In Data Integration
4	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
5	Research On Key Issues In Deep Web Data Integration
6	Research And Implementation Of Distribute Integration Tool Combining ETL With Data Cleaning
7	Study On Methods Of Ontolog—Based Deep Web Data Integration
8	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
9	The Key Issue On Data Cleaning In Web Data Integration
10	Key Techniques On Deep Web Data Extraction