Font Size: a A A

Research And Application Of Deep Web Data Cleansing

Posted on:2011-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y PengFull Text:PDF
GTID:2208330332972998Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet in recent years, Deep Web has become an important part of network information resources, the user can use the query interface to dynamically get the vast amounts of information from the back-end of the Deep Web database.As the Deep Web resources are located in various Deep Web sites, with heterogeneous, dynamic, large volumes of data and other characteristics, it's not convenient to use these resources, therefore, the data integration systems for the Deep Web have emerged.In this paper, the process of data cleaning in Deep Web data integration system has been studied, the data cleaning process refers to extracting the query results returned from Deep Web databases, and merging them into a unified and structural model. This paper divide the data cleaning process into two parts--data extraction and data integration, and studied the technology of these two parts and put forward the related algorithms and solutions. At last, on this basis, we design a Deep Web oriented prototype system for Data Extraction. The main work of this thesis is summarized as follows:(1)This thesis proposes an automatic method of Deep Web Extraction based on XML. This method divided the process of data extraction into five steps,they are page conversion,page pre-processing, page partition, semantic annotations added and extraction rules generated.At first we use the Java open source tool WebHarvest to transform the HTML pages into XML documents,and followed parse the XML documents into class DOM,and then deep traverse the DOM tree to find and remove the noise data in the page; then after that we use the weight added partition algorithm based on DOM tree to divide one page into many blocks, and identify the data area which relevant to the subject of user queryed; and then, we add the semantic annotations by two algorithms,which are the data item property distinguish algorithm, and the property value and semantic annotations segmentation algorithm. Finally, we generate the extraction rules of the Web page by the extraction rules generation algorithm.(2)Proposes the method to integrate the result data from multiple Deep Web data sources.This method divide the data integration process into two steps,which are.the results pattern matching and data consolidation. First, by constructing attribute vector space model to calculate the similarity between attributes, thus match the results pattern of multiple data sources, then through the attribute weight calculation method and the record similarity calculation method to find similar duplicate records. At last, we use the similar data processing method to deal with duplicate data.(3)Designs a Deep Web oriented data extraction prototype system based on above works. The system is divided into two major parts, which are data extraction module and data integration module. Data extraction module extracts the useful data from the result pages,and generates the result pattern and extraction rules.Data integration module matches the result patterns of each Deep Web database,and generates a global module, after that it extracts the data and stores the data into the database,then merges the data.
Keywords/Search Tags:Deep Web, Data Cleaning, Data Extraction, Data Integration, XML
PDF Full Text Request
Related items