Research On Domain-Specific Techniques For Deep Web Data Acquisition

Posted on:2013-04-21

Degree:Master

Type:Thesis

Country:China

Candidate:J B Guo

Full Text:PDF

GTID:2248330371493549

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with rapid development of Internet technology, high-quality information resources hidden in the Web databases receive widespread attention due to complete structure and huge amount of data. However, such information resources can be displayed as HTML pages only when users submit queries to Web query interface, leading to unavailable data through traditional search engines, which is called Deep Web. Therefore, to make maximum use of Deep Web resources, it is required to retrieve data hidden behind query interface and extract structured data from query results.This article makes a study on Deep Web data acquisition technology which is oriented to specific domain. Our work mainly includes two parts:data retrieve and data records extraction. The main research work is summarized as follows:1) For the range-type attribute in Deep Web query interfaces, it puts forward a domain partition method based on sampling, which effectively improve the efficiency of data retrieve in top-k interfaces.2) For the classification-type attribute, it employs a method for data retrieve based on hierarchical tree model. The method adjusts submitted order and effectively reduces the number of submitted queries.3) For the text-type attribute, we raise a method for selecting candidate values. The method screenings the candidate values using their distribution in sampling library and increases the average query harvest.4) According to distribution of feature node in query results page, it brings forward an algorithm for locating data areas. The algorithm combines page structure and attribute feature of data records, so as to weaken influence of page structure’s variation on extraction results.5) In the phase of extracting data records, we propose a method for data record extraction combined characteristic sequence division with tree similarity. The method not only can improve the accuracy of data record extraction but also can align extracted data records.Finally, experiments verifies that our presented algorithm is effective. Besides, we designs an e-commerce Deep web information integration system based on above gains.

Keywords/Search Tags:

Deep Web, data acquisition, data retrieve, data extraction

PDF Full Text Request

Related items

1	Research On Deep Web Data Acquisition Based On Visual Information And DOM Tree
2	Research On Deep Web Data Source Discovery And Sampling
3	Research On Issues In Data Acquisition Of Deep Web
4	Key Techniques On Deep Web Data Extraction
5	Research On Adaptive Wrapper In Deep Web Data Extraction
6	Research And Application Of Deep Web Data Cleansing
7	Research And Application On Technology Of Deep Web Schema Acquisition
8	Research On Key Issues In Deep Web Data Integration
9	Study On Methods Of Ontolog—Based Deep Web Data Integration
10	Research On Domain-oriented High-quality Deep Web Data Integration Techologies