Research On Source Discovery And Query Results Extraction Of Deep Web

Posted on:2014-01-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y Sheng

Full Text:PDF

GTID:2248330398974689

Subject:Computer software and theory

Abstract/Summary:

With the rapid development of Internet technology, there is more and more valuable information contained in the network. However, there is a huge difference on the quantity and quality among information provided by each site. It brings difficulties for people to select high quality information. Search engine technology can undertake sorting and retrieval of network resources, greatly improving the efficiency of the access to valuable resources. But some resources are existed in the background database, cannot be retrievable by the traditional search engine, known as the Deep Web. The data contained in Deep Web is highly structured, abundant and of high quality. Therefore, to study these data is of great significance.To take advantage of the information in the Deep Web, the first problem is the discovery of the Deep Web data sources. Secondly, for the returned results after submitting a query to the Deep Web data area, how to automatically discover the area is the precondition of the information extraction. Both of these problems are of great importance and are also the main content of the study in this dissertation.Firstly, the discovery of data sources is studied. This dissertation designs a data source discovery framework. In view of the decision problem query interface, this dissertation analyses the difference between the query interface and other forms, using a series of rules to judge. Generally, data source field is limited to one type. To find the data source for accurateness, whether the categories are related to the topic must be determined. This dissertation analyzed the shortcomings of traditional data classification method in terms of feature selection, and improves the feature selection strategy. Experiments show that improved method can effectively find sources sites relevant to the subject matter.And then the page data extraction is discussed. After analyzing the characteristics of the online database in the returned results page, the author finds that each data area is similar to the corresponding tag tree in structure. This dissertation adopts a new Web page structure similarity comparison algorithm to identify the location data area. The new algorithm expresses Web label in the form of tree, defines a special kind of tree. Comparing the entire tree is turned to comparing these particular subtrees. Experiment proves this algorithm has higher degree of differentiation on different Web tag tree. After finding out the location of the data area, this dissertation uses Web structure features and the keyword extraction and related records to extract the information.The last is the framework design and the realization of the main modules. This dissertation designs the framework of Deep Web information integration framework. According to the discovery of Deep Web sources and data extraction method described in Chapter3and4, the main modules of the integrated framework are realized.

Keywords/Search Tags:

Deep Web, Sources finding, Query interface, Web structure similarity

Related items

1	Deep Web Sources Classification And Query Interface Schema Extraction Based On Ontology
2	Integrating Deep Web data sources
3	Research Of Query Interface Integration Mechanism In DWIIS System
4	Research Of Query Interface Integration Mechanism In Dwiis System
5	Research On Key Technologies Of Deep Web Data Crawling
6	Research On Quality Estimation Model Of Deep Web Data Sources And Application
7	Study On Deep Web Query Interface Pattern Matching And Query Results Annotation
8	Research On Issues For Uncertainty Of Query In Deep Web
9	SEEDEEP: A system for exploring and querying deep web data sources
10	Research On Direction Finding And Location Algorithms For Wideb And Signal