Font Size: a A A

Research On Source Discovery And Query Results Extraction Of Deep Web

Posted on:2014-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y ShengFull Text:PDF
GTID:2248330398974689Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, there is more and more valuable information contained in the network. However, there is a huge difference on the quantity and quality among information provided by each site. It brings difficulties for people to select high quality information. Search engine technology can undertake sorting and retrieval of network resources, greatly improving the efficiency of the access to valuable resources. But some resources are existed in the background database, cannot be retrievable by the traditional search engine, known as the Deep Web. The data contained in Deep Web is highly structured, abundant and of high quality. Therefore, to study these data is of great significance.To take advantage of the information in the Deep Web, the first problem is the discovery of the Deep Web data sources. Secondly, for the returned results after submitting a query to the Deep Web data area, how to automatically discover the area is the precondition of the information extraction. Both of these problems are of great importance and are also the main content of the study in this dissertation.Firstly, the discovery of data sources is studied. This dissertation designs a data source discovery framework. In view of the decision problem query interface, this dissertation analyses the difference between the query interface and other forms, using a series of rules to judge. Generally, data source field is limited to one type. To find the data source for accurateness, whether the categories are related to the topic must be determined. This dissertation analyzed the shortcomings of traditional data classification method in terms of feature selection, and improves the feature selection strategy. Experiments show that improved method can effectively find sources sites relevant to the subject matter.And then the page data extraction is discussed. After analyzing the characteristics of the online database in the returned results page, the author finds that each data area is similar to the corresponding tag tree in structure. This dissertation adopts a new Web page structure similarity comparison algorithm to identify the location data area. The new algorithm expresses Web label in the form of tree, defines a special kind of tree. Comparing the entire tree is turned to comparing these particular subtrees. Experiment proves this algorithm has higher degree of differentiation on different Web tag tree. After finding out the location of the data area, this dissertation uses Web structure features and the keyword extraction and related records to extract the information.The last is the framework design and the realization of the main modules. This dissertation designs the framework of Deep Web information integration framework. According to the discovery of Deep Web sources and data extraction method described in Chapter3and4, the main modules of the integrated framework are realized.
Keywords/Search Tags:Deep Web, Sources finding, Query interface, Web structure similarity
PDF Full Text Request
Related items