Font Size: a A A

Research On The Key Technologies About Preprocessing Of Deep Web Integrated Query System

Posted on:2013-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:C L ZhangFull Text:PDF
GTID:2248330371972581Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology, people are increasingly inclined to obtain resources from the network. The resources that can be retrieved by traditional search engine are called Surface Web, which only accounted for a small fraction of the whole web resources. The resources hidden in Web database, which only be obtained by submitting a query form to generate dynamic pages are known as Deep Web. Deep Web contains a large number of specialized information, so how to access to these resources efficiently has become the key issue of current research.Deep Web Integrated Query System is a global query system which integrate different query interfaces in the same field. We can get resources from different Web databases by submitting query form in this global interface. Preprocessing is the first stage in the process of system integration, it mainly contains three steps:the discovery of the Web interface, query interface schema extraction and query interface integration. Its final result has a great impact on the next stage of query processing and result processing. Therefore, finding efficient methods in every step of preprocessing stage is the starting point of this article. The main research works of this paper are as follows:(1) Analysising of the characteristics of the Deep Web query form, studing and comparing the advantages and disadvantages of current technology of the discovery of the Web interface. This paper proposes the strategy of selection of the seed URL for the focused crawling technology based on multiple classifiers, improves the form classification and uses the algorithm base on decision tree to distinguish the query form that is non-Web interface.(2) This paper studies the schema feature of query interface and proposes the schema extraction method based on DOM tree and DWI object model according to structural features of the HTML page. First, the interface page is parsed into a DOM tree structure through a web parser, then traverse the DOM tree to find the attribute element and its corresponding label. Last, make DWI object model express the schema information of query interface.(3) This paper proposes a schema matching method based on semantic model according to the characteristics of attribute element of query interface. The method gives similarity formula to attributes from simple matching and complexity matching,which has more effective results.In order to test and verify the efficiency of related technologies for pre-processing stage, this paper designs specific experiments, which results show these methods are feasible.
Keywords/Search Tags:Deep Web, Web Interface Discovery, Schema Extraction, Schema Matching
PDF Full Text Request
Related items