Font Size: a A A

Research On Crawling Deep Web Information

Posted on:2011-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:K JiangFull Text:PDF
GTID:2178330338979983Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The entire web information can be categorized into Surface Web and Deep Web,based on the information's depth. The Surface Web mainly includes the"static"pages, which are statically stored in server in forms of file. Comparatively, Deep Web means the"dynamic"pages whose content hidden behind search forms and only produces results dynamically in response to a direct request. These dynamic information are also known as Deep Web information or Hidden Web information, which have characteristics of large amount, high quality and highly structured. But traditional search engine crawlers can't probe the Deep Web information. Techniques of crawling Deep Web information in large scale become a hot spot in research.In this paper, we first present the current ways of searching Deep Web information and analyze their advantages and disadvantages. Then we decide to adopt the method of extending the traditional search engine crawlers with the ability of automatically filling in forms, constructing the query URL, obtaining the result pages. In this way, traditional search engine can index Deep Web information the same as Surface Web.To achieve these goals, we need to address the following main issues:(1) How to correctly find the Deep Web sources and extract their useful information in spite of the complexity of HTML forms.(2) In condition of diversity of form control, how to select the combination of queries, especially how to generate values of text box input.(3) How to extract valueable information from result pages. To solve the above problems, we first analyze and optimize the traditional search engine crawlers, then present solutions and algotithms of Deep Web data sources finding, form interface extraction, automatically filling in forms, Deep Web information extraction, semantic metadata acqusition, re-crawling result pages. Finally, based on the traditional crawler, we deploy a Deep Web information crawling system. Meanwhile, we present experimental evaluation validating the correctness of our algorithms.
Keywords/Search Tags:Deep Web, automatic form filling, Deep Web information extraction, semantic metadata acqusition, re-crawling
PDF Full Text Request
Related items