Research On Crawling Deep Web Information

Posted on:2011-02-02

Degree:Master

Type:Thesis

Country:China

Candidate:K Jiang

Full Text:PDF

GTID:2178330338979983

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The entire web information can be categorized into Surface Web and Deep Web,based on the information's depth. The Surface Web mainly includes the"static"pages, which are statically stored in server in forms of file. Comparatively, Deep Web means the"dynamic"pages whose content hidden behind search forms and only produces results dynamically in response to a direct request. These dynamic information are also known as Deep Web information or Hidden Web information, which have characteristics of large amount, high quality and highly structured. But traditional search engine crawlers can't probe the Deep Web information. Techniques of crawling Deep Web information in large scale become a hot spot in research.In this paper, we first present the current ways of searching Deep Web information and analyze their advantages and disadvantages. Then we decide to adopt the method of extending the traditional search engine crawlers with the ability of automatically filling in forms, constructing the query URL, obtaining the result pages. In this way, traditional search engine can index Deep Web information the same as Surface Web.To achieve these goals, we need to address the following main issues:(1) How to correctly find the Deep Web sources and extract their useful information in spite of the complexity of HTML forms.(2) In condition of diversity of form control, how to select the combination of queries, especially how to generate values of text box input.(3) How to extract valueable information from result pages. To solve the above problems, we first analyze and optimize the traditional search engine crawlers, then present solutions and algotithms of Deep Web data sources finding, form interface extraction, automatically filling in forms, Deep Web information extraction, semantic metadata acqusition, re-crawling result pages. Finally, based on the traditional crawler, we deploy a Deep Web information crawling system. Meanwhile, we present experimental evaluation validating the correctness of our algorithms.

Keywords/Search Tags:

Deep Web, automatic form filling, Deep Web information extraction, semantic metadata acqusition, re-crawling

PDF Full Text Request

Related items

1	A Study Of Automatic Form-Filling Based On CNN And BiLSTM-CRF
2	Domain-Oriented Incremental Deep Web Crawling
3	Research On The Issues Of Semantic Annotation Based Automatic Metadata Construction
4	Research On Ontology-based Automatic Filling Forms Of Deep Web Entries
5	Study On Data Extraction And Semantic Annotation For Specific Field Deep Web
6	Research On Text Filling Algorithm Based On Deep Learning
7	Discovery Of Query Interfaces And Extraction Of Metadata Information On The Domain-Oriented Deep Web
8	Research And Application Of Web Crawling Algorithm Based On Semantic Analysis
9	Research On Technology Of Deep Web Oriented Data Extraction And Semantic Annotation
10	Research On Key Technologies Of Deep Web Information Integration