Research On An Ajax Supported Deep Web Crawler Model

Posted on:2012-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:C H Guan

Full Text:PDF

GTID:2178330335955558

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

The rapid-growing resource of internet information has already been a huge treasure. In fact, a small portion of internet information called Surface Web can be obtained via Search Engine, while a large number of more valuable information called Deep Web is not available through general search engine. Nowadays, more attention has been paid to the research on collecting and crawlling the information of Deep Web.Owing to the dynamic form of web page, a significant part of information on deep web can not be crawled effectively. When it comes to the deep reasons, more and more websites adopt dynamic scripts to interact with users, especially Ajax technology which is widely applied in web development. Ajax technology has changed the traditional websites'architecture based on the static page and has improved the degree of users' experience. Its properties as execution of JavaScript, state identification and switching, leads to the unaccessible site and resource of background server using Ajax technology via general web crawlers, and those unaccessible resource becomes so-called Deep Web different from the Surface Web's resource. Then, how to retrieve the information from the Ajax-used Deep Web sites is becoming increasingly important. Therefore, achieving access to such information is the starting point of this article. The main contents of this paper are as follows:(1) Studying on the architecture and working principle of general Web crawler, and focusing on the problems the crawler facing when crawlering Ajax page, such as execution of JavaScript, state identification and switching and so on, this paper presents the architecture and algorithms of Deep Web crawler (called AjaxFetcher in this paper) based on a statestorage.(2) Via the addition functionality of embedded browser, AjaxFetcher would be able to simulate and execute Javascript events in pages, and accept asynchronous responses from server. It builds the statestorage of Ajax-based websites gradually by identificating new-generated state via analyzing the change of DOM structure. The statestorage presents the structure and status information of each page.(3) When crawlling the Ajax page containing pagination mode, it is the same server-side response caused by each call of the same Ajax fuction that requires the improvement of efficiency of crawling Ajax page. This paper marks the JavaScript function containing Ajax request as hot spots, improve algorithm mentioned above through the response strategy of caching data, which reduce the performance loss brought about by communication with server.Finally, this paper verify the validity of the new Web Crawler through designing comparative experiments, whose results show that the crawlers would be able to get back more Deep Web resources from the Ajax pages.

Keywords/Search Tags:

Ajax, Deep Web, Web Crawler, StateStorage, Embedded Explorer

PDF Full Text Request

Related items

1	Design And Implementation Of An Ajax Supported Deep Web Crawler System
2	Design And Implementation Of An Ajax-supported DEEP WEB Crawlershanghai Jiao Tong University
3	Research Of Deep Web Crawler Supporting Ajax
4	Research And Implementation On Theme Web Crawler Of Supporting Ajax
5	Design And Implementation Of A Web Crawler System Supported AJAX
6	Research And Implementation Of Web Crawler For URL-Specified Crawling Of Ajax-Based Web Applications
7	Design And Implementation Of A Web Crawler Friendly To Ajax
8	A Web Crawler Supporting AJAX
9	Research And Implement Of Distributed Crawler System Supporting AJAX
10	Design And Implementation Of Social Network Information Crawler