Font Size: a A A

Research On An Ajax Supported Deep Web Crawler Model

Posted on:2012-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:C H GuanFull Text:PDF
GTID:2178330335955558Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The rapid-growing resource of internet information has already been a huge treasure. In fact, a small portion of internet information called Surface Web can be obtained via Search Engine, while a large number of more valuable information called Deep Web is not available through general search engine. Nowadays, more attention has been paid to the research on collecting and crawlling the information of Deep Web.Owing to the dynamic form of web page, a significant part of information on deep web can not be crawled effectively. When it comes to the deep reasons, more and more websites adopt dynamic scripts to interact with users, especially Ajax technology which is widely applied in web development. Ajax technology has changed the traditional websites'architecture based on the static page and has improved the degree of users' experience. Its properties as execution of JavaScript, state identification and switching, leads to the unaccessible site and resource of background server using Ajax technology via general web crawlers, and those unaccessible resource becomes so-called Deep Web different from the Surface Web's resource. Then, how to retrieve the information from the Ajax-used Deep Web sites is becoming increasingly important. Therefore, achieving access to such information is the starting point of this article. The main contents of this paper are as follows:(1) Studying on the architecture and working principle of general Web crawler, and focusing on the problems the crawler facing when crawlering Ajax page, such as execution of JavaScript, state identification and switching and so on, this paper presents the architecture and algorithms of Deep Web crawler (called AjaxFetcher in this paper) based on a statestorage.(2) Via the addition functionality of embedded browser, AjaxFetcher would be able to simulate and execute Javascript events in pages, and accept asynchronous responses from server. It builds the statestorage of Ajax-based websites gradually by identificating new-generated state via analyzing the change of DOM structure. The statestorage presents the structure and status information of each page.(3) When crawlling the Ajax page containing pagination mode, it is the same server-side response caused by each call of the same Ajax fuction that requires the improvement of efficiency of crawling Ajax page. This paper marks the JavaScript function containing Ajax request as hot spots, improve algorithm mentioned above through the response strategy of caching data, which reduce the performance loss brought about by communication with server.Finally, this paper verify the validity of the new Web Crawler through designing comparative experiments, whose results show that the crawlers would be able to get back more Deep Web resources from the Ajax pages.
Keywords/Search Tags:Ajax, Deep Web, Web Crawler, StateStorage, Embedded Explorer
PDF Full Text Request
Related items