Design And Implementation Of An Ajax Supported Deep Web Crawler System

Posted on:2016-11-26

Degree:Master

Type:Thesis

Country:China

Candidate:Z F Duan

Full Text:PDF

GTID:2308330503453241

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Ajax technology changed the traditional static web structure model. The use of JavaScript event execution, state recognition and switching characteristics leads to the web page can be refreshed without refreshing the whole page, and also causes the network resources cannot be grabbed by the traditional web crawler. These network resources are distinguish between Surface Web resources and called Deep Web. However, the Ajax website construction technology has been widely used. And it leads to a large number of Web Deep resources that are hidden and cannot be used for people. Therefore, it is becoming more and more important to use an effective method to capture the dynamic script pages of Ajax, so as to realize the extraction of Web Deep resources.Firstly, this paper analyzed the principle of Ajax technology and the root cause of Deep Web resource, and then expounded the working principle of the web crawler technology and the construction method of the crawler system. The paper used the software engineering method, designed and implemented a crawler system called Spideep. The system can parse Ajax dynamic web page, and crawl Deep Web network resources. The paper described the requirements analysis, outline design, detailed design and the system implementation of the Deep Web crawler system and also about the software development process. The system was divided into three main modules, namely Worker Line module, Task Manager module and URL scheduling module. The Worker Line module was responsible for the whole process of grasping, the task management module was responsible for the task of the crawler system, and the URL scheduling module was responsible for scheduling the URL queue. The Worker Line module included web crawling(Fetcher), web analysis(Extractor), content filtering(Filter) and content storage(Writer).In the implementation of the system, for the Ajax page to crawl, this paper through the crawler system in a non-interface embedded browser HtmlUnit. Firstly, using the HtmlUnit component JavaScript Parser Rhino to analysis the script of JavaScript in the web page source document, and then reconstruct the DOM tree. Secondly, HtmlUnit provided a number of web page components for simulating the browser operations, such as the click of a button, flip, slide and other user behaviors. Through using these simulations, the web information which hidden in the deep(Deep Web) will be dynamic displayed. Finally, using the web analytic tools Htmlparser or Jsoup to analyses the DOM tree for gaining valuable information in Deep Web.Finally, there was an experiment for verifying the performance of the three aspects(Ajax page crawling, stability, availability) of the Deep Web crawler system Spideep. The experimental results were analyzed and showed the Spideep system had good performance and meet the requirements of the intended use.

Keywords/Search Tags:

Crawler, Ajax, DOM Tree, JavaScript, Deep Web, HtmlUnit

PDF Full Text Request

Related items

1	Research And Implementation Of Web Crawler For URL-Specified Crawling Of Ajax-Based Web Applications
2	Design And Implementation Of A Web Crawler Friendly To Ajax
3	Research On An Ajax Supported Deep Web Crawler Model
4	The Design And Implementation Of The Deep Web Acquisition System
5	Design And Implementation Of An Ajax-supported DEEP WEB Crawlershanghai Jiao Tong University
6	An Approach Based On WSFT Model For Crawling Deep Web
7	Research Of Deep Web Crawler Supporting Ajax
8	Research And Optimization Of Dynamic Web Crawler Based On Webmagic
9	Research And Implementation On Theme Web Crawler Of Supporting Ajax
10	Social Network Data Acquisition Technology And Implementation