Font Size: a A A

Design And Implementation Of An Ajax Supported Deep Web Crawler System

Posted on:2016-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z F DuanFull Text:PDF
GTID:2308330503453241Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Ajax technology changed the traditional static web structure model. The use of JavaScript event execution, state recognition and switching characteristics leads to the web page can be refreshed without refreshing the whole page, and also causes the network resources cannot be grabbed by the traditional web crawler. These network resources are distinguish between Surface Web resources and called Deep Web. However, the Ajax website construction technology has been widely used. And it leads to a large number of Web Deep resources that are hidden and cannot be used for people. Therefore, it is becoming more and more important to use an effective method to capture the dynamic script pages of Ajax, so as to realize the extraction of Web Deep resources.Firstly, this paper analyzed the principle of Ajax technology and the root cause of Deep Web resource, and then expounded the working principle of the web crawler technology and the construction method of the crawler system. The paper used the software engineering method, designed and implemented a crawler system called Spideep. The system can parse Ajax dynamic web page, and crawl Deep Web network resources. The paper described the requirements analysis, outline design, detailed design and the system implementation of the Deep Web crawler system and also about the software development process. The system was divided into three main modules, namely Worker Line module, Task Manager module and URL scheduling module. The Worker Line module was responsible for the whole process of grasping, the task management module was responsible for the task of the crawler system, and the URL scheduling module was responsible for scheduling the URL queue. The Worker Line module included web crawling(Fetcher), web analysis(Extractor), content filtering(Filter) and content storage(Writer).In the implementation of the system, for the Ajax page to crawl, this paper through the crawler system in a non-interface embedded browser HtmlUnit. Firstly, using the HtmlUnit component JavaScript Parser Rhino to analysis the script of JavaScript in the web page source document, and then reconstruct the DOM tree. Secondly, HtmlUnit provided a number of web page components for simulating the browser operations, such as the click of a button, flip, slide and other user behaviors. Through using these simulations, the web information which hidden in the deep(Deep Web) will be dynamic displayed. Finally, using the web analytic tools Htmlparser or Jsoup to analyses the DOM tree for gaining valuable information in Deep Web.Finally, there was an experiment for verifying the performance of the three aspects(Ajax page crawling, stability, availability) of the Deep Web crawler system Spideep. The experimental results were analyzed and showed the Spideep system had good performance and meet the requirements of the intended use.
Keywords/Search Tags:Crawler, Ajax, DOM Tree, JavaScript, Deep Web, HtmlUnit
PDF Full Text Request
Related items