Research On Algorithm Of Crawling Ajax Dynamic Web Pages Based On User Interface State Changes

Posted on:2017-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:L Yang

Full Text:PDF

GTID:2308330482479296

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Using Javascript and dynamic DOM manipulation on the client side of Web applications is becoming a widespread approach for achieving rich interactivity and responsiveness in modern Web applications. With the rise of Web 2.0, Ajax is widely used in web development. At the same time, such techniques shatter the concept of Webpages with unique URLs, on which traditional Web crawlers are based. In traditional Web pages each URL marks a static page, while there are a lot of states in Ajax Web applications. These changes of states are not reflected by the URL of the page, but through the changes of dynamic document object model (DOM).some existing algorithms of crawling Aj ax-based web applications have no higher accuracy or have problem of state explosion. According to the characteristics of the Ajax dynamic web pages, this paper designs a Ajax crawler system which can crawl dynamic pages.This paper proposed an algorithm of crawling Ajax dynamic web pages based on User Interface State Change, it first initialize the empty state graph and crawling queue, obtain the DOM tree of initial page by a given URL, and traverse the DOM tree to identify the candidate clickable elements which can cause state change. The algorithm will trigger the events under these candidate elements and then compare the states, if a new state is found, the algorithm will update the state graph. In order to be able to trigger all the corresponding enents binding under the same state, the algorithm implements a method of backtracking to the previous state, this methos can be used to distinguish whether there exist some candidate clickable elements which has not been detected.In order to improve the performance of Ajax crawling algorithm, we also discuss our concurrent Ajax crawling algorithms.the concurrent algorithm monitor all the crawling nodes by a controller, each crawling node is responsible for its own state machine and browser instance to crawl a specific path. The controller distribute work to all crawling nodes. The task of our dynamic partition function is to distribute the work equally over all the participating crawling nodes. Our proposed partition function operates as follows.After the discovery of a new state, if there are still unexplored candidate clickables left in the previous state, that state is assigned to another thread for further exploration. multi-threaded crawling algorithm does not need to back browser reload and back to previous state, greatly shorten the crawling time.Finally, the algorithm is applied in some actual dynamic Web pages to verify the feasibility and effectiveness of the algorithm.

Keywords/Search Tags:

Ajax, Dynamic Web Page, DOM, State graph, Concurrent crawling

PDF Full Text Request

Related items

1	Research And Implementation On Web Page Crawling And Analyzing Techniques For AJAX Script Network
2	Vertical Search Engine For Crawling The Web Page Design And Implementation
3	Detecting And Locating Atomicity Violations In AJAX-based Web Applications
4	Research Of Task Scheduling And AJAX Page Fetching On Distributed Crawler
5	Research On Customized Web Information Crawling And Pushing Techniques
6	Design And Implementation Of The Dynamic Crawler System Based On State Transition
7	Key Technology Research On Web Forums Crawling And Hot Topic Detection
8	Research On Ontology-based Video Website Supervision Method
9	Research And Implementation Of A Combined Focused Crawler Based On Protocol-Driven And Event-Driven Crawling Techniques
10	Research On Extensible Hash Based Dynamic Load Balancing For Parallel Web Crawling