| Using Javascript and dynamic DOM manipulation on the client side of Web applications is becoming a widespread approach for achieving rich interactivity and responsiveness in modern Web applications. With the rise of Web 2.0, Ajax is widely used in web development. At the same time, such techniques shatter the concept of Webpages with unique URLs, on which traditional Web crawlers are based. In traditional Web pages each URL marks a static page, while there are a lot of states in Ajax Web applications. These changes of states are not reflected by the URL of the page, but through the changes of dynamic document object model (DOM).some existing algorithms of crawling Aj ax-based web applications have no higher accuracy or have problem of state explosion. According to the characteristics of the Ajax dynamic web pages, this paper designs a Ajax crawler system which can crawl dynamic pages.This paper proposed an algorithm of crawling Ajax dynamic web pages based on User Interface State Change, it first initialize the empty state graph and crawling queue, obtain the DOM tree of initial page by a given URL, and traverse the DOM tree to identify the candidate clickable elements which can cause state change. The algorithm will trigger the events under these candidate elements and then compare the states, if a new state is found, the algorithm will update the state graph. In order to be able to trigger all the corresponding enents binding under the same state, the algorithm implements a method of backtracking to the previous state, this methos can be used to distinguish whether there exist some candidate clickable elements which has not been detected.In order to improve the performance of Ajax crawling algorithm, we also discuss our concurrent Ajax crawling algorithms.the concurrent algorithm monitor all the crawling nodes by a controller, each crawling node is responsible for its own state machine and browser instance to crawl a specific path. The controller distribute work to all crawling nodes. The task of our dynamic partition function is to distribute the work equally over all the participating crawling nodes. Our proposed partition function operates as follows.After the discovery of a new state, if there are still unexplored candidate clickables left in the previous state, that state is assigned to another thread for further exploration. multi-threaded crawling algorithm does not need to back browser reload and back to previous state, greatly shorten the crawling time.Finally, the algorithm is applied in some actual dynamic Web pages to verify the feasibility and effectiveness of the algorithm. |