Font Size: a A A

Design And Implementation Of A Web Crawler Friendly To Ajax

Posted on:2012-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2178330335963673Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the wide use of the new generation of Web application model——Web2.0, which is user-centric, the development of Ajax technology is also faster,such as Netease Blog, Amazon, Google, etc. Ajax adopts asynchronous request/response mechanism which is Javascript-driven, while the traditional web crawler is lacking in understanding the semantic of Javascript, also can not simulate the asynchronous call by triggering JavaScript events and parse the return of asynchronous data. In addition, in the application of Ajax, Javascript makes great change of the DOM structure and update the page content dynamically, while the traditional web crawler defult the DOM structrue of pages static and unchanged relatively. These factors create major obstacles to the traditional crawler, and affect the search engine to collect information inevitably.In response to the above problems, this paper downloads the pages and gets source code by HTTP requests.Constructing the DOM tree of the pages and analysising the contents of the pages, removing the noisy information. Extracting JavaScript code and files by traversing the DOM tree. Constructing browser objects which are built-in, then adopting a script parsing engine——Rhino which is of open source to tracking and implementing these JavaScript codes to extract the links which are generated dynamically. And further using XPath expressions to the interpreted pages to locate the content which need to extract quickly, generating extaction rules,storing extraction rules and data by XML format,and using XSLTto convet,finally rendering in the form of HTML. We have a solution to get urls and dynamic data from Ajax sites.This paper is designed to Web crawler system friendly to Ajax, applying for constructing the local built-in browser objects, using Rhino to parse the Ajax calls in JavaScript, accessing to the data returned by asynchronous request, suggesting a new solution of Web crawler supporting Ajax.Finally we design experiments to prove the feasibility of this study.
Keywords/Search Tags:Ajax, Web Crawler, JavaScript Parsing, Data Extracting
PDF Full Text Request
Related items