Design And Implementation Of A Web Crawler Friendly To Ajax

Posted on:2012-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:M Zhang

Full Text:PDF

GTID:2178330335963673

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the wide use of the new generation of Web application model——Web2.0, which is user-centric, the development of Ajax technology is also faster,such as Netease Blog, Amazon, Google, etc. Ajax adopts asynchronous request/response mechanism which is Javascript-driven, while the traditional web crawler is lacking in understanding the semantic of Javascript, also can not simulate the asynchronous call by triggering JavaScript events and parse the return of asynchronous data. In addition, in the application of Ajax, Javascript makes great change of the DOM structure and update the page content dynamically, while the traditional web crawler defult the DOM structrue of pages static and unchanged relatively. These factors create major obstacles to the traditional crawler, and affect the search engine to collect information inevitably.In response to the above problems, this paper downloads the pages and gets source code by HTTP requests.Constructing the DOM tree of the pages and analysising the contents of the pages, removing the noisy information. Extracting JavaScript code and files by traversing the DOM tree. Constructing browser objects which are built-in, then adopting a script parsing engine——Rhino which is of open source to tracking and implementing these JavaScript codes to extract the links which are generated dynamically. And further using XPath expressions to the interpreted pages to locate the content which need to extract quickly, generating extaction rules,storing extraction rules and data by XML format,and using XSLTto convet,finally rendering in the form of HTML. We have a solution to get urls and dynamic data from Ajax sites.This paper is designed to Web crawler system friendly to Ajax, applying for constructing the local built-in browser objects, using Rhino to parse the Ajax calls in JavaScript, accessing to the data returned by asynchronous request, suggesting a new solution of Web crawler supporting Ajax.Finally we design experiments to prove the feasibility of this study.

Keywords/Search Tags:

Ajax, Web Crawler, JavaScript Parsing, Data Extracting

PDF Full Text Request

Related items

1	Research And Implementation Of Web Crawler For URL-Specified Crawling Of Ajax-Based Web Applications
2	Design And Implementation Of An Ajax Supported Deep Web Crawler System
3	Design And Construction Of Distributed JS Parsing System
4	Application Research Of Web Crawler Based On Chrome Headless In Web Vulnerability Scanning
5	Research And Implementation On Theme Web Crawler Of Supporting Ajax
6	Research On The Rapid Extraction Method Of Url For Dynamic Pages
7	The Environmental Monitor System Based On AJAX
8	Design And Implementation Of A Web Crawler System Supported AJAX
9	Design And Realization Of A Web Page Gathering System With Javascript Parsing
10	Design And Realization Of A Web Page Gathering System With JavaScript Parsing