Font Size: a A A

Design And Realization Of A Web Page Gathering System With Javascript Parsing

Posted on:2009-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:H X BaiFull Text:PDF
GTID:2198360308477820Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the widely application of search engines, web page gathering technology has been developing rapidly. Web page gathering is the first step of the search engine working flow. The quality of Web pages gathered directly affects the QoS (Quality of Service) of a search engine. In ideal circumstances, the gathered pages should be the pages coherent with the users'vision information (CUVI). However, this idea has been paid no attention all the while.To solve this blind spot, a web page gathering system for CUVI pages is designed in this thesis. To crawl CUVI pages, the first thing to be addressed is the redirection in web pages---one of the main functions of JavaScript. In this thesis, the problem of gathering CUVI pages has been solved pretty well by introducing JavaScript parsing into the gathering system.The contents of this thesis can be divided into two parts:design and implementation of a JavaScript parser and a web pages gathering system.For the design and implementation of a JavaScript parser, first of all, the necessity to deal with JavaScript is investigated. The function distribution of JS procedures in HTML documents is obtained from the analysis of typical research data. Then, a simple JS parser---JSParser is designed and implemented according to the requirement to JavaScript parsing by the gathering system. Finally, it is verified through experiments that the JSParser can meet the requirements of this gathering system both in performance and in function.The web page gathering system consists of two sub-modules:a collector and a controller. The analysis of web pages is creatively introduced into the design of the collector and is combined with JSParser, hence the intension of crawling CUVI pages is successfully achieved. In the implementation of this collector, EPOLL technology is used to satisfy the high concurrency of the collector. By maintaining a FIFO queue of IP addresses in the controller, the collector can download the web pages politely, which makes the collector and the Internet collaborate well.Through extensive testing, it is verified that the introduction of JSParser to the gathering system does not affect the the performance of the system and system runs well with abundant IP addresses.
Keywords/Search Tags:Web page gathering system, users' vision information, JavaScript parsing, web page analysis, FIFO queue of IP addresses
PDF Full Text Request
Related items