Design And Realization Of A Web Page Gathering System With Javascript Parsing

Posted on:2009-10-30

Degree:Master

Type:Thesis

Country:China

Candidate:H X Bai

Full Text:PDF

GTID:2198360308477820

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the widely application of search engines, web page gathering technology has been developing rapidly. Web page gathering is the first step of the search engine working flow. The quality of Web pages gathered directly affects the QoS (Quality of Service) of a search engine. In ideal circumstances, the gathered pages should be the pages coherent with the users'vision information (CUVI). However, this idea has been paid no attention all the while.To solve this blind spot, a web page gathering system for CUVI pages is designed in this thesis. To crawl CUVI pages, the first thing to be addressed is the redirection in web pages---one of the main functions of JavaScript. In this thesis, the problem of gathering CUVI pages has been solved pretty well by introducing JavaScript parsing into the gathering system.The contents of this thesis can be divided into two parts:design and implementation of a JavaScript parser and a web pages gathering system.For the design and implementation of a JavaScript parser, first of all, the necessity to deal with JavaScript is investigated. The function distribution of JS procedures in HTML documents is obtained from the analysis of typical research data. Then, a simple JS parser---JSParser is designed and implemented according to the requirement to JavaScript parsing by the gathering system. Finally, it is verified through experiments that the JSParser can meet the requirements of this gathering system both in performance and in function.The web page gathering system consists of two sub-modules:a collector and a controller. The analysis of web pages is creatively introduced into the design of the collector and is combined with JSParser, hence the intension of crawling CUVI pages is successfully achieved. In the implementation of this collector, EPOLL technology is used to satisfy the high concurrency of the collector. By maintaining a FIFO queue of IP addresses in the controller, the collector can download the web pages politely, which makes the collector and the Internet collaborate well.Through extensive testing, it is verified that the introduction of JSParser to the gathering system does not affect the the performance of the system and system runs well with abundant IP addresses.

Keywords/Search Tags:

Web page gathering system, users' vision information, JavaScript parsing, web page analysis, FIFO queue of IP addresses

PDF Full Text Request

Related items

1	Design And Realization Of A Web Page Gathering System With JavaScript Parsing
2	Research On Search Engine Based On Web Page Mining
3	Stored In Corporate Competitive Intelligence, Intelligence Collecting Platform Based On Web-page Analysis
4	Web Page-oriented Handheld Devices Automatically Cutting Technology Research
5	Research On Webpage Recognition Technology Based On Vision And Semantics
6	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
7	Research On Mining Structure Of WEB Page For Information Extraction
8	Research On Web Page Classification And Information Collection
9	A Study Of Hybrid Cache Management Mechanism Based On Page Classifier And Page Placer
10	Research On Vision Based Algorithm In Chinese Web-Page Classification