Font Size: a A A

Design And Implementation Of Distributed Web Crawler Based On Groovy

Posted on:2011-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z YangFull Text:PDF
GTID:2218330338966977Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, the means for people to access information is gradually being replaced by the network. Meanwhile, the growth of network information is significant. In practice, users often have their own browsing habits and specific topics sensitivity, but a general-purpose Web crawler does not meet need of personalized information collection, and existing topic crawlers also have drawbacks.This thesis analyzes the distribution of the theme page in the Web, designs and implements a distributed customizable topic crawler system, named as CTCS, under Windows environment.This thesis describes the system working environment, network topology, subsystems and their functional modules and workflow, and communication interfaces among subsystems. Then, the thesis discusses the CTCS system design and implementation of each subsystem in detail. Some features of the CTCS system are depicted as follows:(1) In the process of manual customization, CTCS abandons the traditional pure data entry profile, puts the Groovy scripts into configuration, and makes the logical expression appearing in the configuration. In this way, the configuration flexibility and the accuracy and speed for crawler to get data have been greatly improved.(2) For meeting the requirements of Deep Web data capture, CTCS implements a holding state HTTPClient component by the way to maintain client state based on HTTP protocol.(3) Using Java RMI (Remote method invocation) to build a set of flexible and distributed solution, this module not only supports the crawler system, but also can be configured for other distributed business solutions.(4) The CTCS system introduces log center so as to collect the log and develop early warning function, which greatly facilitate the development and maintenance of the system.Finally, the operation of the system is presented in detail.
Keywords/Search Tags:Web crawler, Topic crawler, RMI, Groovy, Distributed software
PDF Full Text Request
Related items