Font Size: a A A

Research On Large-Scale Web Collection Technology Based On Grid

Posted on:2008-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:C YangFull Text:PDF
GTID:2178360245998080Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
VCE Search Engine is an information search engine project based on grid platform using Globus Toolkits. In this paper, we design and implement a feasible web collection system with high performance, based on gird technology, aiming to erect web collection module of VCE Search Engine project.This paper first analyzes and research general characters of grid technology, shows its relative concepts, and then explicitly explains developing toolkit for grid project—Globus, mainly containing its origin, system framework and basic process of developing grid project. And then, this paper introduces and analyzes relative characters of web collection technology. It introduces basic theory, basic work process, basic data process and categorization of work module of web crawlers. It presents general evaluation indexes of web collection system, and specifically analyzes diverse factors that reflect the indexes. Through the experiment, it presents the influence of web collection bandwidth to web collection speed and web collection error rates. Through the analysis of grid technology, this paper summarizes some advantages of large scales of web collection using grid technology. It is apt to gain higher bandwidth, convenient to acquire and release services, facilitated to gather and coordinate information and good scalabilities.Through analyzing grid technologies and influence factors of web collection system, this paper designs a web collection model based on grid. It presents the framework of the model and analyzes its basic work model. Combination with dynamic distributed work model and static distributed work model, it designs an improved interactive work model. Then, it presents task partition mechanism of web collection based grid, which mainly contains the choices of seed Urls and erection of partition function. Then, it designs a two-layer task schedule algorithm—task schedule algorithm in Wan Layer and task schedule algorithm in Lan Layer. Then, it illustrates that this model could acquire relative higher properties.Finally, utilizing the model designed in this paper, we implement a large-scale web collection system based on grid. Through the experiments, it illustrates the validation of relative theories.
Keywords/Search Tags:distributed system, grid, web collection, crawler
PDF Full Text Request
Related items