Font Size: a A A

Design And Implementation Of Dynamic Adaptive Resource Collection System

Posted on:2014-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2268330425475932Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Today, people become used to obtaining information through searching engines as theInternet provides more and more valuable information. In2012, China’s total number of webpages increased by nearly41%over2011, which puts forward higher requirements for thesearch engine resources collection system. Internet contains a large number of pages, whilethe number of dynamic pages is growing dramatically. In the resource collection process, thesystem will inevitably encounter various exceptions, such as slow server responds, duplicatepages, invalid web links, and complicated web link relationships between resources and so on.This paper focuses on solutions related to those problems.This paper aims to design and implement a resource collection system, which can notonly adapt to the WAN in a variety of abnormal conditions dynamically and automatically, butalso can utilize existing collection of information to discover the links between web pages andpredict more similar web pages. In this paper, the system uses real-time statistics collectionprocess information as a basis for real-time filter, which is designed to filter out repeated webpages, invalid access, and timeout Web links, so as to improve the collection efficiency.Compared with the general collection system, the system can better adapt to the unstablenetwork conditions and better handle a large number of spam links. In order to solve thedifficulty of finding web links, this paper proposes a method for link analysis and prediction.Based on the analysis of web links, this prediction method vastly expanded searchingcoverage by gaining accesses to a large number of similar pages, which can make up forconventional link extraction methods.This paper uses a distributed architecture to design and implement resource collectionsystem, in addition to division and implements basic page download module, page parsemodule, remove duplicate URL module, URL scheduling module, the system also addedreal-time filtering module and URL prediction module, as well as statistical information, URLcluster, URL classification and other auxiliary modules, making the system with dynamicadaptive characteristics.Experimental results showed that the proposed methods can identify and adjust toabnormal conditions, so as to improve the system and ensure the collection stability. Fordifficult-to-find web links, the system can effectively predict. In addition to conventionalextraction links, this article also provides another effective way to web links.
Keywords/Search Tags:resource collection, dynamic self-adaption, real-time filtering, URL prediction
PDF Full Text Request
Related items