Font Size: a A A

Website Information Acquisition System Reptiles Subsystem Design And Implementation Of Demand

Posted on:2012-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2218330335998190Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As we know, since the mid-90s, the internet has become a platform of some important social activities such as government, business, education, entertainment, because it has advantages of independent information, convenient obtaining, wide geographical area and low-cost maintenance. Therefore, the people pay more and more intention to the security of internet, which is different from the traditional security. However, the traditional search engines cannot provide customized service and the result is not real-time enough to some specific requirement. We designed a simple spider system to execute the customized tasks in time.Comparing with the traditional search engine with a global huge task, our system is targeted on a limited number of web-sites, and reduce the scope of searching as much as possible by adding the restrictions of searching width (limited number of sites)and depth (max-depth of URL) to meet the critical real-time requirement from user.Furthermore, for high parallelism, we split a task into many sub-tasks, and depend on the consistent hash algorithm to do scheduling of sub-tasks. The algorithm makes sure the workload of crawlers are balanced, and reduce the reassignment of sub-tasks as much as possible when the number of crawlers increases or decreases.For some specified web-sites, we have tested and proved that this crawler system is efficient, scalable, and robust.
Keywords/Search Tags:Parallel crawlers, Task allocation, reptiles management, Consistent Hash algorithm
PDF Full Text Request
Related items