Font Size: a A A

Research And Design Of The General Crawler In Search Engine

Posted on:2014-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:L GaoFull Text:PDF
GTID:2248330395996752Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, as the Internet boom, the Internet had explosive information growth.An important concern for Internet users is how to quickly find the user wants to page content.The faster growth in information, the more users that need urgent. Search engines there is alarge degree of convenience to users fast access to information on demand. This articleResearch and discusses the reptile related technologies and algorithms in the crawing system.Crawling system’s main work is to download Web pages to search engines provide datasupport, in order to retrieve Web pages, reptiles crawl queue system first needs to maintain aninitial, then to crawl the pages in the queue, while extracting new links in a Web page, to thegrabbing queue has been executed until the crawl queue is empty. The content of this articlemainly has following several aspects:This article first describes the following types of search engines,as well as the historyand the common framework of search engines. Preliminary understands the operationmechanism of search engine, Then carried out a detailed discussion of crawling system.Discusses the classification of crawling system, the common framework as well as somealgorithm of design a crawling system. Then discussed in detail of the overall design andimplementation of the CWebSpider, which is an independent research and developmentsystem. The main research contents of this article are as follows:(1) Designed a generic web crawler framework--CWebSpider, and has an in-depthdiscussion of it’s internal framework.(2) The article had an in-depth explanation of crawling algorithms, schedulingalgorithms, as well as extraction algorithm of a CWebSpider, And discussesed thedetailed implementation of CWebSpider System under the linux platform through theanalysis of the techniques and algorithms of CWebSpider.(3) For crawling algorithm of CWebSpider, The article designed the gripper asnetwork layer and application layer, And Did a lot of optimization,Improved thesystem’s crawling efficiency and scalability. For Scheduling algorithm ofCWebSpider, The article designed an improved algorithm based on OPIC, whicheffectively improved the opportunities of grab a more important Web page. Forjudge repeat algorithm, The article implemented a bloom filter method, whichSignificant savings in memory space as well as the efficiency of judge repeat(4) Evaluated the performance of CWebSpider. And had an analysis of experimental results, and prospected the further work...
Keywords/Search Tags:Web Crawler, Scheduler, Downloader, Extractor
PDF Full Text Request
Related items