Research And Design Of The General Crawler In Search Engine

Posted on:2014-02-13

Degree:Master

Type:Thesis

Country:China

Candidate:L Gao

Full Text:PDF

GTID:2248330395996752

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In recent years, as the Internet boom, the Internet had explosive information growth.An important concern for Internet users is how to quickly find the user wants to page content.The faster growth in information, the more users that need urgent. Search engines there is alarge degree of convenience to users fast access to information on demand. This articleResearch and discusses the reptile related technologies and algorithms in the crawing system.Crawling system’s main work is to download Web pages to search engines provide datasupport, in order to retrieve Web pages, reptiles crawl queue system first needs to maintain aninitial, then to crawl the pages in the queue, while extracting new links in a Web page, to thegrabbing queue has been executed until the crawl queue is empty. The content of this articlemainly has following several aspects:This article first describes the following types of search engines，as well as the historyand the common framework of search engines. Preliminary understands the operationmechanism of search engine, Then carried out a detailed discussion of crawling system.Discusses the classification of crawling system, the common framework as well as somealgorithm of design a crawling system. Then discussed in detail of the overall design andimplementation of the CWebSpider, which is an independent research and developmentsystem. The main research contents of this article are as follows:(1) Designed a generic web crawler framework--CWebSpider, and has an in-depthdiscussion of it’s internal framework.(2) The article had an in-depth explanation of crawling algorithms, schedulingalgorithms, as well as extraction algorithm of a CWebSpider, And discussesed thedetailed implementation of CWebSpider System under the linux platform through theanalysis of the techniques and algorithms of CWebSpider.(3) For crawling algorithm of CWebSpider, The article designed the gripper asnetwork layer and application layer, And Did a lot of optimization,Improved thesystem’s crawling efficiency and scalability. For Scheduling algorithm ofCWebSpider, The article designed an improved algorithm based on OPIC, whicheffectively improved the opportunities of grab a more important Web page. Forjudge repeat algorithm, The article implemented a bloom filter method, whichSignificant savings in memory space as well as the efficiency of judge repeat(4) Evaluated the performance of CWebSpider. And had an analysis of experimental results, and prospected the further work...

Keywords/Search Tags:

Web Crawler, Scheduler, Downloader, Extractor

PDF Full Text Request

Related items

1	Design And Implementation Of Android Platform Video Sniffer And Downloader
2	Amelioration de resumes automatiques produits par extraction de phrases: Etude de cas avec Extractor (French text)
3	Research On Topic Focused Web Crawler And Related Technologies
4	The Study And Implementation Of Focused Crawler Technology For Android Technical Information
5	The Design And Implementation Of Oil-extractor Intelligent Monitoring Device
6	Fuzzy Extractor:Constructions And Security Proofs
7	The Study Of Oil Extractor Remote Monitoring Systems Based On GPRS
8	Research On Distributed Information Collection Strategy For Financial Credit
9	Organization Entity Information Extractor From Webpage Base On CRF
10	The Research On Embedded Network Downloader Based On PowerPC-eMule Section