Research Of URL Analysis Model And Scheduling Technology Of Focused Crawler

Posted on:2012-03-04

Degree:Master

Type:Thesis

Country:China

Candidate:H Wu

Full Text:PDF

GTID:2218330368981968

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Amount of information on the Internet surges along with the fast development of the Internet. It is an important problem for users to find the information they need quickly, accurately and in the round, because of which search engine is invented to help users to find the information resource they are interested in. But traditional all-purpose search engine wastes resource because of pursuing high coverage rate blindly and often returns query result including unrelated web pages to the users. To solve these problems, vertical search engine emerges as the times require, which only collects the web pages related to the topics in which the users are interested. Focused crawler plays an important part in the vertical search engine and takes charge of analyzing whether a web page is related to the topic and concrete process of getting web pages.There are two important problems in the focused crawler field:how to judge whether a web page to be got is related to the topic and how to schedule URLs. Aiming at these two problems, this paper proposes a solution separately:Although traditional URL analysis model on the basis of content evaluation has a high accuracy rate but its efficiency is low and maybe gives low-quality web pages high score. Traditional URL analysis model is easy to cause topic drift problem because of absolutely ignoring web page content. This paper combines these two models organically and imports users'query for amending topic description to design a URL analysis model combining content evaluation and link evaluation. Later the experiments prove this model can improve the performance such as accuracy rate and so on.Existing crawler URL scheduling technologies all have some problems such as forming control node bottleneck, URL distribution not balanced and so on. Aiming at these problems, this paper designs a URL scheduling scheme on the basis of multi-node crawling concurrently. This scheme uses scalable bloom filter model to filter reduplicate URLs, consistent hashing to distribute URLs, and UDT protocol to transfer URLs in blocks. After using this scheme in the focused crawler, the feasibility of this scheme is proved in the later experiments.

Keywords/Search Tags:

search engine, focused crawler, URL analysis model, URL scheduling

PDF Full Text Request

Related items

1	Research And Design On Focused Crawler Of Search Engine
2	The Research On Focused Crawling Algorithm In Vertical Search Engine
3	Research And Implement Of Focused-crawler Relevance Algorithm In Search Engine
4	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
5	Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine
6	Research And Implementation Of A Time-based Focused Search Engine
7	Research On An Algorithm Of Focused Crawler In Vertical Search Engine
8	Customizable Focused Crawler
9	The Optimization And Achieve For Focused Crawling Algorithm Based On The Website Content Framework
10	Research And Implementation Of Focused Crawlerâ€™Search Strategy In The Vertical Search Engine