Font Size: a A A

Research Of URL Analysis Model And Scheduling Technology Of Focused Crawler

Posted on:2012-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:H WuFull Text:PDF
GTID:2218330368981968Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Amount of information on the Internet surges along with the fast development of the Internet. It is an important problem for users to find the information they need quickly, accurately and in the round, because of which search engine is invented to help users to find the information resource they are interested in. But traditional all-purpose search engine wastes resource because of pursuing high coverage rate blindly and often returns query result including unrelated web pages to the users. To solve these problems, vertical search engine emerges as the times require, which only collects the web pages related to the topics in which the users are interested. Focused crawler plays an important part in the vertical search engine and takes charge of analyzing whether a web page is related to the topic and concrete process of getting web pages.There are two important problems in the focused crawler field:how to judge whether a web page to be got is related to the topic and how to schedule URLs. Aiming at these two problems, this paper proposes a solution separately:Although traditional URL analysis model on the basis of content evaluation has a high accuracy rate but its efficiency is low and maybe gives low-quality web pages high score. Traditional URL analysis model is easy to cause topic drift problem because of absolutely ignoring web page content. This paper combines these two models organically and imports users'query for amending topic description to design a URL analysis model combining content evaluation and link evaluation. Later the experiments prove this model can improve the performance such as accuracy rate and so on.Existing crawler URL scheduling technologies all have some problems such as forming control node bottleneck, URL distribution not balanced and so on. Aiming at these problems, this paper designs a URL scheduling scheme on the basis of multi-node crawling concurrently. This scheme uses scalable bloom filter model to filter reduplicate URLs, consistent hashing to distribute URLs, and UDT protocol to transfer URLs in blocks. After using this scheme in the focused crawler, the feasibility of this scheme is proved in the later experiments.
Keywords/Search Tags:search engine, focused crawler, URL analysis model, URL scheduling
PDF Full Text Request
Related items