Font Size: a A A

Research And Implementation Of Focused Crawler’Search Strategy In The Vertical Search Engine

Posted on:2014-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:X Y XuFull Text:PDF
GTID:2298330467975896Subject:Mechanical and electrical engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the number of Web resources grows rapidly, search engines become an important tool for people to obtain information resources in the network. When the general search engines are in the pursuit of the maximum information-retrieving coverage, they may decrease the quality of the retrieval information, so they are increasingly unable to meet user’ demand for retrieval services in a professional field, and the development of vertical search engines that focus on some topic becomes a new trend. As an important part of the vertical search engine, the performance of focused crawler has a great impact to the quality and efficiency of information retrieval, therefore, to design a high-performance focused crawler is a important issue in the field of vertical search engines.The main object of study in the paper is the technology of focused crawler. Firstly, this paper gives the research overview of vertical search engines and focused crawler and their working principle, points the advantages that vertical search engines have in information retrieval compared to general crawler, then analyzes search strategy of focused crawler and discusses its accuracy and importance in the respect of forecasting theme. The paper works around effects of text and hypertext information existing in the web pages to focused crawler’search strategy.At first, we make a detailed introduction on how to represent topic information, how to extract key words and how to calculate the weight of key words and degree of related topics. And then analyze the disadvantage of TF-IDF algorithm which is used to calculate the weight of key words and give a improved program. We use VSM model to calculate the degree of related topics between web pages.To distinguish the degree of URL related, we introduce the factor of topic characteristics into traditional HITS algorithm, give an improved HITS algorithm. For each web page, we use three vectors(hub vector, authority vector, content vector)to calculate degree of related topics in order to try to avoid the topic drift phenomenon and discuss the formula of hub value and authority value.The search strategy based on content and strategy based on URL analysis both have the disadvantage of single value evaluation standard, so in this paper these two types of search strategy are combined, and there is a complex search strategy, focused crawler can select the best search strategy according to different crawling stages.
Keywords/Search Tags:Vertical Search Engine, Focused Crawler, Content Analysis, URL Analysis, Search Strategy
PDF Full Text Request
Related items