Font Size: a A A

Research On The Topic Crawler Algorithm Based On Vector Space Model

Posted on:2017-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:R B YaoFull Text:PDF
GTID:2348330482491377Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, the network plays a leading role in access to information brought rapid changes to people's pattern of production and life, the convenience of information retrieval is self-evident,with the development of information technology. The explosive growth of information also highlights some drawbacks, traditional search engine has been unable to make users get specific search results within a limited time, based on this, the focused search engine came into being and become to be a search focus in the research field of search engines. Topical crawler is designed based on the traditional web crawler, which adds two modules, namely the establishment and related topics evaluation, Topical crawler focus on the depth of the crawling,it ideally would download only the web pages that are relevant to a given topic while avoid downloading all others. It mainly needs to solve the establishment of the theme, correlation analysis and evaluation and search algorithm three core issues.This paper researched the key techniques of focused crawler, such as the crawling topic description, calculation of correlation and the search strategy of web pages. The correlation degree calculation and search strategy of the theme crawler based on vector space model is studied. By researching and analyzing the traditional model of theme crawler algorithm, put forward a multi granularity SH crawler algorithm based on vector space model.The main research works as follows:1. The traditional Vector Space Model representation keywords with feature. The calculation of the weight of keywords by TF-IDF method, which is measured by counting the number of relevant keywords appear in the text. By this method the calculation result is only the fuzzy matching of text words, ignore the Web page itself. Thus, accuracy of this approach is poor.Synthesizing the characteristic word in different locations in the same text and the location weight in different texts, people decide to use the modified TF-IDF formula to calculate the key words' weight, and at the same time take the special position into account to change the traditional vector space model(VSM). The correlation of theme page is calculated based on the improved VSM.2. Analysis of advantages and shortages of Shark-search algorithm and HITS algorithm.According to the phenomenon of the noise link of shark-search algorithm and the topic drift of HITS algorithm, people should deeply analyze the web page, and adopt the VIPS(Vision-based page segmentation) algorithm to deal with the given web page partly. When predicting the relative links, people should adopt the multi-granularity shark-search algorithm, combined theHITS algorithm which depends on the query. It can't make up the "global" issues and reduce the noise links, but also eliminated the phenomenon of topic drift of HITS algorithm.3. The precision and recall which are used as the experimental evaluation index compare the pros and cons of granularity SH topic web crawler based on vector space model and other crawler algorithms in terms of crawling quality. We collect statistic and analyze the results, the experimental results show that the topic crawler proposed in this paper is more efficient in improving crawling quality.
Keywords/Search Tags:topic crawler, VSM, relevance calculation, search strategy focused crawler, Multi-granularity Shark-search algorithm, HITS Algorithm, web page segmentation
PDF Full Text Request
Related items