Font Size: a A A

Research And Implementation On Algorithms Of Topical Crawler

Posted on:2014-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:J J DuFull Text:PDF
GTID:2268330401976354Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the number of Internet users is growing, the amount of information on the Internet have increased rapidly. All of these challenges to search engines, the traditional search engine has been impossible to provide users with comprehensive and professional depth of service, vertical search engines emerge as the times require. The topic crawler as a capture module of vertical search engine, is responsible for collecting the web pages. The topic crawler directly affect the search engine service quality. Therefore, as an important part of search engine, it is worthy of studying and improvement especially.In recent years, research on the topic crawler mainly in two aspects--the web crawling strategy and the topic relevance algorithm. This paper focuses on these two aspects, the main work and achievements include:(1) The technology of topic crawler is discussed in this article. It gives brief description on the theories of the distribution characteristics of topic pages on the Internet, URL, the application of the regular expressions, getting webpage, analysising their contents and so on. All these are as a foundation for the establishment of topic crawler.(2) Studying and improving the topic relevance algorithms. Based on the traditional vector space model, according to the structure characteristics of the webpage, this article weights the feature keywords at the special position in the webpage. According to the semantic of subject words, the article introduce senmatic similarity matrix to transform the subject words in the webpage. These methods improve the ability to recognize the topic relevant webpages of the topic crawler and a high topic relevant web pages’download rate, avoiding to the greatest extent on crawling the irrelevant web pages.(3) Introducing the Genetic Algorithm which is outsanding on global searching and the Simulated Annealing algorithm having good effection on local searching, to the topic crawler as a search strategy, and combining with VSM correlation algorithm and the importance degree of the URL links to calculate the priority of the ungrabing links, all these determine the crawling direction of the topic crawler.(4) The realization of a industry customized crawler--crawler module of Australian contractor system. The article gives a detailed introduction to the characteristics of industry customized crawler from the aspect of technology.(5) With the help of the Heritrix, a general crawler frame, the author makes a comparison of the relevance algorithms between imporved VSM and the traditional VSM; the crawler uses HITS, Best-first algorithm, the search strategy of focused crawler based on SAGA (simulated annealing genetic algorithm), combined with the improved VSM algorithm, respectively as search strategy to grab topic web pages. Get statistic the numbers of relevant webpages download by crawler using the three algorithm respectively, the figures prove the search strategy based on SAGA has certain more advantages in some degree than based on link-relevance algorithm HITS, and the Best-First based on content-relevance algorithm.
Keywords/Search Tags:Topic Crawler, VSM Algorithm, Search Strategy based on SAGA
PDF Full Text Request
Related items