Font Size: a A A

Technology Research, Based On Focused Crawling Of Web Information Collection

Posted on:2012-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:B JiangFull Text:PDF
GTID:2208330332992482Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the WWW (World Wide Web), web information showing exponential growth, and you find that you need in this vast repository of information more and more powerless. For such problems, focused crawler against the theme of technology has been proposed in the field of web information retrieval. Focused crawler is the foundation and core for the themes search, and with the gradual deepening and development of the technology, gradually applied to the current personal information collection, links to effectiveness analysis, site structure analysis, and other interested users mining as practice and research.The theme Web information gathering based on focused crawler has extensive and practical significance.This paper introduces the basic principles of the search engine system and web crawler, workflow, and it emphatically analyzes the characteristics of the focused crawler, the page thematic analysis, based on link structure and content of the search strategy algorithm. In the implementation of key technologies, this paper has determined the theme relevance of collected pages and extracted theme features. By calculating the theme relevance through vector space model-based web theme relevance algorithm, the accuracy of theme information collection has been improved. To predict the theme relevance by scraping URL. By making use of the distribution characteristics of theme pages and the extended metadata, to calculate the theme relevance of URL. Considering the tunnel characteristics of theme pages, recall rate of theme information collection has been improved. The link structure analytical algorithm PageRank was introduced. This paper has proposed a comprehensive relevance and importance of URL combined value computation method TPR (Topical PageRank).The experiments tested the technical indicators of information collection, validated the effectiveness of technological improvements.
Keywords/Search Tags:Vertical Search Engine, Focused Crawler, Topic Relevant Degree, Tunnel, TPR
PDF Full Text Request
Related items