Font Size: a A A

Research And Implementation Of Subject-oriented Dual-bound Web Page Crawling Methods

Posted on:2012-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:C X JiaFull Text:PDF
GTID:2178330338984133Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The local topic of web information distribution of presented by the Internet is one of the characteristics of the Internet. With the more and more demand for subject-oriented access to the Internet information, users want access to the information gathering can do better, faster updates, and can automatically discover the main resource area, and then research topics the change and distribution. Because the specific subject information is generally only a small part of the Web, and has dispersed, traditional breadth-first or depth-first based search strategies in the efficiency of Web information collection requirements is difficult to meet expectations. Subject-oriented web crawling system's main task is to utilize the limited network bandwidth, storage capacity and less time crawling pages related to the theme as much as possible.This article first introduces general search engines work briefly, then the search engine's key technologies such as web crawlers, information extraction, text classification, web page ranking are described. Next introduces Subject-oriented search engines work, and analyze its key technologies and research focus.Subsequently, we study the construction and update of the topic feature model, the identification of the theme page two key technologies.Next, this paper focuses on the crawling strategies of the topic crawler and discusses content-based heuristic methods and methods based on the structure of Web hyperlinks. Considering the efficiency and topic drift problems, a new web-based content and hyperlink structure of Web pages collected the dual-bound method is proposed, which improve the coverage of the search engine resources, and can better avoid the topic drift.Finally, we achieved a search engine prototype system. The system not only can accurately automatically crawling pages related to the topic, but also can save network bandwidth, is stable. By contrast, the recall, precision, subject satisfaction index has reached a high level.
Keywords/Search Tags:Topic Search Engine, Web crawler, Crawl strategy, HITS, Shark Search
PDF Full Text Request
Related items