Font Size: a A A

Research And Design Of Topic Crawler Through Tunnels Algorithm

Posted on:2012-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:X ChangFull Text:PDF
GTID:2218330368488324Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity of Internet applications and services, the information of Internet increases on in broadband level exponentially, the complexity and difficulty of retrieving the information for Internet user improves continuously, so the search engine attracts people's attention and concern increasingly. Google announced their indexed pages has reached 10,000 billion, China's web pages has over more than 100 billion, and is still in an exponential growth mode, in the face of so vast information resources, how to effectively extract and use these information has become a huge challenge to search engine.Generic search engine has gradually showed "low coverage, low accuracy, low timeliness" shortcomings when retrieving information, topical search engine (also called professional search engine and vertical search engine) emerged in order to improve the searching efficiency and meet the increasing demand of individualized service, adapt to specific areas and the specific needs of specific populations, and became more and more important to people. Topic search engine just focus on topics which related to the theme, guarantee the collected and update of relative information completely, has "specially, essence, deep" characteristics, becomes the new generation development trend of search engine. Search engine grabbing web information is based on web crawler technology, in which strategy can grab web information effectively has become a focus for crawler study.Paper focus on improving page grabbing efficiency of topic crawler, after the analysis of the VSM web page classification algorithm, improved VSM web page classification algorithm from feature extraction, feature item computing core vocabulary formation three aspects, for feature item computing, this article proposed weighted feature of mixed factors weights calculation method from the Angle of semantic, improve the text similarity calculation precision. On this basis, combining the thoughts of "Better Parent Have Better Children" consider the influence of the genetic factors, improved the similarity of the topic judgment, predicted the climbing forward steps according to of dynamic adjustment similarity, in order to guidance crawler through tunnel by flexible Settings k value. This paper brings forward a dynamic adjustment topic crawler through tunnels algorithm. Eliminates the flaws of k value is set too low caused precision ratio or too high leads to lower the rate of recall ratio. use the Topic similarity algorithm of this paper decide discard the uncorrelated paper or not. expand topic crawler effective search range to a large extent. Make numerous independent web community connected each other becoming a relatively complete theme group, guiding topic crawler through network tunnel effectively, improve the precision ratio and recall ratio of crawler.
Keywords/Search Tags:Topic Crawler, VSM, Web Community, Tunnel, Topic Similarity
PDF Full Text Request
Related items