Font Size: a A A

Research On A Method Of Focused Crawler For Vertical Search System

Posted on:2014-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:L W WangFull Text:PDF
GTID:2268330392972112Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of Internet information, the search results of the Generalsearch engine, which has the characteristics of "broad, generic, deep", could not meetthe need of users in different areas to query the specific topic information. Then thevertical search engine arose.As the core of the vertical search engine, the method used by the focused crawler tocrawl pages affected its performance directly. The traditional focused crawler describedthe topic basing on set of feature words, ignored the semantic relationship betweenfeature words, and affected the result of topic description; Without considering therelevant-link block, page segmentation only extracted the relevant-text block; Thepriority prediction of candidate link only considered text evaluation or link structureevaluation. It set all candidate links priority to the same value or calculate themseparately, which had large amount of calculation; The traditional Tunneling made thenumber of pages not related to the topic increase rapidly, and then affected the accuracyof focused crawler. For these shortcomings, a focused crawler based on topic-relatedconcept and comprehensive value was proposed, as following:1) Get topic-related concept set by the ODP classification tree, then establish topicvector combining with topic description document to describe the topic. In this thesis,related concept of topic concept was taken into consider, and the topic description wasenhanced.2) Page segmentation was used to filter noise, then depending on different types ofpages, relevant block text was extracted to calculate the topic relevance. This solved theproblem that page topic relevance calculation was not accurate due to the noise.3) Text and R-HITS were combined to predict the priority of candidate links. Therelevant link block was extracted, its links served as candidate links and is divided intohigh-related links, low-related links and ordinary links. The high-related links got themaximum priority, low-related links were discarded directly, and ordinary links’ prioritywas calculated by Page Content, Block Text, Anchor Text and Link-Structure Score.4) Based on the Tunneling, each URL of pages which were irrelevant to the topicwas inserted into the Irrelevant URLs Queue. If the number of URL in the same website exceeded the upper limit, the URL in the web site would not jointo the IrrelevantURLs Queue, so that the problem of irrelevant topic pages’ sharp increase was eased. Finally, with Precision and Sum of Information as evaluation index, advantages ofthe proposed focused crawler were proved. Experimental results show that the proposedfocused crawler has higher Precision and Sum of Information. There is a goodapplication prospect and high practical value in topic page collection of vertical searchengine.
Keywords/Search Tags:Focused crawler, Topic-related concept, Page segmentation, Tunneling, R-HITS
PDF Full Text Request
Related items