Research On A Method Of Focused Crawler For Vertical Search System

Posted on:2014-01-31

Degree:Master

Type:Thesis

Country:China

Candidate:L W Wang

Full Text:PDF

GTID:2268330392972112

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid growth of Internet information, the search results of the Generalsearch engine, which has the characteristics of "broad, generic, deep", could not meetthe need of users in different areas to query the specific topic information. Then thevertical search engine arose.As the core of the vertical search engine, the method used by the focused crawler tocrawl pages affected its performance directly. The traditional focused crawler describedthe topic basing on set of feature words, ignored the semantic relationship betweenfeature words, and affected the result of topic description; Without considering therelevant-link block, page segmentation only extracted the relevant-text block; Thepriority prediction of candidate link only considered text evaluation or link structureevaluation. It set all candidate links priority to the same value or calculate themseparately, which had large amount of calculation; The traditional Tunneling made thenumber of pages not related to the topic increase rapidly, and then affected the accuracyof focused crawler. For these shortcomings, a focused crawler based on topic-relatedconcept and comprehensive value was proposed, as following:1) Get topic-related concept set by the ODP classification tree, then establish topicvector combining with topic description document to describe the topic. In this thesis,related concept of topic concept was taken into consider, and the topic description wasenhanced.2) Page segmentation was used to filter noise, then depending on different types ofpages, relevant block text was extracted to calculate the topic relevance. This solved theproblem that page topic relevance calculation was not accurate due to the noise.3) Text and R-HITS were combined to predict the priority of candidate links. Therelevant link block was extracted, its links served as candidate links and is divided intohigh-related links, low-related links and ordinary links. The high-related links got themaximum priority, low-related links were discarded directly, and ordinary links’ prioritywas calculated by Page Content, Block Text, Anchor Text and Link-Structure Score.4) Based on the Tunneling, each URL of pages which were irrelevant to the topicwas inserted into the Irrelevant URLs Queue. If the number of URL in the same website exceeded the upper limit, the URL in the web site would not jointo the IrrelevantURLs Queue, so that the problem of irrelevant topic pages’ sharp increase was eased. Finally, with Precision and Sum of Information as evaluation index, advantages ofthe proposed focused crawler were proved. Experimental results show that the proposedfocused crawler has higher Precision and Sum of Information. There is a goodapplication prospect and high practical value in topic page collection of vertical searchengine.

Keywords/Search Tags:

Focused crawler, Topic-related concept, Page segmentation, Tunneling, R-HITS

PDF Full Text Request

Related items

1	Research On The Topic Crawler Algorithm Based On Vector Space Model
2	Research On Topic Focused Web Crawler And Related Technologies
3	Research On Focused Crawler Based On Page Segmentation
4	Research On Crawling Techniques Of Focused Search Engine
5	Research On The Key Technology And Implementation Of The Focused Crawler Based On HITS And Shark-Search
6	Focused Crawler Based On Domain Ontology And Similarity Concept Context Graph
7	Research On Focused Crawler Based On SVM Classification Algorithm
8	The Design And Implementation Of The Complex Rules-Driven Focused Crawler System
9	The Research Of Specific-topic Crawling Strategy Based On Hierarchical Optimized Dynamic Concept Context Graph
10	Design And Implementation Of Focused Crawler For Blogs