Font Size: a A A

Research Of Focused Crawling Strategy

Posted on:2008-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:J Z ZhengFull Text:PDF
GTID:2178360242478829Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the increase of information on the web, general search engine can't satisfy the desire of personal information retrieval. Recently, vertical search engine which can provide more precision classification and more completed data was produced. Focused crawler using which strategy to crawl the web efficiently is one of the hottest research problems in vertical search engine area recently. So this paper focused on focused crawling strategy.This paper proposed an ontology-based focused crawler model on the basis of analyzing the advantage and disadvantage of classic focused crawling strategy. The three main parts of this model is topical relevance filtering strategy, URL queue prioritizing strategy and ontology management. This paper described them in detail separately.In order to overcome the deficiency of topic filtering strategy based on keywords widely used nowadays, the paper proposed a topic filtering strategy based on concept elicited by concept congregation idea. This strategy congregated keywords of the webpage which is related to topical concept according to different topical contribution. So this strategy can calculate semantic topical relevance effectively. Experiments showed that this ontology based topical filtering strategy have higher precision than keyword based topical filtering strategy.In order to precisely forecast linkage value and avoid the problem of"topic changing", two enhanced URL queue priority ordering strategy was proposed. One is linkage forecasting strategy which perform semantic topical filtering firstly, then perform linkage filtering. The other is combined forecasting strategy which combines the semantic relevance of parent pages and anchor texts as the metric to prioritize the waiting queue. The experiments showed that both of these two strategies have their advantage and disadvantage, we can choose different strategy according to our desire. In order to avoid domain best disadvantage, an ontology management model including ontology builder and ontology model was added in this system. It automatically learns the weight of the concept according to the information collected while crawling. After adding this learner, the crawler can have good performance though crawling for a long time.At last, a demo system—Focused crawler was also proposed in this paper. It was used to test the strategies proposed in this paper.
Keywords/Search Tags:focused crawler, ontology, search engine
PDF Full Text Request
Related items