Font Size: a A A

Study On Focused Crawling Technique For Vertical Search Engine

Posted on:2013-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y ShiFull Text:PDF
GTID:2248330362473931Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As the Internet develops rapidly, problems related to traditional general searchengines arise gradually, such as low Web coverage rate, inaccurate search results etc. Asa result, the vertical search engine would be more appropriate to solve these problems.It collects domain-specific Web pages by focused crawling technique and providescorresponding retrieval service. Obviously, the core part of vertical search engine isfocused crawling technique by which immediate performance improvement could beattained. Key techniques involved in focused crawling will be studied in this thesis,such as describing topic, calculating priorities of candidate links, and adaptive crawlingstrategies. The main contents are as follows:(1)A topic description method based on Wikipedia was proposed. Describingtopic clearly and accurately is the foundation of a focused crawler. And how to computetopic relevance depends on topic description as well. Many current focused crawlers,however, apply vector space model and simple lexical matching to describe topic and tocompute topic relevance respectively. They not only disregard the semantic relationbetween keywords, but also make keywords in the topic vector more scattered and thushave a reduction effect to topic description. Despite the fact that ontology or semanticdictionaries could be used to analyze the semantic relation between words, suchontology model is rare and some semantic dictionaries possess weakness at openness,vocabulary and regular update. Against these shortcomings, the new method usedWikipedia as background knowledge, which is freely available, up-to-date and objective.It constructed topic vector space by category tree and mapped the topic descriptivearticle into a vector to describe topic, introducing semantic analysis to relevancecomputation. Besides, it established a disambiguation reference table to solve theproblem that some terms can be mapped to not the nearest concept in nature or morethan one concept. Experimental results show that this method outperforms traditionaltopic descriptive method significantly in terms of precision and sum of information.(2)An approach calculating priorities of candidate links with page segmentationwas presented. How to prioritize candidate links determines the direction and outcomeof focused crawling. Nowadays, many focused crawlers calculate the priority of acandidate link based on the content of Web page containing it, its anchor text and textsurrounding to it. However, there are mixed with a lot of noise data such as advertisements in Web pages; surrounding text of an anchor link is also hard to define,and anchor text contains limited information. Thus, in this thesis, page segmentationbased on depth-first traversals was firstly introduced to filter part of noise nodes in Webpages. And the priority of a candidate link was measured as an aggregate value bytaking the content of Web page containing it, its block text and anchor text intoconsideration. Page segmentation has been validated to enhance the performance offocused crawling effectively by experiments.(3)Two adaptive strategies based on information gain (IG) and ratio of sum ofinformation (RSI) respectively were introduced. The initial description of topicgenerated by concept hierarchy in category tree tends to be unreal and inaccurate. Toimprove topic description, at every interval during crawling, the contribution of eachconcept in topic vector space to topic description would be learned automatically fromall crawled pages according to the two adaptive strategies and fed back to modify theweight of each concept by the crawler. Experiments demonstrate that the two adaptivestrategies both serve to enhance the crawling ability of the focused crawler; Web pagescrawled with IG are more related to topic than those with RSI, whereas RSI beats IG inoverall stability.In the end, a focused crawling prototype system was designed and implemented,on which a series of experiments were carried out to validate and analyze methodsproposed in this thesis.
Keywords/Search Tags:Focused Crawling, Wikipedia, Topic Description, Page Segmentation, Adaptive Strategy
PDF Full Text Request
Related items