Font Size: a A A

An Incremental Focused Crawling Based On Rocchio's Method

Posted on:2008-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2178360212997042Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
While crawling the World Wide Web, a focused crawler (or topicalcrawler) aims tocollect webpages relevant tosomespecifictopic(s)as muchand irrelevant ones as few as possible. In server side, a focused crawler canserve as the core component of various topical search engines, and in theclient side, it can be deployed as information foragingagent. It is no accidentthat the focused crawling has sparked great interest in both academic andengineeringcircles.This paper studies the issue of online learning of the focused crawlingbythe Rocchio's method, which is the classical relevance feedback approachin information retrieval. Our proposed approach can enhance a focusedcrawler's performanceincrementallyandadaptively.Especially,it makes fulluse of the structural information in HTML code to characterize the linkcontextprecisely.In the face of the difficulty in utilizing diverse relevance clues of ahyperlink, this paper proposes a novel feature representation scheme basedonboththetagnamesandtheDOMtreeoftheHTMLcode.Supposethereisahyperlink ? inpage ? pointingtopage ?. Duringthecrawlingsession,alltherelevancecluesmustbeextractedfrom ? and ? alone.Page ?'sHTMLcode corresponds to a tree structure or DOM tree, and each HTML tagelement corresponds toaninternal node,whileeachtext element correspondsto a leaf node. We denote the DOM tree of ? by dom(?), then anchorelement ?is an internal node of it and its anchor text is the concatenation ofall the leaf nodes'string of the subtree rooted at ?. The URL of ? is the"href"attributeof ?. Wedenoteinternalnode ?'stagnameastag(?), andtheparentnodeof ? asparent(?). Tocharacterize ? precisely,wecanonlyusethehyperlink ? pointingtoit. Differentfromthetraditionalapproachofusingthetextaroundthehyperlinkwithinaspecifiedboundary,ourapproachtakes all the text into account, albeit by prefixing them with"relative pathprefix"additionally. We use the"relative path prefix"to capture the diversestructural relations with ?. The definition of"relative path prefix"is given below:Suppose ? is a leaf node (text node) of dom(?), then it must has a"nearestancestornode"withrespectto ?, whichisdenotedasancestor(???).Therelativepathprefixof ? withrespectto ??isthestring:tag(parent(?))->tag(parent(parent(?)))->…->tag(ancestor(???))<-...<-tag(parent(parent(?))<-tag(parent(?))When computing"relative path prefix", all the inline HTML tags areignored for they are mainly concerned with the representative style ratherthanthelogicalstructureofthedocument.In abidance with traditional text processing, we extract various tokensfromthetextignoringtheirpositions.Prefixedwiththe"relativepathprefix",thesetokensactasthefeaturesofthehyperlinkinquestion.Our representation approach of hyperlinks has the followingcharacteristics:(a) No need to decide upon the surrounding text's boundary in ad-hocmanner.Allthetextsaretakenintoaccount.(b) For using"relative path prefix", our feature set is very sparse.However, the relative path prefix helps to filter noises, because theoccurrenceoftheirrelativepathprefixeswouldbelow.Aprerequisiteofafocusedcrawlerisawebpageclassifier,whichcanbeacquiredbytrainingfrom adequatepositiveinstances andnegativeones fromthe various web page directories available online. We utilize traditionalmachine learning algorithm (e.g., Support Vector Machine) to train a reliableweb page classifier. Each page downloaded will be fed to this pre-trainedclassifier to tell whether it is a relevant web page on topic. This process ofonline classification can be looked upon as the process of relevance feedbackin information retrieval area. In this interpretation, the web page classifieracts as a human reader. From this perspective, the process of the focusedcrawling can be modeled as a continuous online relevance feedback by webpage classifier. Byusing above mentioned feature representation approach ofhyperlinks, a traditional relevance feedback algorithm–Rocchio can be usedto train a global feature profile online dynamically, thus to approach the goalofincrementallyenhancingthedownloadingprecision.Wegivethestepsindetailsasfollowing: (a)Firstly,weencodethefeaturesofeachhyperlinkoritscorrespondingweb page and a globally maintained topical central vector by"relative pathprefix";(b)After downloadinga batch pages, the positiveand negative relevancefeedbacks by the output of the web page classifier are to be folded into theglobaltopicalcentralvector,usingRocchio'smethod;(c) During the crawling, the similarity between each hyperlink's featurevectorandthecentralvectoriscomputed.Thehyperlinkswithhighsimilaritywillbegivenhighdownloadingpriorityaccordingly.(d) Either the feature vector of hyperlink or that of central vector isrepresented by Vector Space Model in IR area. Every English token prefixedwith"relative path prefix"corresponds to a dimension of the Vector Space,whoseweightiscomputedbytraditionTF-IDFscheme.Tobeprecise,Here, N means the number of the sample pages and IDF means thedocument frequency, i.e. the number of documents where the token appears.OurprogramwillcomputethevalueofNandIDFdramatically.(e) If a page has too many links on it, we must limit the number of thelinks. We make the number 40 through experiments, so it can get the balancebetweenthespeedandtheefficiency.Where TF denotes the frequency of it in the document, IDF denotes thefrequency of the documents containing it within the whole document set andN denotes the number of the document set. In our approach, both N and IDFwillbeupdatedcontinuouslyonline.To validate the performance of our method, we chose 12 topics fromDMOZas thetest-bed forourevaluation,and we chooseBestFirst method asthe main comparative object, which is embarrassingly simple but has betterperformancethanmanycomplicatedalgorithms.Theexperimentalresultsshowthat:(a) In the side of total performance: though the two methods havesimilar performance on some topics, our method outperforms BestFirst onmost topics, especially on those topics with stronger"Topical Locality"attributes. (b) On the issue of incrementally adaptability of the approach: ourapproach has stable performance and high harvest ratio. Our approachespecially performs well on those topics with stronger"topical locality". Onsuch topics, our approach usually gives surprisingly good results during thelattercrawlingsession.Though our method achieves the elementary success, we shouldexperiment on the larger dataset. Furthermore, our task is only comparedwith BestFirst method, and we should compare with more other popularfocused crawling methods, including Chakrabarti's accelerated focusedcrawlingfromWWW2002.For the link forecast is the core of the focused crawling, we willcontinuetoresearchthelinkstructureandanalysisinthefuture.Itisbelievedthatthetechniquewillplayaroleinmorefields.
Keywords/Search Tags:Incremental
PDF Full Text Request
Related items