Font Size: a A A

Incremental Focused Crawling Using Anchor Text

Posted on:2008-08-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z C YaoFull Text:PDF
GTID:2178360212497013Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The World Wide Web which appeared in the 90's, became biggestserviceontheInternetinseveral yearsinsteadofotherInternetservices(FTP,Telnet,……). Along with Web's globalization, more and more informationwas uploaded. The Web resource developed so fast that it also brought theproblem of"informationburst"inthemeantime. Howtoget theinformationthat oneself need most facing the information ocean, become very difficult;knowledgeseemedtobeshortofinfrontofabundantinformation.Tosolvetheseproblems,searchenginewasborn.Butthesearchenginecan not do once and for all of solving the problem of "information burst".First, although the search engine can provide great quantities of relatedresults, these results are for all users and the qualityof the results can go onimproving; Secondly, the request of the bandwidth of network andcomputation resource is more and more high of the search engines and howtoeffectivelylowertheloadisalsoproblem.In order to solve the problem in the search engines, focused crawlingwas born. The focused crawling, based on the search engines'technology,applies machine learning and other intelligence methods, to download morerelated page with low cost. Researchers and the employees of the searchenginesareinhugeenergytothistechnology.The research of focused crawling mainly concentrate on two aspects:oneishowtodefinetopic;anotherishowtoorganizethecrawlingqueue,inorder to make the crawlingprocedure more efficiently. Research on the firstproblem, leads to use the classifier as the basic framework of focusedcrawling. Research on the second problem, we call it as crawling strategy.The research of crawling strategy currently concentrate on two directions:crawling prediction based on contents; crawling prediction based on Web structure.Incremental focused crawling using anchor is one of the contentcrawling prediction. Because including inspire and refinement informationin the anchor text, researcher doublyinterested in it. But it is so inspire thatmakestheresearchershrinkatthesightofit.First,alotofresearchersthink,because the anchor text is too short and small, the pimping noise will makeits distortion, it is very "weak". Secondly, anchor text sometimes can'tcompletely reflect a link direction the contents of the web page, but is onlynavigation information such as "Return", "Next" etc. And it includes a greatdealofcheatinginformation.The researcher tries the usage anchor text context to strengthen theanti-interferencecapabilityofanchortext,but wefindthat usinganchortextcontext, can not only resolve the anchor text's so-called "problem" and itreally has problems. First, it brings impurity information; Secondly, thecontext of navigation link usually is also navigation; And the context ofcheatinglinkisalsocheating.According to our research of the Web structure and the link, wesuppose:1. The web page author always hopes anchor text can the contents tohighlydescribethewebpage.2. TheWebisconnected.Awebpageislinkedtobyalotoflinks,notonlydependsonthenavigationinformation.Accordingtoourhypothesis,wecangetthefollowingconclusion:ThewebpageintheWeb,was always linkedbyananchortext that hasenoughinformationtodescribethepage.According to above hypothesis, we think usage of pure anchor text toguide the crawling is available. Now our task is to find an availablecomputing model. The experiment express that use a traditionalclassificationmodeltotheanchorperformsbadly.Accordingtoourresearch, we discover Centroid-classifier has good performance to the anchor text.And Centroid-classifier uses onlypositive example. It has a good support totheincrementalprocedure.Usually, focused crawler is made up of the following parts: Fetcher,HTMLParser, Classifier, strobe link judgment module(Is URLVisited), andFrontierandCrawledFrontier.Thefocusedcrawlerrunsasfollows:1. Theoff-linetrainingofthetextclassifier.2. TheinitializationofFrontier.3. Fetcher gets an URL from Frontier, downloads that web page, andput that URL into Crawled Frontier. From this step on, it is called on-linefetchingphase.Itwillrepeatstep3-6,untilsatisfyendcondition.4. HTMLParserparsesthewebpageandextractsURLs,andconvertsthewebpageintopuretext.5. Classifier classifies the pure text and put relevance value alongwiththeURLintheCrawledFrontier.6. IsURLVisitedwillknowwhethertheURLextractedbytheHTMLParser has been visited. If it hasn't been visited, it will be put it to theFrontier.We use harvest rate to estimate the performance of a focused crawler.Theharvestrateistheaveragerelevancevalueofdownloadedwebpages.Our framework adopts two classifiers: one is called BackgroundClassifier,anditisusedtojudgetherelateddegreeofthewebpage; anotheris called Fore Classifier which will guide the crawling procedure. We willstorethetext,andthetextwillbeusedtoretraintheforeclassifier.The execution process is almost the same as the common procedure offocusedcrawlerexcepttwoaspects:1. The off-line training phrase will train two classifiers. TheBackground Classifier uses the SVM model and the fore one uses theCentroidmodel. 2. On-line crawling will use the incremental feature. After gettingcertain amount of web pages, the program will stop and retrain the ForeClassifier.When off-line training the Background Classifier, we use Yahoo'son-line directory of /Sports/Hockey/Ice-Hockey and /Entertainment/Genres.There are totally 554 web page links under /Sports/Hockey/Ice-Hockey.After training, the precision, the recall and the F1 value of the classifier inorderis 0.8591,0.8624,0.8598./ Entertainment/Genres totallyhave81,572web page links. After training, the precision, the recall of the classifier, onebyoneinorderis0.7875,0.8122.Wetotallydidfourexperiments:Contrast experiment of crawling using the method of Best Firstaccordingtotheanchortext.Contrast experiment of crawling according to the different anchor textfetchingstrategy.Contrast experiment of crawling according to different documentfeatureweightcomputingmethod.Theincrementalfocusedcrawlingwhichusestheanchortext.Accordingtotheaboveexperiments,ourcrawlingstrategyisobviouslybetter than the others, and is very efficiency. Through analysis, we get thefollowing conclusions: using pure anchor text to guide crawling processleads to a good harvest rate; the Centroid-classifier has good effect to judgethe anchor text related degree; the increment feature leads to an effectivewaytoresolvetopicexpansion.Theseconclusionswillrectifytheviewpointthatanchortextistooshortandsmalltoguidecrawling.The focused crawling still has some problems to put into actual use:Chinese words segmentation, the P2P topic search etc. There is a long wayto combine focused crawling with existing search engine technology andfinallypushsearchenginestothefourthgeneration. In fine, people are satisfied with communication to each other after theWeb appeared, but distribution computation environment is still evolving,and the evolvingwill bringus greater convenience. The Web's developmentisexpectable,andourworkisjustabeginningoffurtherresearch.
Keywords/Search Tags:Incremental
PDF Full Text Request
Related items