Font Size: a A A

Centroid-Based Focused Crawler Having Incremental Ability

Posted on:2008-10-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:H WangFull Text:PDF
GTID:1118360242960151Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of the World Wide Web, Web pages which areprocessed by a focused crawler are becoming more and more complex. Nowadays,there are a lot of multi-topic Web pages. Even if they are not, URLs, which lie in a Webpage, are not all attractive to the users. How to crawl selectively in a Web page is oneof the problems, which researchers want to deal with. As follows, we will explain ourmethods for tackling this problem:First, use a back-end classifier to decide whether a downloaded Web page is relevantor not;Second, extract anchor texts and their corresponding URLs from a relevant Webpage, and put them into a priority queue;Third, score anchor texts using front-end and back-end classifiers respectively; sumup the two scores; select the URL which has the highest anchor text score from thefrontier and download its corresponding Web page.There are the detailed procedures, which are used to deal with the problem in thisarticle:(1) Use Yahoo category to train a SVM-light back-end classifier and employ it to judgewhether a Web page is relevant or not,(2) Use Perl free software package HTML::TokeParser to extract anchor texts and theircorresponding URLs, which lie in a relevant Web page,(3) Use a centroid vector corresponding to a root set as front-end classifier. When ananchor text is transformed into its corresponding vector, use front-end and back-endclassifiers to score this vector. First of all, we deal with the third sub-problem—how to obtain the centroid vectorof a root set. On the Reuters-21578 and 20 Newsgroup datasets, we conduct experi-ments to obtain the centroid vector of a root set. The experimental result shows thatthe traditional TFIDF model is not the best method for calculating the feature weightof a document in a root set. This intrigues our interest to propose a new model namedTFIDF-2 to calculate it, as well as three heuristic rules—Max, Ave and Sum for cal-culating the feature weight of a centroid vector. With these two methods, this papercan finally get a centroid vector corresponding to a root set. With the obtained centroidvector, this paper utilizes it to score anchor texts, which amounts to extracting relevantdocuments from an unlabeled dataset. Obviously, the unlabeled dataset is consisted ofanchor texts and the relevant documents refer to the anchor texts, which are relevantto a root set or interested by users. On the Reuters-21578 and 20 Newsgroup datasets,this paper verifies the e?ectiveness of using a centroid vector to extract relevant anchortexts from an unlabeled dataset. Additionally, during applying a centroid vector to a textclassification, this paper invents a new method for calculating the similarity between acentroid vector and a test document. This new method takes into consideration not onlythe similarity between a test document and a centroid vector, but also the importanceof a centroid feature, which lies in the test document.After getting the front-end and back-end classifiers, this paper applies them intofocused crawling. As the front-end classifier, a centroid vector can provide a focusedcrawler with an immediate payo?; whereas the back-end classifier can give it a de-layed one. As a result, this two classifiers'framework endows a focused crawler withthe ability of tunneling, to some extent, which means it can enable a focused crawlerto start from some relevant Web pages, pass through some irrelevant Web pages andfinally reach some other relevant Web pages. Guided by front-end and back-end clas-sifiers, a focused crawler can simply use anchor texts to predict the relevance of theircorresponding URLs accurately. Therefore, this strategy can lend to a focused crawlernot only a high performance but also a fined-grained crawling.On-line incremental crawling is one of the problems, which must be settled inthe domain of focused crawling. Generally, it is not necessary and possible for theinitial seed URLs or Web page sample data to cover all aspects of a topic. Furthermore,some topics are changing along with our epoch. All these changes can be added into a focused crawler through an automatic or a manual manner in order to optimize andupdate a crawling strategy. As a result, the adaptation of a focused crawler will beimproved.The contributions and novelties of this paper are listed below:1. Aimed at the unsuitability of traditional TFIDF model for calculating the featureweight of a document in a root set (also called positive set), a new model namedTFIDF-2 is proposed in this paper.2. Based on TFIDF and TFIDF-2 models, three heuristic rules—Max, Ave and Sumare brought forward for obtaining a centroid vector corresponding to a root set.3. A new method is proposed to calculate the similarity between a centroid vector anda test document. This technique takes into consideration not only the similaritybetween them but also the importance of a centroid feature which occurs in the testdocument.4. Using feature weight calculating methods proposed in this paper, a centroid vectorcan be obtained from a root set and used as a front-end classifier to guide a focusedcrawler. As a front-end classifier, a centroid vector has many advantages, such asobtained easily and updated conveniently.5. Proposing two classifiers'framework. This framework can enable a focused crawlerto tunnel some irrelevant Web pages before reaching many other relevant Web pages.In this case, the traits of a focused crawler are made up of clusters, which are con-nected by a lot of irrelevant Web pages. Each of cluster, which can be called com-munity to some degree, is consisted of many relevant Web pages.6. Proposing a crawling strategy which has an incremental ability. Through on-linelearning, a centroid vector can make a focused crawler adjust, optimize its crawlingstrategy quickly and consequently its adaptation is improved.7. The big di?erence between our crawling strategy and the others'is that in ourmethod, URLs and their corresponding anchor texts are stored in a priority queue,in contrast, URLs and their corresponding priorities (scores) are stored in other re-searchers'frontiers. A URL priority is a score which is maintained constantly during the period of its life time. Obviously, this is not reasonable. During the proceedingof a focused crawler, its crawling strategy can accumulate more and more experienceand consequently predict a URL more accurate than ever.In summary, a centroid-based crawling strategy is proposed in this paper. Guidedby a front-end and a back-end classifiers, this strategy can fulfill high performanceincremental crawling with the help of anchor texts. Additionally, URLs are stored inpriority queue as well as their corresponding anchor texts. In fact, the final goal of thisstrategy is to postpone predicting relevance of a URL. This strategy has many merits.On one hand, as time goes on, the frontier can accumulate more and more candidateURLs, and consequently a focused crawler can pick up much more representative URLsfrom it. On the other hand, the more Web pages the crawling strategy downloads, themore experience it accumulates. As a result, it can predict the relevance of a URL moreaccurate than before.In this paper, although some advances have been achieved in the domain of fo-cused crawling, there is much work to be done because of the limited research time andthe authors'abilities. To be more specific, we list those that need improving and areworth researching:Select more categories from DMOZ to conduct experiments to verify the robustnessand validity of our proposed technique.This paper only compares our proposed technique with the Best-First crawling strat-egy. As our future important work, more experiments will be done to compare ourmethod with Context Graphs and Critic-Apprentice framework which is proposed bychakrabarti.Do theoretical analysis for a centroid vector when using it in text classification andextracting relevant documents from an unlabeled dataset.Combining a front-end classifier, which is consisted of a centroid vector and a deci-sion tree with SVM-light back-end one and using this three classifiers'framework indeep Web domain.
Keywords/Search Tags:Centroid-Based
PDF Full Text Request
Related items