Font Size: a A A

The Key Technique Research Of Focused Crawler

Posted on:2009-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:J J LiFull Text:PDF
GTID:2178360242481599Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the web information, a lot of informationdistributing on web has deeply affected our lives. We can find anyinformation we need easily by clicking the mouse. But at the same time, it isalready an age with the exploding of information today, the expanding ofinformation makes us hard to find the information we need at once. Up tonow, it is impossible to search all over the web, and the traditional searchingengines encounter some difficulties that can not satisfy the users on therequirement of the focus searching. To resolve it, the focus crawler appears,and plays an important role in the development and study of searchingengine.The first Spider program in the world is the World Wide WebWanderer developed by MIT Matthew Gray, which was used to track thedeveloping of the web size. At the beginning, it was used to count thenumber of servers on web. Then it could fetch URLs. Searching engine iscomposed of three parts: crawler, index creator, and query searcher.According to differences of the information collecting methods and serviceways, searching engine can be separate into three parts: directory mode,robot mode and meta-searching mode.The focus crawler filtrates the non-related links according to somepages filtrating analysis arithmetic, and saves the related links to put theminto the URLs queue which is still not crawled. Then it chooses the webURLs which will be crawled next from the queue by some definitesearching rules. Repeat the progresses above, until it reaches any system condition.A basic focus crawler program is composed of web pages crawling,web pages resolution management, pages sorting, page URLs queue beingcrawled priority calculating and database management. Focus crawler is aweb crawling system which searches focus-related resources. Its searchingrange is the all internet, but the system doesn't exert the net as thetraditional system did. The crawler creates the focus information beforecrawling for establishing the focus which it wants to search. Then guide itsown crawling direction bases on the text information and the web structureinformation and search the largeset information quantity by using thelimited resources. The target of the focus crawler is to predict theprobability of pages related focus.Focus crawler can collect some relevant web information about sometopic on web, and improves the collection and index ability. As theimportant part of the currency searching engine and focus searching engine,focus crawler can crawl through new page information regularly, to upgradethe ad-hoc DB of search engine, and to provide new data.Focus crawler need to resolve these problems as how to describe anddefine the interested focus, how to decide the visiting order of the URLs,and how to estimate the relativity between the focus and a page.System structure: a basic focus crawler program is composed of pagecatching, page analyses process, page sorting, page URLs queue beingcrawled priority calculating and database management, etc.According to the above paragraphs, we classify the focus crawlingstrategy, and expound some crawling methods bases on the content such as the Best first search, Fish search, Shark-search, etc, and some crawlingmethods bases on linkage authority estimate such as Social NetworkAnalysis, HITS, Pagerank, etc.We design an integration crawling strategy bases on the improvedarithmetic of Shark-search arithmetic and Pagerank arithmetic.Its basic theory is calculating the relevant score of child nodes by usingShark-search arithmetic. Finally decide the crawling PRI on the integratedvalue of relevant score and authority value.Cut one being crawled URL queue into two which are called hot_queueand url_queue. Add three kinds of parameters: page content related degreeparameters a and b, node relevant score parameter c and d, and anchor textparameter e. Advanced Shark-search elicitation-style searching arithmeticstill uses VSM (Vector Space Model) to calculate the similar score of pagesand RDV.For the advanced Shark-search arithmetic still doesn't consider theeffect of link structure to the focus, we advance an integration crawlingstrategy by the improved arithmetic of Shark-search arithmetic andPagerank arithmetic. Identify if it is related to the focus by judging ofcontent, then extend the range of resources downloading to discoverimportant resources which may be similar.According to the experiment result, integration crawling strategy avoidsthe focus excursion and ensures the similar score of page crawling. But atthe same time, crawling strategy is being consummated for the reason of thefaultiness of RDV and the orientation of the parameters.Advance the veracity of link value predicts is the focus of study. Using the concept index theory of the modern information index domain in thelink value calculates is a new direction. The crawling has an identity ofrepeating. And how to combine the web dynamic and variational rules andthe statistic results is an important problem to improve the veracity of valueaccount. The crawler usually use the fixed search strategy, and lack ofapplicability. It is being researched on how to advance the self-flexibility offocus crawler.
Keywords/Search Tags:Technique
PDF Full Text Request
Related items