Font Size: a A A

Research And Design On Focused Crawler Of Search Engine

Posted on:2011-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:S HeFull Text:PDF
GTID:2178360305972974Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the conflict between the growth of the Web information and the ability of people achieving it is becoming huger and huger, which requires support from search engine technology. However, the resources of the internet growing exponentially, the collection of information on the network faces the index size,update rate,individual needs and many other challenges.The traditional search engines can not meet people's growing need of personal information retrieval services. Establishing the specific areas of the focused search engine has become the new search engine trends, which is known as the fourth generation of search engines. The research on focused crawler which plays an important role in the focused search engine has become one of the most popular directions in the network data mining.The main focus of this paper is to research the focused crawler, by analyzing the correlation algorithm of it, designing the download logic on the Heritrix. Focused crawler is a special crawler, its main objective is to crawl as many as possible relevant pages, as little as possible to crawl irrelevant pages in limited time.The main research activities include:①,Studied the structure and related theories of focused crawler, analyze the related technology and key algorithm, and then design and implement a SAS-Crawler based on simulated annealing algorithm search strategy.②,In the aspect of calculate the correlation of the subject of the page. By analyzing the structure of the page, assign different weights under different labels in different positions. so that make the calculation more accurate.③,In the aspect of predict the correlation of link URL, we consider the various heuristic information. Such as,the link tex,the link context information,the parent page and the number of links which link to. By synthesizing consider the context and link structure, which avoid the "drifting", but also increase the search space.④,In the aspect of the link selection, adopting the link selection strategy based on simulated annealing, As there are many pages which named "tunnel", had caused the related pages which linked after the irrelevant pages are not easy to be searched. Means that even if the current pages are not related to the topic, the target pages of links point to may be relevant to the topic. The selection strategy based on simulated annealing can limit local optimum. then can also help download more relevant web pages⑤,By studying the open-source crawler Heritrix, then do some improvements on it.Increased the topic established module,the page relevant to the topic calculation module and link evaluation module. And make the search strategy which based on simulated annealing. Last proposed SAS-Crawler and tested the crawler by experiment.
Keywords/Search Tags:search engine, focused crawler, web hyperlink analysis, VSM, simulated annealing
PDF Full Text Request
Related items