Font Size: a A A

Research On The Crawler Of Search Engine

Posted on:2011-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y GongFull Text:PDF
GTID:2178360305983029Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The search engine as the information retrieval technology in the Internet time's application makes the people more effective to gain network resources. But with the development of Internet, the traditional search engine, namely the general search engine cannot satisfy the people's increasingly demand to the information retrieval service. This thesis research and discuss correlation techniques to the focused crawler which held the important position in focused search engine.Web crawler is used to download web pages from Internet. Starting from some seeding links, general web crawler searches all the web pages throughout the internet. The focused crawler aim to get more pages related to topic, apart from the fundamental function of general web crawler, the focused crawler should able to analyze links and content in web pages to guide and forecast crawler's crawling path. What crawling strategy does the crawler used to visit the Internet have a significance impact on the focused crawler's efficiency. This thesis studied and improved the focused crawling algorithm based on the Context Graph. The main research works as follows:(1) Research on general crawler and focused crawler's technical principle and workflow; make a careful analysis of focused crawler's crawling strategy. This thesis introduce and analysis good and bad points of the crawling strategies based on link analysis and based on content analysis which are usually used by focused crawler.(2) To resolve the problem that traditional focused crawling algorithm cannot deal with " the tunnel", this thesis introduced in detail a crawling algorithm based on the Context Graph, by predicting the level of web pages in the context graph, the crawling algorithm advances along the most promising path that leads to target documents at low cost of crawling irrelevant pages to find target documents quicker and resolve "the tunnel".(3) To improve the feature selection and appraisal quality used in the crawling algorithm based on Context Graph, This thesis used a feature selection method based on the word frequency difference and a modified TF-IDF formula joined the word's category weight. (4) A demo system—Focused crawler was proposed in this paper. The experiment results show that the feature selection quality and the focused crawler's performance can improve by the improved algorithm proposed in this paper.
Keywords/Search Tags:Search engine, Focused crawler, Context Graph, Feature selection
PDF Full Text Request
Related items