Research On The Crawler Of Search Engine

Posted on:2011-08-23

Degree:Master

Type:Thesis

Country:China

Candidate:Y Gong

Full Text:PDF

GTID:2178360305983029

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The search engine as the information retrieval technology in the Internet time's application makes the people more effective to gain network resources. But with the development of Internet, the traditional search engine, namely the general search engine cannot satisfy the people's increasingly demand to the information retrieval service. This thesis research and discuss correlation techniques to the focused crawler which held the important position in focused search engine.Web crawler is used to download web pages from Internet. Starting from some seeding links, general web crawler searches all the web pages throughout the internet. The focused crawler aim to get more pages related to topic, apart from the fundamental function of general web crawler, the focused crawler should able to analyze links and content in web pages to guide and forecast crawler's crawling path. What crawling strategy does the crawler used to visit the Internet have a significance impact on the focused crawler's efficiency. This thesis studied and improved the focused crawling algorithm based on the Context Graph. The main research works as follows:(1) Research on general crawler and focused crawler's technical principle and workflow; make a careful analysis of focused crawler's crawling strategy. This thesis introduce and analysis good and bad points of the crawling strategies based on link analysis and based on content analysis which are usually used by focused crawler.(2) To resolve the problem that traditional focused crawling algorithm cannot deal with " the tunnel", this thesis introduced in detail a crawling algorithm based on the Context Graph, by predicting the level of web pages in the context graph, the crawling algorithm advances along the most promising path that leads to target documents at low cost of crawling irrelevant pages to find target documents quicker and resolve "the tunnel".(3) To improve the feature selection and appraisal quality used in the crawling algorithm based on Context Graph, This thesis used a feature selection method based on the word frequency difference and a modified TF-IDF formula joined the word's category weight. (4) A demo system—Focused crawler was proposed in this paper. The experiment results show that the feature selection quality and the focused crawler's performance can improve by the improved algorithm proposed in this paper.

Keywords/Search Tags:

Search engine, Focused crawler, Context Graph, Feature selection

PDF Full Text Request

Related items

1	The Focused Web Crawling Strategy Based On Incremental Learning
2	The Research On Focused Crawling Algorithm In Vertical Search Engine
3	Research And Design On Focused Crawler Of Search Engine
4	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
5	Research And Implementation On Focused Crawler With New Strategy For The Vertical Search Engine
6	Research And Implementation Of A Time-based Focused Search Engine
7	Research On An Algorithm Of Focused Crawler In Vertical Search Engine
8	Customizable Focused Crawler
9	Research And Implement Of Focused-crawler Relevance Algorithm In Search Engine
10	Research And Implementation On Focused Crawler With Search Strategy