Focused Crawling

Posted on:2009-06-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Xue

Full Text:PDF

GTID:2208360245461401

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the web, a series of questions have exposed in the traditional generic search engine, such as low coverage, large resource occupancy rate , long time to update, low pertinence and others. To tackle those problems of generic search engine and to satisfy specifically user subject-oriented query, the topic focused crawling technology came into being. The focused crawling, based on the generic search engine's technology, applies machine learning and other intelligence methods, to download more related page with low cost. Topic focused crawling technology has occupied a position with its high degree of specialization and objectives in next development of search engine from 1990s.Currently, the researches of topic focused crawling mainly concentrate on two hotspots: Document Categoration and Crawling Strategy.Document Categoration technology is studied in this thesis. Topic focused crawling usually classify documents by the algorithm of TF-IDF of SVM. However, TF-IDF algorithm only takes the term's frequency (TF) and the document frequency (DF) into consideration. It neglects the term's positional information and different positional term has different importance to classify the document. To solve this problem,"A algorithm of term weighting based on information of term position"is proposed in this paper: assign different weighting factors to different positional term, therefore the weighting values of the term can reflect the importance of the term objectively. This algorithm improves the veracity of document categorizaation. The weighting factors can be adjusted to get good result.According the disadvantage of Best-First search, and the information people used when they judge whether the hyperlinks are useful to him or not,"A crawling strategy based on comprehensive information of URL"is proposed in this thesis. It calculates the predicted correlation of the URL by analyzing the similarity of current Web page, URL directory information and anchor text of hyperlink, then it put the URL to different priority crawling queue according the URL's predicted correlation. For the URL having low correlation value, the system doesn't throw it away, but put it to the waiting queue. When other crawling queues are free, system crawls these URLs for finding new topic Web pages. This crawling strategy is simple, but it improves the efficiency and recall rate.Finally, based on theory of"A algorithm of term weighting based on information of term position"and"A crawling strategy based on comprehensive information of URL", the design and the implementation of the topic focused crawer, including the system structure and method are detailed in this thesis.

Keywords/Search Tags:

topic focused crawler, term position, directory layer of URL, anchor text

PDF Full Text Request

Related items

1	A Focused Crawler Based On Statistical Machine Translation And Topic Propagation
2	Design And Implementation Of Focused Crawler For Blogs
3	Based On The Theme Of The Html Tags Crawler Design And Realization
4	Research On Topic Focused Web Crawler And Related Technologies
5	The Design And Implementation Of The Topic-focused Web Crawler System
6	Design And Implemention Of Focused Crawler To Application Store
7	Research And Realization On Focused Crawler Key Technologies Of Vertical Search Engine
8	Customizable Focused Crawler
9	The Research And Implement Of Topic-focused Web Crawler Based On SVM Classification Algorithm
10	Research On A Method Of Focused Crawler For Vertical Search System