Font Size: a A A

The Research On The Focused Crawling Technology Based On The Concept Tree

Posted on:2006-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y C CengFull Text:PDF
GTID:2168360155962145Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The crawling way of web robots is divided into exhaustive crawls and focused crawls. The exhaustive crawls that attempt to crawl all the Web pages consume vast resources of storage and bandwidth. At the same time it is very difficult for people to find web documents for the particular purpose using the crawls. The focused crawls search only the important subset of the WWW that pertains to a specific topic of semantic relevance, which decrease the network traffic and download, so it is very important to develop the focused crawling technology. However, there are some defects about the current technology, such as limitations about the starting URLs leading to the object topic set, shortages about leading to the other topic area that is not adjacent to the preceding one when there are not relevant documents that are not naturally connected with the set.The principle of crawling web pages and the key technology of web robots having been studied in this paper, the focused crawling method based on the concept tree is presented to improve the harvest rate of the crawl, which is named FCMCT for short and endows the crawling URL objects with the layer semantic information using the semantic model of the concept tree. The FCMCT presents that the knowledge-path is obtained from the concept tree in some domain according to the object topic and the topic layers are built according to the path, that the documents and all the topic layers are donated with vectors of the class words except the irrelevant layer and the comparability between them is estimated by the cosine of the inner product, that the URLs extracted from a web document are allotted to the waiting queue corresponding to the topic layer that the document is relevant to and the links are endowed with the layer semantic information about the content of the document , that the links are endowed with the combined value of the layer semantic information about the class words and the other metrics in ordering all the waiting queues, so the crawling URL objects are endowed with the layer semantic information about web documents' content and the class words.The prototype about the focused crawler based on the concept tree is built using the non-recursive method and multithreading mechanism based on FCMCT. The task manager based on memory is responsible for managing the tasks of adding, ordering and allotting the URLs etc in the crawling process. The thread pool controls many crawler worker threads that crawl the web documents in parallel.
Keywords/Search Tags:Web Robots, Focused Crawlers, Focused Crawls, Exhaustive Crawls, Topic Layers, Concepts, Concept Tree
PDF Full Text Request
Related items