Font Size: a A A

Research On Topic Focused Web Crawler And Related Technologies

Posted on:2008-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:X H PuFull Text:PDF
GTID:2178360245497925Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the Internet's rapid development, more and more information has been displayed to people. But in the face of the massive resources of the Internet, people are always only interested in a particular area of information. It is a problem that how to find the information, which people want to get, quickly and accurately from the voluminous and complex net information. The web-based crawler which came forth in 1994 automatically crawls into people's hope. Generic web crawler has provided a great convenience to people. But because of its comprehensive, it does not have professional-oriented features, so that it is inadequate with accuracy and speed. To improve the quality of information services, people began to study the subject-oriented web crawler.In this paper, we address two issues of the subject-oriented web crawler. One is how to define the subject; the other is how to sort links to be downloaded in the queue efficiently. It aims to visit only relevant pages, and get a great scale of hyperlinks which link to the relevant pages. The crawling method in this paper is a novel one, which is based on the semi-structured features of the webpage and content information. The results of experiment show that it is a very effective method for focused crawler.Blog, as an emerging phenomenon of the Internet, has been concerned by more and more people. We consider"Blog"as a special"subject"in this paper, and thus design and implement a blog-oriented web crawler.With the explosive growth of the Internet information, the web has become a huge worldwide information network. On the current scale of the Internet, the single web crawler is unable to visit the entire web in an effective time-frame. Distributed web crawler is the inevitable trend of development. Distributed web crawler adopts multi-parallel crawl, so it improves the whole system efficiency and extensibility.In the distribution design, we mainly consider two facets of parallel. One is the multi-thread in the internal node; the other is distributed parallel among the nodes. We focus on the distribution and parallel between nodes. We address two issues of the distributed web crawler including the crawl strategy and the dynamic configuration. The results of experiment show that the hash function based on the web site achieves the goal of the distributed web crawler. The ability of the single node in distributed web crawler should not decrease so much with the single web crawler. As we pursue the load balance of the system, we also should reduce the communication and management spending as much as possible.
Keywords/Search Tags:Crawler, Spider, Robots, Wanderers, the theme crawler, focused crawler, subject-oriented crawler, distributed crawler, blog
PDF Full Text Request
Related items