Font Size: a A A

Design And Implementation Of Topic-focused Crawler For Education News

Posted on:2012-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z LuFull Text:PDF
GTID:2218330362956307Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of World Wide Web (WWW), it is widely accepted that the Internet, called the Fourth Media, will be the most potential and energetic media after newspapers, radio and television as an important carrier of the hot social news. In order to know the Internet hot news in time, especially to master the education-related hot spots and their trends, the related organizations of education introduced the hot news and analysis system. The focused spider designed and implemented by this paper is in the information collection layer and is the fundamental part of hot news and analysis system. It is responsible for the information collection within the education field.The traditional Web crawler serves to search engine, and it can't meet the needs of the specified topic, but the focused spider selectively crawls topic related web pages. This paper deeply studies many key technologies such as degree of topic relatedness, text extraction and hyperlink extraction, and proposes a general framework of the crawler design for education news, and designs the modules of the crawler system. In the help of related technologies and tools and with the needs of the system itself, this paper discusses the concrete realization of the core module in detail. This paper has completed the following main work: Firstly, in order to focus on the main sites, we designed a crawling strategy based on the weight model; secondly, in order to improve the efficiency of hyperlink extraction, we adopted an accurate hyperlink extraction strategy based on XPath. Finally, in order to spider the visited URLs as little as possible, we used a crawling strategy to avoid duplication based on Berkeley DB.Through the analysis of the results of the system, it shows that the spider is running steadily and is contributing data for the hot news and analysis system continuously. The spider implemented by this paper meets the requirements and has achieved satisfactory results. The hyperlink extraction strategy based on XPath and the crawling strategy to avoid duplication based on Berkeley DB in this paper are important for the implementation of focused spider.
Keywords/Search Tags:focused spider, information extraction, topic similarity, hyperlink extraction
PDF Full Text Request
Related items