Design And Implementation Of Topic-focused Crawler For Education News

Posted on:2012-07-01

Degree:Master

Type:Thesis

Country:China

Candidate:Z Lu

Full Text:PDF

GTID:2218330362956307

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the rapid development of World Wide Web (WWW), it is widely accepted that the Internet, called the Fourth Media, will be the most potential and energetic media after newspapers, radio and television as an important carrier of the hot social news. In order to know the Internet hot news in time, especially to master the education-related hot spots and their trends, the related organizations of education introduced the hot news and analysis system. The focused spider designed and implemented by this paper is in the information collection layer and is the fundamental part of hot news and analysis system. It is responsible for the information collection within the education field.The traditional Web crawler serves to search engine, and it can't meet the needs of the specified topic, but the focused spider selectively crawls topic related web pages. This paper deeply studies many key technologies such as degree of topic relatedness, text extraction and hyperlink extraction, and proposes a general framework of the crawler design for education news, and designs the modules of the crawler system. In the help of related technologies and tools and with the needs of the system itself, this paper discusses the concrete realization of the core module in detail. This paper has completed the following main work: Firstly, in order to focus on the main sites, we designed a crawling strategy based on the weight model; secondly, in order to improve the efficiency of hyperlink extraction, we adopted an accurate hyperlink extraction strategy based on XPath. Finally, in order to spider the visited URLs as little as possible, we used a crawling strategy to avoid duplication based on Berkeley DB.Through the analysis of the results of the system, it shows that the spider is running steadily and is contributing data for the hot news and analysis system continuously. The spider implemented by this paper meets the requirements and has achieved satisfactory results. The hyperlink extraction strategy based on XPath and the crawling strategy to avoid duplication based on Berkeley DB in this paper are important for the implementation of focused spider.

Keywords/Search Tags:

focused spider, information extraction, topic similarity, hyperlink extraction

PDF Full Text Request

Related items

1	Application And Research Of Information Extraction And Topic Spider For Criminal Investigation Web Pages
2	Topic Chain-based Topic Information Extraction From Chinese Food Complaint Documents
3	The Research On Focused Web Information Extraction
4	Technology For Domain-oriented Automatic Information Extraction From Semi-structured Web
5	An Automatic Extraction Method For Chinese Article Keywords Based On TextRank And Similarity Of Word Items
6	Research Of Focused Search Engine About Petroleum Subject
7	Research On Keyword Extraction Algorithm For Chinese Text Based On Document Topic Structure And Semantics
8	The Design And Implementation Of The Topic-focused Web Crawler System
9	Research On Web-Based Extraction Technology Of Hyperlink And Web Page Content
10	The Design And Implementation Of Enterprise Information-Oriented Web Focused Search