Font Size: a A A

Design And Implementation Of News-Collecting System

Posted on:2009-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhangFull Text:PDF
GTID:2178360278457090Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Regarding the applications of the news-page such as automatic classification, automatic abstract, sensitive information monitor as well as web data mining and so on, gathering news information from Internet automatically is very important. Taking the semi-structured news-page as the object of our study, two questions are addressed in this thesis: how to collect news-page from the complicated network space and how to extract content from them. A detailed scheme of news-collecting system is designed and implemented in this thesis. The main works include three aspects.1) Based on the B/S structure, the design of the news-collecting system is given. The system is divided into two subsystems: Meta search engine subsystems and information extraction system. And a detailed modular design is given for both subsystems.2) The realization of essential technology in the subsystem of Meta search engine is given. Multi-thread technique is used to interact with search engines parallelly and regular expression is used to parse the search results. Then, a strategy of removing the repeated results and ordering the remainings is designed and realized.3) Based on the thorough analysis of the structure of news-page labels, a concept on the smallest table block is proposed. Then, the problem of extracting the content from news-page is transformed into the problem of seeking the smallest table block. Based on this transformation, an approach using Bayesian theory to extract the news content from page is provided to realize news content extraction.The experimental results show that the design of news-collecting system is practical and feasible. The content extraction algorithm is accurate and effective, and the automation of the news-collecting is realized initially.
Keywords/Search Tags:Meta search engine, Search engine, Information retrieval, Information extraction, News content extraction
PDF Full Text Request
Related items