Font Size: a A A

Research And Implementation Of Template Based WEB News Searching Technology

Posted on:2011-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y LinFull Text:PDF
GTID:2178360308452667Subject:Software engineering
Abstract/Summary:PDF Full Text Request
At present, search engine is required to locate the interested online news from internet. There are various news search engines, such as Google News and Baidu News, but none of these can satisfy the needs of enterprise usage.The purpose of this paper is to establish an integrated enterprise work platform supporting news search, news editing and news archive with Web news search engine which can accomplish news search efficiently, accurately and entirely. The customized Web news search engine deals less information than general search engine does, so it can obtain news faster and more accurately and then extract news content from web page into unified format for further usage.In this paper, we analyze popular Web news search technologies, and then research two keys of template based distributed Web news search engine with practical requirements:1) the distributed structure of search engine and dynamic tasks assignment among crawlers. We design distributed search engine architecture with many crawlers and one coordination server. Crawlers obtain tasks from the coordination server and then accomplish the tasks. The searched results are uploaded from crawlers to the server and then gathered on the server. Historical data based shortest time consuming assignment algorithm which assures the search accomplished in the shortest time is proposed to assign tasks among crawlers. 2) Automatic template based Web news content extraction. In order to extract news content from web pages efficiently and accurately, we design and implement a template based method which automatically detects tag templates of web sites and then extracts news content using the template. The time complexity of the method is O(n), n is the size of web page.Then, based on the research of the key technologies, we have analyzed the specific requirements, designed the architecture including use case model, static logical model and dynamic logical model using UML, and implemented an enterprise news search platform using .Net and Postgresql with the practical business requirements of Intelligence Company w. The platform has been passed the function and performance testing.At present, the platform has been tried out in the company's daily business, and can satisfy its business requirements properly. The platform is able to search over 30000 pieces of Web news everyday by 5 crawlers in 4.5 hours.The overall daily news coverage rate is 92%, which is higher than Google News'38.51% and Baidu News'19.2%. The accurate rate is 90.5%, which is higher than the old tool's 67.68%. The performance showed the methods we proposed in this paper and the product we have implemented is good for supporting the field oriented Web news search.
Keywords/Search Tags:Web news search, distributed crawlers, text extraction, search engine
PDF Full Text Request
Related items