Font Size: a A A

Research And Implementation Of Multi-Type News Syndication Based On Content

Posted on:2011-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:M QiuFull Text:PDF
GTID:2178360302964541Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the current information exploring age,it has been a difficult problem for users to select and sydicate useful information from the rest effectively.On the other hand, according to the prevailing and fast spread of online news,the gap between traditional medium and the "Forth Medium" Internet is consistently decreasing.So in the area of News,in order to stride across the gap between the traditional medium and Internet as well as to gather all useful information together quickly for users, in this paper,a lot of research has been done on the sydication of multi-resource and multi-style News.The writer has made a deep study on the key technologies for news sydication.A solution that is exclusively used in news area is proposed on the basis of the present popular syndication technology RSS.The solution is named as content based multi-style news sydication system.The key technologies are generalized as bellow:In this paper different news from different resources will be processed by differnt Agents because of the ability of autonomy and cooperation that an Agent should own. Meanwile,a content based news syndication system architecture for different styles of news from different resources will be proposed which combine the technology of keywords automatic abstraction together with the technology of Eigenvector Space Similarity Caculation as a guideline of the automatic syndication of different styles of news from different resources.This system architecture which uses the Multi-Agent constructure can provide many functions as the cellectiong of news,the pre-processing of news page, the automatic keywords extraction,keywords sets matching and the communication with users.C-NSSA is with the character of high parallelity,high reliability and high expansibility.Under the guide of C-NSSA,a deep study was be made on the key technologies for news automatic syndication including news page parsing technology,keywords automatic extraction for news context and keywords sets matching on the base of news content.In the process of news page parse,a methord for news page parsing on the basis of news page structure analysis is proposed.This methord is based on HTML DOM technoloy which and transfer a HTML page to a DOM tree.After that,the news title and main body can be extracted by caculating the size of a text node group or the general size of text nodes in one group according to the analysis on the constructure of a news page.This methord is very suitable for the application of this system architecture.In the keywords extraction process for news,some improvements have been made on the traditional TF*IDF methord by the co-occurrence theory.This new methord is based on the pure TF*IDF algorithm and put the location feature of words as well as co-occurence feature between words into consideration.On the other hand,due to the reason that the scale of news's content is very large,different news are divided into different teams which we called channel,which means different news in different channel will be caculated differently.This methord is much more effective in the keywords extraction for news text.In the lase process that is keyword sets matching,the widely used VSM model and cosine coefficient methord to caculate the degree of similarity between two keyword sets. Meanwhile due to the special situation of this application that vedio news has not a text and the number of keywords of every piece of news is limited which will affect the matching result,word co-occurrence is put into consideration once again.Because this system architecture has been worked in a project,the project will be described in this paper.Some experiences are also designed the results of which shows that the content based multi-style news syndication system can work well.
Keywords/Search Tags:News, Automatic Syndication, Keywords Extraction, Word Co-occurrence, Multi-Style
PDF Full Text Request
Related items