Font Size: a A A

Research On Web News Topic Organization And Acquisition System

Posted on:2009-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z M WangFull Text:PDF
GTID:2178360278970825Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As information on Internet is available in abundance, Internet is becoming a vital source of Knowledge Getting. But information is too much to look up valuable information efficiently. For this reason, it is very important to neaten the information on internet. Our topic is aimed to design an intelligent information acquisition system based on topic detection and tracking, which possesses information collection, retrieval and management functions, and provides high effective information services. It is very important in the field of Search engine, information supervision and knowledge administration. This dissertation studies the model, principle of Knowledge Getting System, and analyses the difficulty of several key technology in this system, the major contributions are as follows:(1) A general web crawler is designed and realized to fulfill the demand of the System, where the protocol of Robots is analyzed and web style is classified and the news time is parsed. The experiment shows that the web crawler have nice generality and can automatically download web pages and provide sufficient support for following information applications.(2)The problem that the guideline, advertisement and copyright information embedded in Web pages make the performance of topic detection worse was analyzed carefully, and a Web noise cleaning technique based on VSM is proposed.(3) A method of topic detection based on adaptive Centroid vector is proposed to avoid the shortcoming of current adaptive methods. The new method introduces name entities to represent topic and combines preliminary topic Centroid vector with every mdified Centroid vector for topic detection. Experiments show that the new algorithm lowers the probability of miss and false alarm errors, and improves the performance of topic detection system.(4) A topic tracking method based on LS-SVM is proposed. The new method adopts LSI (Latent Semantic Indexing) to do dimension reduction and text expression, and then adapts SVM to complete semantic-based topic tracking. The result of experiment shows that compared to conventional methods, the new method can raise the precision and recall, and improve the performance of topic tracking effectively.(5) A topic cause-and-effect generating method based on NS-IMMC is proposed. The new method chooses representative sentences for news documents according to the speciality of news structure (NS, News Structure). and then utilizes IMMC (Improved Min-Max Clustering) to classify these representative sentences to generate multi-documents summary which represents the topic cause-and-effect. The result of experiment shows that the cause-and-effect generated by the new method has an all-around content covering and a strong logic, and shows the development of the topic exactly.
Keywords/Search Tags:Topic Detection, Topic Tracking, Vector Space Model, Name Entities, Cause-and-Effect of Topic
PDF Full Text Request
Related items