Font Size: a A A

Research On Text Processing Technology For Topics Of Hot News

Posted on:2016-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y FangFull Text:PDF
GTID:1108330503453425Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The popping up of online news all over the web calls for automatic processing by aid of computer. The main tasks towards topics processing of hot news include topic detection, topic tracking and topic evolution. These tasks aim to automatically detect the topic, aggregate related news stories and reveal patterns of topic evolution. For now, the main restrict to topic related tasks still lies in text categorization technology, which is the focus of this dissertation. Several text categorization methods based on various text representation models were proposed to improve the performance of text categorization, and ultimately improve the performance of hot topic management. The dissertation is organized as follows:i) Schemes and representation models for text categorizationBased on the survey of state-of-the-art text categorization schemes, distinguishing was carried out. We proposed a 3-dimension scheme representing a text categorization method by 3 factors. Further, this scheme was extended to a 6-tuple one to represent any text categorization method. Specifically, this scheme may reflect the implementation process of text categorization and present the result systematically. For text categorization method based on topic model, this scheme can fully express contents that other schemes cannot.ii) Multi-strategy integration for text categorization method based on SVM-decision treeWe explored the technology combining SVM(support vector machine) and decision trees for text categorization. Four factors(the constructing style of decision trees, data scale, structure form, and inter-class distance) were examined, which dominated the construction of decision trees. To construct a SVM decision tree, a method integrating multi strategy was proposed, by which structured classifier has clearer architecture, less hierarchy, and better adaption for classification, thus time efficiency and classification accuracy can be improved simultaneously.iii) Hot topic evolution method for dynamic topicsIn a topic cycle, both focus topics and numbers of sub topics are varying. Therefore, the topics are dynamic. How to get the number of topics dynamically becomes a focus of our attention. ILDA based model was proposed to get required parameters. With this method the input corpus could be dynamically updated, which met the requirement of topic evolution better. A topic evolution analysis system on it can execute automatically without given topic number, satisfying the prespecified requirement. Experiments on Chinese and English corpus showed that it is portable and useful for practice.iv) Self-adapted topic model combined dynamic and static features to improve the classificationBy exploring the cause of ―Rich Topic Gets Richer‖ problem and its solution, we found that ―static in motion‖ was one of the phenomena occurred in topic evolution. A self-adapted topic model was proposed combining dynamic and static features. In this model, the weighted static factor lays stress on the weight of steady topic features, and the weighted dynamic factor from resampling lays stress on the major features occurred in neighboring cycles. To some degree, this decreases the topic bias and benefit the subtle or fine topic classification.To examine the proposed schemes, a framework was set up for hot topic management. It was composed of five modules: data acquisition, knowledge base construction, hot topic detection, hot topic tracking and hot topic revolution. The system has achieved the prespecified running capability and applied into an advanced project.
Keywords/Search Tags:hot spot, topic detection and tracking, evolution analysis, text categorization, topic model
PDF Full Text Request
Related items