Font Size: a A A

For Topic Detection Technology Research Of Network News

Posted on:2014-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:A H ZhaoFull Text:PDF
GTID:2248330398458455Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the network as a new media has become animportant channel for people to obtain information. Faced with the massive amounts of networknews information, how to quickly and accurately obtain hot news topics, effectively organize andanalyze the news, is the focus and hotspots in the field of information processing research. As akey technology to solve this problem, Topic Detection and Tracking aims to detect the unknowntopics from the network news information flow, follow up on known topics. The technology putsthe topics as a main line, aggregates the distributed information automatically, facilitating peopleto understand the full details of an event as well as related activities. It has broad applicationspace and development prospects in fields of information security, financial securities, industryresearch and so on.This paper summarizes the current research status of the Topic Detection technology,analyzes the problems faced with, and gives the research thoughts. In the research process, thepaper introduces the key technology which involves in detail, explores the online topic detectiontechnology in depth, the works as follows:Firstly, this paper makes further research on the topic model building. It fully considers thetitle and body these two parts combined with characteristics of the news, uses dual vector torepresent text, highlights the importance of the news’s title adequately, and improves theefficiency of detection. Besides, it uses the center vector space model to construct the topicmodel, the weight of each feature in the topic model is re-calculated when the new news put in,thereby dynamically adjusts the topic model, achieves the object of the online real-timedetection.Secondly, this paper puts forward a network hotspot topic detection algorithm based on thecore word clusters. Aim at the shortcomings of the single-pass clustering algorithm, theclustering process uses two-tier clustering strategy. First of all, it micro-clusters the news’s titlevectors to discover the new topics, and puts the news which reaches the preconditions into thecandidate sets. Then, it clusters the news in the candidate set, analyzes the heat of topic, andultimately gets the network hot topics for a period of time. Experimental results show that themethod of this paper improves the recognition performance and accuracy rate of TopicDetection.Thirdly, this paper puts forward a method of subtopic division. Nowadays it is difficult todistinguish the subtopics in a hot news topic on the internet. To solve this problem, in the paper,the method of subtopic division based on Latent Dirichlet Allocation is presented. It describes anews document by Latent Dirichlet Allocation, and uses Bayes standard method to determine theoptimal number of topics in order to fit documents best. According to the high similarity of documents between subtopics, the relativity analysis of feature words is introduced. Using theimproved Kullback-Leibler distance to calculate the similarity of news stories can distinguish thestories which have similar content but belong to different topics effectively. Finally, it divides ahot news topic to subtopics by clustering the news documents with the single-pass incrementalclustering algorithm. Experimental results verify the availability of the improved similaritycalculation method, and it shows that this method can improve the performance of subtopicdivision effectively comparing to the baseline method.
Keywords/Search Tags:Topic Detection, Vector Space Mode, Text Clustering, Subtopic Division, SimilarityCalculate
PDF Full Text Request
Related items