Font Size: a A A

Research On Bursty Topic Detection In News Streams

Posted on:2013-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:G DuFull Text:PDF
GTID:1228330374499507Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The bursty topic detection is to extract fast emerging topics from large volume of text data. The research involves many aspects including burst feature detection, topic modeling and topic evolution analysis. According to different kinds of corpora, there are often different solutions. This paper concerns the newspaper headline streams. Newspaper headlines are short and sparse in the vector space model, but have great vocabulary and variations. All these factors pose great challenges for bursty topic detection. The contributions of this paper are as follows:(1) Reliability optimization of burst features:The Kleinberg’s two-state automaton model is the most well-known generative model for burst feature detection. In order to maximize the reliability, we deeply examined this model and proposed an estimation algorithm for the optimal resolution parameter of the model. Furthermore, our method can produce word-adaptive resolutions for each word. This breaks the bottleneck of the unified resolution in the original model. Experiments show that the estimated resolution of our algorithm approximates to the the optimum obtained by exhaustive method.(2) Robustness optimization of burst features:In newspaper headline streams, the distribution of burst events and trivial events are uneven. There are often a few burst events accompanying with many trivial events. Therefore, the document sequence will often bring noises in the feature detection results. In other words, the results are not robust to the noises in the document sequence. To remove those noises, we proposed a noise elimination method that can be integrated into nearly all the burst feature detection algorithm seamlessly. The method adopts the idea of entropy maximization and converts the maximization problem into a KL-distance minimization problem. Experiments illustrate that our method can effectively remove the noises in burst features and preserve important burst features. Therefore, the robustness of the burst features is greatly improved.(3) The static structure of bursty topic:Newspaper headlines are short and spare. In order to extract the topic structure, a hierarchical clustering method is proposed to construct a word co-occurrence tree. By analyzing this tree, we can identify many so-called omitted quotation features caused by the bursty topics. This is based on the phenomenon: Bursty topics often have short co-occurrence patterns that omit the quotation of general contexts. So we discover bursty topics by identifying omitted quotation features in the word co-occurrence tree. This process is purely static without analyzing the time stamps of the headlines. Specifically, we proposed a word weighting algorithm to rank the words in each headline. Next, a heuristic clustering algorithm is proposed to assign headlines into several clusters. Each cluster is represented by a topic word as the cluster center. Subsequently, we apply this algorithm iteratively in a hierarchical manner to get a co-occurrence tree. Finally, omitted quotation features are detected to describe the underlying bursty topics in the headlines stream. Experiments show that our clustering method can reduce the dimensionality and the sparsity of the headline data dramatically, and the cluster centers are descriptive for topic presentation. Furthermore, by comparison with the bursty events in the Wikipedia, our method can cover the most of them, and the omitted quotation features are good depiction of the real events.(4) The dynamic structure of bursty topic:In classical burst feature detection, the burst intensity of a word is represented by the word frequency, and a frequency sequence is analyzed to model the burst features. In this paper, we assume that the bursty topics are some word co-occurrence patterns which functions as a connecting link between the preceding headline and the following. Given a topic word, we model the headline stream relevant to the word as the evolution of patterns co-occur with the word through the time, and identify those patterns connecting the preceding headline and developing the following one as bursty topics. Specifically, we rank the words in each headline to maximize the fix-start (the start is always the topic word) random walk probability likelihood. This ranking can put expressive words to the start of a headline. After that, we assume the ranked headline stream is generated by a dynamic topic model based on semi-random walk. Each headline in this model is generated by referring to its nearest preceding headline and developing its own words randomly. The reference part of each headline is the hidden variable of the model. By inferring the hidden variables, we can get an evolution sequence of the headline stream and identify bursty topics in the sequence. Experiments suggest that our algorithm can effectively capture the changes of word co-occurrence patterns in the headline stream, and can identify more bursty topics than frequency-based methods. Moreover, our algorithm also computes the starts of the bursty topics more accurately than baseline methods.
Keywords/Search Tags:bursty topic detection, burst feature detection, topicmodeling, topic evolution
PDF Full Text Request
Related items