Font Size: a A A

Research Of Topic Detection And Tracking On Microblog Feeds

Posted on:2017-04-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:J J HuangFull Text:PDF
GTID:1318330512986006Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Microblog platforms have developed into one of the most popular social networks of discussing social issues and people's daily life.The user generated contents(UGCs in short)in microblog platform form into microblog feed streams with real-time generation and dynamically updating.These quickly-updated UGCs make it urgent to develop an effective tool to reveal what new topics are attracting the most online attention at present and how these topics evolve over time.Monitoring such streams is of significance to early-warning of emergency,product marketing,public opinion management,information recommendation,and so on.Topic detection and tracking in microblog feed streams is one of such effective tool to reveal what things are drawing public attention and further help related organizations to draw up countermeasures earlier.However,due to the short length,high noise,low quality,fast changing and large volume of the microblog feeds,previous topic detection and tracking methods suffer several challenges when directly applied into the texts:(1)It is time and memory consuming when detecting topics in large volume of low quality microblog feeds.(2)It is difficult to generate meaningful and coherent topics when directly clustering the high-dimensional and sparse vectors of the microblog feeds.(3)It is difficult to make a trade-off between the demands of real-time detection and accuracy when detecting emerging topic in microblog feed streams.(4)It is difficult to track and figure out the topic evolution states over time.In this study,we propose a topic detection and tracking framework over microblog stream,which tries to tackle with the above challenges from microblog feed sampling,topic abstraction,emerging topic detection and topic tracking.The main points of this study are list as follows:(1)To tackle with a low quality and large volume of microblog feed stream,we propose a high quality microblog extraction method based on time-frequency transformation,which can extract a small part of representative feeds from the raw stream.Considering the feed content quality,social network features of microblog feed,URL and other features,we propose a feature fusion algorithm based on wavelet transformation to estimation the impact of each feature on the feed quality.Experimental result tested on one million of microblog feeds demonstrates that the proposed method is effective to extract high quality information with low redundancy among the extracted feeds.(2)To tackle with the high-dimensions,features sparsity and noisy disturbing when employing the vector-space-based clustering methods in microblog feeds,we propose a short texts clustering and topic extraction(STC-TE in short)method based on frequent itemsets that appear in the feeds.In this method,it first studies the impact of multi-features on the short texts' quality.Then,it digs out massive amount of frequent itemsets from the high quality short text set via setting a low support level,and design a similar itemsets filtering strategy to filter out most of the less important frequent itemsets.Furthermore,based on the frequent itemsets similarity evaluated with relevant texts,we propose a clusters self-adaptive spectral clustering(CSA_SC in short)algorithm to achieve important frequent itemsets clustering and topic extraction.At last,large-scale of short texts are classified into different topic clusters according to the topic words extracted from the frequent itemset clusters.The method is tested on one million of Sina Weibo dataset to evaluate the performance of important frequent itemset selection and clustering,topic words extraction,and large scale of short texts classification.Experimental results show that the STC-TE method can achieve topic extraction and large-scale short texts clustering with high accuracy.(3)To tackle with the difficult of measuring the similarity between a pair of microblog feeds and identifying valuable keywords from a large scale of vocabulary effectively,we propose a high utility pattern clustering method to detect topics in microblog feed streams.This method first extracts a group of representative patterns from the microblog feed streams,and then groups these patterns into topic clusters.This approach works well on large scale of microblog feed streams because it clusters the patterns that perform better in describing topics,rather than clustering noises and short microblog feeds directly.Furthermore,the proposed method can detect coherent topics and emerging topics simultaneously.Extensive experimental results on Twitter streams and Sina Weibo streams show that the developed method achieves better performance than other existing topic detection methods,leading to a desirable solution of detecting event from microblog streams.(4)To tackle with the balance between the demands of real-time detection and accuracy,we propose a probabilistic method based on topic novelty and fading.This method aligns emerging word detection from temporal perspective with coherent topic mining from spatial perspective.Specifically,we first design a metric to estimate word novelty and fading based on local weighted linear regression(LWLR),which can highlight the word novelty of expressing an emerging topic and suppress the word novelty of expressing an existing topic.We then track emerging topics by leveraging topic novelty and fading probabilities,which are learnt by designing and solving an optimization problem.We evaluate our method on a microblog feed stream containing over one million feeds.Experimental results show the promising performance of the proposed method in detecting emerging topic and tracking topic evolution over time on both effectiveness and efficiency.
Keywords/Search Tags:Microblog feed streams, topic detection, topic tracking, high quality microblog feed, topic abstraction, emerging topic
PDF Full Text Request
Related items