Font Size: a A A

Research On Clustering Algorithms For Social Media Content

Posted on:2015-11-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:C S LiFull Text:PDF
GTID:1108330422492538Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the booming development of Web2.0, Internet users are increasingly active and have created huge scale of text data in social media services. Meanwhile, social relations of users in reality have been captured by social media services. Faced with massive and valuable data, many scientists found that the traditional clustering methods didn’t work well for the complicated data, since some new characteristics had emerged in social media data, which might contain strong noise, be highly sparse, be short text, be dynamics and have missing value.Recently, several approaches have been proposed for dealing with these new charac-teristics, such as probabilistic topic model or graph partition. However, for social media data some existing clustering methods have some shortcomings. For example, ignoring distribution of whole corpus and relations among data, which usually cause poor perfor-mance of traditional methods on social media data.This dissertation focuses on those problems caused by new characteristics of social media data and proposes five novel clustering methods which refer to latest research re-sults in graph clustering models and probabilistic topic models. The main contributions of this dissertation are as follows:(1) Two algorithms, DOM tree structure based web page segmentation (TPS) algo-rithm and graph partition based web page segmentation (GPPS) algorithm, are proposed for modeling the mapping relationship between DOM tree structure and page semantic modules. Seeing a web page as a set of semantic information block-s, TPS and GPPS algorithms can divide a web page into several modules with unique topic. The TPS algorithm segments a page by heuristic rules, whereas the GPPS algorithm considers the DOM tree as a graph and detects the semantic modules by graph clustering method. A substantial number of experiments are performed on various web data sets and experimental results show that TPS and GPPS can provide effective and robust semantic modules. The TPS and GPPS algorithm will be used to filter the noise text in web page, they are predecessor work of other approaches in the dissertation.(2) A topic based bursty event detection (TBE) algorithm is proposed to cluster the bursty words in text stream on social media. The TBE algorithm first detects the bursty words by Gaussion distribution. Then, TBE simultaneously considers co-occurrent relationships among bursty words and the hidden topics generating the events. Finally, events are tracked over timeline via their topics that are determi-nated by TBE algorithm. This approach also designs a visualization method to illustrate the detected events. Experimental results on blog data set and reuters data set have demonstrated that TBE outperforms state-of-the-art method signifi-cantly in event detection.(3) A topic event detection and tracking (TEDT) algorithm is proposed by extending the traditional topic model. The TEDT algorithm uses occurrence probability of words in a topic to measure distance among words and proposes a stream cluster-ing method to generate the events which have the highest occurrence probability in this topic. Then, the event topic is used to track the changes of events over timeline. A visualization approach is proposed to illustrate the detected events. Experimental results on blog data and reuters data set have demonstrated that T-EDT outperforms traditional topic model in event detection.(4) A generative model named the Author-Topic-Community (ATC) model is pro-posed. The ATC model infers author interest profiles and their community struc-tures simultaneously based on the contents of the documents written by these au-thors and their social relationships. The ATC model employs the knowledge in the expertise of user to compensate for the lack of user’s links. A learning algorithm based on variational inference is adopted to estimate the model parameters. Via the mutual promotion between the author topics and the author community structure, the inferred ATC model achieves more robust author interest profiling and commu-nity discovery. Experimental results on synthetic data and dblp/blog/digg/twitter data have shown the performance of ATC model is better than that other author topic models.In this dissertation, the five clustering algorithms are proposed to solve different problems in the social media. The TPS and GPPS algorithms are used to filter the noise content in web page, they are predecessor work of other algorithms. The TBE algorithm simultaneously considers the co-occurrent relationships between bursty words and the hidden topics generating the events to cluster the bursty words. The TEDT algorithm extends traditional topic model and tackles two limitations of topic model by clustering the co-occurrent features of the underlying topics in the text stream. The ATC model employs the knowledge in the expertise of user to compensate for the lack of user’s links and learns the users’ expertises and communities.In summary, this dissertation establishes several novel clustering methods to counter the new characteristics of social media data, which may contain strong noise, be highly sparse, short text, be dynamics and have missing value. The research in this dissertation improve the clustering techniques in social media further, whereas it is promising to bring more and better choices in the fields of financial analysis and recommender system in e-commerce.
Keywords/Search Tags:social media, clustering, probabilistic model, text stream, event detection, variational inference, user topic, community detection
PDF Full Text Request
Related items