Research On Clustering Algorithms For Social Media Content

Posted on:2015-11-07

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C S Li

Full Text:PDF

GTID:1108330422492538

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the booming development of Web2.0, Internet users are increasingly active and have created huge scale of text data in social media services. Meanwhile, social relations of users in reality have been captured by social media services. Faced with massive and valuable data, many scientists found that the traditional clustering methods didnâ€™t work well for the complicated data, since some new characteristics had emerged in social media data, which might contain strong noise, be highly sparse, be short text, be dynamics and have missing value.Recently, several approaches have been proposed for dealing with these new charac-teristics, such as probabilistic topic model or graph partition. However, for social media data some existing clustering methods have some shortcomings. For example, ignoring distribution of whole corpus and relations among data, which usually cause poor perfor-mance of traditional methods on social media data.This dissertation focuses on those problems caused by new characteristics of social media data and proposes five novel clustering methods which refer to latest research re-sults in graph clustering models and probabilistic topic models. The main contributions of this dissertation are as follows:(1) Two algorithms, DOM tree structure based web page segmentation (TPS) algo-rithm and graph partition based web page segmentation (GPPS) algorithm, are proposed for modeling the mapping relationship between DOM tree structure and page semantic modules. Seeing a web page as a set of semantic information block-s, TPS and GPPS algorithms can divide a web page into several modules with unique topic. The TPS algorithm segments a page by heuristic rules, whereas the GPPS algorithm considers the DOM tree as a graph and detects the semantic modules by graph clustering method. A substantial number of experiments are performed on various web data sets and experimental results show that TPS and GPPS can provide effective and robust semantic modules. The TPS and GPPS algorithm will be used to filter the noise text in web page, they are predecessor work of other approaches in the dissertation.(2) A topic based bursty event detection (TBE) algorithm is proposed to cluster the bursty words in text stream on social media. The TBE algorithm first detects the bursty words by Gaussion distribution. Then, TBE simultaneously considers co-occurrent relationships among bursty words and the hidden topics generating the events. Finally, events are tracked over timeline via their topics that are determi-nated by TBE algorithm. This approach also designs a visualization method to illustrate the detected events. Experimental results on blog data set and reuters data set have demonstrated that TBE outperforms state-of-the-art method signifi-cantly in event detection.(3) A topic event detection and tracking (TEDT) algorithm is proposed by extending the traditional topic model. The TEDT algorithm uses occurrence probability of words in a topic to measure distance among words and proposes a stream cluster-ing method to generate the events which have the highest occurrence probability in this topic. Then, the event topic is used to track the changes of events over timeline. A visualization approach is proposed to illustrate the detected events. Experimental results on blog data and reuters data set have demonstrated that T-EDT outperforms traditional topic model in event detection.(4) A generative model named the Author-Topic-Community (ATC) model is pro-posed. The ATC model infers author interest profiles and their community struc-tures simultaneously based on the contents of the documents written by these au-thors and their social relationships. The ATC model employs the knowledge in the expertise of user to compensate for the lack of userâ€™s links. A learning algorithm based on variational inference is adopted to estimate the model parameters. Via the mutual promotion between the author topics and the author community structure, the inferred ATC model achieves more robust author interest profiling and commu-nity discovery. Experimental results on synthetic data and dblp/blog/digg/twitter data have shown the performance of ATC model is better than that other author topic models.In this dissertation, the five clustering algorithms are proposed to solve different problems in the social media. The TPS and GPPS algorithms are used to filter the noise content in web page, they are predecessor work of other algorithms. The TBE algorithm simultaneously considers the co-occurrent relationships between bursty words and the hidden topics generating the events to cluster the bursty words. The TEDT algorithm extends traditional topic model and tackles two limitations of topic model by clustering the co-occurrent features of the underlying topics in the text stream. The ATC model employs the knowledge in the expertise of user to compensate for the lack of userâ€™s links and learns the usersâ€™ expertises and communities.In summary, this dissertation establishes several novel clustering methods to counter the new characteristics of social media data, which may contain strong noise, be highly sparse, short text, be dynamics and have missing value. The research in this dissertation improve the clustering techniques in social media further, whereas it is promising to bring more and better choices in the fields of financial analysis and recommender system in e-commerce.

Keywords/Search Tags:

social media, clustering, probabilistic model, text stream, event detection, variational inference, user topic, community detection

PDF Full Text Request

Related items

1	Research On Text Clustering Algorithm And Its Application In Topic Detection
2	Research And Implementation Of Distributed Topic Clustering Technology For Text Flow
3	Research On Topic Detection Method Of Complex Short Text Based On Topic Model
4	Research On Construction Method And Application Of Deep Probabilistic Models Integrating Text Structure Information
5	Event Detection From Microblogs Based On Topic Model
6	Research On Keyissues On Topic Detection And Topic Diffusionin Social Media
7	Research And Application Of Probabilistic Generative Model With Variational Learning And Inference
8	Probabilistic Generative Models-based Topic Mod-eling Of Text And Its Applications
9	Research On Bursty Event Detection And Traceability Analysis In Social Media
10	A Topic Model Based On Community Structure