Font Size: a A A

Research Of Analysis Technology On Text Content And Profiles In Social Media

Posted on:2016-11-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:J G DuFull Text:PDF
GTID:1108330503455327Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recent years have witnessed the rapid development of online social media, where a large number of users participate in. Online social media significantly influence the way we live. In such a web environment, how to understand and mine users’ content on online social media has been of significant importance for both research and industry communities. Social medias not only enrich people’s life, but also brings both opportunities and challenages for research on users’ text content and profiles. On one hand, the text content that users generated are rich and varied; On the other hand, users’ profiles are also complicated. All of which brings challenages for such research. In this paper, we study analysis technology of text content generated by users and their profiles. For both text content and profiles, We propose the corresponding models to model such data, in order to better understand users on social medias. To summarize, the contributions of this paper include:We model the text content and propose an unsupervised model to identify organizational phrases in argumentative texts. Usually people argue with one another using not only claims and evidences about the topic under discussion but also language used to organize them, which we refer to as shell. In this paper, we study how to separate shell from topical content using unsupervised model. Along this line, we develop a latent variable model, named Shell Topic Model(STM), to jointly model both topics and shell. The model uses bigram language models to model shell and unigram models to model topical content. Experiments on two real data sets show that our model can identify meaningful shell phrases. In addition, for two extrinsic tasks, we find that separating shell and topical content using our model helps improve the performance compared with baseline methods that do not distinguish between shell and topical content.We model users’ profiles(link relations) and propose a model that estimates tweets’ reading probability via users’ retweeting behaviors(one kind of link relations). Along with twitter’s tremendous growth, it has drawn increasing interests from researchers. In literature, it is always assumed that twitter users could catch up with or read all the tweets posted by their friends. In this paper, we loose this assumption and model users’ reading behaviors. Specifically, we propose a ReadBehavior model to measure the probability that a user reads a specific tweet. Our model captures users’ retweeting behaviors and the correlation between tweets’ posting time and the corresponding retweeting time. Since reading probability is not well defined and difficult to evaluate, we evaluate our model extrinsically. To this end, based on our model, we develop a PageRank-like algorithm to infer influential users, and use the algorithm to evaluate our model. The experimental results show that the algorithm based on our model outperforms other related algorithms that do not consider users’ reading behaviors.We model text content with users’ profiles and propose a general topic model with document relative similarities. Topic modeling has been widely used in text mining. Previous topic models such as Latent Dirichlet Allocation(LDA) are successful in learning hidden topics but they do not take into account profiles of documents(metadata). To tackle this problem, many augmented topic models have been proposed to jointly model text and metadata. But most existing models handle only categorical and numerical types of metadata. We identify another type of metadata that can be more natural to obtain from users’ profiles. These are relative similarities among documents. In this paper, we propose a general model that links LDA with constraints derived from document relative similarities. Specifically, in our model, the constraints act as a regularizer of the log likelihood of LDA. Experiments show that our model is able to learn coherent topics. The results also show that our model outperforms the baselines for age prediction and document classification.
Keywords/Search Tags:social media, topic modeling, reading probability, organizational phrases, document relative similarites
PDF Full Text Request
Related items