Font Size: a A A

Forum Topic Model Based On A Combination Of Selective Long Text And Short Text

Posted on:2016-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhengFull Text:PDF
GTID:2308330479482179Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The research work of this paper is performing text analysis in the posts of the Forum system, and mining the topics of their comments and reviewers. Latent Dirichlet Allocation(LDA) model is an unsupervised machine learning technique which can be used to identify hidden topical information in large-scale document collection. The Link-LDA introduces a link between the documents and other objects, using the links to model the topical relationship between them, in order to get better model performance. And propose a model with link information suitable for the scene is very important. In this paper, we propose a new topic model for blog, post bar and forum system, that considers post as long text and comment data(comment and reviewer) as short text, and combines them into a single framework.By observing and analyzing the Forum, we found that Forum system takes on three important features which are our modeling basis. Firstly, we found that most of the comments are short texts, which are often carried out in response to a topic or question for the post, so we define that this kind of comment belongs to only one topic, and there is also one correspondence relationship between reviewer and comment. Secondly,We make post and comments correspond to the same vocabulary, in order that post and comments can influence each other. In other words, the topic of post is the basis of the comments while the comments can affect the topic distribution of post. However, not all comments of the post are the response to the content of the post, so we introduce an additional topic for the comments, this kind of comments are defined as spam, meanwhile reviewer who spams frequently is defined as spam user. In this way we can achieve the purpose of selectively binding the long text and short text in the model. Finally, for spam comment, we introduce partially supervised information. Meanwhile, there are some comments we can count as spam without semantic analysis, and their pure texts are marked as spam supervised information. With this supervised information, we can better identify the topic of comments.Based on these findings, we propose a topic model selectively binding the long text and short text with partially supervised information--Forum-LDA, which is adapted to the forum system. We have conducted experiments for Forum-LDA model in Chinese and English dataset, and have compared with the LDA and Link-LDA model. The results of experiments show that Forum-LDA had a better data generalization performance that can enhance clustering features of the post with comments and reviewers, and can also find out the topic of the comment better, so as to further identify the corresponding spam comments and spam users in the corpus. It has certain advantages compared to LDA and Link-LDA.
Keywords/Search Tags:topic model, forum system, post and comments, long text and short text, additional topic, partially supervised
PDF Full Text Request
Related items