Font Size: a A A

The Study On Dynamical Topic Modeling And Text Summarization For Web Forums

Posted on:2013-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z C RenFull Text:PDF
GTID:2248330395969709Subject:The computer system structure.
Abstract/Summary:PDF Full Text Request
The development of Internet has brought great technical innovation; as one kind of social media, web forum is becoming one important tool for people’s lively communications. In web forums, people propose or reply information using their own profiles. Usually, a forum often involves various aspects such as culture, politics, sports, etc. Many popular web forums have not only become one kind of "first place" for several users proposing opinions, but also one important platform to understand what people thinking and discussion, because in recent years, every sensitive event usually came out, first of all, in web forums. However, how to track those dynamic changes in themes of current forums threads and get the major content in each thread, to get better for monitoring and tracking of sensitive topics, has become a challenging problem.At the same time, for the forum’s users, when they browsing each thread in web forum, they often get confused for those numerous and redundancy data, which thereby reducing the reading efficiency and quality. Therefore, how to help users to quickly understand the web forum’s thread is also become a meaningful work.With the proposition of topic models, in recent years, there are several research works about topic tracking in the internet. However, for web forums that with complicated socialization in short context stream, there is lack of research results on it. The reason is that the forums’threads have problems of topic dependencies and topic drifting. Meanwhile, as the forum post is kind of short document, the user often do not pay attention to grammar, rhetoric and spelling when proposing posts, which results in short text semantic sparse characteristics in thread documents.In this thesis, for the situation that the web forums have no effective document summarization methods, we propose one model based on the LDA topic model:the Post Propagation Model. In this thesis, we consider the reply-relationship among posts in web forums when we build topic models for web forums. And we consider the topic distribution in web forums as one dynamic process to solve the dependency and drifting problems. For more exactly inference the variables in the model, we use the Gibbs EM sampling algorithm to determine the dynamic variables, then we derivate the distribution of topics in different times.To make users get understand more clear about each thread in web forums, based on PPM, we propose three different web forums summarization methods. Through computing the sum of topic weights in sentences, and we can confirm the saliency of each topic; Finally we get the summarization by extracting sentences as the saliency in each thread. For optimization reason, in this thesis we introduce the Markov Random walks and generate the summary by one ranking process that is sensitive to topics to assign values to sentences.Due to the lack of the corresponding experimental data set at this moment, in this thesis, the authors crawl400documents from two high-participated web forums to build our own data set. This thesis firstly detects the results of our topic model in data set, especially for the change of same topic at different time periods. The experimental results show that the proposed model (Post Propagation Model) is better than static topic models, such as LDA, because of it is has more sensitive function to topic detection process. To the web forums summarization task, we make human-generated reference summarization for the data set from popular web fourms, and introduce the widely-used evaluation methods, ROUGE, in the field of document summarization. The experimental results show that our new method performs better than other baseline comparisons in every ROUGE evaluation metrics.
Keywords/Search Tags:Web Forums, Topic Modeling, Document Summarization, Markov Random Walks, Topic-sensitive PageRank, Natural Language Processing, Machine Learning, Information Retrieval
PDF Full Text Request
Related items