Font Size: a A A

Clustering Community Questions Based On Topic Models

Posted on:2012-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y J MiaoFull Text:PDF
GTID:2248330362968181Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of Web2.0, Community Question Answering (CQA) services,such as Yahoo! Answers and Baidu Zhidao, have become a popularapproach to information seeking. As user-posted questions accumulate to ahuge volume, effective management of CQA is becoming an increasinglyimportant issue. Usually, the archives of community questions are organizedin the form of hierarchical categories, whose maintenance highly relies onusers’ behaviors: askers are required to select appropriate categories whensubmitting questions. To reduce the efforts of users and boost the intelligenceof question management in CQA, in this paper we study the problem ofCommunity Question Clustering, which is primarily concerned with groupinga collection of community questions into predefined categories.This problem can be naturally reformulated as a clustering task. In thearea of data mining, text clustering is meant to partition a document collectioninto disjunct clusters, each of which corresponds to a specific category ordomain. Unlike traditional text data, community questions containunstructured text data and structured user data simultaneously. On these twodimensions, we formally define clustering features of community questionsand investigate the effectiveness of probabilistic topic models. Based on thetraditional PLSA model, we propose Basic-PLSA, which performs clusteringmerely on the textual features. Then, we take user information into account,which provides valuable indication on correlations between categories andcommunity questions. To exploit user information, we extend Basic-PLSA indifferent ways, and propose the User-PLSA and Reg-PLSA models.Particularly, User-PLSA combines text and user information in a unifiedprobabilistic framework, and Reg-PLSA smoothes parameter estimation withquestion graph which is derived from user information. The experimentalresults show that our Community Question Clustering problem can be solvedeffectively by the proposed models, among which Reg-PLSA achieves the bestclustering results and efficiency. Meanwhile, evaluation reveals thatincorporation of user information improves the performance of Basic-PLSAsignificantly. Also, through empirical studies, we demonstrate how thevariation in textual features influences the clustering of community questions. In addition, using these clustering models, we proceed to examineidentification of new categories in CQA. With categorical prior knowledge,we modify the unsupervised Basic-PLSA model into a semi-supervisedmethod, which successfully finds new categories of high quality for CQA andachieves better identification performance than Basic-PLSA.
Keywords/Search Tags:CQA, Clustering, Community Question, Topic Model
PDF Full Text Request
Related items