Clustering Community Questions Based On Topic Models

Posted on:2012-06-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Miao

Full Text:PDF

GTID:2248330362968181

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In the era of Web2.0, Community Question Answering (CQA) services,such as Yahoo! Answers and Baidu Zhidao, have become a popularapproach to information seeking. As user-posted questions accumulate to ahuge volume, effective management of CQA is becoming an increasinglyimportant issue. Usually, the archives of community questions are organizedin the form of hierarchical categories, whose maintenance highly relies onusers’ behaviors: askers are required to select appropriate categories whensubmitting questions. To reduce the efforts of users and boost the intelligenceof question management in CQA, in this paper we study the problem ofCommunity Question Clustering, which is primarily concerned with groupinga collection of community questions into predefined categories.This problem can be naturally reformulated as a clustering task. In thearea of data mining, text clustering is meant to partition a document collectioninto disjunct clusters, each of which corresponds to a specific category ordomain. Unlike traditional text data, community questions containunstructured text data and structured user data simultaneously. On these twodimensions, we formally define clustering features of community questionsand investigate the effectiveness of probabilistic topic models. Based on thetraditional PLSA model, we propose Basic-PLSA, which performs clusteringmerely on the textual features. Then, we take user information into account,which provides valuable indication on correlations between categories andcommunity questions. To exploit user information, we extend Basic-PLSA indifferent ways, and propose the User-PLSA and Reg-PLSA models.Particularly, User-PLSA combines text and user information in a unifiedprobabilistic framework, and Reg-PLSA smoothes parameter estimation withquestion graph which is derived from user information. The experimentalresults show that our Community Question Clustering problem can be solvedeffectively by the proposed models, among which Reg-PLSA achieves the bestclustering results and efficiency. Meanwhile, evaluation reveals thatincorporation of user information improves the performance of Basic-PLSAsignificantly. Also, through empirical studies, we demonstrate how thevariation in textual features influences the clustering of community questions. In addition, using these clustering models, we proceed to examineidentification of new categories in CQA. With categorical prior knowledge,we modify the unsupervised Basic-PLSA model into a semi-supervisedmethod, which successfully finds new categories of high quality for CQA andachieves better identification performance than Basic-PLSA.

Keywords/Search Tags:

CQA, Clustering, Community Question, Topic Model

PDF Full Text Request

Related items

1	Research On The Re-use Of Community Question Answering Knowledge
2	Research On Tag Generation For Chinese Short Text Based On Community Question Answering System
3	Research And Application Of Answer Ranking And Question Retrieval In Community Question Answering System
4	Research On Community-based Question Retrieval Technology Based On Topic Translation Model
5	A Study Of Question Retrieval Technology In The Chinese Community Question Answering Systems
6	The Research And Implementaltion Of Expert Finding Method For Community Question Answering
7	Topic Mining And Q&A Recommendation Research In Socialized Q&A Community In UGC Environment
8	Expert Discovery Method For Question Answering Community Based On LSTM Model
9	Key Issues Of Question Understanding In Community Question Answering System
10	Research On Topic Model Based Personalized Expertise Ranking Algorithm In Community Based Questioning Answering