Font Size: a A A

Short Text Clustering For Question Answering Systems

Posted on:2012-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L NiFull Text:PDF
GTID:1118330335462378Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Web 2.0, the User-Interactive Question Answering (UIQA) systems have attracted more and more attention. The UIQA systems provide a bridge to connect askers and answerers, and stimulate the answerers in the QA community to solve questions. However, UIQA systems are also filled with duplicate or similar questions. The redundancy in UIQA systems prevent the users from quickly knowledge obtaining.We investigate the short text clustering algorithm to group the questions in the UIQA system. A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. In order to find the core terms, we model the collection of short text snippets as a graph, in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. Each term can bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. The term, which minimizes the inter-class similarity and maximizes the inner-class similarity, is selected as the core term. TermCut then bisect the short text collection into two clusters, in which one cluster contains the term, whereas those snippets in the other cluster do not. We iteratively bisect the collection, and finally a set of clusters are formed.Based on the TermCut strategy, we propose two clustering algorithms, namely Cluster Number based TermCut (CNTC) and Threshold based TermCut (TTC) respectively. CNTC uses the prior knowledge of target cluster number as the stop condition. The output cluster terminates the bisection when the target cluster number is obtained. In some cases, it is difficult to obtain the prior knowledge of the target cluster number. Unlike CNTC, TTC uses a similarity threshold to determine whether to stop bisecting. The clustering process of TTC stops, when the bisection does not lead to any improvement of the inter-class similarity and the inner-class dissimilarity.We design a prototype to apply the proposed short text clustering algorithm to question recommendation. A topic based user interest model is proposed to capture the different user interests. Based on the model, we can rank the questions according to each user's interest. Top ranked questions are clustered and recommended to the user. The demonstration of the clustering algorithm is then given.
Keywords/Search Tags:Web, Question Answering System, User Interactive Question Answering System, Short Text Clustering, Question Clustering
PDF Full Text Request
Related items