Short Text Clustering For Question Answering Systems

Posted on:2012-01-28

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X L Ni

Full Text:PDF

GTID:1118330335462378

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Web 2.0, the User-Interactive Question Answering (UIQA) systems have attracted more and more attention. The UIQA systems provide a bridge to connect askers and answerers, and stimulate the answerers in the QA community to solve questions. However, UIQA systems are also filled with duplicate or similar questions. The redundancy in UIQA systems prevent the users from quickly knowledge obtaining.We investigate the short text clustering algorithm to group the questions in the UIQA system. A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. In order to find the core terms, we model the collection of short text snippets as a graph, in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. Each term can bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. The term, which minimizes the inter-class similarity and maximizes the inner-class similarity, is selected as the core term. TermCut then bisect the short text collection into two clusters, in which one cluster contains the term, whereas those snippets in the other cluster do not. We iteratively bisect the collection, and finally a set of clusters are formed.Based on the TermCut strategy, we propose two clustering algorithms, namely Cluster Number based TermCut (CNTC) and Threshold based TermCut (TTC) respectively. CNTC uses the prior knowledge of target cluster number as the stop condition. The output cluster terminates the bisection when the target cluster number is obtained. In some cases, it is difficult to obtain the prior knowledge of the target cluster number. Unlike CNTC, TTC uses a similarity threshold to determine whether to stop bisecting. The clustering process of TTC stops, when the bisection does not lead to any improvement of the inter-class similarity and the inner-class dissimilarity.We design a prototype to apply the proposed short text clustering algorithm to question recommendation. A topic based user interest model is proposed to capture the different user interests. Based on the model, we can rank the questions according to each user's interest. Top ranked questions are clustered and recommended to the user. The demonstration of the clustering algorithm is then given.

Keywords/Search Tags:

Web, Question Answering System, User Interactive Question Answering System, Short Text Clustering, Question Clustering

PDF Full Text Request

Related items

1	Applications Of Short Text Similarity Assessment In User-interactive Question Answering
2	Question Recommendation Mechanism In User-Interactive Question Answering Systems
3	Research On Key Techniques Of Question Understanding For Open-domain Question Answering System
4	Research On The Re-use Of Community Question Answering Knowledge
5	The Design And Implementation Of A Web-based Intelligent Question-answering System
6	Research And Application Of Key Technologies Of Community Question Answering
7	Research On Question Analysis And Answer Extraction Methods Of Chinese Question Answering Systems
8	Research Of Specific Domain Question Answering System Based On Internet Information
9	Research On Tag Generation For Chinese Short Text Based On Community Question Answering System
10	Research On Text Retrieval Of Restricted Question Answering System