Font Size: a A A

Research On The Semantic Mining Of Question-answer Pairs In Web Communities

Posted on:2014-02-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:B X WangFull Text:PDF
GTID:1268330392972599Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Web2.0techniques has brought the boom of User-generated Contents (UGC). As a new kind of web information resource, the high-qualityUGC has been showing great research and application value, thus it is of significanceto conduct the collecting and mining work on the UGC resource. Being the typicalUGC directly produced from the users’ knowledge sharing behaviors via web medias, theQuestion-Answer resource composed of large amounts of QA pairs serves as the represen-tation of human knowledge in the form of web texts, with the human language principlesduring web communications contained in it. Consequently, the high-quality QA resourceplays an important role in both question-answering system construction and natural lan-guage processing research.The web communities (e.g., Community-driven question answering (cQA) portals,online forums, etc.) have provided a platform for information communication. In thecommunities, people usually tend to communicate with each other and share their knowl-edge by asking questions and delivering answers, thus a large number of QA pairs areinvolved in the web communities. A considerable part of the QA pairs have descriptiveanswers, the value of which lies in their contributions on improving the performance ofQA systems by covering the complicated questions. However, since the users in the com-munity are not obligated to provide meaningful information, the valuable QA pairs aremostly hidden in the useless information. To the construction of the knowledge base ofQA systems, it is a challenging task to automatically identify and extract QA pairs fromthe community contents with much noise information.This thesis focuses on the essential problems in the semantic mining of the QApairs in the web communities. Our research includes both the semantic relationship basedautomatic QA pair detection in the communities, and the QA information generating andfusing tasks utilizing the semantic knowledge hidden in the QA pairs. In detail, the majorcontents of this thesis include the following four parts.The quantification of the semantic relevance for the questions and their answers isthe essential issue in the semantic mining of the community QA pairs. The QA pairs in thecommunities are typical short texts, which leads to the severe feature sparsity and the lackof word co-occurrence information, and makes the semantic relevance calculation of the QA pairs diferent from that of the ordinary texts. After investigating the characteristicsand difculties of semantic relevance quantification of short texts, this thesis proposestwo deep learning models with diferent architectures to quantify the QA pairs’ semanticrelevance, based on the discussion of the language characteristics of the QA pairs in thecommunities. The experimental results have indicated that the proposed deep learningmodels are able to improve the precision of the semantic relevance quantification for theQA pairs.The non-textual features based on the structural and social information of the webcommunities have been widely utilized in the QA resource mining studies. The basis ofthe general non-textual features is the heuristic regularities from the observation on thecommunities. Such features usually fails to ofer deterministic information to the QApair identifying models, so that the identifying precision can not be further improved. Toobtain the deterministic information for the forum thread oriented answer identification,this thesis has presented a forum thread segmentation strategy, and has extracted a groupof new features based on the segments. Utilizing the segment information, an approachis put forward to find the best answers. The experimental results show that the segmentfeatures have made a good contribution to the improvement of the answer identifyingmethods’ precisions, and the approach utilizing the segment features have outperformedthe classical methods.The purpose of the statistical question generation research is to get the statistic infor-mation reflecting the semantic relationships between words from the web community QAresource, and utilize the information in the automatic Question Generation (QG) task, soas to extend the application field of QA resource mining. The parser or sentence transfor-mation rule based mainstream QG techniques are usually not adaptive to the descriptivesource texts, and perform with lower efciency when dealing with large scale web data.This thesis gives the detail description of the statistical question generation task systemati-cally, and proposes the Deep Belief Network (DBN) based model to generate the essentialwords of the question from the given answer text. The generated words are then reorga-nized into questions following simple patterns. The experimental results have shown theDBN model is promising for the QG task oriented to the complicated descriptive texts.It is common that a question in the cQA portal may have several correct answers,in this case, summarizing such answers is a meaningful work rather than selecting onebest answer. As an efective operation to improve the quality of the QA resource, an- swer summarization reflects the level of the semantic mining research on the QA pairs.From the perspective of the answers’ topic information, this thesis discuses the difer-ent strategies for answer summarization. This thesis presents the Adaptive MaximumMarginal Relevance (AMMR) based approach to summarize the answers with the giventopic information as the basis. In the situation that the topic information is absente, thisthesis proposes two diferent algorithms based on the principles of original answer setreconstruction and sub-topic generation respectively. In the experiments, the summariza-tion methods are evaluated and analyzed with precision and redundancy as the evaluationmetrics.
Keywords/Search Tags:Question-Answer Pair Detection, Semantic Relevance Quantification forShort Texts, Deep Learning, Non-textual Feature, Statistical Question Gen-eration, Answer Summarization
PDF Full Text Request
Related items