Font Size: a A A

Key Technologies Of Reading Comprehension For Open-domain

Posted on:2011-05-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z C ZhangFull Text:PDF
GTID:1118360332957943Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic reading comprehension (RC), one of the traditional research areas in AI, has become the―hotspot‖of current research in the natural language processing (NLP) field and has been stimulated by the emergence of continuous TREC question answering (QA) tracks since 1999. It can be used to verify and evaluate the performance of fundamental NLP technologies and also explore various text language understanding methodologies in different discourse aspects.The major problem in open-domain question answering based on large-scale document collection is the difficulty of pinpointing error and also analyzing as to what contributes to faulty answer outputs, due to the complex nature of QA system architecture. However, RC, which using a single document as a target corpus without the need for document retrieval module, can be utilized as an alternative means for QA and can focus on question analysis and answer extraction phase.After an in-depth review of related works in an open–domain RC, we found that most answer extraction methods are shallow and their performances can be improved further. Thus this thesis deals with four research directions in an attempt to addresses this problem.1. Question classification is an important component in both of RC and QA systems for it has strong influence on the performance of final answer extraction. Compared to text, the length of questions raised by users are generally short and contain fewer numbers of features for classification. Hence, sparseness problem in training question data set becomes serious and will degrade the classification performance at large. In view of this problem, this paper proposes a new question classification approach based on cue word identification and training data expansion. In this approach, we first identify the key features for classification and then extend the training data set with the questions mined automatically from the Web to alleviate the problem of training data sparseness. This approach combines Nearest Neighbor model and support vector machine (SVM) and then improves the classification performance. 2. While extracting answer sentences from a text for users'questions, the performance is limited by various bag-of-words (BOW) methods. This paper proposes an answer sentence extraction method based on tree kernels of syntactic and shallow semantic structures. This method combines multiple features including syntactic tree structures, shallow semantic trees, words, and contextual information of sentences in the text in a coherent manner and extracts answer sentences from text with machine learning model based on these features.3. The information for an entity or fact in a text is often described by multiple sentences from different facets. Although context coherences exist among them, a single sentence cannot convey the whole information of an entity or fact. Thus if questions asked by users are related to multiple sentences in the text, it will be difficult to extract the right answer employing information of a single sentence alone. This paper presents a new answer extraction method called concept graph matching model. It constructs concept graphs for both text and question and then extracts the sub-graph which best matches the question graph from the text graph. One concept node will be extracted from this sub-graph to form the final answer string. In this approach, text concept graph is built from concepts and syntactic/semantic relations in all sentences in the text which in turn improves the performance of answer extraction over the method of employing information from single sentence.4. This paper presents an answer sentence extraction method for why question types. Answer extraction method does not only utilize words and semantic relations in the question, to identify text sentence corresponding to the question topic, but also the various rhetorical features among text sentences. Besides, it employs information from causal relations among words mined from a large scale document collection to discriminate causal relation between sentences. Machine learning model, augmented with these two features, predicts and ranks the probability with which every sentence being the answer sentence of a why question type.
Keywords/Search Tags:Reading Comprehension, Question Classification, Answer Extraction, Tree kernel, Concept Graph Matching, Rhetorical Identification
PDF Full Text Request
Related items