Font Size: a A A

Research For Chinese Reading Comprehension Based On Word Distributed Representation

Posted on:2015-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:2308330461985028Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
Natural Language Processing(NLP) is a core area of artificial intelligence. In the NLP area, it facilitates people to obtain useful information exactly by developing technology of Reading Comprehension(RC). Reading comprehension task aims to obtain answer sentence automatically from a natural language article and a article-related question. In recent years, there are many studies on RC task at home and abroad. They mainly introduce some match scores between a sentence and a question in one article based on one-hot shot word representation. However, there are few studies to introduce word distributed representation into the reading comprehension task.In this thesis, we firstly attempt to train word embedding matrix through Neural Language Model(NLM). Then, we cast the reading comprehension task as a two-class classification task. We employ maximum entropy model and use Chinese Reading Comprehension Corpus(CRCC) as our dataset, which is developed by Shanxi University. By introducing the distributed word representation matrix, we construct several features. These features are listed as follows:1)MAXOUT feature, it describes the Euclidean distance for the maximum vectors of two distributed matrices of question sentence and answer sentence; 2) Algorithm mean feature, it is Euclidean distance for the mean-valued vectors of two distributed matrices of question sentence and answer sentence; 3) Average word pair similarity feature, it is a average value of a matrix, each element of which is a Euclidean distance of a vector of a word in question and another vector of a word in sentence.4) Angle cosine feature, it is calculated by adding angle cosine to the MAXOUT feature. These features are used to measure similarity between a question and a sentence in one article. Due to the CRCC is a small datase, we use held-out validation to conduct experiments. We segment the corpus into five training sets and test sets, and use HumSent to evaluate model performance.The initial results are obtained firstly by using unscaled word embedding matrix. Then, scale optimization is used to the word embedding matrix. The word embedding matrix and features are selected to promote the performance for the model in training and test sets. The result shows that, the HumSent accuracy rate of 63.37% is obtained by adding the word embedding matrices which have been optimized to the maximum entropy model, using 11 features to train in the model. Then, the HumSent accuracy rate of 63.81% can be obtained, based on a character embedding,2.07% is achieved. Above all, the word distributed representation embedding matrices can the promote the model performance of RC task in a certain extent.
Keywords/Search Tags:Reading comprehension, Maximum entropy model, Neural language model, Distributed word representation
PDF Full Text Request
Related items